Data Processing Flashcards

Question 1

Q

What is batch processing?

Answer

A

Batch processing handles large volumes of data at scheduled intervals. Data is collected, processed in groups, and output produced. Used for analytics, ETL, reports. Examples: Hadoop MapReduce, Spark batch jobs.

Question 2

Q

What is stream processing?

Answer

A

Stream processing handles data in real-time as it arrives. Continuous processing with low latency. Used for real-time analytics, monitoring, fraud detection. Examples: Apache Kafka Streams, Flink, Storm.

Question 3

Q

What is MapReduce?

Answer

A

MapReduce is a programming model for processing large datasets. Map phase transforms data into key-value pairs. Reduce phase aggregates values by key. Enables parallel distributed processing.

Question 4

Q

What is ETL (Extract Transform Load)?

Answer

A

ETL extracts data from sources, transforms it (clean, aggregate, format), and loads into target system (data warehouse). Used for data integration and analytics pipelines.

Question 5

Q

What is a data warehouse?

Answer

A

A data warehouse is a centralized repository for structured data from multiple sources. Optimized for analytics and reporting with historical data. Examples: Snowflake, BigQuery, Redshift.

Question 6

Q

What is a data lake?

Answer

A

A data lake stores raw, unstructured data in native format until needed. Schema-on-read approach. More flexible than data warehouse but requires processing. Examples: AWS S3 + analytics tools.

Question 7

Q

What is data partitioning in analytics?

Answer

A

Partitioning divides large datasets into smaller chunks based on criteria (date, region). Improves query performance by scanning only relevant partitions. Common in big data systems.

Question 8

Q

What is OLTP vs OLAP?

Answer

A

OLTP (Online Transaction Processing): handles day-to-day transactions, fast writes, normalized. OLAP (Online Analytical Processing): handles complex queries, read-heavy, denormalized, historical data.

Question 9

Q

What is a materialized view?

Answer

A

A materialized view is a precomputed query result stored as a table. Trades storage and update cost for faster reads. Must be refreshed when source data changes.

Question 10

Q

What is data sharding in analytics?

Answer

A

Sharding distributes data across nodes for parallel processing. Each shard contains subset of data. Enables horizontal scaling for large datasets. Sharding key determines data distribution.

Data Processing Flashcards

(10 cards)