What is batch processing?
Batch processing handles large volumes of data at scheduled intervals. Data is collected, processed in groups, and output produced. Used for analytics, ETL, reports. Examples: Hadoop MapReduce, Spark batch jobs.
What is stream processing?
Stream processing handles data in real-time as it arrives. Continuous processing with low latency. Used for real-time analytics, monitoring, fraud detection. Examples: Apache Kafka Streams, Flink, Storm.
What is MapReduce?
MapReduce is a programming model for processing large datasets. Map phase transforms data into key-value pairs. Reduce phase aggregates values by key. Enables parallel distributed processing.
What is ETL (Extract Transform Load)?
ETL extracts data from sources, transforms it (clean, aggregate, format), and loads into target system (data warehouse). Used for data integration and analytics pipelines.
What is a data warehouse?
A data warehouse is a centralized repository for structured data from multiple sources. Optimized for analytics and reporting with historical data. Examples: Snowflake, BigQuery, Redshift.
What is a data lake?
A data lake stores raw, unstructured data in native format until needed. Schema-on-read approach. More flexible than data warehouse but requires processing. Examples: AWS S3 + analytics tools.
What is data partitioning in analytics?
Partitioning divides large datasets into smaller chunks based on criteria (date, region). Improves query performance by scanning only relevant partitions. Common in big data systems.
What is OLTP vs OLAP?
OLTP (Online Transaction Processing): handles day-to-day transactions, fast writes, normalized. OLAP (Online Analytical Processing): handles complex queries, read-heavy, denormalized, historical data.
What is a materialized view?
A materialized view is a precomputed query result stored as a table. Trades storage and update cost for faster reads. Must be refreshed when source data changes.
What is data sharding in analytics?
Sharding distributes data across nodes for parallel processing. Each shard contains subset of data. Enables horizontal scaling for large datasets. Sharding key determines data distribution.