What is batch ingestion in data engineering?
The process of collecting and loading data in discrete chunks at scheduled intervals rather than continuously as events arrive.
What are common batch intervals used in ingestion?
Hourly, daily, weekly, or custom intervals based on business needs and source system constraints.
What is the difference between a full refresh and an incremental load?
A full refresh reloads all data from the source each time, while an incremental load only processes new or changed records since the last run.
When is a full refresh acceptable for batch ingestion?
When the dataset is small enough that reloading it regularly is feasible and simple, and the business can tolerate the cost and latency.
Why are incremental loads preferred for large tables?
They reduce processing time, I/O, and load on source systems by only handling new or changed data.
What is a high-water mark (watermark) in incremental loading?
A stored value representing the last successfully processed point in a sequence, such as the latest timestamp or ID seen.
How is a high-water mark used in practice?
Each batch load reads rows with a key greater than the stored watermark, then updates the watermark when the load completes successfully.
What is change data capture (CDC)?
A set of techniques for capturing and replicating changes from a source system, typically by reading logs or change tables.
What are two main approaches to CDC from databases?
Log-based CDC, which reads database transaction logs, and query-based CDC, which uses queries on timestamps or change flags.
Why is log-based CDC often preferred over query-based CDC?
It captures all changes with minimal impact on the source, including deletes and updates, and avoids heavy polling queries.
What is an upsert operation in the context of batch loads?
A combined insert and update operation that inserts new rows and updates existing rows based on a key match.
Why are upserts important in incremental pipelines?
Because they allow incremental loads to keep target tables in sync with source changes without full reloads.
What is a typical upsert pattern for batch ELT?
Load new data into a staging table and then MERGE from staging into the target table based on business keys and change logic.
Why is staging data before merging into the final table a good practice?
It isolates raw ingest from business logic, simplifies debugging, and allows validation before impacting curated tables.
What is ELT (Extract-Load-Transform)?
A pattern where data is extracted from sources, loaded into a central store, and then transformed there using its compute engine.
How does ELT differ from ETL (Extract-Transform-Load)?
ETL performs transformations before loading into the target store, while ELT performs most transformations after loading, inside the warehouse or lakehouse.
Why has ELT become popular with modern warehouses and lakehouses?
Because they provide scalable compute close to storage, making in-place transformations efficient and reducing the need for separate ETL servers.
What is a common layout for ELT in a warehouse?
Raw tables hold ingested data as-is, staging tables standardize and clean it, and curated tables provide modeled data for consumption.
What is the role of a landing zone or raw layer in batch ingestion?
To capture incoming data with minimal transformation as a source-of-truth that can be reprocessed if downstream logic changes.
Why should raw data be treated as append-only where possible?
Append-only designs simplify consistency, auditing, and reprocessing without worrying about in-place edits to historical records.
What is deduplication in batch pipelines?
The process of identifying and removing duplicate records that may arise from retries, CDC anomalies, or source system behavior.
What keys are commonly used for deduplication?
Natural business keys combined with timestamps or monotonically increasing IDs, or surrogate event identifiers if available.
Why must deduplication logic be idempotent?
Because pipelines may rerun the same batch or reprocess overlapping windows; idempotent logic ensures consistent results without duplicates.
What is backfilling in batch pipelines?
Reprocessing historical data for past periods, often to fix bugs, rebuild models, or populate new tables with older data.