Batch Ingestion & ELT Flashcards

(41 cards)

1
Q

What is batch ingestion in data engineering?

A

The process of collecting and loading data in discrete chunks at scheduled intervals rather than continuously as events arrive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are common batch intervals used in ingestion?

A

Hourly, daily, weekly, or custom intervals based on business needs and source system constraints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between a full refresh and an incremental load?

A

A full refresh reloads all data from the source each time, while an incremental load only processes new or changed records since the last run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When is a full refresh acceptable for batch ingestion?

A

When the dataset is small enough that reloading it regularly is feasible and simple, and the business can tolerate the cost and latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why are incremental loads preferred for large tables?

A

They reduce processing time, I/O, and load on source systems by only handling new or changed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a high-water mark (watermark) in incremental loading?

A

A stored value representing the last successfully processed point in a sequence, such as the latest timestamp or ID seen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is a high-water mark used in practice?

A

Each batch load reads rows with a key greater than the stored watermark, then updates the watermark when the load completes successfully.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is change data capture (CDC)?

A

A set of techniques for capturing and replicating changes from a source system, typically by reading logs or change tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are two main approaches to CDC from databases?

A

Log-based CDC, which reads database transaction logs, and query-based CDC, which uses queries on timestamps or change flags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is log-based CDC often preferred over query-based CDC?

A

It captures all changes with minimal impact on the source, including deletes and updates, and avoids heavy polling queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an upsert operation in the context of batch loads?

A

A combined insert and update operation that inserts new rows and updates existing rows based on a key match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why are upserts important in incremental pipelines?

A

Because they allow incremental loads to keep target tables in sync with source changes without full reloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a typical upsert pattern for batch ELT?

A

Load new data into a staging table and then MERGE from staging into the target table based on business keys and change logic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is staging data before merging into the final table a good practice?

A

It isolates raw ingest from business logic, simplifies debugging, and allows validation before impacting curated tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is ELT (Extract-Load-Transform)?

A

A pattern where data is extracted from sources, loaded into a central store, and then transformed there using its compute engine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does ELT differ from ETL (Extract-Transform-Load)?

A

ETL performs transformations before loading into the target store, while ELT performs most transformations after loading, inside the warehouse or lakehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why has ELT become popular with modern warehouses and lakehouses?

A

Because they provide scalable compute close to storage, making in-place transformations efficient and reducing the need for separate ETL servers.

18
Q

What is a common layout for ELT in a warehouse?

A

Raw tables hold ingested data as-is, staging tables standardize and clean it, and curated tables provide modeled data for consumption.

19
Q

What is the role of a landing zone or raw layer in batch ingestion?

A

To capture incoming data with minimal transformation as a source-of-truth that can be reprocessed if downstream logic changes.

20
Q

Why should raw data be treated as append-only where possible?

A

Append-only designs simplify consistency, auditing, and reprocessing without worrying about in-place edits to historical records.

21
Q

What is deduplication in batch pipelines?

A

The process of identifying and removing duplicate records that may arise from retries, CDC anomalies, or source system behavior.

22
Q

What keys are commonly used for deduplication?

A

Natural business keys combined with timestamps or monotonically increasing IDs, or surrogate event identifiers if available.

23
Q

Why must deduplication logic be idempotent?

A

Because pipelines may rerun the same batch or reprocess overlapping windows; idempotent logic ensures consistent results without duplicates.

24
Q

What is backfilling in batch pipelines?

A

Reprocessing historical data for past periods, often to fix bugs, rebuild models, or populate new tables with older data.

25
What considerations are important when backfilling large ranges?
Managing compute load, avoiding interference with current production loads, and ensuring backfill logic is consistent with current transformations.
26
What is a batch window in ingestion design?
The time frame during which a batch job must complete to deliver data freshness requirements, such as within an hour of data generation.
27
Why is aligning batch windows with upstream and downstream SLAs important?
So that source systems are not overloaded, and consumers get data updates within agreed timeframes.
28
What is late-arriving data in batch systems?
Data that arrives after the expected processing period for its event time, such as records with old timestamps appearing in later batches.
29
How can a batch system handle late-arriving data?
By using watermark and windowing strategies, periodic backfills, or upsert logic that can revise historical partitions.
30
What is a common pattern for loading fact tables incrementally?
Stage new fact rows, deduplicate by key, then MERGE into the fact table based on the defined grain and keys.
31
How are slowly changing dimensions typically updated in batch?
By detecting changes in dimension attributes and applying SCD Type 1 (overwrite) or Type 2 (add new version rows) logic in batch jobs.
32
Why is batch ordering and dependency management important?
Downstream jobs may rely on outputs from upstream jobs; running them out of order can produce inconsistent or incomplete datasets.
33
What is atomicity in the context of batch loads into target tables?
The property that either the entire set of changes is applied or none is, ensuring consumers never see partial updates.
34
How can atomic batch loads be implemented?
By writing to temporary tables and then swapping or renaming them in a single operation, or using transactional table formats.
35
What role do checksums and row counts play in batch ingestion?
They help verify that the number of records and data integrity match expectations, serving as basic data quality checks.
36
What is a typical data validation step after loading a batch?
Comparing row counts and key distributions to source or prior runs, checking for unexpected nulls, and ensuring basic constraints hold.
37
Why should batch jobs be idempotent and restartable?
Because failures, restarts, and partial runs are inevitable, and pipelines must recover without manual data repair whenever possible.
38
What is the difference between schedule-based and event-based triggering of batch jobs?
Schedule-based jobs run at fixed times; event-based jobs run in response to triggers such as file arrival or a prior job’s completion.
39
Why is it important to log metadata for each batch run?
Run metadata such as timestamps, status, input ranges, and row counts supports debugging, auditing, and observability.
40
What is the role of orchestration tools in batch ELT?
They manage job schedules, dependencies, retries, and alerting, providing end-to-end visibility over pipelines.
41
What is a good one-sentence mental model for batch ingestion and ELT?
Regularly move and stage new or changed data in append-only fashion, validate and deduplicate it, then transform it inside the central store into curated, incremental views of the business.