Modern Data Engineering Stack Flashcards

(31 cards)

1
Q

What is a data pipeline?

A

A data pipeline is a system that automatically moves data from one place to another and transforms it along the way so it becomes useful for analysis or applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a data lake?

A

A data lake is a storage system that keeps large amounts of raw data in its original format until it is needed for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a data warehouse?

A

A data warehouse is a database optimized for analyzing structured data and running large analytical queries quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between a data lake and a data warehouse?

A

A data lake stores raw flexible data while a data warehouse stores cleaned structured data optimized for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a lakehouse architecture?

A

A lakehouse combines the flexibility of a data lake with the structure and performance of a data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the medallion architecture?

A

It is a data organization pattern with layers called Bronze Silver and Gold where raw data is gradually cleaned and refined into analytics-ready datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Bronze layer in a medallion architecture?

A

The Bronze layer stores raw data exactly as it arrives from source systems with minimal processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Silver layer in a medallion architecture?

A

The Silver layer stores cleaned and standardized data that is ready for general analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Gold layer in a medallion architecture?

A

The Gold layer contains highly refined datasets designed specifically for business reports and dashboards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is ETL?

A

ETL stands for Extract Transform Load which means data is extracted from sources transformed into the correct format and then loaded into a destination database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ELT?

A

ELT stands for Extract Load Transform where raw data is first loaded into the warehouse and transformations happen afterward inside the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why has ELT become popular?

A

Because modern warehouses are powerful enough to handle transformations directly so data can be loaded faster and processed later.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is dbt?

A

dbt is a tool that lets data engineers transform data inside a warehouse using version-controlled SQL models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do teams use dbt?

A

It organizes SQL transformations like software projects with testing documentation and version control.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Airflow?

A

Airflow is a workflow orchestration tool that schedules and manages data pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does orchestration mean in data engineering?

A

Orchestration means coordinating when and how different pipeline tasks run so data flows in the correct order.

17
Q

What is a DAG in Airflow?

A

A DAG or Directed Acyclic Graph is a workflow diagram that defines the order tasks must run in a pipeline.

18
Q

What is Kafka?

A

Kafka is a distributed streaming platform that allows systems to send and receive real-time streams of data.

19
Q

What is streaming data?

A

Streaming data is data processed continuously as it arrives rather than waiting for large batches.

20
Q

What is batch processing?

A

Batch processing processes large groups of data at scheduled intervals such as hourly or daily jobs.

21
Q

What is a data catalog?

A

A data catalog is a searchable inventory of datasets that helps teams understand what data exists and how it can be used.

22
Q

What is data lineage?

A

Data lineage tracks where data originated and how it changed through each step of the pipeline.

23
Q

Why is data lineage important?

A

It helps teams understand how reports are built and trace errors back to their source.

24
Q

What is data governance?

A

Data governance defines rules for how data should be managed accessed and protected.

25
What is a schema?
A schema is the structure that defines how data is organized in a database such as tables columns and relationships.
26
What is schema evolution?
Schema evolution means allowing the structure of data to change over time without breaking existing pipelines.
27
What is columnar storage?
Columnar storage stores data by columns instead of rows which makes analytical queries much faster.
28
Why do modern warehouses use columnar storage?
Because analytical queries usually scan a few columns across many rows and column storage reads only the needed columns.
29
What is a distributed database?
A distributed database spreads data across multiple machines so it can scale to handle large datasets and workloads.
30
What is horizontal scaling?
Horizontal scaling means adding more machines to increase system capacity rather than upgrading a single machine.
31
What is fault tolerance?
Fault tolerance means a system continues operating even if some components fail.