Modern Data Engineering Stack Flashcards by Play Like A Pro

What is a data pipeline?

A data pipeline is a system that automatically moves data from one place to another and transforms it along the way so it becomes useful for analysis or applications.

How well did you know this?

Not at all

Perfectly

What is a data lake?

A data lake is a storage system that keeps large amounts of raw data in its original format until it is needed for analysis.

How well did you know this?

Not at all

Perfectly

What is a data warehouse?

A data warehouse is a database optimized for analyzing structured data and running large analytical queries quickly.

How well did you know this?

Not at all

Perfectly

What is the difference between a data lake and a data warehouse?

A data lake stores raw flexible data while a data warehouse stores cleaned structured data optimized for analysis.

How well did you know this?

Not at all

Perfectly

What is a lakehouse architecture?

A lakehouse combines the flexibility of a data lake with the structure and performance of a data warehouse.

How well did you know this?

Not at all

Perfectly

What is the medallion architecture?

It is a data organization pattern with layers called Bronze Silver and Gold where raw data is gradually cleaned and refined into analytics-ready datasets.

How well did you know this?

Not at all

Perfectly

What is the Bronze layer in a medallion architecture?

The Bronze layer stores raw data exactly as it arrives from source systems with minimal processing.

How well did you know this?

Not at all

Perfectly

What is the Silver layer in a medallion architecture?

The Silver layer stores cleaned and standardized data that is ready for general analysis.

How well did you know this?

Not at all

Perfectly

What is the Gold layer in a medallion architecture?

The Gold layer contains highly refined datasets designed specifically for business reports and dashboards.

How well did you know this?

Not at all

Perfectly

What is ETL?

ETL stands for Extract Transform Load which means data is extracted from sources transformed into the correct format and then loaded into a destination database.

How well did you know this?

Not at all

Perfectly

What is ELT?

ELT stands for Extract Load Transform where raw data is first loaded into the warehouse and transformations happen afterward inside the database.

How well did you know this?

Not at all

Perfectly

Why has ELT become popular?

Because modern warehouses are powerful enough to handle transformations directly so data can be loaded faster and processed later.

How well did you know this?

Not at all

Perfectly

What is dbt?

dbt is a tool that lets data engineers transform data inside a warehouse using version-controlled SQL models.

How well did you know this?

Not at all

Perfectly

Why do teams use dbt?

It organizes SQL transformations like software projects with testing documentation and version control.

How well did you know this?

Not at all

Perfectly

What is Airflow?

Airflow is a workflow orchestration tool that schedules and manages data pipelines.

How well did you know this?

Not at all

Perfectly

What does orchestration mean in data engineering?

Study These Flashcards

Orchestration means coordinating when and how different pipeline tasks run so data flows in the correct order.

What is a DAG in Airflow?

Study These Flashcards

A DAG or Directed Acyclic Graph is a workflow diagram that defines the order tasks must run in a pipeline.

What is Kafka?

Study These Flashcards

Kafka is a distributed streaming platform that allows systems to send and receive real-time streams of data.

What is streaming data?

Study These Flashcards

Streaming data is data processed continuously as it arrives rather than waiting for large batches.

What is batch processing?

Study These Flashcards

Batch processing processes large groups of data at scheduled intervals such as hourly or daily jobs.

What is a data catalog?

Study These Flashcards

A data catalog is a searchable inventory of datasets that helps teams understand what data exists and how it can be used.

What is data lineage?

Study These Flashcards

Data lineage tracks where data originated and how it changed through each step of the pipeline.

Why is data lineage important?

Study These Flashcards

It helps teams understand how reports are built and trace errors back to their source.

What is data governance?

Study These Flashcards

Data governance defines rules for how data should be managed accessed and protected.

What is a schema?

A schema is the structure that defines how data is organized in a database such as tables columns and relationships.

What is schema evolution?

Schema evolution means allowing the structure of data to change over time without breaking existing pipelines.

What is columnar storage?

Columnar storage stores data by columns instead of rows which makes analytical queries much faster.

Why do modern warehouses use columnar storage?

Because analytical queries usually scan a few columns across many rows and column storage reads only the needed columns.

What is a distributed database?

A distributed database spreads data across multiple machines so it can scale to handle large datasets and workloads.

What is horizontal scaling?

Horizontal scaling means adding more machines to increase system capacity rather than upgrading a single machine.

What is fault tolerance?

Fault tolerance means a system continues operating even if some components fail.

Modern Data Engineering Stack Flashcards

(31 cards)