What is a data pipeline?
A data pipeline is a system that automatically moves data from one place to another and transforms it along the way so it becomes useful for analysis or applications.
What is a data lake?
A data lake is a storage system that keeps large amounts of raw data in its original format until it is needed for analysis.
What is a data warehouse?
A data warehouse is a database optimized for analyzing structured data and running large analytical queries quickly.
What is the difference between a data lake and a data warehouse?
A data lake stores raw flexible data while a data warehouse stores cleaned structured data optimized for analysis.
What is a lakehouse architecture?
A lakehouse combines the flexibility of a data lake with the structure and performance of a data warehouse.
What is the medallion architecture?
It is a data organization pattern with layers called Bronze Silver and Gold where raw data is gradually cleaned and refined into analytics-ready datasets.
What is the Bronze layer in a medallion architecture?
The Bronze layer stores raw data exactly as it arrives from source systems with minimal processing.
What is the Silver layer in a medallion architecture?
The Silver layer stores cleaned and standardized data that is ready for general analysis.
What is the Gold layer in a medallion architecture?
The Gold layer contains highly refined datasets designed specifically for business reports and dashboards.
What is ETL?
ETL stands for Extract Transform Load which means data is extracted from sources transformed into the correct format and then loaded into a destination database.
What is ELT?
ELT stands for Extract Load Transform where raw data is first loaded into the warehouse and transformations happen afterward inside the database.
Why has ELT become popular?
Because modern warehouses are powerful enough to handle transformations directly so data can be loaded faster and processed later.
What is dbt?
dbt is a tool that lets data engineers transform data inside a warehouse using version-controlled SQL models.
Why do teams use dbt?
It organizes SQL transformations like software projects with testing documentation and version control.
What is Airflow?
Airflow is a workflow orchestration tool that schedules and manages data pipelines.
What does orchestration mean in data engineering?
Orchestration means coordinating when and how different pipeline tasks run so data flows in the correct order.
What is a DAG in Airflow?
A DAG or Directed Acyclic Graph is a workflow diagram that defines the order tasks must run in a pipeline.
What is Kafka?
Kafka is a distributed streaming platform that allows systems to send and receive real-time streams of data.
What is streaming data?
Streaming data is data processed continuously as it arrives rather than waiting for large batches.
What is batch processing?
Batch processing processes large groups of data at scheduled intervals such as hourly or daily jobs.
What is a data catalog?
A data catalog is a searchable inventory of datasets that helps teams understand what data exists and how it can be used.
What is data lineage?
Data lineage tracks where data originated and how it changed through each step of the pipeline.
Why is data lineage important?
It helps teams understand how reports are built and trace errors back to their source.
What is data governance?
Data governance defines rules for how data should be managed accessed and protected.