What is the core goal of data engineering?
To design, build, and operate reliable systems that move, transform, and organize data so it can be queried and used by others, such as analytics, machine learning, and applications.
How does data engineering differ from data science?
Data engineering focuses on infrastructure, pipelines, and reliability, while data science focuses on analysis, modeling, and decision-making using the data produced by those systems.
How does data engineering differ from software engineering?
Software engineering centers on application logic and user-facing features, whereas data engineering centers on data flows, storage, schemas, and batch or stream processing, though both share coding and reliability practices.
What are the three main planes in a modern data platform?
Ingestion, storage and modeling, and serving or consumption.
What is a data pipeline at a high level?
A defined sequence of steps that takes data from one or more sources, transforms it, and writes it to one or more targets on a schedule or in real time.
What is the difference between batch and streaming processing?
Batch processing works on finite chunks of data at intervals, whereas streaming processing handles data continuously as events arrive with low latency.
What is a data lake in conceptual terms?
A central store for raw and lightly processed data, usually on cheap object storage and supporting many formats and use cases.
What is a data warehouse in conceptual terms?
A structured store optimized for analytical SQL over curated, modeled data with well-defined schemas.
What is a lakehouse architecture trying to achieve?
Combine the flexibility of a data lake with the reliability, performance, and governance of a data warehouse on a single storage layer.
Why are file formats a core concern for data engineers?
They determine how efficiently data can be stored, read, partitioned, and processed by engines; poor choices increase cost and latency.
What is the key difference between row-oriented and column-oriented storage?
Row storage stores all columns of a row together, while column storage stores each column separately across many rows.
Why are columnar formats preferred for analytics workloads?
They allow column pruning, better compression, and efficient vectorized reads, which reduce I/O and speed up analytical queries.
What is partitioning in the context of large datasets?
Splitting data into directory or logical segments based on one or more keys, such as date or region, so queries can skip irrelevant partitions.
Why does partition selection matter for performance?
Good partition keys align with common query filters and yield balanced partitions, minimizing scanned data and skew.
What is the small files problem in data lakes?
Having many tiny files increases metadata overhead and scheduling cost, making queries and jobs slower and less efficient.
What is schema-on-read?
Storing raw or loosely structured data and applying a schema at query time when it is read by an engine.
What is schema-on-write?
Enforcing structure and types at the time data is ingested so stored data conforms to a defined schema.
Why is schema-on-read attractive for raw ingestion?
It allows fast onboarding of heterogeneous sources without blocking on full modeling, which is useful for exploration and rapid prototyping.
Why is schema discipline still necessary even with schema-on-read?
Without agreed schemas for curated layers, queries become brittle, break easily, and data quality is hard to guarantee.
What are the typical layers in a well-structured data platform?
Raw or landing, cleaned or staging, and curated layers such as marts or feature stores, sometimes with additional sandbox layers.
Why separate raw, staging, and curated data?
To preserve original data, isolate cleaning and standardization logic, and keep curated layers stable and business-friendly.
What is data modeling in the context of data engineering?
Designing how data is structured into tables and schemas, including relationships and grain, to support efficient and understandable querying.
What is dimensional modeling?
A modeling approach using fact tables for events or measures and dimension tables for entities and context, optimized for analytical queries.
What is the role of a fact table?
To store measurements or events with foreign keys to dimensions and numeric metrics that can be aggregated.