Data Engineering Fundamentals Flashcards by O Cam

What is the core goal of data engineering?

To design, build, and operate reliable systems that move, transform, and organize data so it can be queried and used by others, such as analytics, machine learning, and applications.

How well did you know this?

Not at all

Perfectly

How does data engineering differ from data science?

Data engineering focuses on infrastructure, pipelines, and reliability, while data science focuses on analysis, modeling, and decision-making using the data produced by those systems.

How well did you know this?

Not at all

Perfectly

How does data engineering differ from software engineering?

Software engineering centers on application logic and user-facing features, whereas data engineering centers on data flows, storage, schemas, and batch or stream processing, though both share coding and reliability practices.

How well did you know this?

Not at all

Perfectly

What are the three main planes in a modern data platform?

Ingestion, storage and modeling, and serving or consumption.

How well did you know this?

Not at all

Perfectly

What is a data pipeline at a high level?

A defined sequence of steps that takes data from one or more sources, transforms it, and writes it to one or more targets on a schedule or in real time.

How well did you know this?

Not at all

Perfectly

What is the difference between batch and streaming processing?

Batch processing works on finite chunks of data at intervals, whereas streaming processing handles data continuously as events arrive with low latency.

How well did you know this?

Not at all

Perfectly

What is a data lake in conceptual terms?

A central store for raw and lightly processed data, usually on cheap object storage and supporting many formats and use cases.

How well did you know this?

Not at all

Perfectly

What is a data warehouse in conceptual terms?

A structured store optimized for analytical SQL over curated, modeled data with well-defined schemas.

How well did you know this?

Not at all

Perfectly

What is a lakehouse architecture trying to achieve?

Combine the flexibility of a data lake with the reliability, performance, and governance of a data warehouse on a single storage layer.

How well did you know this?

Not at all

Perfectly

Why are file formats a core concern for data engineers?

They determine how efficiently data can be stored, read, partitioned, and processed by engines; poor choices increase cost and latency.

How well did you know this?

Not at all

Perfectly

What is the key difference between row-oriented and column-oriented storage?

Row storage stores all columns of a row together, while column storage stores each column separately across many rows.

How well did you know this?

Not at all

Perfectly

Why are columnar formats preferred for analytics workloads?

They allow column pruning, better compression, and efficient vectorized reads, which reduce I/O and speed up analytical queries.

How well did you know this?

Not at all

Perfectly

What is partitioning in the context of large datasets?

Splitting data into directory or logical segments based on one or more keys, such as date or region, so queries can skip irrelevant partitions.

How well did you know this?

Not at all

Perfectly

Why does partition selection matter for performance?

Good partition keys align with common query filters and yield balanced partitions, minimizing scanned data and skew.

How well did you know this?

Not at all

Perfectly

What is the small files problem in data lakes?

Having many tiny files increases metadata overhead and scheduling cost, making queries and jobs slower and less efficient.

How well did you know this?

Not at all

Perfectly

What is schema-on-read?

Storing raw or loosely structured data and applying a schema at query time when it is read by an engine.

How well did you know this?

Not at all

Perfectly

What is schema-on-write?

Enforcing structure and types at the time data is ingested so stored data conforms to a defined schema.

How well did you know this?

Not at all

Perfectly

Why is schema-on-read attractive for raw ingestion?

Study These Flashcards

It allows fast onboarding of heterogeneous sources without blocking on full modeling, which is useful for exploration and rapid prototyping.

Why is schema discipline still necessary even with schema-on-read?

Study These Flashcards

Without agreed schemas for curated layers, queries become brittle, break easily, and data quality is hard to guarantee.

What are the typical layers in a well-structured data platform?

Study These Flashcards

Raw or landing, cleaned or staging, and curated layers such as marts or feature stores, sometimes with additional sandbox layers.

Why separate raw, staging, and curated data?

Study These Flashcards

To preserve original data, isolate cleaning and standardization logic, and keep curated layers stable and business-friendly.

What is data modeling in the context of data engineering?

Study These Flashcards

Designing how data is structured into tables and schemas, including relationships and grain, to support efficient and understandable querying.

What is dimensional modeling?

Study These Flashcards

A modeling approach using fact tables for events or measures and dimension tables for entities and context, optimized for analytical queries.

What is the role of a fact table?

Study These Flashcards

To store measurements or events with foreign keys to dimensions and numeric metrics that can be aggregated.

What is the role of a dimension table?

To store descriptive attributes for entities such as customers, products, or dates, enabling slicing, filtering, and labeling of facts.

What is denormalization and why is it common in analytics?

Denormalization stores redundant data to reduce joins, trading some storage and update complexity for simpler, faster analytic queries.

What is orchestration in data engineering?

Coordinating the execution order, scheduling, dependencies, and error handling for multiple pipeline steps and jobs.

Why is orchestration critical beyond just having individual scripts?

It ensures data arrives in the correct order, handles retries and failures systematically, and provides visibility into end-to-end workflows.

What is idempotency for data pipelines?

The property that running a job multiple times with the same inputs produces the same final result without duplicates or inconsistencies.

Why must ingestion and transformation jobs be idempotent?

Because failures and retries are inevitable and non-idempotent jobs can corrupt data or double-count events on reruns.

What is data quality in a data engineering context?

The extent to which data is accurate, complete, timely, consistent, and valid for its intended use.

Why should data quality checks be part of pipelines rather than manual ad hoc checks?

Automated checks catch issues early, prevent bad data from propagating, and make quality guarantees repeatable and observable.

What is observability for data systems?

The ability to understand system behavior via logs, metrics, traces, and data-level checks such as freshness, row counts, and distributions.

What is lineage in data platforms?

Traceable relationships showing how data moves and transforms from sources through intermediate steps to final outputs.

Why is lineage important?

It helps debug issues, assess the impact of changes, and satisfy audit and compliance requirements.

What is a data contract as a mental model?

A clear, versioned agreement about schema, semantics, and SLAs between data producers and consumers.

Why do data contracts matter in larger organizations?

They reduce accidental breaking changes, clarify responsibilities, and improve trust between teams that share data.

What is the main role of a data engineer in machine learning systems?

To deliver reliable feature and label pipelines with histories and monitoring so models can be trained and served correctly.

What is the difference between offline and online data in ML?

Offline data is used for training and batch scoring, while online data is served in real time for low-latency predictions.

Why is train–serve skew a data engineering concern?

Differences between offline and online feature pipelines often stem from data engineering choices, and skew can cause models to underperform in production.

What are the main reliability risks in data engineering systems?

Late or missing data, schema changes, bad data quality, pipeline failures, and unanticipated changes in scale or usage.

What are the main cost drivers in a modern data platform?

Storage volume and retention, compute used for batch and stream jobs and queries, and data transfer between systems or clouds.

Why should a data engineer care about cost and not just performance?

Because data platforms can become very expensive, and good engineering balances reliability, performance, and cost constraints.

What is a good one-sentence mental model for data engineering?

Data engineering is the discipline of turning messy, evolving source data into trustworthy, well-modeled datasets and APIs that can be safely consumed at scale.

Data Engineering Fundamentals Flashcards

(44 cards)