Miscellaneous & Anti-Patterns Flashcards by O Cam

What is a design pattern in data engineering?

A reusable solution to a common data problem, described at a high level so it can be adapted to specific technologies.

How well did you know this?

Not at all

Perfectly

What is an anti-pattern in data engineering?

A common but counterproductive approach that seems convenient initially but leads to reliability, performance, or maintainability problems over time.

How well did you know this?

Not at all

Perfectly

Why is recognizing patterns and anti-patterns valuable for a senior data engineer?

It helps choose proven approaches quickly, avoid known pitfalls, and reason about trade-offs in complex systems.

How well did you know this?

Not at all

Perfectly

What is the ‘raw → staging → curated’ layering pattern?

A pattern where raw data is ingested with minimal changes, staging standardizes and cleans it, and curated layers provide business-ready models.

How well did you know this?

Not at all

Perfectly

Why is mixing raw and curated data in the same tables or paths an anti-pattern?

It blurs semantics, increases the risk of using unclean data in production, and complicates lifecycle and governance.

How well did you know this?

Not at all

Perfectly

What is the ‘append-only logs’ pattern?

Storing events as immutable, time-ordered records and deriving state from them, which supports replay, auditability, and recovery.

How well did you know this?

Not at all

Perfectly

Why is in-place destructive updating of historical data generally an anti-pattern in analytics?

It destroys history, hinders debugging, and can break time-based analysis and reproducibility.

How well did you know this?

Not at all

Perfectly

What is the ‘write-ahead log plus derived views’ pattern?

Store raw events or transactions, and build derived summary tables and materialized views from them for fast queries.

How well did you know this?

Not at all

Perfectly

Why is ‘report directly from application OLTP databases’ usually an anti-pattern at scale?

It couples analytics to operational systems, risks performance impact on OLTP workloads, and provides inconsistent or incomplete data.

How well did you know this?

Not at all

Perfectly

What is the ‘single source of truth’ pattern?

Designating a canonical dataset or table as the authoritative version for a domain, with other copies derived from it.

How well did you know this?

Not at all

Perfectly

Why is letting many teams maintain their own uncoordinated copies of the same core data an anti-pattern?

It leads to divergent definitions, conflicting numbers, and complex reconciliation work across the organization.

How well did you know this?

Not at all

Perfectly

What is the ‘idempotent job’ pattern?

Jobs are written so rerunning them with the same inputs produces the same output, avoiding duplicates or corruption.

How well did you know this?

Not at all

Perfectly

Why are non-idempotent jobs an anti-pattern in scheduled and retried pipelines?

They can double-count records, leave partial updates, and require manual cleanup after failures.

How well did you know this?

Not at all

Perfectly

What is the ‘backfill-friendly’ pattern?

Designing pipelines so they can be safely run over historical ranges without special-case logic or manual intervention.

How well did you know this?

Not at all

Perfectly

Why is hard-coding dates or ranges in pipeline code an anti-pattern?

It makes backfills and parameterized runs difficult, forcing code changes for routine maintenance tasks.

How well did you know this?

Not at all

Perfectly

What is the ‘schema-first’ or contract-first pattern?

Defining schemas and contracts before building pipelines, using them as the source of truth for producers and consumers.

How well did you know this?

Not at all

Perfectly

Why is allowing schemas to evolve silently without communication an anti-pattern?

Study These Flashcards

Unannounced changes break consumers, cause subtle bugs, and undermine trust in the data platform.

What is the ‘slowly changing dimension (SCD) type 2’ pattern used for?

Study These Flashcards

Capturing historical attribute changes in dimension tables so analyses can reflect what was known at different times.

Why is rewriting historical dimension attributes without tracking changes an anti-pattern in many analytics systems?

Study These Flashcards

It loses history and may produce misleading historical reports that reflect current attributes instead of past reality.

What is the ‘late-arriving data handling’ pattern?

Study These Flashcards

Designing pipelines and models to accept and correctly integrate records that arrive after their nominal processing window.

Why is ignoring late-arriving data an anti-pattern for time-based metrics?

Study These Flashcards

It can systematically undercount or misalign metrics, particularly in domains with delayed reporting or eventual consistency.

What is the ‘feature store’ pattern in ML systems?

Study These Flashcards

Centralizing feature definitions, storage, and serving so multiple models can reuse them consistently across training and inference.

Why is embedding one-off feature code directly into each model script an anti-pattern?

Study These Flashcards

It leads to duplicated logic, inconsistent definitions, and difficult maintenance when definitions or data sources change.

What is the ‘small, composable jobs’ pattern?

Study These Flashcards

Breaking pipelines into smaller steps that each do one thing well and can be tested, monitored, and reused independently.

Why is the 'giant monolithic job' often an anti-pattern?

It is hard to understand, test, debug, and evolve; failures can require rerunning large amounts of work even for small changes.

What is the 'orchestration as coordinator, not compute' pattern?

Using the orchestrator only to schedule and coordinate tasks, while actual data processing happens in specialized compute engines.

Why is putting heavy data processing logic directly in the orchestrator an anti-pattern?

It entangles control logic with data logic, strains the orchestrator, and reduces portability and testability of transformations.

What is the 'metric as a first-class object' pattern?

Defining metrics with clear formulas, dimensions, and ownership in a central semantic layer or catalog.

Why is hard-coding metric logic separately in many reports and pipelines an anti-pattern?

It produces inconsistent numbers and makes changes to metric definitions complex and error-prone.

What is the 'one fact, many slices' pattern in warehousing?

Designing fact tables at a clear grain so they can support many different analysis cuts via dimensions instead of many overlapping, redundant facts.

Why is proliferating many similar fact tables with slightly different grains an anti-pattern?

It complicates modeling, confuses users, and makes reconciliation and maintenance more difficult.

What is the 'observability baked in' pattern?

Including logging, metrics, and data checks in pipeline design from the start instead of adding them only after incidents.

Why is treating observability as an afterthought an anti-pattern?

It delays detection of issues, increases time to resolve, and makes root-cause analysis much harder when problems occur.

What is the 'cost-aware design' pattern?

Making architectural choices with explicit awareness of storage, compute, and transfer costs, and measuring impact over time.

Why is ignoring cost until budgets are exceeded an anti-pattern?

By then, expensive patterns are deeply embedded, making optimization or re-architecture painful and disruptive.

What is the 'testable pipeline' pattern?

Structuring transformations and code so they can be tested with small datasets and automated tests before running at full scale.

Why is having pipeline logic that only runs in production with live data an anti-pattern?

Bugs are found late, on critical data, and often during business hours, causing avoidable incidents and manual debugging.

What is the 'documented ownership' pattern?

Assigning each dataset, pipeline, and model clear owners responsible for quality, uptime, and evolution.

Why is the 'everyone and no one owns it' state an anti-pattern?

Issues linger unresolved, quality degrades, and it is unclear who decides on schema or process changes.

What is a good one-sentence mental model for patterns and anti-patterns in DE?

Prefer designs that are explicit, idempotent, testable, and reusable, and be suspicious of shortcuts that hide complexity, mix concerns, or make history and contracts opaque.

Miscellaneous & Anti-Patterns Flashcards

(40 cards)