What is a design pattern in data engineering?
A reusable solution to a common data problem, described at a high level so it can be adapted to specific technologies.
What is an anti-pattern in data engineering?
A common but counterproductive approach that seems convenient initially but leads to reliability, performance, or maintainability problems over time.
Why is recognizing patterns and anti-patterns valuable for a senior data engineer?
It helps choose proven approaches quickly, avoid known pitfalls, and reason about trade-offs in complex systems.
What is the ‘raw → staging → curated’ layering pattern?
A pattern where raw data is ingested with minimal changes, staging standardizes and cleans it, and curated layers provide business-ready models.
Why is mixing raw and curated data in the same tables or paths an anti-pattern?
It blurs semantics, increases the risk of using unclean data in production, and complicates lifecycle and governance.
What is the ‘append-only logs’ pattern?
Storing events as immutable, time-ordered records and deriving state from them, which supports replay, auditability, and recovery.
Why is in-place destructive updating of historical data generally an anti-pattern in analytics?
It destroys history, hinders debugging, and can break time-based analysis and reproducibility.
What is the ‘write-ahead log plus derived views’ pattern?
Store raw events or transactions, and build derived summary tables and materialized views from them for fast queries.
Why is ‘report directly from application OLTP databases’ usually an anti-pattern at scale?
It couples analytics to operational systems, risks performance impact on OLTP workloads, and provides inconsistent or incomplete data.
What is the ‘single source of truth’ pattern?
Designating a canonical dataset or table as the authoritative version for a domain, with other copies derived from it.
Why is letting many teams maintain their own uncoordinated copies of the same core data an anti-pattern?
It leads to divergent definitions, conflicting numbers, and complex reconciliation work across the organization.
What is the ‘idempotent job’ pattern?
Jobs are written so rerunning them with the same inputs produces the same output, avoiding duplicates or corruption.
Why are non-idempotent jobs an anti-pattern in scheduled and retried pipelines?
They can double-count records, leave partial updates, and require manual cleanup after failures.
What is the ‘backfill-friendly’ pattern?
Designing pipelines so they can be safely run over historical ranges without special-case logic or manual intervention.
Why is hard-coding dates or ranges in pipeline code an anti-pattern?
It makes backfills and parameterized runs difficult, forcing code changes for routine maintenance tasks.
What is the ‘schema-first’ or contract-first pattern?
Defining schemas and contracts before building pipelines, using them as the source of truth for producers and consumers.
Why is allowing schemas to evolve silently without communication an anti-pattern?
Unannounced changes break consumers, cause subtle bugs, and undermine trust in the data platform.
What is the ‘slowly changing dimension (SCD) type 2’ pattern used for?
Capturing historical attribute changes in dimension tables so analyses can reflect what was known at different times.
Why is rewriting historical dimension attributes without tracking changes an anti-pattern in many analytics systems?
It loses history and may produce misleading historical reports that reflect current attributes instead of past reality.
What is the ‘late-arriving data handling’ pattern?
Designing pipelines and models to accept and correctly integrate records that arrive after their nominal processing window.
Why is ignoring late-arriving data an anti-pattern for time-based metrics?
It can systematically undercount or misalign metrics, particularly in domains with delayed reporting or eventual consistency.
What is the ‘feature store’ pattern in ML systems?
Centralizing feature definitions, storage, and serving so multiple models can reuse them consistently across training and inference.
Why is embedding one-off feature code directly into each model script an anti-pattern?
It leads to duplicated logic, inconsistent definitions, and difficult maintenance when definitions or data sources change.
What is the ‘small, composable jobs’ pattern?
Breaking pipelines into smaller steps that each do one thing well and can be tested, monitored, and reused independently.