What is data quality in the context of data engineering?
The degree to which data is accurate, complete, timely, consistent, and valid for its intended use.
Why must data quality be defined relative to use cases?
Different consumers have different tolerances; data that is acceptable for trend analysis may be insufficient for regulatory reporting.
What are the common dimensions of data quality?
Accuracy, completeness, consistency, timeliness, validity, and uniqueness.
What does accuracy mean for data quality?
That data values correctly reflect the real-world objects or events they represent.
What does completeness mean for data quality?
That required fields and records are present and not missing beyond acceptable thresholds.
What does consistency mean for data quality?
That data does not contain contradictory information across systems or within itself and follows agreed rules.
What does timeliness mean for data quality?
That data is available within the required time window relative to when events occur or decisions must be made.
What does validity mean for data quality?
That data values conform to defined formats, types, ranges, and business rules.
What does uniqueness mean for data quality?
That records that should be unique (e.g., primary keys) are not duplicated, avoiding double counting and confusion.
What is the role of data quality checks in pipelines?
To automatically detect and surface quality issues early by validating data against expectations as it flows through pipelines.
What is a schema check?
A validation that incoming data matches expected column names, types, and structures.
What are example content-level data quality checks?
Non-null constraints, range checks, allowed value lists, referential integrity checks, and pattern or format validations.
What is anomaly detection at the data level?
Checking for unusual changes in distributions, counts, or patterns that may indicate upstream issues or bugs.
Why should data quality checks be automated?
Manual checks don’t scale and are inconsistent, whereas automated checks provide consistent, repeatable validation across runs.
What is the difference between hard and soft data quality checks?
Hard checks fail or block the pipeline on violation; soft checks log warnings or raise alerts but allow processing to continue.
When might soft checks be preferable?
When some issues are known to occur occasionally and should be monitored but not block the entire pipeline.
Why is it risky to use only soft checks?
Critical issues may go unaddressed, leading to bad data in downstream systems and eroding trust.
What is a data quality SLA?
An agreement specifying acceptable thresholds for quality metrics such as freshness, completeness, or error rates.
Why is monitoring data quality metrics over time important?
It reveals trends, recurring issues, and the impact of changes, making quality improvements measurable.
What is a data contract?
A formal, versioned agreement between data producers and consumers specifying schema, semantics, and service levels.
What elements typically belong in a data contract?
Schema definitions, field semantics, allowed values, refresh cadence, retention policy, and backward-compatibility rules.
How do data contracts reduce breaking changes?
They make expectations explicit and require coordinated schema evolution rather than unannounced changes.
What is backward-compatible schema change?
A change that does not break existing consumers, such as adding nullable fields or expanding enums in a controlled way.
What are examples of breaking schema changes?
Dropping or renaming fields, changing types incompatibly, or altering meaning without versioning or notice.