Data Quality, Testing, and Data Contracts Flashcards

(40 cards)

1
Q

What is data quality in the context of data engineering?

A

The degree to which data is accurate, complete, timely, consistent, and valid for its intended use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why must data quality be defined relative to use cases?

A

Different consumers have different tolerances; data that is acceptable for trend analysis may be insufficient for regulatory reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the common dimensions of data quality?

A

Accuracy, completeness, consistency, timeliness, validity, and uniqueness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does accuracy mean for data quality?

A

That data values correctly reflect the real-world objects or events they represent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does completeness mean for data quality?

A

That required fields and records are present and not missing beyond acceptable thresholds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does consistency mean for data quality?

A

That data does not contain contradictory information across systems or within itself and follows agreed rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does timeliness mean for data quality?

A

That data is available within the required time window relative to when events occur or decisions must be made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does validity mean for data quality?

A

That data values conform to defined formats, types, ranges, and business rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does uniqueness mean for data quality?

A

That records that should be unique (e.g., primary keys) are not duplicated, avoiding double counting and confusion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the role of data quality checks in pipelines?

A

To automatically detect and surface quality issues early by validating data against expectations as it flows through pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a schema check?

A

A validation that incoming data matches expected column names, types, and structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are example content-level data quality checks?

A

Non-null constraints, range checks, allowed value lists, referential integrity checks, and pattern or format validations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is anomaly detection at the data level?

A

Checking for unusual changes in distributions, counts, or patterns that may indicate upstream issues or bugs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why should data quality checks be automated?

A

Manual checks don’t scale and are inconsistent, whereas automated checks provide consistent, repeatable validation across runs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between hard and soft data quality checks?

A

Hard checks fail or block the pipeline on violation; soft checks log warnings or raise alerts but allow processing to continue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When might soft checks be preferable?

A

When some issues are known to occur occasionally and should be monitored but not block the entire pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is it risky to use only soft checks?

A

Critical issues may go unaddressed, leading to bad data in downstream systems and eroding trust.

18
Q

What is a data quality SLA?

A

An agreement specifying acceptable thresholds for quality metrics such as freshness, completeness, or error rates.

19
Q

Why is monitoring data quality metrics over time important?

A

It reveals trends, recurring issues, and the impact of changes, making quality improvements measurable.

20
Q

What is a data contract?

A

A formal, versioned agreement between data producers and consumers specifying schema, semantics, and service levels.

21
Q

What elements typically belong in a data contract?

A

Schema definitions, field semantics, allowed values, refresh cadence, retention policy, and backward-compatibility rules.

22
Q

How do data contracts reduce breaking changes?

A

They make expectations explicit and require coordinated schema evolution rather than unannounced changes.

23
Q

What is backward-compatible schema change?

A

A change that does not break existing consumers, such as adding nullable fields or expanding enums in a controlled way.

24
Q

What are examples of breaking schema changes?

A

Dropping or renaming fields, changing types incompatibly, or altering meaning without versioning or notice.

25
Why should breaking changes be versioned instead of applied in place?
Versioning allows old and new consumers to coexist, reducing disruption while migrations are completed.
26
What is a contract-first approach to data?
Defining schemas and contracts before implementation and using them as the source of truth for producers and consumers.
27
What is schema registry in the context of data contracts?
A system that stores and validates schema versions for topics or datasets, enforcing compatibility rules.
28
Why is a schema registry useful for streaming systems?
It ensures producers serialize data according to agreed schemas and consumers can deserialize and evolve safely.
29
What is producer-side validation in data contracts?
Checks performed by data producers to ensure they only emit events or records that conform to the contract schema.
30
What is consumer-side validation?
Checks performed by consumers to ensure incoming data matches expectations, guarding against misbehaving producers or drift.
31
What is a data quality firewall?
A layer that blocks or quarantines data that violates critical quality rules before it reaches curated or production tables.
32
What is a quarantine table or error bucket?
A location where bad or suspect records are stored for later analysis and remediation instead of silently dropping them.
33
Why is it important to retain bad records separately?
They provide clues for debugging upstream issues and allow selective correction or replay once problems are fixed.
34
What is referential integrity in data quality?
The guarantee that foreign keys in one table refer to valid primary keys in another, preventing orphan records.
35
Why is referential integrity important for analytics?
Broken references cause missing dimension attributes, inconsistent aggregates, and misleading reports.
36
What is a data quality incident?
An event where data fails to meet agreed quality standards or SLAs, impacting consumers or downstream decisions.
37
Why should data quality incidents be documented and reviewed?
Postmortems identify root causes and process changes that prevent recurrence, improving the system over time.
38
What is the role of ownership in data quality and contracts?
Clear owners for each dataset or contract ensure accountability for quality, evolution, and communication with consumers.
39
How do data quality and contracts relate to trust in a data platform?
Consistent adherence to contracts and visible quality metrics build trust that data is reliable and safe to use.
40
What is a good one-sentence mental model for data quality and contracts?
Define explicit expectations for your data, continuously verify that reality matches those expectations, and make any change to them deliberate, versioned, and visible.