What is data quality?
Data quality refers to how accurate, complete, reliable, and relevant data is for its intended purpose.
Poor quality = poor decisions.
Define reliability in the context of data quality.
Also example
Data is consistent over time and from trusted sources.
If you measure the same thing multiple times under the same conditions, you get the same result.
Example: A sensor records the temperature as 25°C every time in the same environment — this is reliable. If readings jump randomly (25°C → 30°C → 20°C), it’s unreliable.
If you measure the same thing multiple times under the same conditions, you get the same result.
List methods to ensure reliability.
Use standardised measurement tools – ensures the same method or device is used each time so results are consistent and comparable.
Automate data collection to reduce human error – removes bias or mistakes that can occur from manual entry, keeping results stable.
Test data consistency over time – regularly check if the same process produces similar outcomes across different periods or datasets.
Example: A sensor records the same temperature consistently.
Define validity in the context of data quality.
Validity is the accuracy of data — the data correctly measures what it is supposed to measure.
The data correctly measures what it is supposed to measure.
List methods to ensure validity.
Example: A valid customer satisfaction survey question.
What is accuracy in data quality? w/ example
How close the data is to the true or accepted value
Example: GPS coordinates of a store that match exactly with Google Maps = accurate.
Example: GPS coordinates matching exactly with Google Maps.
Define relevance in data quality w/ example
Data must be applicable to the problem you are solving
Example: Using sales data to predict foot traffic in a store is relevant; using social media likes may not be.
Example: Using sales data to predict foot traffic.
Define completeness in data quality w/ example
All necessary data is present
Example: A customer database missing phone numbers = incomplete.
Example: A customer database missing phone numbers.
Define timeliness in data quality w/ example
Data is up-to-date
Example: Stock prices updated every second are timely; last year’s prices = not timely.
Example: Stock prices updated every second.
What is the purpose of validation rules?
Prevent incorrect data entry
Example: Age field only allows 0–120.
What is the purpose of data cleaning?
Remove errors, duplicates
Example: Fix missing addresses, remove duplicate entries.
What is the purpose of data auditing?
Review data periodically
Example: Check monthly sales data for anomalies.
What is the purpose of automated data collection?
Reduce human error
Example: IoT sensors logging temperature.
What is the purpose of cross-verification?
Compare data from multiple sources
Example: Check customer emails in CRM vs sign-up form.
What is the purpose of standardised protocols?
Maintain consistency
Example: Use the same units (kg, m, $).
What is structured data?
Clearly organised (tables, SQL)
Easy to analyse.
What is unstructured data?
Text, images, videos
Harder to analyse; may lack reliability.
What is semi-structured data?
JSON, XML — has structure but not as strict as tables
Combines elements of both structured and unstructured data.
Give an example of a reliability issue.
Temperature sensor gives random spikes
Indicates inconsistency in data.
Give an example of a validity issue.
Using social media likes as a proxy for customer satisfaction
May not accurately reflect true customer sentiment.
Give an example of a completeness issue.
Missing transaction IDs in a sales report
Indicates lack of necessary data.
Give an example of a timeliness issue.
Using last year’s stock prices for today’s trading decisions
Data is outdated.
Why does data quality matter?
Poor data quality → bad decisions → financial loss or missed opportunities
Reliable and valid data ensures trustworthy analysis.