What is data quality and list key quality dimensions.
Data quality measures fitness for purpose. Key dimensions: accuracy (correct values), completeness (no missing data), consistency (uniform format), timeliness (current), and validity (conforms to format rules).
How would you approach data cleaning?
Data cleaning involves: identifying missing values (handle via imputation or removal), detecting outliers, standardizing formats, removing duplicates, correcting errors, validating against business rules, and documenting all changes.
What is data validation and how does it differ from verification?
Validation checks if data meets required quality standards before use. Verification confirms data accurately represents source information. Validation asks ‘is it acceptable?’ while verification asks ‘is it correct?’
What are the main causes of missing data?
Causes include: data entry errors, equipment failure, non-response in surveys, system outages, and intentional removal. Understanding cause determines best handling strategy and whether analysis remains valid.
Describe methods for handling missing data.
Methods include: deletion (remove rows/columns), mean/median imputation, forward/backward fill for time series, or advanced techniques like k-NN or model-based imputation. Choose based on percentage missing and analysis type.
Explain how you would identify duplicate records in a table.
Use GROUP BY with HAVING COUNT(*) > 1 to find duplicates, or use window functions like ROW_NUMBER() OVER(PARTITION BY key_columns ORDER BY key_columns) to flag them. Then investigate root causes and decide whether to keep or remove.
Explain how you would reconcile data from multiple sources.
Reconciliation involves: identifying key fields for matching, loading data from both sources, joining on keys, comparing values, investigating discrepancies, identifying duplicates or missing records, and documenting differences.
What is data standardization and why is it important?
Standardization ensures consistent format, units, and representation across data. Examples: standardizing date formats (YYYY-MM-DD), converting all text to lowercase, using consistent units (e.g., kilograms vs pounds). Prevents analysis errors.
How would you handle inconsistent data types?
Identify type mismatches, convert columns to correct types using CAST or CONVERT, validate conversions work correctly, handle failed conversions carefully, and document all type changes made for reproducibility.
Explain the concept of data imputation.
Imputation fills missing values with estimated values. Methods: mean/median/mode, forward/backward fill, regression-based, or k-NN. Choose based on missing data mechanism, percentage missing, and impact on analysis.
What is feature engineering and its importance in data preparation?
Feature engineering creates new variables from existing data to improve model performance. Examples: creating age groups from birth date, deriving ratios, encoding categorical variables. Often provides largest performance gains.
How would you handle categorical variables in analysis?
Options include: one-hot encoding (create binary columns), label encoding (assign integers), target encoding (encode by target mean), ordinal encoding (for ordered categories). Choice depends on modeling technique used.
Explain the concept of data profiling.
Data profiling examines data to understand its characteristics. Includes: checking distributions, identifying missing values, detecting outliers, validating formats, and understanding relationships. Foundation for effective cleaning.
What is data lineage and why does it matter?
Data lineage traces data origin through transformations to final use. Important for: debugging data issues, ensuring reproducibility, compliance, understanding dependencies, and impact analysis of changes.
How would you detect and handle data anomalies?
Detect using: statistical methods (z-scores, IQR), business rules, visual inspection. Handle by: investigating cause, validating accuracy, deciding whether to remove/keep/adjust, and documenting decisions.
Explain the concept of data governance.
Data governance establishes policies, procedures, and controls for data management. Includes: defining data ownership, ensuring quality, managing access, establishing standards, and enforcing compliance.
What is the difference between data cleansing and data wrangling?
Cleansing focuses on error correction and standardization. Wrangling includes all data preparation: cleaning, transforming, reshaping to make suitable for analysis. Wrangling is broader concept.
How would you handle outliers in your dataset?
Options: investigate cause (data error vs. legitimate extreme), visualize to understand impact, use robust statistics, remove if erroneous, keep if valid and meaningful, or analyze separately.
Explain the concept of data transformation.
Transformation changes data representation: scaling (normalization), encoding, creating new variables, aggregating, reshaping. Important for: meeting algorithm assumptions, improving interpretability, and enhancing relationships.
What is binning and when would you use it?
Binning groups continuous values into intervals/categories. Use for: reducing noise, creating categorical features, improving interpretability, improving model performance. Methods: equal-width, equal-frequency, custom.
How would you handle time zone issues in temporal data?
Standardize all data to single time zone (typically UTC), document original time zones, be aware of daylight saving transitions, handle timestamp consistency across systems, and validate time calculations.
Explain the concept of data normalization vs standardization.
Normalization scales values to 0-1 range (or similar). Standardization (z-score) scales to mean 0, std dev 1. Both remove units of measurement. Choose based on algorithm requirements and interpretability needs.
What is the importance of documenting data cleaning steps?
Documentation enables: reproducibility, understanding decisions made, identifying errors, enabling collaboration, facilitating audits, and supporting knowledge transfer. Essential for data integrity and credibility.
How would you validate that your data cleaning was successful?
Validate by: checking completeness (no missing values), verifying distributions make sense, spot-checking cleaned data against original, running business logic validations, comparing metrics before/after cleaning.