Data
Found everywhere, generated at an unprecedented rate. Fundamental component of society
Data in our lives
Types of Data
Structured, Unstructured, Semi-Structured
Structured Data
Well organized, formattable and easily searchable ex. Financial Records. Usually stored in RDBMS or files like csv
Unstructured Data
Unorganized, Unformatted and different formats ex. Social Media Posts, Emails etc. Usually stored in file systems/CMS that preserve original structure
Semi-Structured Data
Combination. Type of unorganized or partially organized data which doesn’t follow a rigid format but still has some level of structure. Mix of fixed and variable fields. Can be found in XML or JSON files.
Qualitative Data examlpe
Gender, Nationality
Quantitative Data example
Height, Weight
Raw Data
Original Source of data, hard to use for analysis. Raw data may only need to be processed once
Processed Data
Data that is ready for analysis, processing can include merging, subsetting, transforming etc. All steps should be recorded.
Raw Data Example
ASCII files to Binary files that are machine generated, unformatted excel files, API responses.
Accuracy
The measure of data quality that ensures data is correct, free from errors, and represents the real-world value accurately.
Completeness
Indicates whether all required data is recorded or if some is missing/unavailable.
Consistency
Ensures uniformity across data. Examples of issues include partially modified records or dangling updates.
Timeliness
Refers to whether data is updated promptly to reflect the current state.
Believability
Assesses how trustworthy or credible the data is.
Interpretability
Reflects how easily the data can be understood by users.
Data Consolidation Process
Why is Data Consolidation Needed?
Disparate Data
Data is often stored in diverse locations and formats, which may include:
- Relational Databases: Used in operational systems for structured data.
- XML Files: Common in web services for hierarchical data.
- Desktop Databases: Such as Microsoft Access.
- Spreadsheets: Examples include Microsoft Excel.
- JSON: Popular for semi-structured or API-related data.
Challenges with Disparate Data
Causes of Inconsistencies
1/0, T/F, Y/N).Other Data Quality Issues
Causes of Missing Data