What are ETL pieces
Transformation etc
What is data cleansing
Dirty data should remove to go to dataware house
What is GIGO
Stands for “Garbage In, Garbage Out.” GIGO is a computer science acronym that implies bad input will result in bad output.
What is dirty data
It is relative term. It means data does not confirm its value.
Who tells data dirty or clean
The person who have domain knowledge
What is toddler employee
Example of dirty data. Employee too much young to get a job
What is un-born employee
Employee DOB is less than Date of joining
What is govt decision making
Investment of govt where there is no need and it is loss of money
What is direct mall marketing
Failure of advertisement campaign and loss of money
What are lighter side of dirty data
- Un-born Employee
What are 3 classes of anomalies
What are sub classes of syntactically dirty data
- Irregularities
What are sub classes of Semantically dirty data
What are Coverage anomalies
- Missing Records
What are lexical errors
There is problem in structure of data and storage problem
What are irregularities
Missing of unit (e.g. there is salary in column 2000 and we do not know it is Pkr, USD or what)
What is Integrity constraint violation
Integrity constraint violations occur when an insert, update, or delete statement violates a primary key, foreign key, check, or unique constraint or a unique index.
What is business rule contradiction
It is violation of business rule
How we can handle coverage anomalies
What are 2 key based problem
- None-Primary key problems
What are primary key problems
What are non primary key problems
What are 4 methods of automate data cleansing
1- Association rules (Make rules on statistical properties)
2- Pattern based (Find different pattern values)
3- Statistical (with the help of mean value etc)
4- Clustering (group together values which are similar and anomalies left alone)