WHY DOES YOUR DATA HAVE
ERRORS
1) Data entry errors
2) Measurement errors
3) Distillation errors
4) Data integration errors
WHAT TO DO ABOUT THEM
Data entry interface design (before)
* Organizational management (before)
* Automated data auditing and cleaning (after)*
* Exploratory data analysis and cleaning (after)
* Not the focus of today, but read about it.
EXPLORATORY DATA ANALYSIS
HOW TO START: 1) Explore your data:
OUTLIERS
2) For continuous variables – are there any outliers? i.e. Some
values that are far from the center? i.e., your average age is 36,
but you also have observations with age 150, while the second
highest observation is 67. You can get a sense of this from the
descriptive statistics table (i.e., if you look at min/max and
average values).
In R you can check this two ways > boxplot and outlier test.
WHAT TO DO WITH MISSING DATA
1) Test whether it is missing at random
2) Consider:
* Do you want to impute it, either using an average or median
* Should you delete the observation entirely
* Should you simply not use that column
Concretely, you have to identify (read more in Chapter 5, 5.4, in Business Analytics: Data Science for Business Problems)
1. Which features have missing values;
2. The extent of the missingness;
3. The reasons for the missingness; and
4. What to do about them.
NON-CONTINUOUS DATA
IDs- integrity check – you should not have any duplicate IDs…
Categorical variables – check for unusually high frequencies – it could be a data
error, but even if it is not, this might impact subsequent data analysis, e.g., if
99% of your respondents are students, vs full time employees, this might not
be a good target variable/feature to include later
CLOSING REMARKS
%>%
It’s called a pipe.. Read about it here:
https://www.datacamp.com/tutorial/pipe-r-tutorial#intro
TIDYR
1 Each variable is saved in its own column
2 Each observation is saved in its own row
3 Each “type” of observation stored in a single table (here, storms)
Gather ()
Collapses multiple columns into two columns
1 A key column that contains the former column names
2 A value column that contains the former column cells
gather (cases, “year”, “n”, 2:4)
cases = data frame to reshape
year = name of the key column (a character string)
n = name of the new value column (a character string)
2:4 = names or numeric indexes of columns to collapse
Spread () vs gather ()
Make observations from variables with gather()
Make variables from observations with spread()
TIDYR
A package that reshapes the layout of data sets
Make observations from variables with gather()
Make variables from observations with spread()
Split and merge column with unite() and separate()