Phases of data preparation
Types/causes of missing data
During data extraction/collection:
a) Completely random missing: probability of missing is identical for each observation.
b) Random missing: missing in a variable is random but has a relationship with other input variable (e.g.. missing data on age is more common for women than men).
c) Missing that depend on predictors not observed: missing data not random, but depend on variable that has not been recorded.
d) Missing that depend on the same value of the variable. probability of missing depends on value itself (e.g. low income tends to not declare it).
How to handle missing values:
1) Delete them.
* *Listwise** deletion - entire observation that contains the missing data is deleted. Risk: little data.
Pairwise deletion - statistical analysis is calculated with all available values. Risk: different sample sizes for each variable. When missing data is completely random, pairwise deletion is preferred.
2) Impute with mean/mode/median. Mean and median for numeric data, the mode for categorical.
3) Building predictive models. Build a predictive model to estimate the values that will replace the missing data.
Risk: estimated values more regular than true values.
4) Recoding with K-NN. Advantage: It can be used interchangeably to qualitative and quantitative variables.
Risk: computational difficulty and sensitivity of model parametrisation.
Types of outliers and origins
Univariate
Multivariate
Consequences of outliers
1) Increase the error variance, reduce the power of statistical tests.
2) If outliers are not distributed randomly, compromise normality of distribution.
3) Influence results of estimated tests.
4) They may have an impact on the basic assumptions of regression, ANOVA and other statistical methods.
How to identify outliers
Univariate Outliers
Multivariate Outliers
How to manage outliers
Variable transformation strategies