Data Preparation Flashcards

Question 1

Q

Phases of data preparation

Answer

A

Identification of variables
Univariate analysis
Bivariate analysis
Missing data
Outlier treatment
Variable transformation
Creation of new variables

Question 2

Q

Types/causes of missing data

Answer

A

During data extraction/collection:

a) Completely random missing: probability of missing is identical for each observation.
b) Random missing: missing in a variable is random but has a relationship with other input variable (e.g.. missing data on age is more common for women than men).
c) Missing that depend on predictors not observed: missing data not random, but depend on variable that has not been recorded.
d) Missing that depend on the same value of the variable. probability of missing depends on value itself (e.g. low income tends to not declare it).

Question 3

Q

How to handle missing values:

Answer

A

1) Delete them.
* *Listwise** deletion - entire observation that contains the missing data is deleted. Risk: little data.

Pairwise deletion - statistical analysis is calculated with all available values. Risk: different sample sizes for each variable. When missing data is completely random, pairwise deletion is preferred.

2) Impute with mean/mode/median. Mean and median for numeric data, the mode for categorical.

Generalised replacement: one size fits all.
Similarity replacement: calculate separately values for different categories and impute. Average of height for men is imputed different than height of women.

3) Building predictive models. Build a predictive model to estimate the values that will replace the missing data.
Risk: estimated values more regular than true values.

4) Recoding with K-NN. Advantage: It can be used interchangeably to qualitative and quantitative variables.
Risk: computational difficulty and sensitivity of model parametrisation.

Question 4

Q

Types of outliers and origins

Answer

A

Univariate
Multivariate

Natural Outlier.
Non-Natural / Due to errors:
Intentional outlier: related to sensitive data. Interview some young people on alcohol consumption. Only some of them will report actual value.
Measurement errors: when measurement tool used is faulty. Example: weighing machines.
Experimental errors: abnormal event that has affected the outcome of the experiment.
Sampling error: for example, we have to measure the height of few athletes. For error, we include a pair of basketball players in the sample.
Data entry errors: human errors, such as errors caused during data collection.
Data processing error: extract data from more sources. It is possible that some errors of manipulation or extraction bring outliers in the set of final data.

Question 5

Q

Consequences of outliers

Answer

A

1) Increase the error variance, reduce the power of statistical tests.
2) If outliers are not distributed randomly, compromise normality of distribution.
3) Influence results of estimated tests.
4) They may have an impact on the basic assumptions of regression, ANOVA and other statistical methods.

Question 6

Q

How to identify outliers

Answer

A

Univariate Outliers

Visual inspection
Beyond interquartile range limit/outside 5th and 95th percentile / more than 3 std from the mean.

Multivariate Outliers

Mahalanobis distance (use Chi2 as cutoff value).
Cook’s D (calculated by removing the ith data point from the model and recalculating the regression. It summarizes how much all the values in the regression model change when the ith observation is removed)

Question 7

Q

How to manage outliers

Answer

A

Elimination of observations.
Data transformations: log, reduce their impact by weighting.
Treating them as separate group.
Replacing values.

Question 8

Q

Variable transformation strategies

Answer

A

Categorical to dummies
Right skewed distribution: Log, cubic, sqrt
Left skewed distribution: sqrt, cubic, exponential

Data Preparation Flashcards

(8 cards)