What is a way to display ggplots together in R?
library(patchwork) #allows you to display ggplots together using plot1 + plot2
How do you print the number of missing values in a model?
naprint(na.action(model))
What is the code for creating a table display of the missingness patterns?
md.pattern(data)
How do you display the number of NAs in each variable of a dataset?
colSums(is.na(data))
What are embedded or model based methods for missing data?
Don’t impute, deal with missing values in the prediction model itself
What are the advantages of multiple imputation? (3)
How do you perform LOCF imputation in R?
tidyr::fill(data, variable)
Under what conditions is listwise deletion unbiased? What happens to the standard error?
Mean, regression weight and correlation are unbiased only under NDD. Standard error is too large
What is linkage?
The dissimilarity between two clusters if one or both contains multiple observations
What are the disadvantages of single and centroid and complete linkage?
How do you perform mean imputation in R? (2 ways)
library(“mice”)
imp
What are the advantages of the indicator method? (2)
How do you perform k-means clustering in R? and what does the output consist of?
means_cluster
Under what conditions is stochastic regression imputation unbiased? What happens to the standard error?
Mean, regression weights and correlation are unbiased under SDD
Standard error is too small
What is regression imputation?
First builds a model from the observed data
Predictions for the incomplete cases are then calculated under the fitted model and serve as replacements for the missing data
What is mean imputation?
Replace missing data by the mean or the mode for categorical data
What are the forumlas for NDD, SDD and UDD? Where M indicates whether variable 2 is missing (1) or not (0)
NDD: Pr(M=1 | var1, var2) = Pr(M=1)
SDD: Pr(M=1 | var1, var2) = Pr(M=1 | var1)
UDD: Pr(M=1 | var1, var2) can’t be reduced
What is the k-medoids clustering algorithm?
How do you perform multiple imputation and fit a model in R?
imp
What are internal validation indices and what are some popular methods?
Under what conditions is regression imputation unbiased? What happens to the standard error?
Mean and regression weights are unbiased under SDD
-for regression weights under SDD, only if the factors that influence the missingness are part of the regression model
Standard error is too small
What is a “good” k-means clustering?
One for which the within-cluster variation (W(Ck)) is as small as possible
When does hierarchical clustering give worse results than k-means clustering?
When the data doesn’t have a hierarchical structure. e.g. when the best division into 2 groups is by gender but the best division into 3 groups is by nationality
How do you differentiate between SDD and UDD?
You can’t