what are some of the issues with data collection?
Why is it necessary to be fastidious in cleaning your data?
Remembering that my data is precious, it tells a story, what is the best method for checking the accuracy of my data?
Why should I look for out of range values?
What is missing data, and why does it matter?
*Missing Data is information not available for a subject (or case) about whom other information is available.
*Usually occurs when the participant fails to answer one or more questions in a survey.
I need to consider is it Systematic or Random?
- I need to look for patterns and relationships underlying the missing data to maintain as close as possible values in the original distribution when a remedy is applied.
*Impact can reduce sample size available for analysis and can also distort results.
What are the principles of missing data screening?
I need to deal with missing data prior to Cleaning & Analysis - but how do I do this?
I can can test for levels of missingness by creating a dummy-coded variable (0/1) and then using a t-test to assess whether there is a mean difference between the 2 groups on the DV of interest.
*If this is not significantly different it is not as critical in the way I deal with the data although degree of missingness is important to look at next (more than 5% is not great).
There are 3 kinds of missing data, what are they?
The data may be:
*Missing Completely at Random – unpredictable (MCAR)
*Missing At Random but ignorable response (MAR)
*Missing not at random or non-ignorable (MNAR)
MNAR is the worst kind :-(
What are the 3 alternatives that used to be employed to handle missing data? and why are they not the most appropriate?
Historically, missing data has been handled using either listwise deletion of the entire case or, pairwise deletion (when the item missing is involved in any calculation) or replacement of the mean value of the item in question.
Primarily, using listwise deletion would eliminate other very relevant information and although pairwise deletion would lessen this effect, it would still eliminate substantial, valuable data. Instead data transformation appears to be the optimal way in dealing with the issue of missing values, rather than eliminating the pattern of responses from the individual being assessed.
SPSS will use list-wise or pairwise deletion to deal with data during most analyses as the default. Why is this not a great idea?
A better method to avoid loss of valuable data or inappropriate imputation, is achieved by either Regression or EM (expectation maximisation – Model-Based Methods) replacement.
Systematic replacement of missing values may be used in your research using SPSS, however, remember to reference any changes made to the actual raw data set in your results section.
Why is Regression Replacement (RP) a better option for handling missing data?
The regression option provides an assigned mean value which takes into account the pattern of responses from individual cases on all other variables and supplies an adjusted mean for each participant through a regression analysis
What other things do I need to remember when considering missing data?
*Missing data under 5% for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.). Particularly relevant in a large dataset.
The number of cases with no missing data must be sufficient for the selected analysis technique to be used if replacement values are not being substituted (imputed) for the missing data.
So what if I have more than 5% data missing?
Under 10% – Any of the data replacement methods can be applied when missing data is this low, although the complete case method has been shown to be the least preferred.
10 to 20% – The increased presence of missing data makes the all available, regression methods most preferred for MCAR data & model-based methods are necessary with MAR missing data processes.
Why is Expectation Maximisation (EM) so highly regarded when dealing with missing data?
Expectation Maximisation (EM):
So, what does the Regression Method of Data replacement have to offer?
What is the difference between EM and Regression Replacement?
EM will give you the estimated maximum likelihood replacement, where the regression model will give you take the pattern of each case into account for the replacement
What does Little’s MCAR test: χ2 statistic tell us?
*This statistic tests whether the missing data are characterized as MCAR, MAR or NMAR. MAR (missing at random)
What does a Little’s MCAR test of >.05 indicate?
when Little’s MCAR test is NOT significant, as in this case it “indicates that the probability that the pattern of missing values diverges from randomness >.05, so that MCAR may be inferred (p.63). This suggests that generated values may be used to replace the missingness.
Problems that can arise and suggest EM is not best method of data replacement & what to do:
What are univariate outliers?
Univariate Outliers are extreme values on a variable identified as ±3.29 SD.
How does one manage univariate outliers?
What are the 2 kinds of multivariate outliers? How does one manage multivariate outliers?
Multivariate Outliers:
Remind me from Andy Field, what are the key values for Mahalanobis Distance?
Mahalanobis Distance measures the influence of a case by examining the distance of the cases from the mean
With regard Data transformation: What do T & F recommend to do prior to dealing with outliers?
*If the data is skewed or kurtosis is causing a significant problem then transformations may be undertaken:
*square root transformation for moderately + / - skewed
*substantial + skewness use log transformations
NB: Transformations often reduce the impact of outliers. Transformations are best done on ungrouped data.