Lecture 5 - Data Cleaning Flashcards by Jeremy Robertson

How to convert invalid values to NA?

A quick way to treat the problem in the age and income variables is to convert the invalid values to NA, as if they were missing values.
We can use the mutate() and na_if() functions from the dplyr package

How well did you know this?

Not at all

Perfectly

What does mutate() and na_if() do?

mutate() adds columns to a data frame or modifies existing columns.
na_if() turns a specific problematic value into NA.

How well did you know this?

Not at all

Perfectly

What are sentinal values?

Values that have or need to be converted to numeric values to be used
REFER TO SLIDES

How well did you know this?

Not at all

Perfectly

How to deal with outliers?

If we suspect that there are outliers in a variable (column) in our dataset,
we should consider dealing with them in the data cleaning process.
We can use:
We could use boxplot.stats() to help us identify the outliers.
We could fix the problem by turning all the negative incomes to NA

REFER TO SLIDES

How well did you know this?

Not at all

Perfectly

What are the rules about outliers?

Like the missing values issue that we will look at later on, we need to justify what to do with the high income values that are identified as outliers.
How many outliers are there?
If only a few, then we can omit them, e.g., set them to NA.

REFER TO SLIDES

How well did you know this?

Not at all

Perfectly

What are missing values?

An important feature of R is that it allows for NA (“not available”).
NA represents an unknown value.
Missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

REFER TO SLIDES

How well did you know this?

Not at all

Perfectly

How to deal with missing values?

Strategies for treating missing values vary depending on the answers to the
following two questions:
How many?
Why they are missing?
===
Fundamentally, there are two things you can do with these variables:
Drop the rows with missing values, or
Convert the missing values to a meaningful value.

How well did you know this?

Not at all

Perfectly

When is it safe to drop rows?

When the rows dropped dont make up a significant amount of the total rows

How well did you know this?

Not at all

Perfectly

Check missing values

REFER TO SLIDES

How well did you know this?

Not at all

Perfectly

Missing values – To Drop or Not to Drop?

If you are missing data for a particular variable from a large portion of the observations or NAs spread throughout the data, then consider:
- If the variable is categorical, then create a new category (e.g., missing) for the variable.
- If the variable is numerical,
- when values are missing randomly, replace them with the mean value or an appropriate estimate, a.k.a. imputing missing values;
- when values are missing systematically, convert them to categorical and add a new category, or replace them with zero and add a masking variable.
- Such as you can using missing or invalid

How well did you know this?

Not at all

Perfectly

What is listwise deletion?

If only a small proportion of values are missing and they tend to be for the same data points, then consider dropping those rows from your analysis, this is called listwise deletion.

How well did you know this?

Not at all

Perfectly

Missing values – Numerical variables

One might believe that the data collection failed at random so the missing values are independent of other variables. In this case, the missing values can be replaced by the mean or an appropriate estimate (e.g., the median).

How well did you know this?

Not at all

Perfectly

Missing values – imputing with better estimate

The estimate can be improved (potentially better than mean) if other variables that relate to it are used for prediction.

We can use other models such as regression models or clustering
===
Note: imputing a missing value of an input variable based on the other input variables can be applied to categorical data as well.

How well did you know this?

Not at all

Perfectly

What is a trick that can work well in missingness?

A trick that has worked well is to not only replace the NAs with the mean, but also add an additional indicator variable (e.g., isBAD) to keep track of which data points have been altered.

How well did you know this?

Not at all

Perfectly

Why can Missingness Indicators can be useful?

If the missing values really are missing randomly, then the indicator variables are uninformative, and the model should ignore them.

If the missing values are missing systematically, then the indicator variables provide useful additional information to the modelling
algorithm.
===
In many situations, the isBAD variables are sometimes even more
informative and useful than the original variables!

How well did you know this?

Not at all

Perfectly

What is vtreat package?

Study These Flashcards

vtreat is a package for automatically treating missing values. It creates a treatment plan that records all the information needed so that the data treatment process can be repeated. You then use this treatment plan
- to “prepare” or treat your training data before you fit a model, and
- then again to treat new data before feeding it into the model.

REFER TO SLIDES FOR EXAMPLES

What is recoding variables?

Study These Flashcards

Change a continuous variable into a set of categories
Create a pass/fail variable based on a set of cutoff scores
Replace miscoded values with correct values

Discretizing continuous variables – Motivation

Study These Flashcards

REFER TO SLIDES FOR EXAMPLES

Explicit categorisation

Study These Flashcards

This is adding labels to ranges of data through assignment
REFER TO SLIDES FOR EXAMPLES

How do you rename variables?

Study These Flashcards

Use fix() to invoke the interactive editor.
The dplyr package has a rename() function that’s useful for altering the names of variables.

How to get the month for a specific date in a specific year?

Study These Flashcards

monthDays(as.Date(‘2020-02-01’)) will get you the number of days in that month for that year

What is the format of the date

Study These Flashcards

yyyy-mm-dd by default or you can have mm/dd/yyyy using formatting
REFER TO SLIDES

How do you get the current date?

Study These Flashcards

Sys.Date() gets only the date
date() gets you the day (mon, tues, etc), month, numerical day, time and year
Sys.time() gives you the date, time and timezone

How do you format the date

Study These Flashcards

using format() function

How do you get the weekdays and months

using weekdays() and months() functions

How do you extract information from date/time

julian() returns the number of days (possibly fractional) since the origin day (1970 Jan 1st), which can be changed by the origin argument.

Dates used for calculation

REFER TO SLIDES

Lecture 5 - Data Cleaning Flashcards

(27 cards)