Lecture 5 - Data Cleaning Flashcards

(27 cards)

1
Q

How to convert invalid values to NA?

A

A quick way to treat the problem in the age and income variables is to convert the invalid values to NA, as if they were missing values.
We can use the mutate() and na_if() functions from the dplyr package

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does mutate() and na_if() do?

A

mutate() adds columns to a data frame or modifies existing columns.
na_if() turns a specific problematic value into NA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are sentinal values?

A

Values that have or need to be converted to numeric values to be used
REFER TO SLIDES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to deal with outliers?

A

If we suspect that there are outliers in a variable (column) in our dataset,
we should consider dealing with them in the data cleaning process.
We can use:
We could use boxplot.stats() to help us identify the outliers.
We could fix the problem by turning all the negative incomes to NA

REFER TO SLIDES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the rules about outliers?

A

Like the missing values issue that we will look at later on, we need to justify what to do with the high income values that are identified as outliers.
How many outliers are there?
If only a few, then we can omit them, e.g., set them to NA.

REFER TO SLIDES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are missing values?

A

An important feature of R is that it allows for NA (“not available”).
NA represents an unknown value.
Missing values are “contagious”: almost any operation involving an unknown value will also be unknown.

REFER TO SLIDES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to deal with missing values?

A

Strategies for treating missing values vary depending on the answers to the
following two questions:
How many?
Why they are missing?
===
Fundamentally, there are two things you can do with these variables:
Drop the rows with missing values, or
Convert the missing values to a meaningful value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is it safe to drop rows?

A

When the rows dropped dont make up a significant amount of the total rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Check missing values

A

REFER TO SLIDES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Missing values – To Drop or Not to Drop?

A

If you are missing data for a particular variable from a large portion of the observations or NAs spread throughout the data, then consider:
- If the variable is categorical, then create a new category (e.g., missing) for the variable.
- If the variable is numerical,
- when values are missing randomly, replace them with the mean value or an appropriate estimate, a.k.a. imputing missing values;
- when values are missing systematically, convert them to categorical and add a new category, or replace them with zero and add a masking variable.
- Such as you can using missing or invalid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is listwise deletion?

A

If only a small proportion of values are missing and they tend to be for the same data points, then consider dropping those rows from your analysis, this is called listwise deletion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Missing values – Numerical variables

A

One might believe that the data collection failed at random so the missing values are independent of other variables. In this case, the missing values can be replaced by the mean or an appropriate estimate (e.g., the median).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Missing values – imputing with better estimate

A

The estimate can be improved (potentially better than mean) if other variables that relate to it are used for prediction.

We can use other models such as regression models or clustering
===
Note: imputing a missing value of an input variable based on the other input variables can be applied to categorical data as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a trick that can work well in missingness?

A

A trick that has worked well is to not only replace the NAs with the mean, but also add an additional indicator variable (e.g., isBAD) to keep track of which data points have been altered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why can Missingness Indicators can be useful?

A

If the missing values really are missing randomly, then the indicator variables are uninformative, and the model should ignore them.

If the missing values are missing systematically, then the indicator variables provide useful additional information to the modelling
algorithm.
===
In many situations, the isBAD variables are sometimes even more
informative and useful than the original variables!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is vtreat package?

A

vtreat is a package for automatically treating missing values. It creates a treatment plan that records all the information needed so that the data treatment process can be repeated. You then use this treatment plan
- to “prepare” or treat your training data before you fit a model, and
- then again to treat new data before feeding it into the model.

REFER TO SLIDES FOR EXAMPLES

17
Q

What is recoding variables?

A

Change a continuous variable into a set of categories
Create a pass/fail variable based on a set of cutoff scores
Replace miscoded values with correct values

18
Q

Discretizing continuous variables – Motivation

A

REFER TO SLIDES FOR EXAMPLES

19
Q

Explicit categorisation

A

This is adding labels to ranges of data through assignment
REFER TO SLIDES FOR EXAMPLES

20
Q

How do you rename variables?

A

Use fix() to invoke the interactive editor.
The dplyr package has a rename() function that’s useful for altering the names of variables.

21
Q

How to get the month for a specific date in a specific year?

A

monthDays(as.Date(‘2020-02-01’)) will get you the number of days in that month for that year

22
Q

What is the format of the date

A

yyyy-mm-dd by default or you can have mm/dd/yyyy using formatting
REFER TO SLIDES

23
Q

How do you get the current date?

A

Sys.Date() gets only the date
date() gets you the day (mon, tues, etc), month, numerical day, time and year
Sys.time() gives you the date, time and timezone

24
Q

How do you format the date

A

using format() function

25
How do you get the weekdays and months
using weekdays() and months() functions
26
How do you extract information from date/time
julian() returns the number of days (possibly fractional) since the origin day (1970 Jan 1st), which can be changed by the origin argument.
27
Dates used for calculation
REFER TO SLIDES