How to convert invalid values to NA?
A quick way to treat the problem in the age and income variables is to convert the invalid values to NA, as if they were missing values.
We can use the mutate() and na_if() functions from the dplyr package
What does mutate() and na_if() do?
mutate() adds columns to a data frame or modifies existing columns.
na_if() turns a specific problematic value into NA.
What are sentinal values?
Values that have or need to be converted to numeric values to be used
REFER TO SLIDES
How to deal with outliers?
If we suspect that there are outliers in a variable (column) in our dataset,
we should consider dealing with them in the data cleaning process.
We can use:
We could use boxplot.stats() to help us identify the outliers.
We could fix the problem by turning all the negative incomes to NA
REFER TO SLIDES
What are the rules about outliers?
Like the missing values issue that we will look at later on, we need to justify what to do with the high income values that are identified as outliers.
How many outliers are there?
If only a few, then we can omit them, e.g., set them to NA.
REFER TO SLIDES
What are missing values?
An important feature of R is that it allows for NA (“not available”).
NA represents an unknown value.
Missing values are “contagious”: almost any operation involving an unknown value will also be unknown.
REFER TO SLIDES
How to deal with missing values?
Strategies for treating missing values vary depending on the answers to the
following two questions:
How many?
Why they are missing?
===
Fundamentally, there are two things you can do with these variables:
Drop the rows with missing values, or
Convert the missing values to a meaningful value.
When is it safe to drop rows?
When the rows dropped dont make up a significant amount of the total rows
Check missing values
REFER TO SLIDES
Missing values – To Drop or Not to Drop?
If you are missing data for a particular variable from a large portion of the observations or NAs spread throughout the data, then consider:
- If the variable is categorical, then create a new category (e.g., missing) for the variable.
- If the variable is numerical,
- when values are missing randomly, replace them with the mean value or an appropriate estimate, a.k.a. imputing missing values;
- when values are missing systematically, convert them to categorical and add a new category, or replace them with zero and add a masking variable.
- Such as you can using missing or invalid
What is listwise deletion?
If only a small proportion of values are missing and they tend to be for the same data points, then consider dropping those rows from your analysis, this is called listwise deletion.
Missing values – Numerical variables
One might believe that the data collection failed at random so the missing values are independent of other variables. In this case, the missing values can be replaced by the mean or an appropriate estimate (e.g., the median).
Missing values – imputing with better estimate
The estimate can be improved (potentially better than mean) if other variables that relate to it are used for prediction.
We can use other models such as regression models or clustering
===
Note: imputing a missing value of an input variable based on the other input variables can be applied to categorical data as well.
What is a trick that can work well in missingness?
A trick that has worked well is to not only replace the NAs with the mean, but also add an additional indicator variable (e.g., isBAD) to keep track of which data points have been altered.
Why can Missingness Indicators can be useful?
If the missing values really are missing randomly, then the indicator variables are uninformative, and the model should ignore them.
If the missing values are missing systematically, then the indicator variables provide useful additional information to the modelling
algorithm.
===
In many situations, the isBAD variables are sometimes even more
informative and useful than the original variables!
What is vtreat package?
vtreat is a package for automatically treating missing values. It creates a treatment plan that records all the information needed so that the data treatment process can be repeated. You then use this treatment plan
- to “prepare” or treat your training data before you fit a model, and
- then again to treat new data before feeding it into the model.
REFER TO SLIDES FOR EXAMPLES
What is recoding variables?
Change a continuous variable into a set of categories
Create a pass/fail variable based on a set of cutoff scores
Replace miscoded values with correct values
Discretizing continuous variables – Motivation
REFER TO SLIDES FOR EXAMPLES
Explicit categorisation
This is adding labels to ranges of data through assignment
REFER TO SLIDES FOR EXAMPLES
How do you rename variables?
Use fix() to invoke the interactive editor.
The dplyr package has a rename() function that’s useful for altering the names of variables.
How to get the month for a specific date in a specific year?
monthDays(as.Date(‘2020-02-01’)) will get you the number of days in that month for that year
What is the format of the date
yyyy-mm-dd by default or you can have mm/dd/yyyy using formatting
REFER TO SLIDES
How do you get the current date?
Sys.Date() gets only the date
date() gets you the day (mon, tues, etc), month, numerical day, time and year
Sys.time() gives you the date, time and timezone
How do you format the date
using format() function