What is data wrangling?
Cleaning, transforming, and organizing data for analysis
Name four data manipulation purposes.
How do you create new variables?
Apply functions to existing variables (e.g., BMI)
What is imputation?
Adding or updating data to fill missing values
List out 2 method for wrangling
What scenario is recommended to use base R?
Minimal dependency or restricted environments
What scenario is recommended to use dplyr?
How to apply a lookup table?
Use lookup_tbl[target_vector] to relabel values
What is a tibble?
A modern data frame with enhanced printing and behavior
How do tibbles print differently?
Show first 10 rows; neatly truncate columns
Name a tibble creation function.
How check if object is tibble? ( 2 )
What verb selects columns?
What verb filters rows?
Subset Rows distinct(), slice()
How to sort rows?
arrange() (use desc() for descending)
How to add new columns?
mutate() (transmute() to keep only new ones)
How to summarise data?
summarise() with summary functions
What is the pipe operator?
c(15, 20, 35) %>% mean()
%>% means then, which means we instantiate the vector then calculate its mean
What does the pipe mean?
Pass output of one command to the next
How to filter rows with base R and dplr?
data(flights, package="nycflights") subset(flights, month==2 & day==1) OR flights |> subset(month==2 & day==1)
data(flights, package="nycflights") filter(flights, month==2 & day==1) OR flights |> subset(month==2 & day==1)
How to select base R and dplr?
data(flights, package="nycflights")
flights[, c("year","month","day")]data(flights, package="nycflights") select(flights, year, month,day)
How to group and summarise data with dplr?
data(flights, package="nycflights") flights |> group_by(month) -> flights_grouped_by_mth
data(flights, package="nycflights") flights |> group_by(month) |> summarise(mean(dep_delay, na.rm=T))
na.rm=T ignores the missing values
How to create new columns from existing ones with dplyr?
flights |> mutate( gain = arr_delay - delay, air_time_hrs = air_time / 60, speed = distance / air_time_hrs)
flights |> group_by(carrier) %>% mutate( gain = arr_delay - delay, air_time_hrs = air_time / 60, speed = distance / air_time_hrs)