Lecture 6 - Data Manipulation Flashcards by Wl hoe

What is data wrangling?

Cleaning, transforming, and organizing data for analysis

How well did you know this?

Not at all

Perfectly

Name four data manipulation purposes.

Filtering ( Reorder the observations )
Variable creation
Calculating a set of summary statistics
Imputation

How well did you know this?

Not at all

Perfectly

How do you create new variables?

Apply functions to existing variables (e.g., BMI)

How well did you know this?

Not at all

Perfectly

What is imputation?

Adding or updating data to fill missing values

How well did you know this?

Not at all

Perfectly

List out 2 method for wrangling

dplyr
base R

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

What scenario is recommended to use base R?

Minimal dependency or restricted environments

How well did you know this?

Not at all

Perfectly

What scenario is recommended to use dplyr?

Need clear, readable code, and optimisation

How well did you know this?

Not at all

Perfectly

How to apply a lookup table?

Use lookup_tbl[target_vector] to relabel values

How well did you know this?

Not at all

Perfectly

What is a tibble?

A modern data frame with enhanced printing and behavior

How well did you know this?

Not at all

Perfectly

How do tibbles print differently?

Show first 10 rows; neatly truncate columns

How well did you know this?

Not at all

Perfectly

Name a tibble creation function.

tibble()
as_tibble()

How well did you know this?

Not at all

Perfectly

How check if object is tibble? ( 2 )

is.tibble()
is.tbl()

How well did you know this?

Not at all

Perfectly

What verb selects columns?

select()

How well did you know this?

Not at all

Perfectly

What verb filters rows?

filter()

Subset Rows distinct(), slice()

How well did you know this?

Not at all

Perfectly

How to sort rows?

Study These Flashcards

arrange() (use desc() for descending)

How to add new columns?

Study These Flashcards

mutate() (transmute() to keep only new ones)

How to summarise data?

Study These Flashcards

summarise() with summary functions

What is the pipe operator?

Study These Flashcards

%>% (magrittr)
|> (native R)

c(15, 20, 35) %>% mean()

%>% means then, which means we instantiate the vector then calculate its mean

What does the pipe mean?

Study These Flashcards

Pass output of one command to the next

How to filter rows with base R and dplr?

Study These Flashcards

base R -> subset()

data(flights, package="nycflights")
subset(flights, month==2 & day==1)
OR
flights |> subset(month==2 & day==1)

dply -> filter()

data(flights, package="nycflights")
filter(flights, month==2 & day==1)
OR
flights |> subset(month==2 & day==1)

How to select base R and dplr?

Study These Flashcards

[]

data(flights, package="nycflights")
flights[, c("year","month","day")]

select()

data(flights, package="nycflights")
select(flights, year, month,day)

How to group and summarise data with dplr?

Study These Flashcards

group_by()

data(flights, package="nycflights")
flights |> group_by(month) -> flights_grouped_by_mth

summarise()

data(flights, package="nycflights")
flights |> group_by(month) |> summarise(mean(dep_delay, na.rm=T))

na.rm=T ignores the missing values

How to create new columns from existing ones with dplyr?

Study These Flashcards

mutate()

flights |>
mutate(
gain = arr_delay - delay,
air_time_hrs = air_time / 60,
speed = distance / air_time_hrs)

group() + mutate()

flights |>
group_by(carrier) %>%
mutate(
gain = arr_delay - delay,
air_time_hrs = air_time / 60,
speed = distance / air_time_hrs)

Name five dplyr join types.

Inner Left Right Full Semi Anti

Lecture 6 - Data Manipulation Flashcards

(25 cards)