Lecture 6 - Data Manipulation Flashcards

(25 cards)

1
Q

What is data wrangling?

A

Cleaning, transforming, and organizing data for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name four data manipulation purposes.

A
  1. Filtering ( Reorder the observations )
  2. Variable creation
  3. Calculating a set of summary statistics
  4. Imputation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you create new variables?

A

Apply functions to existing variables (e.g., BMI)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is imputation?

A

Adding or updating data to fill missing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

List out 2 method for wrangling

A
  1. dplyr
  2. base R
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What scenario is recommended to use base R?

A

Minimal dependency or restricted environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What scenario is recommended to use dplyr?

A
  1. Need clear, readable code, and optimisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to apply a lookup table?

A

Use lookup_tbl[target_vector] to relabel values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a tibble?

A

A modern data frame with enhanced printing and behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do tibbles print differently?

A

Show first 10 rows; neatly truncate columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name a tibble creation function.

A
  1. tibble()
  2. as_tibble()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How check if object is tibble? ( 2 )

A
  1. is.tibble()
  2. is.tbl()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What verb selects columns?

A
  1. select()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What verb filters rows?

A
  1. filter()

Subset Rows distinct(), slice()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to sort rows?

A

arrange() (use desc() for descending)

17
Q

How to add new columns?

A

mutate() (transmute() to keep only new ones)

18
Q

How to summarise data?

A

summarise() with summary functions

19
Q

What is the pipe operator?

A
  1. %>% (magrittr)
  2. |> (native R)
c(15, 20, 35) %>% mean()

%>% means then, which means we instantiate the vector then calculate its mean

20
Q

What does the pipe mean?

A

Pass output of one command to the next

21
Q

How to filter rows with base R and dplr?

A
  1. base R -> subset()
data(flights, package="nycflights")
subset(flights, month==2 & day==1)
OR
flights |> subset(month==2 & day==1)
  1. dply -> filter()
data(flights, package="nycflights")
filter(flights, month==2 & day==1)
OR
flights |> subset(month==2 & day==1)
22
Q

How to select base R and dplr?

A
  1. []
data(flights, package="nycflights")
flights[, c("year","month","day")]
  1. select()
data(flights, package="nycflights")
select(flights, year, month,day)
23
Q

How to group and summarise data with dplr?

A
  1. group_by()
data(flights, package="nycflights")
flights |> group_by(month) -> flights_grouped_by_mth
  1. summarise()
data(flights, package="nycflights")
flights |> group_by(month) |> summarise(mean(dep_delay, na.rm=T))

na.rm=T ignores the missing values

24
Q

How to create new columns from existing ones with dplyr?

A
  1. mutate()
flights |>
mutate(
gain = arr_delay - delay,
air_time_hrs = air_time / 60,
speed = distance / air_time_hrs)
  1. group() + mutate()
flights |>
group_by(carrier) %>%
mutate(
gain = arr_delay - delay,
air_time_hrs = air_time / 60,
speed = distance / air_time_hrs)
25
Name five dplyr join types.
Inner Left Right Full Semi Anti