R Flashcards

Question

Hot keys to open a new script

Answer 1

Ctrl+Shift+N on Windows

Answer 2

Stuff that is stored in R/Things that are stored in containers in R. Objects can be variables or more complicated entities like functions (and datasets).

Answer 3

Use <- OR = to assign a value to a variable. Examples: a <- 1 b <- 1 c <- 1 a = 1 b = 1 c = 1

Answer 4

enter the value (e.g., a) and then hit enter OR print(object name) and then hit enter Examples: a OR print(a)

Answer 5

ls() Note: A dataset is also considered an object, so it will also tell you if a dataset is currently loaded in your R environment Also, IDEs standardly have a tab that shows you all the variable names (e.g., in RStudio, you can look at the Environment>Global Environment pane)

Answer 6

You haven't defined x yet

Answer 7

ls ls without the parentheses is a function without an argument. It will show you the code for ls. With the parentheses (the argument) it will show you all objects saved in your workspace. Note: In general to evaluate a function, we need to use parentheses. If we type a function without parentheses, R shows us the code for the function. Most functions also require an argument, i.e., something to be written inside the parenthesis.

Answer 8

Rules have to start with a letter and they can't contain spaces Best practice: stick to lowercase and use underscores instead of spaces. Be careful to not use any names already in use (e.g., don't use install.packages as it's already a function).

Answer 9

Any line that starts with # will not be evaluated

Answer 10

help(function_name) OR ?function_name Example: help(log) OR ?Log

Answer 11

seq creates a list of numbers sum adds them up Example: n<- 10 x <- seq(1, n) sum(x) SO n is 10 x is the list of numbers between 1 through 10: 1 2 3 4 5 6 7 8 9 10 (think of the comma like through or an en-dash) sum is: 55 (OR 1+2+3+4+5+6+7+8+9+10) The output will be [1] 55 because 1+2+3+4+5+6+9+10=55

Answer 12

Log2(16) Logarithm base 2 of 16 AKA how many times does 2 (the base) have to be multiplied by itself to equal 16? 2*2=4 4*2=8 8*2=16 so the answer is 4 (because 2 has to be multiplied by itself 4 times to get 16)

Answer 13

The square root of a number is a value which on multiplied by itself gives the original number. It is represented by the symbol '√'. For example, the square root of 25 is √25 = 5. 5*5=25 sqrt(4) will =2 (because 2*2=4)

Answer 14

log(exp(x))

Answer 15

number or variable written in the upper right of a base number that indicates how many times that base number should be multiplied by itself. 2^2 2*2=4

Answer 16

The type of object For example, a <- 2 class(a) output: [1] "numeric" Or class(ls) output: "function"

Answer 17

In R, the [1] (or [2], [3], etc.) you see at the start of output is an index label that tells you the position of the first element being printed on that line. Here’s why: R prints vectors, lists, and other objects in a linear stream of values. To help you keep track of where you are in the sequence, R prints an index in square brackets at the beginning of each line. [1] means “the first element printed here is element number 1 of the vector.” If the output wraps to the next line, you might see [11], meaning “the first element on this line is the 11th element of the vector.” Example x <- 1:15 x Output: [1] 1 2 3 4 5 6 7 8 9 10 [11] 11 12 13 14 15 On the first line, [1] tells you it’s showing elements starting from the 1st. On the second line, [11] tells you it’s showing elements starting from the 11th. 👉 It’s not part of your data—just a printing guide so you know where in the vector you are.

Answer 18

A single number is technically a vector, but in general, they have several entries.

Answer 19

# states from the `murders` data set Think of data frames as tables The columns contain one variable and the rows have a set of values that match each column. You can have different data types in a data frame. Each column should have the same number of items, even if some are missing. We store data in a data frame, e.g., data("murders") Then if you do class(murders), the output will make it clear that it is a "data.frame"\ Example: library(dslabs) data(murders) Save temperatures in an object called `temp` temp <- c(35, 88, 42, 84, 81, 30) Store city names in a `city` object city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") Generate a data frame with city names and temperatures city_temps <- data.frame(name = city, temperature = temp) Define a states variable `states` to contain the name of the states <- c(murders$state) Define a `ranks` variable to determine the rank of sizes # of population ranks <- rank(murders$population) Generate a `my_df` data frame with the name of the states and their rank my_df <- data.frame(states = states, ranks = ranks)

Answer 20

str(object_name) Example: str(murders) will show the structure of the data stored inside of the data frame "murders" Output: 'data.frame': 51 obs. of 5 variables: $ state : chr "Alabama" etc. $ abb : etc. $ region: etc $ population : etc. $ total : etc. etc. (I did not list the full output above as it was lengthy)

Answer 21

observations AKA rows in a table

Answer 22

abbreviation

Answer 23

head(murders)

Answer 24

It's the accessor Example: murders$population The output is the column associated with population in the dataset stored in the data frame "murders" [1] 4779736 710231 6392017 etc. *Note: the order of the entries in the list 'murder$population' preserves the order of the rows in the data table.

Answer 25

names(name_of_data_frame) Example: names(murders) Output: [1] "state" [2] "abb" [3] "region" [4] "population" [5] "total" OR str(object_name) Example: str(murders) will show the structure of the data stored inside of the data frame "murders" Output: 'data.frame': 51 obs. of 5 variables: $ state : chr "Alabama" etc. $ abb : etc. $ region: etc $ population : etc. $ total : etc. etc. (I did not list the full output above as it was lengthy)

Answer 26

the function length length(vector_name) example: length(pop) output: [1} 51 because there are 51 entries, one for each state.

Answer 27

Note: We use quotes to distinguish between variable names and characters strings "a" will give you the character string a

Answer 28

These must either be true or false For example: z <- 3 == 2 z output: [1] FALSE class(z) output: [1] "logical'

Answer 29

a relational operator that asks if a value equals another value For example: z <- 3 == 2 z output: [1] FALSE It's false because 3 does not equal 2.

Answer 30

Factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables. Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. Factors/categorical variables are stored in levels. R stores each level as an integer. (This is more memory efficient that storing all of the characters.) Note: the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow an alphabetical order. Factors are useful for storing categorical data. In the below example, regions are categorical. example: class(murders$region) class tells you what kind of object it is murders is the data frame that contains the data about the number of homicides in the U.S. $ is the accessor; output is the column "region" in the murders data frame So you're asking r what type of object is the column region in the murders data frame output: [1] "factor" So region is a factor, not a character vector To see the specific regions: levels(murders$region) output: [1] "Northeast" "South" "North Central" "West"

Answer 31

levels(data_frame_name$Column_name) Example: levels(murders$region) murders is the data frame that contains the data about the number of homicides in the U.S. $ is the accessor; output is the column "region" in the murders data frame output: [1] "Northeast" "South" "North Central" "West"

Answer 32

In the background of R, we store integers. Integers are smaller memory-wise than characters.

Answer 33

Factors can be easily confused with characters. be careful! general advice: avoid factors as much as possible, though they are sometimes necessary to fit statistical models that depend on categorical data.

Answer 34

determine the type of an object

Answer 35

tables with rows representing observations and columns representing different variables

Answer 36

an object consisting of several entries and can be a numeric vector, a character vector, or a logical vector (i.e., must either be true or false). A vector is a series of values, all of the same type. They are the most basic unit of data in R. They can store numerical, character or logical data. In R, we can create a vector with the c function, which is short for concatenate. To concatenate, we write the elements of the vector separated by a comma in parentheses. For example, a numerical vector containing costs can be created like this: cost <- c(50, 75, 90, 100, 150)

Answer 37

quotes "a" will give you the character string a

Answer 38

The function levels() can be used to determine the levels of a factor. Example: levels(murders$revion) output: [1] "Northeast" "South" "North Central" "West"

Answer 39

use it to determine the number of factors Example: # R program to get the number of levels of a factor Creating a factor gender <- factor(c("female", "male", "male", "female")); gender Calling nlevels() function to get the number of levels nlevels(gender) Output: [1] female male male female Levels: female male [1] 2

Answer 40

p <- murders$population Or o <- murders[["population"]]

Answer 41

c() the action of connecting objects to a string The `c()` function connects all the strings into a single vector

Answer 42

The function table takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region. Example: table(murders$region) Output: Northeast South North Central West 9 17 12 13

Answer 43

codes <- c(italy = 380, canada = 124, egypt = 818) OR codes <- c("italy" = 380, "canada" = 124, "egypt" = 818) OR codes <- c(380, 124, 818) country <- c("italy","canada","egypt") names(codes) <- country

Answer 44

seq creates a list of numbers First Argument: Defines the start of the sequence Second Argument: Defines the end of the sequence Third Argument (optional): Tells seq how much to jump by; the default--if a third argument is not entered--is to go up by consecutive intervals of one. Examples: seq(1, 10) output: [1] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 seq(1, 10, 2) [1] 1, 3, 5, 7, 9 Note: If you want consecutive integers, you can use the following shorthand: 1:10 Notes: The second argument is a maximum, not necessarily the end. So if we write: seq(7,50,7), we will obtain the vector of integers like we had written: seq(7, 49, 7) This can be useful because sometimes we will want consecutive numbers that are less than a predetermined value. When we use these functions, R produces integers, not numerics, because they are typically used to index something

Answer 45

the function c() or concatenate AND seq() which generates sequences

Answer 46

# Using square brackets is useful for subsetting to access specific elements of a vector Lets you access specific parts of a vector by using square brackets to access elements of a vector. Examples: codes[2] output: canada 124 codes[c(1,3)] output: italy egypt 380 818 codes[1:2] output: italy canada 380 124 If the entries of a vector are named, they may be accessed by referring to their name codes["canada"] output: canada 124 codes[c("egypt","italy")] ouptut: egypt italy 818 380 To access the 3rd, 4th, and 5th elements of the cost vector: cost[3:5] OR cost[c(3,4,5)] *The : operator helps condense the code and obtain consecutive values from a range. To access just the first item and fifth item in the cost vector: cost[c(1,5)]

Answer 47

Is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match as expected. For example: x <- c(1, "canada", 3) R coerced the data into characters. It guessed that because you put a character string in the vector (i.e., "canada") that you meant the 1 and 3 to be character strings (i.e., "1" and "3") Note: You won't get an error or a warning with the above example. You may not initially realize that R changed 1 and 3 to character strings In the above situation, we'd say that "R coerced the data into a character string"

Answer 48

as.character() Example: x <- 1:5 y <- as.character(x) y output: [1] "1", "2", "3", "4", "5"

Answer 49

as.numeric() Example: Change "1", "2", "3", "4", "5" (characters) to 1, 2, 3, 4, 5 (numeric) x <- 1:5 y <- as.character(x) y output: "1", "2", "3", "4", "5" as.numeric(y) output: 1, 2, 3, 4, 5

Answer 50

In R, missing data is assigned to the value NA NA is the special value for missing data. Not Available Example: When R fails to coerce something, we'll get NA. For instance, in the below example R will be able to convert the "1" and "3" to numeric values, but it won't know what to do with b. x <- c("1", "b", "3") as.numeric(x) output: [1] 1 NA 3 Warning message: NAs introduced by coercion *So the output is 1 (missing value) 3 This makes it clear that "b" is the missing value R doesn't know what to do, so instead of converting "b" to a number, it tells us that it's NA/Not Available/Missing Data It's very common to come across NA as it's used for all missing data. Be ready to see a lot of NAs

Answer 51

length(32:99)

Answer 52

The seq() function has another useful argument: length.out. This argument allows us to generate sequences that increment by the same value but generate a vector of a specific length. Example 1: x <- seq(0, 100, length.out = 5) output: 0, 25, 50, 75, 100. *There are only 5 numbers in the output. Example 2: seq.int(3, 30, length.out = 10) output: 3, 6, 9, 12, 15, 18, 21, 24, 27, 30 *There are only 10 numbers in the output.

Answer 53

Add the letter L after the integer Example: class(3L) output: integer

Answer 54

The main difference is that integers take up less space in a computer's memory. For large operations, using integers can have a substantial impact. Otherwise, for most purposes, integers and numerics are indistinguishable. For example, 3, the integer, minus 3, the number, is 0. Example: 3L-3 output: 0

Answer 55

Character vectors

Answer 56

sort() Example: To see the largest number of gun murders listed from least to most: library(dslabs) data(murders) sort(murders$total) output: [1] 2 4 5 5 7 8 11 [8] 12 12 16 19 21 22 27 etc.

Answer 57

order() Example: x <- 31 4 15 92 65 sort(x) [1] 4 15 31 65 92 order(x) [1] 2 3 1 5 4 BECAUSE x <- 31 (1st) 4 (2nd) 15(3rd) 92 (4th) 65 (5th) sort(x) [1] 4 (2) 15 (3) 31 (1) 65 (5) 92 (4) order(x) [1] 2, 3, 1, 5, 4

Answer 58

rank() rank() tells you what order the numbers are in. ex// x<- 31, 4, 15, 92, 65 sort(x) [1] 4, 15, 31, 65, 92 rank(x) [1] 3, 1, 2, 4, 5 BECAUSE sort(x) [1] 4 (1st), 15 (2nd), 31(3rd), 65(4th), 92(5th) x<- 31(3), 4(1), 15(2), 92(5), 65(4) rank(x) [1] 3, 1, 2, 5, 4

Answer 59

max() Example: max(murders$total) [1] 1257 That is, the state with the highest number of murders listed in the totals column of the murders dataset is 1257 (California).

Answer 60

which.max() Example: max(murders$total) [1] 1257 i_max <- which.max(murders$total) i_max [1] 5 murders$state[i_max] [1] "California"

Answer 61

max() returns the largest value which.max() returns the index of the largest value?

Answer 62

min() Example: min(murders$total) [1] 1 That is, the state with the lowest number of murders listed in the totals column of the murders dataset is 1 (Vermont).

Answer 63

which.min() Example: min(murders$total) [1] 1 i_min <- which.min(murders$total) i_min [1] 46 murders$state[i_min] [1] "Vermont"

Answer 64

the selection of objects in which the order of selection matters. Example: There are 720 permutations of the digits 1, 2, 3, 4, 5, and 6 vs. a combination, which means the selection of objects in which the order does not matter.

Answer 65

min() returns the smallest value which.min() returns the index of the smallest value

Answer 66

When we apply the is.na function to a vector, it gives us a logical vector that tells us which inputs are NA. NA: Not Available. Commonly used for missing data; a common problem in real-world data sets.

Answer 67

Logical denial !TRUE becomes FALSE !FALSE becomes TRUE

Answer 68

murder_rate <- murders$total/murders$population*100000 murders$state[order(murder_rate, decreasing=TRUE)]

Answer 69

It means that if you apply an operation like +, -, *, or / to two vectors, R matches up the elements in the same positions of those vectors and performs the operation pair by pair Example: Convert heights from inches to centimeters heights <- c(69, 62, 66, 70) heights * 2.54 [1] 175.26, 157.48, 167.64, 177.80 So 69*2.54=175.26 62*1.75=157.48 etc.

Answer 70

< less than <= less than or equal to > greater than >= greater than or equal to == exactly equal to != not equal to ! NOT & AND NOTE: & makes two logicals true, only when they're both true TRUE & TRUE > TRUE TRUE & FALSE > FALSE FALSE & FALSE > FALSE Example: We want to find a state in the western U.S. with a murder rate less than or equal to 1 per 100,000 people. west <- murders$region == "West" safe <- murder_rate <=1 index <- safe & west murders$state[index] [1] "Hawaii" [2] "Idaho" [3] "Oregon" [4] "Utah" [5] "Wyoming' | OR

Answer 71

gives us the entries of a logical vector that are true. Example x <- FALSE, TRUE, FALSE, TRUE, TRUE, FALSE [1] 2, 4, 5 Use Case: We want to look up Massachusetts' murder rate. The function "which" tells us which entries of a logical vector are true, so we can type: index <- which(murders$state == "Massachusetts") index [1] 22 murder_rate[index] [1] 1.802 (You could just use the below index vector to find that info, but it makes the index a much smaller object if we use which. Other option: index <- murders$state == "Massachusetts" murder_rate[index] [1] 1.802

Answer 72

Looks for entries in a vector and returns the index needed to access them. Example: We want to find the murder rate for several different states (i.e., New York, Florida, and Texas). This function tells us which indices of the second vector match each of the entries a first vector. index <- match(c("New York", "Florida", "Texas"), murders$state) index [1] 33 10 44 murder$state[index] [1] "New York", "Florida", "Texas" murder_rate[index] [1] 2.668 3.398 3.201 Notes: 33 is the index that matched New York, 10 matched Florida, 44 matched Texas

Answer 73

Tells you whether or not each element of a first vector is in a second vector. Example: x <- c("a", "b", "c", "d", "e") y <- c("a", "d", "f") y %in% x [1] TRUE, TRUE, FALSE Example: You aren't sure if Boston, Dakota, and Washington are states, but you want to find out. c("Boston", "Dakota", "Washington") %in% murders$state [1] FALSE, FALSE, TRUE Only Washington is a state

Answer 74

sum TRUE > 1 FALSE > 0 So when we sum them, we're basically counting the cases that are true Example: sum(index) 5 So 5 cases in the index are true

Answer 75

For example, if we compare a vector to a single number, it performs the test for each entry. Example: We will define the index as the murder rate smaller than 0.71 per 100,000, or if we want to know if it's less or equal, we can use less than or equal. index <- murder_rate < 0.71 index <- murder_rate <= 0.71 index [1] FALSE FALSE FALSE FALSE TRUE etc. There are 50 entries that are either false or true. The entries that are true are the cases for which the murder rate is smaller or equal than 0.71 per 100,000 people. To see which these are, we can leverage the fact that vectors can be indexed with logicals. murders$state[index] [1] "Hawaii" [2] "Iowa" [3] "New Hampshire" [4] "North Dakota" [5] "Vermont" To count how many entries are true, the function sum returns the sum of these entries. TRUE > 1 FALSE > 0 So when we sum them, we're basically counting the cases that are true sum(index) 5 So 5 cases in the index are true (i.e., 5 states have murder rates less than or equal to 0.71 per 100,000 people).

Answer 76

Exploratory data visualization

Answer 77

Exploratory Data Analysis (EDA) is a important step in data science and data analytics as it visualizes data to understand its main features, find patterns and discover how different parts of the data are connected.

Answer 78

D3 is a JavaScript library and framework for creating visualizations. It's more flexible and powerful than R, but it takes longer to generate a plot.

Answer 79

Example: population_in_millions <- murders$population/10^6 total_gun_murders <- murders$total plot(population_in_millions, total_gun_murders)

Answer 80

Histograms are graphical summaries that give you a general overview of the types of values you have. In R, they can be produced using the hist() function. a histogram of murder rates murders <- mutate(murders, rate = total / population * 100000) hist(murders$rate)

Answer 81

Boxplots provide a more compact summary of a distribution that a histogram and are more useful for comparing distributions. boxplots of murder rates by region boxplot(rate~region, data = murders) In a single line of code, stratify state populations by region # and generate boxplots for the strata for the `murders` data set boxplot (population~region, data = murders)

Answer 82

Use the function plot() a simple scatterplot of total murders versus population x <- murders$population /10^6 y <- murders$total plot(x, y) Or For a quick plot that avoids accessing variables twice, we can use the with function: with(murders, plot(population, total)) The function with lets us use the murders column names in the plot function. It also works with any data frames and any function.

Answer 83

a mathematical relationship where one quantity changes by a constant multiplicative factor for a constant change in another quantity, meaning the output value is multiplied by a fixed amount each time the input increases by a fixed amount.

Answer 84

a connection between two variables that graphs as a straight line, where a change in one variable results in a proportional, constant change in the other

Answer 85

Use to add a new column or to change an existing one to the data table first argument = data frame second argument = name and value of the variable Example 1: add the murder rates to our murders data frame. library(dslabs) data("murders") murders <- mutate (murders, rate= total/population*100000) So data frame: murders name and value of variable: rate=total/population*100000 Example 2: Take the log transformation of the population variable: mutate(murders, population = log10(population)) Example 3: apply the same transformation to several variables. mutate(murders, across(c(population, total), log10)) Example 4: mutate(murders, across(where(is.numeric), log10) *Notes: -Notice that we used "total" and "population" inside the function which are objects that are not defined in our workspace. We don't get an error, because functions inside the dplyr package know to look for variables in the data frame provided in the first argument. In the call to mutate above, total will have the values in murders$total. -The output of the above code will show an updated table/murders object with a new column. BUT even though we have overwritten the original murders object, it does not change the object that is loaded with data(murders). So, if we load the murders data again, the original will overwrite our mutated version. -Like the filter function, we can use the data table variable names inside the function, and we'll know that we mean the columns and not the objects in the workspace.

Answer 86

Use to subset the data by selecting specific COLUMNS. *Subset: Lets you access specific parts of a vector by using square brackets to access elements of a vector. Example: new_table <-table select (murders, state, region, rate) filter(new_table, rate <= 0.71) In the above code, we select 3 columns, assign them to a new object, and then filter the object. OR select(data_frame, column_name_1, column_name_2) Example: select(murders, state, abb) when selecting the columns "state" and "abb" from the data frame "murders" Use Case: We have a data table with hundreds of columns, but we just want to view a few of the columns.

Answer 87

Use to perform a series of operations

Answer 88

Use to subset the data by filtering specific ROWS. first argument = data table second argument = the conditional statement Example: new_table <- select(murders, state, region, rate) filter(new_table, rate <= 0.71) So, first we defined a new object (i.e., new_table) in the murders data table/data frame and then we select variable names (i.e., state, region, rate) AKA columns and then filter rows. Notes: -Like the mutate function, we can use the data table variable names inside the function, and we'll know that we mean the columns and not the objects in the workspace. *Subset: Lets you access specific parts of a vector by using square brackets to access elements of a vector.

Answer 89

pipe operator: |> (available starting with R version 4.1.0) OR %>% (tidyverse operator) Normally, we have to define an intermediate object in order to use select, mutate, and filter together. Example: We defined "new_table" (the intermediate object) below. new_table <- select(murders, state, region, rate) filter(new_table, rate <= 0.71) BUT in dplyr, we can avoid that. We can write code that looks more like what we want to do (i.e., data>select>filter) AKA take the original data>select some columns>and then filter some rows. We use the pipe to do that. So, the pipe makes it possible to perform a series of operations by sending the results of one function to another function using the pipe operator %>% Example: murders |> select(state, region, rate) |> filter(rate <= 0.71) So: data table |> (pipe) select(variable names/columns) %>% (pipe) filter(variable names/rows) Note: -When using the pipe, we no longer need to specify the required argument as dplyr assumes that whatever is being piped should be operated on.

Answer 90

installing and loading the dplyr package install.packages("dplyr") library(dplyr)

Answer 91

rank(x) gives you the ranks from lowest to highest. rank(-x) gives you the ranks from highest to lowest.

Answer 92

!= Example: Remove Florida row no_florida <- filter(murders, state != "Florida")

Answer 93

Example: grades <- data.frame(names=c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90)) grades output: names exam_1 exam_2 1 John 95 90 2 Juan 80 85 3 Jean 90 85 4 Yao 85 90 Note: -data.frame turns characters into factors by default. To avoid that, put the entries in quotes to keep the strings as characters (i.e., "John", "Juan", "Jean", "Yao". (Before version R 4.0, you had to write "stringsAsFactors = FALSE)" at the end of the line of code. That's no longer necessary.)

Answer 94

Character Type: A character vector is used to store text data. Usage: Typically used for text strings or identifiers that do not have a defined set of levels. Example: "apple", "banana", "cherry". When to Use: Use character vectors for free-text data that doesn’t have a fixed set of categories. Factor Type: A factor is used to represent categorical data. Internally, factors are stored as integer vectors with a corresponding set of character labels. Usage: Used to store categorical data that has a fixed set of possible values (levels). Example: A factor variable fruit with levels "apple", "banana", and "cherry". When to Use: Use factors for categorical data, especially when the categories are important for statistical analysis or when you need to specify the order of levels.

Answer 95

Note that if rank(x) gives you the ranks of x from smallest to largest, rank(-x) gives you the ranks from largest to smallest.

Answer 96

Count the number of rows Example: library(dplyr) library(dslabs) data(murders) no_south <- filter(murders, region != "South") nrow(no_south) You'll get the number of rows where the region does not equal "South"

Answer 97

library(dplyr) library(dslabs) data(murders) my_states <- murders |> mutate(rate = total / population * 100000, rank = rank(-rate)) |> filter (region %in% c("Northeast", "West" ) & rate < 1) |> select(state, rate, rank) vs without the |> operator: library(dplyr) library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate)) my_states <- filter(murders, region %in% c("Northeast", "West") & rate < 1 ) select(my_states, state, rate, rank)

Answer 98

The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember Operations: mutate() adds new variables that are functions of existing variables select() picks variables based on their names. (COLUMNS) filter() picks cases based on their values. (ROWS) summarise() reduces multiple values down to a single summary. arrange() changes the ordering of the rows.

Answer 99

Package: a collection of R functions, data and compiled code. Library: The location where the packages are stored is called the library. Data: the data table

Answer 100

The measure of how spread out numbers are around their average. How to calculate SD: -subtract mean from each number -square the results -add them up -divide by the length of the list (i.e., the total number of numbers in the list) -take the square root of the result Low standard of deviation: the data is closely clustered around the mean/average High standard of deviation: the data is dispersed over a wider range of values Use standard deviation to determine if data is standard and expected OR unusual and unexpected. A data point that is beyond a certain number of standard deviations from the mean (e.g., 3σ) represents an outcome that is significantly above or below the average. This can be used to determine if a result is "statistically significant" or part of "expected variation". Use Case: Is a bottle with an extra ounce of soda expected, or is it statistically significant and warranting additional investigation into the production line? Notes: -The mathematics symbol (not relevant for r) for standard deviation is the lowercase Greek letter sigma (σ) 68-95-99.7 rule (or Empirical rule): -68% of the data fall within one standard deviation of the mean. -95% of the data fall within 2 standard deviations of the mean. -99.7% of the data fall within 3 standard deviations of the mean. 5 Sigma Results: Results that are 5 standard deviations above or below the mean. (A result that deviates this much may signify a discovery as it has only a 1 in 3.5 million chance that it is due to random fluctuation.)

Answer 101

dplyr verb The summarize function in dplyr provides a way to compute summary statistics for your data. Example: library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 10^5) s <- murders |> filter(region == "West") |> summarize(minimum = min(rate), median = median(rate), maximum = max(rate)) s Minimum Median Maximum [1] 0.515 1.29 3.63 Note: -Because the resulting table s is a data frame, we can access the components with the accessor (i.e., $) s$median [1] 1.29 Example: For example, if you wanted to know what the mean for `carat` was in this dataset, you could run the code in the chunk below: summarize(diamonds, mean_carat = mean(carat))

Answer 102

use arrange to order entire data frames (as opposed to using order and sort to order different columns). Example: Order the states by population size: murders |> arrange(population) |> head() state abb region pop total rate Wyoming WY West 563626 5 0.887 So, with arrange we get to decide which column to sort by. To see the states sorted by murder rates, for example, we would use arrange(rate) instead. Note: Default is ascending order. To do descending order: murders |> arrange(desc(rate))

Answer 103

Example: library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 10^5) s <- murders |> filter(region == "West") |> summarize(minimum = min(rate), median = median(rate), maximum = max(rate)) s Minimum Median Maximum [1] 0.515 1.29 3.63

Answer 104

Example: If we wanted to compute the murder rate for the entire country, we could not just take the average rate, because it does not take into account that some states are more populous than others and need to be weighed more. Average rate: mean(murders$rate) [1] 2.78 Instead we can compute the rate using this code & the summarize function: us_murder_rate <- murders |> summarize(rate = sum(total)/ sum(population * 10^5 us_murder_rate rate [1] 3.03

Answer 105

# Then call the summarize function Example: The below line of code returns the minimum (o), median (0.5), and maximum (1) values. min= 0th percentile of the vector median = 50th percentile of the vector maximum = 100th percentile of the vector quantile(x, c(0,0.5,1)) This returns the minimum, median, and maximum of the vector x. So, if you use a function that returns 2 or more values (see below), summarize returns a table with 3 rows, one for each outcome of the call of the function quantile. You get a vector. murders |> filter(region == "West") |> summarize (range = quantile(rate, c(0, 0.5, 1))) range 1 0.515 2 1.292 3 3.630 If you want to have them in columns, than you need to write a function that returns a data frame rather than a vector. #Define quantile as a function my_quantile <- function(x){ r <- quantile(x, c(0, 0.5, 1)) data.frame(minimum = r[1], median = r[2]. maximum = r[3]) } murders |> filter(region == "West") |> summarize(my_quantile(rate)) Min Median Max [1] 0.515 1.29 3.63

Answer 106

dplyr pull function can be used to access values stored in data when using pipes. When a data object is piped that object and its columns can be accessed using the pull(f) function. Notes: -dplyr function summarize always returns a data frame. This may be a problem if you want to use the result with functions that require a numeric value. Use the pull function is you want a numeric value rather than a data frame. Examples: library(tidyverse) library(dplyr) library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 10^5) #average rate adjusted by population size (weighted average): us_murder_rate <- murders %>% summarize(rate = sum(total) / sum(population) * 10^5) us_murder_rate #us_murder_rate is stored as a data frame: class(us_murder_rate) #the pull function can return it as a numeric value: us_murder_rate %>% pull(rate) #using pull to save the number directly: us_murder_rate <- murders %>% summarize(rate = sum(total) / sum(population) * 10^5) %>% pull(rate) us_murder_rate #us_murder_rate is now stored as a number: class(us_murder_rate)

Answer 107

Use the access function $ or the dplyr pull function

Answer 108

Think of the dot as a placeholder for the data that's being passed through the pipe. Example of a dot being used to imitate the pull function: us_murder_rate <- murders |> summarize(rate = sum(total) / sum(population) * 10^5 |> .$rate us_murder_rate [1] 3.03 class(us_murder_rate) [1] numeric

Answer 109

A common operation in data exploration: Split the data into groups and then compute summaries for each group. Use the group_by () function to do this. Example: In the below code, the summarize function will apply a summarization to each group separately. (So, this happens whenever summarize follows group_by.) murders |> group_by (region) |> summarize(median = median(rate))

Answer 110

A package is a set of R functions, compiled code, and sample data dplyr: A package for manipulating data frames purr: A package for working with functions ggplot2: A graphing package

Answer 111

In statistics, an observation is one occurrence of something you're measuring. Example: You're measuring the weight of a certain species of turtle. Each turtle that you collect the weight of counts as one single observation. In r, each observation is represented as a row in a data frame.

Answer 112

Matrix: A 2-D collection of elements of the same data type (i.e., all numeric, all character, etc.) This means it has both rows and columns. Visual Example: [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 Data Frame: It's like a table in a spreadsheet; it can hold different data types in each column (e.g., numeric, character, logical, etc.), , although each column must be of the same data type. Visual Example: Data Frame (mixed types) ┌────────┬─────┬────────┐ │ Name │ Age │ Passed │ ├────────┼─────┼────────┤ │ Alice │ 25 │ TRUE │ │ Bob │ 30 │ FALSE │ │ Carol │ 22 │ TRUE │ └────────┴─────┴────────┘ Vector: a one-dimensional sequence of data elements of the same type.

Answer 113

each row represents one observation and the columns represent the different variables available for each of these observations.

Answer 114

standard deviation

Answer 115

Use arrange to decide which column to sort by. Example: murders |> arrange(rate) |> head() output: state abb region pop total rate Vermont VT Northeast 625741 2 0.320 Hawaii HI West 1360301 7 0.515 Iowa IA N Central 3046355 21 0.689 Note: In dplyr, the default is to arrange in ascending order. To see in descending order: murders |> arrange(desc(rate))

Answer 116

If we are ordering by a column with ties, we can use a second column to break the tie. Similarly, a third column can be used to break ties between first and second and so on. Example: Here we order by region, then within region we order by murder rate: murders |> arrange(region, rate) |> head() #> state abb region population total rate #> 1 Vermont VT Northeast 625741 2 0.320 #> 2 New Hampshire NH Northeast 1316470 5 0.380 #> 3 Maine ME Northeast 1328361 11 0.828 #> 4 Rhode Island RI Northeast 1052567 16 1.520 #> 5 Massachusetts MA Northeast 6547629 118 1.802 #> 6 New York NY Northeast 19378102 517 2.668

Answer 117

This function takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. Example: murders |> top_n(5, rate) #> state abb region population total rate #> 1 District of Columbia DC South 601723 99 16.45 #> 2 Louisiana LA South 4533372 351 7.74 #> 3 Maryland MD South 5773552 293 5.07 #> 4 Missouri MO North Central 5988927 321 5.36 #> 5 South Carolina SC South 4625364 207 4.48

Answer 118

A special kind of data frame. You can think of them as modern versions of data frames. Tibbles -Never change the data type of the inputs -Never change the names of your variables -Never create row names Make printing easier (i.e., they won't overload your console as they're set up to only pull up the first ten rows). Differences between a data frame and a tibble: -The print method for tibbles is more readable. -If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or a scalar. With tibble, this does not happen, which is useful since tidyverse functions require data frames as an input. -With tibbles, if you want to access the vector that defines a column, and not get back a data frame, you need to use the accessor $: class(as_tibble(murders)$population) #> [1] "numeric" -While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. -Tibbles can be grouped The function group_by returns a special kind of tibble: a grouped tibble. tibbles are the preferred format in the tidyverse, so tidyverse functions that produce a data frame from scratch return a tibble. (The functions group_by and summarize always return a tbl data frame.)

Answer 119

To convert a dataframe to a tibble. Example: as_tibble(murders)

Answer 120

. Used as shorthand reference to the current object being passed through the pipeline. Example, Implicit Placeholder: 1:5 |> mean() |> sqrt visual flow: 1:5 #mean(1:5) mean(.) #sqrt(mean(1:5)) sqrt(.) Example, Explicit Placeholder Inside an Expression: 1:5 |> mean() |> {. - 2} visual flow: 1:5 #mean(1:5) mean(.) #(mean(1:5)) -2 {. -2| Example, Data Frame with dplyr ?? iris |> filter(.$Species == "setosa") |> summarize(avg = mean(.$Sepal.length)) visual flow: iris #filter(Iris, Iris$Species == "setosa") filter(.$Species == "setosa") #summarize(Iris, avg = mean(Iris$Sepal.Length)) summarize(avg = mean(.$Sepal.Length)) Example, Formula Shorthand: sapply(1:5, ~ .^2) visual flow: Each element of 1:5 #(1)^2, (2)^2, (3)^2, ... .~2

Answer 121

No. once you install a package, it remains installed and only needs to be loaded with library.

Answer 122

If you want a quick look at the argument without opening the help system, use the args() function. Example: args(log)

Answer 123

Arithmetic operators: help("+") + x - x x + y x - y x * y x / y x ^ y x %% y x %/% y Relational operators: help(">") x < y x > y x <= y x >= y x == y x != y

Answer 124

To specify arguments, we must use =, and cannot use <-. Example: To what power must I raise 2 to get 8? log(base = 2, x = 8) #> [1] 3

Answer 125

There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing: data() This shows you the object name for these datasets. These datasets are objects that can be used by simply typing the name. For example, if you type: co2 R will show you Mauna Loa atmospheric CO2 concentration data.

Answer 126

This term is used to refer to objects with several entries. The function length tells you how many entries are in the vector. Example: the object murders$population is not one number but several so it is a vector. Numeric Vector: In a numeric vector, every entry must be a number. Character vector: All entries in a character vector need to be a character. Logical Vector: All must be either True or False.

Answer 127

?Comparison Description Binary operators which allow the comparison of values in atomic vectors. Usage x < y x > y x <= y x >= y x == y x != y

Answer 128

levels(): You can specify an order through the level argument when creating the factor with the factor function. reorder(): lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. Example: Take the sum of the total murders in each region and reorder the factor following those sums. region <- murders$region value <- murders$total region <- reorder(region, value, FUN = sum) levels(region) [1] "Northeast" "North Central" "West" "South" *Note: Factors sometimes behave like characters and sometimes they don't. Confusing factors and characters are a common source of bugs. Reminder: Factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables. Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. Factors/categorical variables are stored in levels. R stores each level as an integer. (This is more memory efficient that storing all of the characters.) Note: the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow an alphabetical order.

Answer 129

Data Frames are a special case of lists. Lists allow you to store any combination of different types. Example, How to create a list: record <- list(name = "John Doe", student_id = 1234, grades = c(95, 82, 91, 97, 93), final_grade = "A")

Answer 130

With the accessor $ or with double square brackets [[ Examples: records$student_id [1] 1234 OR record[["student_id"]] [1] 1234 BUT if a list does not have names, you can only extract the elements with the brackets, not the accessor. Example: record[[1]] [1] "John Doe" Notes: What does it mean if a list does not have names? Usually when you create a list, you give names to each element. For example: $name [1] "Ken" $age [1] 34 $city [1] "Chicago" In this example, each element has a name (i.e., name, age, city) so you can access them by name: person$name person[["city"]] BUT you can also create lists without naming the elements: my_list <-list("Ken", 34, "Chicago" To access those elements, you would need to do: my_list[[1]] #"Ken" my_list[[2]] # 34

Answer 131

Use square brackets. syntax: matrix_name[row, column] Example 1: If you want the second row, third column in a matrix: mat[2,3] [1] 10 Example 2: If you want the entire second row, you leave the column spot empty: mat[2, ] [1] 2 6 10 Example 3: If you want the entire third column, leave the row spot empty. mat [ , 3' [1] 9 10 11 12 Example 4: Access more than one column or more than one row. mat[ , 2:3] Example 5: You can subset both rows and columns: mat [1:2, 2:3] Reminder: Matrix: A 2-D collection of elements of the same data type (i.e., all numeric, all character, etc.) Visual Example: [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 Data Frame: It's like a table in a spreadsheet; it can hold different data types in each column (e.g., numeric, character, logical, etc.) Visual Example: Data Frame (mixed types) ┌────────┬─────┬────────┐ │ Name │ Age │ Passed │ ├────────┼─────┼────────┤ │ Alice │ 25 │ TRUE │ │ Bob │ 30 │ FALSE │ │ Carol │ 22 │ TRUE │ └────────┴─────┴────────┘

Answer 132

Use single square brackets. syntax: data_frame_name[row, column] Example: murders[25, 1] [1] "Mississippi" Example: murders[2:3, ] (*You get all columns, because only specified the rows wanted) state abb region pop total Alaska AK West 710231 19 Arizona AZ West 6392017 232

Answer 133

concatenate c() Example: codes <- c(380, 12, 818) codes [1] 380 124 818 Character Vectors: Use quotes to denote that the entries are characters rather than variable names. country <- c("italy', "canada", "egypt") You can also use single quotes: country <- c('italy', 'canada', 'egypt') Note: If you don't use quotes with characters, you will get an error. R will be looking for variables and won't find any, which will result in an error.

Answer 134

Why? It can be useful. Example: When defining a vector of a country codes, you can use names to connect the codes with the country name. codes <- c(italy = 380, canada = 124, egypt = 818) codes [1] italy canada egypt [2] 380 124 818 OR codes <- c("italy" = 380, "canada" = 124, "egypt" = 818) codes [1] italy canada egypt [2] 380 124 818 OR Use the names function to assign names: codes <- c(38-, 124, 818) country <- c("italy", "canada", "egypt") names(codes) <- country codes [1] italy canada egypt 380 124 818 Note: The object codes continues to be a numeric vector: class(codes) [1] numeric But with names: names(codes) [1] "italy" "canada" "egypt"

Answer 135

Integers: whole numbers (i.e., numbers without a decimal point) Numeric: numbers that contain a decimal.

Answer 136

In R, recycling refers to how R handles operations between vectors of different lengths. When you perform arithmetic (like addition, subtraction, multiplication, etc.) on two vectors that aren’t the same length, R automatically “recycles” (repeats) the shorter vector until it matches the length of the longer one. Note: it's a common source of unnoticed errors. If you accidentally have mismatched vector lengths, R won’t stop you — it’ll quietly recycle values. That can give you plausible-looking but wrong results. that is, you won't receive an error or a warning that this will be done, so you must be careful to ensure that your vectors are the same length!

Answer 137

set actions to occur only if a condition or a set of conditions are met.

Answer 138

Conditional expressions are one of the basic features of programming. They are used for what is called flow control. The most common conditional expression is the if-else statement Example: a <- 0 if (a != 0) { print(1/a) } else{ print("No reciprocal for 0.") } #> [1] "No reciprocal for 0."

Answer 139

if else function allows you to perform element-wise conditional operations on vectors or data frames. syntax: ifelse(test_expression, x, y) test_expression: an object which can be coerced to logical mode x: the value or expression to be returned when the condition is true. It can be a single value, vector, or expression. y: the value or expression to be returned when the condition is false. It can be a single value, vector, or expression.

Answer 140

modulo operator Gives you the remainder of an integer division. Example: # create a vector a = c(5,7,2,9) # check if each element in a is even or odd ifelse(a %% 2 == 0,"even","odd") [1] "odd" "odd" "even" "odd" So, 5/2 = 2.5 so 5 %% 2 is 5 and since that isn't 0, it's odd Remember: Dividing by 2 can tell you if a number is even or odd. Even numbers will result in whole numbers without a remainder/even numbers are perfectly divisible by 2. Odd numbers will result in numbers with a remainder.

Answer 141

any: takes a vector of logicals and returns TRUE if any of the entries are TRUE all: takes a vector of logicals and returns TRUE if all of the entries are TRUE Example: z <- c(TRUE, TRUE, FALSE) any(z) [1] TRUE all(z) [1] FALSE

Answer 142

A labeled container that keeps track of which functions and variables belong to which package or environment, so that two packages (or functions) can use the same name without clashing. Example: Both the dplyr package and the stats package have a function named filter() dplyr filter(): filters rows in a data frame stats filter(): applies a linear filter to a time series Think of it as R asking: When you call a function named filter(), which filter(0 do you mean--the one from dplyr or the one from stats?

Answer 143

Example: Both the dplyr package and the stats package have a function named filter() You can force the use of a specific namespace by using double colons ( ::) like this: stats:: filter dplyr:: filter

Answer 144

Use the double colon package_name:: function_name Example: stats:: filter

Answer 145

In programming, a for loop is a control flow statement that allows you to execute a block of code repeatedly that is based on a specific condition. It is commonly used when you know how many times you want to execute a block of code. For Loop syntax varies on the programming language, but for r it's: for (x in 1:10) [ print(x) } Example: Print every item in this list. fruits <- list("apple", "banana", "cherry") for (x in fruits) { print(x) } [1] "apple" [1] "banana" [1] "cherry"

Answer 146

Vectorization, because it results in shorter and clearer code.

Answer 147

A function that will apply the same operation on each of the vectors Example: x <- 1:10 sqrt(x) #> [1] 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16 y <- 1:10 x*y #> [1] 1 4 9 16 25 36 49 64 81 100

Answer 148

The scalar data structure holds only a single atomic value at a time. Vectors that have a single value (length 1) are called scalars. Vectors can contain numbers, characters, factors, or logicals. But all the elements inside a vector must be of the same class. In other words, vectors can contain either numbers, characters, or logicals but not mixtures of these types of data. There is only one exception to this rule: you can include NA (this is a special type of logical) to denote missing data in vectors with other data types.

Answer 149

Functions that help us apply the same function to each entry in a vector, matrix, data frame, or list.

Answer 150

Allows you to perform element-wise operations on any function. Example: x<- 1:10 sapply(x, sqrt) #> [1] 1.00 1.41 1.73 2.00 2.24 2.65 2.83 3.00 3.16 So, each element of x is passed on to the function sqrt and the result is returned. These results are concatenated. In this case, the result is a vector of the same length as the original x. Sooo, the for loop can be written as follows: n <- 1:25 s_n <- sapply(n, compute_s_n) Note: compute_s_n is a user-defined function (i.e., earlier in the script what it does must be defined).

Answer 151

stands for "function" in many of R's "apply family" functions(i.e., apply, lapply, tapply, mapply, vapply, replicate). #Apply this function (FUN) to each element of x xapply(x, FUN) Example 1: #FUN = sqrt sapply(1;5, sqrt) [1] 1.00 1.41 1.73 2.00 2.23 Example 2: #sapply(x=n, FUN = compute_s_n) sapply(x = n, FUN = compute_s_n) So x is the data(1:25) FUN is the function to apply (compute_s_n) FUN is just a parameter that tells R: "Hey, this is a function I want you to apply to every element of X."

Answer 152

A set of functions thatallow users to apply a function to elements of a vector, list, or matrix. The Functions: apply, lapply, tapply, mapply, vapply, and replicate BUT it is considered legacy functionality and should not be used for new code. Instead, use the purr package for all looping in R.

Answer 153

dplyr package: use to manipulate data frames. Introduces functions that perform some of the most common operations when working with data frames. purrr package: used for working with functions ggplot: a graphing package

Answer 154

A few of the functions: Rows: filter(): chooses rows based on column values slice(): chooses rows based on location arrange(): changes the order of rows --desc to veer from default of ascending Columns: select(): changes whether or not a column is included rename(): changes the names of the columns mutate(): changes the values of columns and creates new columns relocate(): changes the order of the columns The pipe |>

Answer 155

starts_with : matches names that begin with "abc" contains: matches names that contain "xyz" ends_with: matches names that end with "xyz" matches: selects variables that match a regular expression. This one matches any variables that contain repeated characters. num_range: ("x", 1:3): matches x1, x2, and x3 Example, start_with: select(penguins, starts_with("Bill")) output: bill_length_mm bill_depth_mm 39.1 18.7 39.5 17.4 Example, contains: select(penguins, contains("length")) bill_length_mm flipper_length_mm 39.1 181 39.5 186 ends_with select(penguins, ends_with("_mm), ends with("__g") bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 39.1 18.7 181 3750

Answer 156

Create a data frame in the tibble format: grades <- tibble(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90)) Base R (without packages loaded) had the data.frame function that can be used to create a regular data frame rather than a tibble: grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90))

Answer 157

Includes functions similar to sapply (i.e., the function that applies the same function or procedure to elements of an object), but they will never convert our result to a character under certain circumstances. Purr functions will return objects of a specified type or return an error if that is not possible. Reminder: sapply allows you to perform element-wise operations on any function.

Answer 158

Like sapply, allows you to perform element-wise operations on any function, but it will always return a list Example: library(purrr) s_n <- map(n, compute_s_n) class(s_n) [1] "list" But if you want a numeric vector, use map_dbl, which always returns a vector of numeric values. s_n <- map_dbl(n, compute_s_n) class(s_n) [1] "numeric"

Answer 159

The case_when function is useful for vectorizing conditional statements. It's similar to ifelse but can output any number of values (as opposed to just TRUE or FALSE). Example: Split numbers into negative, positive, and 0: x <- c(-2, -1, 0, 1, 2) case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE ~ "Zero") [1] "Negative" "Negative" "Zero" "Positive" Positive" Common Use Case: Define categorical variables based on existing variables. Example 2: Compare the murder rates in 4 groups of states: New England, West Coast, South, and Other. Start by assigning these categories to the variables. murders |> mutate(group = case_when( abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England", abb %in% c("WA", "OR", "CA") ~ "West Coast", region == "South" ~ "South", TRUE ~ "Other)) |> group_by(group) |> summarize(rate = sum(total)/sum(population)*10^5) A tibble: 4x2 group rate New England 1.72 Other 2.71 South 3.63 West Coast 2.90 Note: "vectorized" means that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.

Answer 160

use to see if a value falls inside an interval. Example: Check to see if the elements of vector x are between a and b: between (x, a, b)

Answer 161

Used to ignore NA values when calculating Example: NHANES |> filter(AgeDecade == " 20-29" & Gender == "female") |> summarize( minbp = min(BPSysAve, na.rm = TRUE), maxbp = max(BPSysAve, na.rm = TRUE) )

Answer 162

Separate package that requires installation. #install the data.table package install.packages("data.table") #load data.table package library(data.table) load other packages and datasets library(tidyverse) library(dplyr) library(dslabs) data(murders)

Answer 163

Convert the data frame into a data.table object using the as.data.table function. Example: murders_dt <- as.data.table(murders)

Answer 164

Selecting with data.table: murders_dt[, c("state", "region")] OR (use .() data.table notation to alert R that variables inside the parenthesis are column names, not objects in the R environment.) murders_dt[, .(state, region)] vs. Selecting with dplyr: select(murders, state, region)

Answer 165

Example: Add a new column "rate" to the table: Using data.table: murders_dt[, rate := total / population * 100000] *Note: data.table avoids new assignment (i.e., the := operator modifies data in place/it directly alters the existing table without creating a new copy). This takes up less memory than the dplyr mutate. This helps with large datasets that take up most of your computer's memory. vs. Using dplyr: murders <- mutate(murders, rate = total / population * 100000)

Answer 166

Example: murders_dt[, ":="(rate = total / population * 100000, rank = rank(population))]

Answer 167

data.table package was designed to avoid wasting memory, so if you make a "copy of the table in any of the following ways, you're just creating a new name for an object, you're not actually creating a new object. 1. Assignment x<- data.table(a = 1) y <- x 2. Modify x x[, a := 2] y #> a #> #> 1: 2 3. Modify y y[, a := 1] x #> a #> #> 1: 1 To create an actual copy: Copy x <- data.table(a = 1) y <- copy(x) x[, a := 2] y #> a #> #> 1: 1

Answer 168

In data.table parlance, all set functions change their input by reference. This means that no copy is made, other than temporary working memory, which is as large as one column. The only other data.table operator that modifies input by reference is :=

Answer 169

Use the setDT() function to convert a data frame to a data table. Syntax: setDT(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) Example: x <- data.frame(a = 1) setDT(x) x: name of the data frame to convert to a data table keep.rownames: whether to keep the row names from the data table in a new column key: character vector of one or more column names to pass to setkeyv check.names: whether to check names for valid formats before converting data frame to data table *Use setDT() when working with larger data sets that take up a considerable amount of RAM because the operation will modify each object in place, conserving memory. For data that is a very small percentage of RAM, using data.table's copy-and-modify is fine.

Answer 170

data.frame Base R data structure for tables No package required Speed: slower for large data Memory usage: copies data often Syntax: standard, verbose Best for: small/medium datasets, beginners data.table Enhanced version of data.frame data.table package is required speed: very fast memory usage: updates by reference (no copies) syntax: concise, powerful best for: large datasets, fast data manipulation

Answer 171

Example: Extract rates less than or equal to 0.7 data.table: murders[rate <= 0.7] dplyr: filters(murders, rate <= 0.7)

Answer 172

Example: select the state and rate for those with a rate less than or equal to 0.7 data.table: murders[rate <= 0.7, .(state, rate)] dplyr: murders |> filter(rate <= 0.7) |> select (state, rate)

Answer 173

library(tidyverse) library(dplyr) library(dslabs) data(murders) library(data.table) murders <- setDT(murders) murders <- mutate(murders, rate = total / population * 10^5) murders[, rate := total / population * 100000]

Answer 174

In data.table, we call functions inside .() and they will be applied to rows.

Answer 175

Example: load packages and prepare the data - heights dataset library(tidyverse) library(dplyr) library(dslabs) data(heights) heights <- setDT(heights) dplyr: s <- heights |> summarize(average = mean(height), standard_deviation = sd(height) data.table: s <- heights [, .(average = mean(height), standard_deviation = sd(height))] multiple summaries in data.table: heights[, .(median_min_max(height))]

Answer 176

load packages and prepare the data - heights dataset library(tidyverse) library(dplyr) library(dslabs) data(heights) heights <- setDT(heights) dplyr: s <- heights |> filter(sex) == "Female") |> summarize(average = mean(height), standard_deviation = sd(height)) data.table: s <- heights[sex == "Female", .(average = mean(height), standard_deviation = sd(height))]

Answer 177

load packages and prepare the data - heights dataset library(tidyverse) library(dplyr) library(dslabs) data(heights) heights <- setDT(heights) #get mean height and sd for males and females heights[, .(average = mean(height), standard_deviation = sd(height)), by = sex]

Answer 178

load packages and datasets and prepare the data library(tidyverse) library(dplyr) library(data.table) library(dslabs) data(murders) murders <- setDT(murders) murders[, rate := total / population * 100000] order by population murders[order(population)] |> head() order by population in descending order murders[order(population, decreasing = TRUE)] order by region and then murder rate murders[order(region, rate)]

Answer 179

A tbl (pronounced "tibble") is a special kind of data frame. Tibbles are the default data frame in the tidyverse. Tibbles display better than regular data frames. Subsets of tibbles are tibbles, which is useful because tidyverse functions require data frames as inputs. Tibbles will warn you if you try to access a column that doesn't exist. Entries in tibbles can be complex - they can be lists or functions. The function group_by() returns a grouped tibble, which is a special kind of tibble.

Answer 180

murders |> group_by(region)

Answer 181

murders |> group_by(region) |> class()

Answer 182

as_tibble(gapminder) Note: gapminder is a dataset in R saved in a data frame (or tbl) containing social and econonmic indicators for countries over time.

Answer 183

class(murders[,1]) class(as_tibbles(murders[,1])

Answer 184

class(as_tibble(murders)$state)

Answer 185

tibble(id = c(1, 2, 3), func = c(mean, median, sd))

Answer 186

✅ The core idea When you use filter() and select(), you still have a data frame (a table). But the mean() function expects a vector (just a single column of numbers). If you try to do mean(height) directly after a pipe, R gets confused because you're still working with a data frame. 💡 What summarize() does summarize() (or summarise()) reduces a data frame down to a single summary value per group. It takes your column (height) and applies a function to it (mean()). It converts this: height sex 61 Female 65 Female 62 Female ... ... Into this: mean_height_cm 162.5 So summarize() tells R: "Take this column and compute a summary from it."

Answer 187

General Form: if(boolean condition){ expressions } else { alternative expressions } Example: Find the states with a minimum murder rate less than 0.5 ind <- which.min(murder_rate) if(murder_rate[ind] < 0.5){ print(murders$state[ind]) } else{ print("No state has murder rate that low") } [1] Vermont

Answer 188

load packages and prepare the data library(tidyverse) library(dplyr) library(dslabs) data(murders) library(data.table) murders <- setDT(murders) murders <- mutate(murders, rate = total / population * 10^5) murders[, rate := total / population * 100000] subsetting dplyr: filter(murders, rate <= 0.7 data.table: murders[rate <= 0.7]

Answer 189

load packages and prepare the data library(tidyverse) library(dplyr) library(dslabs) data(murders) library(data.table) murders <- setDT(murders) murders <- mutate(murders, rate = total / population * 10^5) murders[, rate := total / population * 100000] #combining filter and select dplyr: murders %>% filter(rate <= 0.7) %>% select(state, rate) data.table: murders[rate <= 0.7, .(state, rate)]

Answer 190

This function takes 3 arguments: 1 logical argument and 2 possible answers. If the logical is true, the first answer is returned. if it's false, the second answer is returned. *Ifelse works on vectors. It examines each element of a logical vector and returns a corresponding answer. Example: If a is bigger than zero, return the reciprocal. If not, return NA. a <- 0 ifelse(a > 0, 1/a, NA) [1] NA Notes: Logical Vector: TRUE, FALSE, or NA (for missing values)

Answer 191

Examples: #How many nas? data(na_example) sum(is.na(na_example)) [1] 145 #Remove nas and then confirm they're gone. no_nas <- ifelse(is.na(na_example), 0, na_example) sum(is.na(no_nas)) [1] 0

Answer 192

The any function takes a vector of logicals and returns true if any of the entries are true. Example: Are there any true in the vector z? z <- c(TRUE, TRUE, FALSE) any(z) [1] TRUE Example: Are there any true in the vector b? b <= c(FALSE, FALSE, FALSE) any(b) [1] FALSE

Answer 193

takes a vector of logicals and returns true if all of the entries are true. z <- c(TRUE, TRUE, FALSE) all (z) [1] FALSE z <- c(TRUE, TRUE, TRUE) all(z) [1] TRUE

Answer 194

my_function <- function(x) {operations that operate on x which is defined by the user of the function; the value's final line is returned} Functions can have more than one variable. my_function <- function(x, y, z) { operations that operate on x, y, z, which is defined by the user of the function; the value's final line is returned} example of defining a function to compute the average of a vector x avg <- function(x){ s <- sum(x) n <- length(x) s/n } Notes: -Functions are objects so must be assigned a variable name with the arrow operator.

Answer 195

Namespaces are a convention used by programming languages to be able to use the same variable names to access the values of different objects. For example, in the following code we use the same name, x, to represent two different objects, one in the Namespace inside the function and another in the Namespace outside the function. my_func <- function(x){ x <- x + 1 print(x) return(NULL) } x <- 1 my_func(x) print(x) Note that when we redefine x as x+1, this happens to the object named x in the Namespace of the my_func function. Therefore the x defined in the Namespace outside the function is not affected.

Answer 196

R uses lexical scoping, meaning variables defined inside a function are separate from those defined outside. Example: x <- 3 my_func <- function(y){ x <- 5 y print(x) } my_func(x) [1] 5

Answer 197

Code performs the same task over and over again while changing a variable. For-Loops let us define the range that our variable takes. They change the value as you loop and evaluate the expression every time. general form: for(i in range of values) { operations that use i, which is changing across the range of values } example: for(i in 1:5) { print(i) } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 (Note: At the end of the loop, the value of i is the last value of the range. So, if you type i after the above for-loop, you get back 5.) i [5]

Answer 198

apply sapply tapply mapply

Answer 199

Examples: create an empty vector s_n <- vector(length = m) create a for-loop that calculates the sum of integers from 1 to n for 10 different values of n and stores them in a vector called results: results <- vector("numeric", 10) n <- 10 for(i in 1:n){

Answer 200

No. Not all spreadsheet files are in a text format. These cannot be viewed in a text editor. Examples include Google Sheets, which are rendered on a browser, and Microsoft Excel, which has its own proprietary format.

Answer 201

When creating spreadsheets with text files, a new row is defined with return and columns are separated with some predefined special character. The most common characters are comma (,), semicolon (;), space ( ), and tab (a preset number of spaces or \t).

Answer 202

Up to this point, we have been using data sets already stored as R objects. However, it is common to import data into R from either a file, a database, or other sources.

Answer 203

You can think of your computer’s filesystem as a series of nested folders, each containing other folders and files. We refer to folders as directories.

Answer 204

We refer to the folder that contains all other folders as the root directory.

Answer 205

We refer to the directory in which we are currently located as the working directory.

Answer 206

a list of directory names that can be thought of as instructions on what folders to click on and in what order to find a file

Answer 207

The path of a file is a list of directory names that can be thought of as instructions on what folders to click on and in what order to find a file IF the instructions are for finding the file starting in the working directory, we refer to it as a relative path.

Answer 208

The path of a file is a list of directory names that can be thought of as instructions on what folders to click on and in what order to find a file IF the instructions are for finding the file starting from the root directory, we refer to it as a full path

Answer 209

Example: system.file(package = "dslabs") #> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dslabs" Note that the output will be different across different computers. The system.file function finds the full path to the files that were added to your system when you installed the dslabs package. The strings separated by slashes are the directory names. The first slash represents the root directory and we know this is a full path because it starts with a slash.

Answer 210

We can use the function list.files to show the names of files and directories in any directory. For example, here are the files in the dslabs package directory: dir <- system.file(package = "dslabs") list.files(dir) #> [1] "data" "DESCRIPTION" "extdata" "help" #> [5] "html" "INDEX" "Meta" "NAMESPACE" #> [9] "R" "script" Note that these do not start with slash which implies they are relative paths. These relative paths give us the location of the files or directories if the path stored in dir is our working directory.

Answer 211

We highly recommend only using relative paths in your code. The reason is that full paths are unique to your computer and you want your code to be portable.

Answer 212

If you want to know the full path of your working directory using the getwd function. wd <- getwd()

Answer 213

If you need to change your working directory, you can use the function setwd or you can change it through RStudio by clicking on “Session”.

Answer 214

The file.path function combines characters to form a complete path, ensuring compatibility with the respective operating system. This function is useful because often you want to define paths using a variable. Here is an example that constructs the full path for a spreadsheet containing the murders data. Here the variable dir contains the full path for the dslabs package and extdata/murders.csv is the relative path of the spreadsheet if dir is considered the working directory. dir <- system.file(package = "dslabs") file_path <- file.path(dir, "extdata/murders.csv")

Answer 215

You can copy the file with full path file_path to your working directory using the function file.copy: file.copy(file_path, "murders.csv") #> [1] TRUE If the file is copied successfully, this function will return TRUE.

Answer 216

When text files are used to store a spreadsheet, line breaks are used to separate rows and a predefined character, referred to as the delimiter, is used to separate columns within a row. The most common delimiters are comma (,), semicolon (;), space (), and tab (a preset number of spaces or \t).

Answer 217

In some cases, the delimiter can be inferred from file suffix. For example, files ending in csv or tsv are expected to be comma and tab delimited, respectively. However, it is harder to infer the delimiter for files ending in txt. As a result we recommend looking at the file rather than inferring from the suffix. You can look at any number of lines from within R using the readLines function: readLines("murders.csv", n = 3) #> [1] "state,abb,region,population,total" #> [2] "Alabama,AL,South,4779736,135" #> [3] "Alaska,AK,West,710231,19" This immediately reveals that the file is indeed comma delimited. It also reveals that the file has a header: the first row contains column names rather than data. This is also important to know. Most parsers assume the file starts with a header, but not all files have one.

Answer 218

The most common delimiters are comma (,), semicolon (;), space (), and tab (a preset number of spaces or \t). Slightly different approaches are used to read these files into R, so we need to know what delimiter was used. In some cases, the delimiter can be inferred from file suffix. For example, files ending in csv or tsv are expected to be comma and tab delimited, respectively. However, it is harder to infer the delimiter for files ending in txt. As a result we recommend looking at the file rather than inferring from the suffix. *You can look at any number of lines from within R using the readLines function: readLines("murders.csv", n = 3) #> [1] "state,abb,region,population,total" #> [2] "Alabama,AL,South,4779736,135" #> [3] "Alaska,AK,West,710231,19" This immediately reveals that the file is indeed comma delimited. It also reveals that the file has a header: the first row contains column names rather than data. This is also important to know. Most parsers assume the file starts with a header, but not all files have one.

Answer 219

Unlike text files, which are designed for human readability and have standardized conventions, binary files can adopt numerous formats specific to their data type. Opening image files such as jpg or png in a text editor or using readLines in R will not show comprehensible content because these are binary files.

Answer 220

incorrectly identifying the file’s encoding. At its core, a computer translates everything into sequences of 0s and 1s. ASCII is an encoding system that assigns specific numbers to characters. Using 7 bits, ASCII can represent unique symbols, sufficient for all English keyboard characters. However, many global languages contain characters outside ASCII’s range. For instance, the é in “México” isn’t in ASCII’s catalog. To address this, broader encodings, such as Unicode, emerged. Unicode offers variations using 8, 16, or 32 bits, known as UTF-8, UTF-16, and UTF-32. RStudio typically uses UTF-8 as its default. Notably, ASCII is a subset of UTF-8, meaning that if a file is ASCII-encoded, presuming it’s UTF-8 encoded won’t cause issues. However, there other encodings, such as ISO-8859-1 (also known as Latin-1) developed for the western European languages, Big5 for Traditional Chinese, and ISO-8859-6 for Arabic.

Answer 221

A parser/file parser/importing function is an importing function

Answer 222

With scan, you can read-in each cell of a file. Example: x <- scan("murders.csv", sep = ",", what = "c") x[1:10] #> [1] "state" "abb" "region" "population" "total" #> [6] "Alabama" "AL" "South" "4779736" "135" Why this is useful: When reading in spreadsheets many things can go wrong. The file might have multiline headers or be missing cells. With experience you will learn how to deal with different challenges.

Answer 223

The readr package includes parsers, for reading text file spreadsheets into R. readr is part of the tidyverse, but you can load it directly using: library(readr)

Answer 224

Function Format Typical suffix read_table white space separated values txt read_csv comma separated values csv read_csv2 semicolon separated values csv read_tsv tab delimited separated values tsv read_delim general text file format, must define delimiter txt It also includes read_lines with similar functionality to readLines. It also includes guess_encoding, which tries to guess at encoding: guess_encoding("murders.csv") #> # A tibble: 1 × 2 #> encoding confidence #> #> 1 ASCII 1

Answer 225

The readxl package provides functions to read-in Microsoft Excel formats. library(readxl)

Answer 226

The readxl package provides functions to read-in Microsoft Excel formats. Function Format Typical suffix read_excel auto detect the format xls, xlsx read_xls original format xls read_xlsx new format xlsx The excel_sheets function gives us the names of all the sheets in an Excel file.

Answer 227

a powerful and fast utility designed for reading large datasets. fread automatically detects the format of the input, whether it’s delimited text or even files compressed in formats like gzip or zip. It offers a significant speed advantage over the other parsers described here, especially for large files. library(data.table) dat <- fread("murders.csv") Note fread returns a data.table object.

Answer 228

tempdir: creates a directory with a random name that is likely to be unique. tempfile: creates a character string, not a file, that is likely to be a unique filename. So you can run a command like this which erases the temporary file once it imports the data: tmp_filename <- tempfile() download.file(url, tmp_filename) dat <- read_csv(tmp_filename) file.remove(tmp_filename)

Answer 229

You want the names you pick for objects, files, and directories to be memorable, easy to spell, and descriptive. This is actually a hard balance to achieve and it does require time and thought. One important rule to follow is do not use spaces, use underscores _ or dashes instead -. Also, avoid symbols; stick to letters and numbers.

Answer 230

Write Dates as YYYY-MM-DD

Answer 231

Choose Good names Write Dates as YYYY-MM-DD: We recommend using this global ISO8601 standard No Empty Cells: Fill in all cells and use common code for missing data Put Just One Thing in a Cell - It is better to add columns to store the extra information rather than having more than one piece of information in one cell. Make It a Rectangle - The spreadsheet should be a rectangle. Create a Data Dictionary - If you need to explain things, such as what the columns are or what the labels used for categorical variables are, do this in a separate file. No Calculations in the Raw Data Files - Excel permits you to perform calculations. Do not make this part of your spreadsheet. Code for calculations should be in a script. Do Not Use Font Color or Highlighting as Data - Most import functions are not able to import this information. Encode this information as a variable instead. Make Backups - Make regular backups of your data. Use Data Validation to Avoid Errors - Leverage the tools in your spreadsheet software so that the process is as error-free and repetitive-stress-injury-free as possible. Save the Data as Text Files - Save files for sharing in comma or tab delimited format.

Answer 232

When importing data from a spreadsheet, the first step is to locate the file containing the data. You could use an approach similar to what you do to open files in Microsoft Excel (although we do not recommend it) by clicking on the RStudio “File” menu - “Import Dataset”, and then through folders until you find the file. Our preference is to write code. We need to let the R functions doing the importing know where to look for the file containing the data. The simplest way to do this is to have a copy of the file in the folder in which the importing functions look by default. Example: # Copy the spreadsheet containing the US murders data (included as part of the dslabs package) filename <- "murders.csv" dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename) file.copy(fullpath, "murders.csv") Once the file is copied, import the data with a line of code. Use the read_csv function from the readr package (included in the tidyverse) library(tidyverse) dat <- read_csv(filename)

Answer 233

Example: filename <- "murders.csv" dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename)

Answer 234

Example: filename <- "murders.csv" dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename) file.copy(fullpath, "murders.csv")

Answer 235

Example: read_lines("murders.csv", n_max = 3)

Answer 236

View(dat) Note: Read-in means "imported"

Answer 237

Use the download.file function in order to have a local copy of the file. download.file(url, "murders.csv") Note: Be careful as it will overwrite files without warning

Answer 238

tmp_filename <- tempfile() download.file(url, tmp_filename) dat <- read_csv(tmp_filename) file.remove(tmp_filename) Note: A temporary file (often called a temp file) is a scratch file created by your computer during a program's execution. It’s meant to store data only briefly while R is working. R has a built-in function tempfile() that creates a path to a file in your system's temporary directory — a special folder used just for temporary data. This folder gets cleaned automatically by your operating system or when R closes. Step What Happens Where is the data? 1 A temp file path is created No data yet 2 File downloaded from the internet Stored temporarily on your disk 3 File read into R using read_csv() Now lives in R memory 4 The downloaded temp file is deleted Doesn't matter—data still exists

Answer 239

path <- system.file("extdata", package = "dslabs") filename <- "murders.csv" x <- scan(file.path(path, filename), sep = ",", what = "c") x[1:10]

Answer 240

A body of reusable code used to perform specific tasks in R

Answer 241

Information that a function in R needs in order to run

Answer 242

A representation of a value in R that can be stored for use later during programming. Variables can also be called objects

Answer 243

Must always start with a letter (e.g., you should not use 5penguins) It can contain numbers and underscores

Answer 244

A group of data elements of the same type stored in a sequence in R Example: vec_1 <- c(12, 48.5, 6, 99) [1] 12 48.5 6 99

Answer 245

A tool in R for expressing a sequence of multiple operations, represented with %>% or |> depending on the version used Example: ToothGrowth %>% filter(dose == 0.5) %>% arrange(len)

Answer 246

Vector: A group of data elements stored in a sequence in R 1) Atomic: homogenous -integer: positive and negative whole values (3) -double: decimal values(101.175) -logical: True/False -character: string/character value ("Coding") -complex -raw Numeric Vectors: Integers or Doubles Atomic Vectors: Logical, Numeric, or Character *complex and raw aren't commonly used in data analysis. 2) Recursive: heterogenous -list

Answer 247

store numeric data in a vector c(2.5, 48.5, 101.5) create a vector of integers c(1L, 5L, 15L) create a vector containing characters or logicals c("Sara", "Lisa", "Anna") c(TRUE, FALSE, TRUE) create a vector of a sequence of numbers z <- c(4:10) z

Answer 248

typeof() function is used to determine a vector's type. Examples: typeof(c("a", "b")) [1] "character" typeof(c(1L, 3L)) [1] "integer" check if a vector is a specific type by using one of the following functions: is.logical() is.double() is.integer() is.character() Examples: x <-c(2L, 5L, 11L) is.integer(x) [1] TRUE y <- c(TRUE, TRUE, FALSE) is.character(y) [1] FALSE

Answer 249

You can name elements in vectors of any type with the names() function. Example: x <- c(1, 3, 5) names(x) <- c("a", "B", "c") x [1] a b c 1 3 5

Answer 250

Reference the element's position in the vector or its name with the extract operator [] Example: x <- c(1, 3, 5) names(x) <-c("a", "b", "c") x x["b"] output: a b c 1 3 5 b 3 Example: x <- c(1, 3, 5) names(x) <- c("a", "b", "c") x x["b"] output: a b c 1 3 5 b 3

Answer 251

now(): run it to get the current data-time today(): to get the current data, month, and day

Answer 252

It's in the tidyverse install.packages("tidyverse") library(tidyverse) library(lubridate) lubridate contains tools to convert strings to dates or date-times so you can perform operations on them. (Date/time data often comes as character strings, so it must be converted before operations can be performed.) To use these functions, arrange "y" "m" and "d) (i.e., year, month, and date) in the order wanted. Example: ymd("2023-01-20") [1] "2023-01-20" mdy("January 20th, 2023) [1] "2023-01-20" dmy("20-Jan-2021") [1] "2021-01-20"

Answer 253

year month date hour minute second

Answer 254

as_date(now() [1] "2021-01-20"

Answer 255

A data frame is a collection of columns containing data, similar to a spreadsheet or SQL table. Each column has a name that represents a variable and includes one observation per row. Data frames summarize data and organize it into a format that is easy to read and use.

Answer 256

If you need to manually create a data frame in R, you can use the data.frame () function. data.frame(x = c(1, 2, 3) , y = c(1.5, 5.5, 7.5)) x y 1 1.5 2 5.5 3 7.5

Answer 257

syntax: 0row_to_extract, column_to_extract) Example: the data frame: x y 1 1.5 2 5.5 3 7.5 z <- data.frame(x = c(1, 2, 3), y = c(1.5, 5.5, 7.5)) z[2, 1]

Answer 258

Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the function. Your file types will usually be something like .txt, .docx, or .csv. Examples: file.create("new_text_file.txt") file.create("new_word_file.docx") file.create("new_csv_file.csv")

Answer 259

Copy a file with the file.copy() function. In the parentheses, add the name of the file to be copied. Then, enter a comma, and add the name of the destination folder that you want to copy the file to. Syntax: file.copy("new_text_file.txt", "destination_folder")

Answer 260

You can delete R files with the unlink() function. Enter the file’s name in the parentheses of the function. Syntax: unlink("some_.file.csv")

Answer 261

To create a matrix in R, you can use the matrix() function. The matrix() function has two main arguments that you enter in the parentheses. First, add a vector. The vector contains the values you want to place in the matrix. Next, add at least one matrix dimension. You can choose to specify the number of rows or the number of columns by using the code nrow = or ncol =. For example, to create a 2x3 (two rows by three columns) matrix containing the values 3-8, enter a vector containing that series of numbers: c(3:8). Then, enter a comma. Finally, enter nrow = 2 to specify the number of rows. Run the code: matrix(c(3:8), nrow = 2) R displays a matrix with three columns and two rows (typically referred to as a “2x3”) that contain the numeric values 3, 4, 5, 6, 7, 8. R places the first value (3) of the vector in the uppermost row, and the leftmost column of the matrix, and continues the sequence from left to right. Example 2: You can also choose to specify the number of columns (ncol = ) instead of the number of rows (nrow = ). Run the code: matrix(c(3:8), ncol = 2) R infers the number of rows automatically.

Answer 262

Used to assign values to variables and vectors

Answer 263

Used to complete math calculations + (addition) - (subtraction) * (multiplication) / (division)

Answer 264

and: & Example: You want to find observations (rows) in which conditions are both extremely sunny and windy. You define this as observations that have a Solar measurement of over 150 and a Wind measurement of over 10. This code specifies that R should return a value of TRUE for rows in which the airquality dataset’s Solar.R value is greater than 150 and its Wind value is greater than 10, and a value of FALSE otherwise. airquality[, "Solar.R"] > 150 & airquality[, "Wind"] > 10 or: | Example: you want to specify rows where it’s extremely sunny or it’s extremely windy, which you define as having a Solar measurement of over 150 or a Wind measurement of over 10. This code specifies that R should return a value of TRUE when either the airquality dataset’s Solar.R value is greater than 150 or its Wind value is greater than 10. Otherwise, R will return a value of FALSE. airquality[, "Solar.R"] > 150 | airquality[, "Wind"] > 10 not: != Example: focus on the weather measurements for days that aren't the first day of the month. R should return a value of TRUE when the airquality dataset’s Day value is not 1 and a value of FALSE when the Day value is 1. airquality[, "Day"] != 1

Answer 265

A declaration that if a certain condition holds, a certain event must take place.

Answer 266

The if statement sets a condition, and if the condition evaluates to TRUE, the R code associated with the if statement is executed. Syntax: In R, you place the code for the condition inside the parentheses of the if statement. The code to be executed if the condition is TRUE follows in curly braces (expr). Note that in this case, the second curly brace is placed on its own line of code and identifies the end of the code that you want to execute. if (condition) { expr } Example: if x is greater than 0, then R will print out the string "x is a positive number". x <- 4 if (x > 0) { print("x is a positive number") } Output; As x is equal to 4, the condition is true (because 4 is greater than 0). Therefore, when you run the code, R prints out the string "x is a positive number". But, if you change x to a negative number, such as -4, then the condition will be FALSE because -4 is not greater than 0. If you run the code, R will not execute the print statement. Instead, a blank line will appear as the result.

Answer 267

The else statement is used in combination with an if statement. Syntax: if (condition) { expr1 } else { expr2 } Example 1: x <- 7 if (x > 0) { print ("x is a positive number") } else { print ("x is either a negative number or zero") } Example 2: the if statement checks the Temp value for the first row in airquality. if (airquality$Temp[1] < 80) { print("It's not a hot day!") } else { print("It's a hot day.") }

Answer 268

Syntax: if (condition1) { expr1 } else if (condition2) { expr2 } else { expr3 } If the if condition (condition1) is met, then R executes the code in the first expression (expr1). If the if condition is not met and the else if condition (condition2) is met, then R executes the code in the second expression (expr2). If neither of the two conditions are met, R executes the code in the third expression (expr3).

Answer 269

AND (&), OR (|), NOT(!) Logical operators can be used to check a condition and return a logical data type. In R, logical data is presented as T or TRUE when a condition is met, and F or FALSE when it is not.

Answer 270

The str() and glimpse() functions will both return summaries of each column in your data arranged horizontally.

Answer 271

returns a list of column names from your dataset

Answer 272

Use this function to rename the columns or variables in your data. Example 1: Rename "carat" column to "carat_new" in the diamonds dataset. rename(diamonds, carat_new_carat) Example 2: Rename more than one variable. rename(diamonds, carat_new = carat, cut_new = cut)

Answer 273

Both "=" and "<-" are valid assignment operators in R, but they serve slightly different purposes and contexts. The "=" operator is typically used within function calls, while the "<-" operator is preferred for general assignments. == is exactly equal to. A logical comparator testing for equality. example: x <- 10 y <- 10 z <- 5 Check if x is equal to y x == y # Output: [1] TRUE The "=" operator is commonly used for assignments within function calls. example: x = 10 print(x) [1] 10 The "<-" operator is the traditional assignment operator in R. It is specifically designed for assignment operations and is considered a best practice by many R programmers for regular variable assignments. example: y <- 20 print(y) [1] 20

Answer 274

To build a visual with 'ggplot2' you layer plot elements together with a '+' symbol. examples: Take the `diamonds` data, plots the carat column on the X-axis, the price column on the Y-axis, and represents the data as a scatter plot using the `geom_point()` command ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point()

Answer 275

Units of reproducible R code that include: reusable R functions, documentation about the functions, sample datasets, and test for checking your code (to make sure that it does what you want it to do). Base R: a set of packages in RStudio that are available when you start.

Answer 276

Comprehensive R Archive Network An online archive with R packages, source code, manuals, and documentation. CRAN ensures that packages are authentic and valid, so if you source a package through CRAN, you can feel confident in its legitimacy.

Answer 277

Conflicts happen when packages have functions with the same names as other functions. Whatever package you most recently loaded, will be the default package for the current R session.

Answer 278

ggplot2: used for data visuals. tibble: works with data frames tidyr: used for data cleaning (i.e., makes tidy data). It works with wide and long data. readr: used for importing data. To read a dataset with readr, combine the function with a column specification. The column specification describes how each column should be converted to the most appropriate data type. But this isn't usually necessary as readr will figure it out for you automatically. purrr: works with functions and vectors dplyr: offers functions that help you complete common data manipulation tasks (e.g., the filter function finds cases where certain conditions are true). stringr: includes functions that make it easier to work with strings. forcats: provides tools that solve common problems with factors.

Answer 279

tidyverse_update(): Use this function to check for updates. (The packages in tidyverse change a lot.) You can then update your packages. A few options to do so: update.packages(): Use this function to update all of your packages. This will take some time. install.packages("package name"): use this function to quickly update one package. Best Practice: update packages regularly to make sure you have the latest version in your code.

Answer 280

Documentation that acts as a guide to an R package. A vignette shares details about the problem that the package is designed to solve and how the included functions can help you solve it. Use browseVignettes function to read through the vignettes of a loaded package. Example: Use browseVignettes on ggplot2 browseVignettes("ggplot2") Output: Vignettes in package ggplot2 -Aesthetic specifications - HTML source R code Extending ggplot2 - HTML source R code Using ggplot2 in packages - HTML source R code

Answer 281

installed.packages() function in R returns a matrix containing information about all packages installed in the specified libraries. This matrix includes details such as the package name, the library path where it's located, its version number, and other metadata like dependencies, imports, and suggestions.

Answer 282

In programming, "nested" describes code that performs a particular function and is contained within code that performs a broader function.

Answer 283

From the inside out.

Answer 284

ctrl+shift+m

Answer 285

-variables are organized into columns -observations are organized into rows -each value must have its own cell

Answer 286

Get the structure of the data frame. provides high-level info like the column names and the type of data contained in each column Example: Data Frame: Names Age Bob 48 Shirley 61 Liz 73 Dave 39 str(people) output: 'data.frame': 4 obs of 2 variables: $ names: chr "Bob" "Shirley" "Liz" "Dave" $ age : num 48 61 73 39

Answer 287

get the column names Example: Data Frame: Names Age Bob 48 Shirley 61 Liz 73 Dave 39 colnames(people) output: "names" "age"

Answer 288

Example: Data Frame: Names Age Bob 48 Shirley 61 Liz 73 Dave 39 glimpse(people) Ouput: Rows: 4 Columns: 2 $ names "Bob", "Shirley", "Liz", "Dave" $ age 48, 61, 73, 39

Answer 289

Use this function to make changes to the data frame. Part of the dplyr package which is in the tidyverse. You need to load the tidyverse library to use it. Syntax: mutate(data_name_to_change, name_of_the_new_col_to_create)

Answer 290

Use to import data from a csv, tsv, dlim, fw, table, or log file. read_csv(): comma-separated values (.csv) files ex// read_csv(readr_example("mtcars.csv")) read_tsv(): tab-separated values files read_delim(): general delimited files read_fwf(): fixed-width files read_table(): tabular files where columns are separated by white-space read_log(): web log files

Answer 291

Part of the tidyverse but not a core tidyverse package so must load readxl in R by using library() function. Use the read_excel() function to read a spreadsheet file. ex// read_excel(readxl_example("type-me.xlsx") Used the excel_sheets() function to list the names of individual sheets excel_sheets(readxl_example("type-me.xlsx")) You can also specify a sheet by a name or number: logical_coercion numeric_coercion date_coercion text_coercion ex// read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion") output: R willr eturn a tibble fo the sheet.

Answer 292

rename() change column names Example: rename_with(penguins, toupper) this will change all of the column names to uppercase. or rename_with(penguins, tolower) this will change all of the column names to lowercase, which is more common. clean_names(): ensures that there are only characters, numbers, and underscores in the names ex// clean_names(penguins) *the dataset is called "penguins"

Answer 293

Both functions return a summary of the data frame, including the number of columns and rows.

Answer 294

Do: -keep your filenames to a reasonable length -use underscores and hyphens for readability -start or end your filename with a letter or number -use a standard date format when applicable; example YYYY-MM-DD -Use filenames for related files that work well with default ordering (e.g., chronological order, logical order with numbers first, etc.) Don't: -Use unnecessary additional character in filenames -Use spaces or illegal characters, e.g., &, %, #, <., > -start or end your filename with a symbol -use incomplete or inconsistent date formats, e.g., M-D-YY -use filenames for related files that do not work well with default ordering, e.g., a random system of numbers or date formats, using letters first

Answer 295

-Assignment: assign values to variables. x <- 2 -Arithmetic: perform basic math operations, such as addition, subtraction, multiplication, and division + - * / %% (modulus; returns the remainder after division) %/% Integer division(returns an integer value after division) ^ -Relational/comparators: allow you to compare values. The output for relational operators is TRUE or FALSE, which is a logical data type or boolean data type. < > <= >= == != -Logical: allow you to combine logical statements and return a logical value like TRUE or FALSE: all values must be TRUE for the entire operation to evaluate to TRUE & Element-wise logical AND operator | Element-wise logical OR operator: one value must be TRUE for the entire operation to evaluate to TRUE ! Logical Not (e.g., !TRUE = FALSE and !FALSE = TRUE) | Element-wise logical OR

Answer 296

The separate() function turns a single character column into multiple columns. employee <- data.frame(id, name, job_title) separate(employee, name, into=c('first_name', 'last_name'), sep=' ')

Answer 297

unite() function makes it possible to merge columns together. Syntax: unite(data, col, ..., sep = "_", remove = TRUE) data: The data frame col: The name of the new column as a string or symbol ...: A selection of columns. If empty, all variables are selected. You can select all variables between x and z with x:z or exclude y with '-y' sep: the separator to use between values (in the below example it's a space) Example: unite(employee, 'name', first_name, last_name, sep= ' ')

Answer 298

pivot_longer(): Part of the tidyr package, use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns. pivot_wider() function: convert your data to have more columns and fewer rows. (presumably also part of the tidyr package).

Answer 299

Four datasets that have nearly identical summary statistics

Answer 300

bias computes the average amount by which actual is greater than predicted. If it returns a positive value, your model is systematically underestimating the true values. If it returns a negative value, your model is systematically overestimating the true values. Syntax: bias(actual, predicted) Example 1: Compare predicted temp with actuals. install.packages("SimDesign") library(SimDesign) actual_temp <- c(68.3, 70, 72.4, 71, 67, 70) predicted_temp <- c(67.9, 69, 71.5, 70, 67, 69) bias(actual_temp, predicted_temp) [1] 0.7166667 Positive value so the model is underestimating the true values. That is, the prediction is biased toward lower temps. It's fairly close to zero, but it isn't as accurate as would be ideal. Example 2: Compare actual sales with stock (i.e., predicted sales): #No need to install a package and draw on library as SimDesign is already set up. actual_sales <- c(150, 203, 137, 247, 116, 287) predicted_sales <- c(200, 300, 150, 250, 150, 300) bias(actual_sales, predicted_sales) [1] -35 Negative value so the model is overestimating the true values. That is, they're ordering too much stock for release days.

Answer 301

In R, the sample() function allows you to take a random sample of elements from a data set. Use Case: We decided to add randomization to the position of the ads using R. We wanted to make sure that the ads with similar frequencies were near each other and to eliminate as much bias as possible. We used sample() to inject a randomization element into our R programming. In R, the sample() function allows you to take a random sample of elements from a data set. Adding this piece of code randomly shuffled the rows in our data. We presented the ads to users again, and this time, the position of the ads was random and controlled for bias. Less bias meant that the survey was more effective because the data was more reliable.”

Answer 302

SMOTE (Synthetic Minority Oversampling Technique) Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem. SMOTE (Chawla et. al. 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset. more details: https://search.r-project.org/CRAN/refmans/performanceEstimation/html/smote.html Syntax: smote(form, data, perc.over = 2, k = 5, perc.under = 2) Arguments form: A formula describing the prediction problem data: A data frame containing the original (unbalanced) data set perc.over: A number that drives the decision of how many extra cases from the minority class are generated (known as over-sampling). k: A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class. perc.under: A number that drives the decision of how many extra cases from the majority classes are selected for each case generated from the minority class (known as under-sampling) Use Case: In another instance of the data analysis process focusing on furniture sales, a significant issue arose when the dataset contained biased information related to the geographic representation of sales data. Certain regions were overrepresented, leading to skewed conclusions about popular furniture items. To address this bias, the furniture team employed statistical techniques to rebalance the dataset, oversampling underrepresented regions, and undersampling the overrepresented ones with R programming. *The team employed the SMOTE (Synthetic Minority Oversampling Technique) for oversampling underrepresented regions and the NearMiss algorithm for undersampling overrepresented regions. Bootstrapping and k-nearest neighbor are used by the SMOTE function to generate further observations of the bias through oversampling.

Answer 303

Generates synthetic positive instances using nearmiss algorithm. Syntax: nearmiss(df, var, k = 5, under_ratio = 1) df: data.frame or tibble. Must have 1 factor variable and remaining numeric variables. var: Character, name of variable containing factor variable. k: An integer. Number of nearest neighbor that are used to generate the new examples of the minority class

Answer 304

FWF (fixed-width file): A text file with a specific format, which enables the saving of textual data in an organized fashion

Answer 305

A computer-generated file that records events from operating systems and other software programs

Answer 306

ggplot2: most popular data visualization package in R. Can make scatterplots, bar charts, line diagrams, etc. can add titles, etc. *If you need a data visual function, ggplot2 probably has a function. (There is a cheat sheet.) Plotly: General purpose package that lets you do a wide range of visualization functions. RGL: Package that focuses on 3D visuals. Other visual packages: Lattice Dygraphs Leaflet Highcharter Patchwork Patchwork gganimate ggridges

Answer 307

In ggplot2, -an aesthetic is a visual property of an object in your plot. Think of it as a connection or mapping between a visual feature in your pot and a variable in your data (e.g., in a scatterplot, aesthetics include things like the size, shape, color, or location (i.e., x- or y-axis) of your data points). -a geom is a geometric object used to represent your data (e.g., you can use points to create a scatterplot, bars to create a bar chart, lines to create a line diagram, etc.) -a facet lets you display smaller groups, or subsets, of your data -the label and annotate functions let you customize your plot (e.g., you can titles, subtitles, and captions to communicate the purpose of your plot).

Answer 308

Syntax: ggplot(data=)+(mapping=aes(

Answer 309

To learn more about any r function run the code ? function_name

Answer 310

Example: Existing scatterplot: ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g)) Edited scatterplot with color by species & a legend: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species)) Edited scatterplot with shape by species: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species)) Edited scatterplot with a different shape and color for each species: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species, color=species, size=species)) #alpha aesthetic controls the transparency of the points. A good option when you have a dense plot with lots of data points. ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species, color=species, alpha=species)) #Just set all points to purple. We aren't mapping color to a specific variable like species, so this code needs to be outside the aes function. ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g), color="purple")

Answer 311

color: you can change the color of all points on your plot or the color of each data group size: you can change the size of the points on your plot by data group shape: you can change the shape of the points on your plot by data group

Answer 312

Example of smooth: ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g)) Example of smooth and points: ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+geom_point(mapping=aes(x=flipper=length_mm, y=body_mass_g)) Example of plotting a different line type for each species: ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+geom_point(mapping=aes(x=flipper=length_mm, y=body_mass_g, linetype=species))

Answer 313

creates a scatterplot and then adds a small amount of random noise to each point in the plot. Jittering helps deal with overplotting (i.e., when data points in a plot overlap with one another). Jittering makes the points easier to find. By using the jitter function, we can get a better picture of the true underlying relationship between two variables in a dataset. However, we should be careful not to add too much jitter, as this can distort the original data too much. Example: ggplot(data=penguins)+ geom_jitter(mapping=aes(x=flipper_length_mm, y=body_mass_g, linetype=species))

Answer 314

Bar charts Example: *Note all examples draw on a diamond data set that includes data on the cut and fill of each diamond. ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut)) #to add color outlines to the bar chart: ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut, color=cut)) #to add color fill to the bar chart (i.e., to fully fill the columns): ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut, fill=cut)) Note: (note: if you don't specify a variable for the y-axis, the code defaults to 'count') Example: library(ggplot2) library(tidyverse) #create the bar chart ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = hotel, fill = market_segment)) #filter the bar chart to just include city hotels that are online TA onlineta_city_hotels <- filter(hotel_bookings, (hotel=="City Hotel" & hotel_bookings$market_segment=="Online TA")) View(onlineta_city_hotels)

Answer 315

Loess smoothing The loess smoothing process is best for smoothing plots with less than 1000 points. ggplot(data, aes(x=, y=))+ geom_point() + geom_smooth(method="loess") The gam smoothing, or generalized additive model smoothing, is useful for smoothing plots with a large number of points. ggplot(data, aes(x=, y=)) + geom_point() + geom_smooth(method="gam", formula = y ~s(x))

Answer 316

Use to facet your plot by a single variable. Example: Facet_wrap lets us create a separate plot for each species: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+facet_wrap)~(~species)

Answer 317

Tilde operator is used to define the relationship between dependent variable and independent variables in a statistical model formula. The variable on the left-hand side of tilde operator is the dependent variable and the variable(s) on the right-hand side of tilde operator is/are called the independent variable(s). So, tilde operator helps to define that dependent variable depends on the independent variable(s) that are on the right-hand side of tilde operator.

Answer 318

Use to facet your plot with two variables. Note: Unlike the facet_wrap() function, the facet_grid() function will include plots even if they're empty. Example: #2 variables: sex and species ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+facet_grid(sex~species)

Answer 319

Example 1: ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = distribution_channel)) + facet_wrap(~deposit_type) + theme(axis.text.x = element_text(angle = 45)) Example 2: Same as above but with a different chart for each market segment: ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = distribution_channel)) + facet_wrap(~market_segment) + theme(axis.text.x = element_text(angle = 45))

Answer 320

Chart Title: To add a title toa chart, use a label function: title = Average product rating Subtitle: use subtitle="Sample of Three Penguin Species" Ex// ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species") To create bars on a chart: geom_bar() To highlight underperforming products, use an aesthetics function: col = ifelse(x<2, 'blue', 'yellow') To create a scatterplot chart: geom_point() To create a trendline: geom_smooth() To compare data trends across average ratings, use a facets function: facet_wrap(~Average Rating) To label the axes, use an aesthetics function: aes(x=Average price (USD), y = Product) To add a caption: caption="enter caption here" Ex// ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species", caption="Data collected by Dr. Kristen Gorman") To remove the axis label: Setting labs(x = "") omits the label but still allocates space; setting labs(x = NULL) removes the label and its space.

Answer 321

To add notes to a document or diagram to explain or comment upon it. The annotate function will allow you to put text inside the grid to call out specific data points. Ex: Add info about the Gentoos to the chart in large, bold, and purple text that is tilted at a 25 degree angle. annotate("text", x=220, y=3500, label="The Gentoos are the largest") library('ggplot2') library('palmerpenguins') ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+ labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species", caption="Data collected by Dr. Kristen Gorman")+ annotate("text", x=220, y=3500, label="The Gentoos are the largest", color="purple", fontface="bold", size=4.5, angle=25) OR if you want a shorter string of code, you could assign the first portion to a variable and then tack on the annotation: library('ggplot2') library('palmerpenguins') p <- ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+ labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species", caption="Data collected by Dr. Kristen Gorman") p+annotate("text", x=220, y=3500, label="The Gentoos are the largest") There are 3 fonts that are guaranteed to work everywhere: sans (the default) serif mono There are 3 values for fontface: plain (the default) bold italic Alignment of the text: hjust(left, center, right, inward, outward) vjust(bottom, middle, top, inward, outward) check_overlap: If check_overlap = TRUE, overlapping labels will be automatically removed from the plot. The algorithm is simple: labels are plotted in the order they appear in the data frame; if a label would overlap with an existing point, it’s omitted. Notes: -more on syntax: https://ggplot2.tidyverse.org/reference/annotate.html

Answer 322

# a data frame with all the annotation info Using ggplot2, 2 main functions are available for that kind of annotation: geom_text to add a simple piece of text geom_label to add a label: framed text Ex// library library(ggplot2) basic graph p <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() annotation <- data.frame( x = c(2,4.5), y = c(20,25), label = c("label 1", "label 2") ) Add text p + geom_text(data=annotation, aes( x=x, y=y, label=label), , color="orange", size=7 , angle=45, fontface="bold" ) Note: possible to shorten with annotate: # p + # annotate("text", x = c(2,4.5), y = c(20,25), # label = c("label 1", "label 2") , color="orange", # size=7 , angle=45, fontface="bold") Right chart: using labels p + geom_label(data=annotation, aes( x=x, y=y, label=label), , color="orange", size=7 , angle=45, fontface="bold" )

Answer 323

The annotate() function allows to add all kind of shape on a ggplot2 chart. The first argument will control what kind is used: rect or segment for rectangle, segment or arrow. #Add rectangles p + annotate("rect", xmin=c(2,4), xmax=c(3,5), ymin=c(20,10) , ymax=c(30,20), alpha=0.2, color="blue", fill="blue") #Add segments p + annotate("segment", x = 1, xend = 3, y = 25, yend = 15, colour = "purple", size=3, alpha=0.6) #Add arrow p + annotate("segment", x = 2, xend = 4, y = 15, yend = 25, colour = "pink", size=3, alpha=0.6, arrow=arrow())

Answer 324

geom_text() and geom_label() to add text, as illustrated earlier. geom_rect() to highlight interesting rectangular regions of the plot. geom_rect() has aesthetics xmin, xmax, ymin and ymax. geom_line(), geom_path() and geom_segment() to add lines. All these geoms have an arrow parameter, which allows you to place an arrowhead on the line. Create arrowheads with arrow(), which has arguments angle, length, ends and type. geom_vline(), geom_hline() and geom_abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.

Answer 325

Useful function for saving a plot. It defaults to saving the last plot that you displayed and uses the size of the current graphics device. Example: library(ggplot2) library(palmerpenguins) ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species)) ggsave("Three Penguin Species.png") Output: saves the graph as Three Penguin Species as a png. Visible in the Files>Cloud>Project directory (folder) on Posit Cloud (web service that delivers an IDE similar to RStudio).

Answer 326

Save ggplot into a PDF file: Create some plots library(ggplot2) myplot1 <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() myplot2 <- ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() Print plots to a pdf file (dev.off function closes the graphic device) pdf("ggplot.pdf") print(myplot1) # Plot 1 --> in the first page of PDF print(myplot2) # Plot 2 ---> in the second page of the PDF dev.off() #Print into a png file: png("myplot.png") print(myplot) dev.off() Save as Jpeg Image jpeg(file="saving_plot1.jpeg") hist(Temperature, col="darkgreen") dev.off() Save as png Image png(file="C:/Datamentor/R-tutorial/saving_plot2.png", width=600, height=350) hist(Temperature, col="gold") dev.off() Save as bmp Image bmp(file="saving_plot3.bmp", width=6, height=4, units="in", res=100) hist(Temperature, col="steelblue") dev.off() Save as pdf File pdf(file="saving_plot4.pdf") hist(Temperature, col="violet") dev.off() Save as postscript file postscript(file="saving_plot4.ps") hist(Temperature, col="violet") dev.off()

Answer 327

Plots can be saved as bitmap images(raster), which are fixed size OR as vector images which are easily resizeable. Raster: type of image produced when scanning or photographing an object. Raster images are compiled using pixels containing unique color and tonal info that comes together to create an image. Since raster images are pixel-based, they are resolution dependent. Most of the images we come across like jpeg or png are bitmap images. They have a fixed resolution and are pixelated when zoomed enough. Functions that help us save plots in this format are jpeg(), png(), bmp() and tiff().

Answer 328

Example: hotel_bookings <- read.csv("hotel_bookings.csv") library(ggplot2) library(tidyverse) mindate <- min(hotel_bookings$arrival_date_year) maxdate <- max(hotel_bookings$arrival_date_year) ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = market_segment)) + facet_wrap(~hotel) + theme(axis.text.x = element_text(angle = 45)) + labs(title="Comparison of market segments by hotel type for hotel bookings", caption=paste0("Data from: ", mindate, " to ", maxdate), x="Market Segment", y="Number of Bookings")

Answer 329

A visual property of an object in a plot

Answer 330

A series of functions that splits data into subsets in a matrix of panels

Answer 331

A process for smoothing plots with a large number of points

Answer 332

The geometric object used to represent data

Answer 333

A group of R functions used for customizing a plot

Answer 334

A process used for smoothing plots with fewer than 1,000 points

Answer 335

The process of matching up a specific variable in a dataset with a specific aesthetic

Answer 336

A process used to make data visualizations in R clearer and more readable

Answer 337

A line on a data visualization that uses smoothing to represent a trend

Answer 338

A file format for making dynamic documents with R. You can use an R Markdown file as a code notebook to save, organize, and document your analysis using code chunks, comments, and other features. It allows you to save and execute code & generate shareable reports for stakeholders. You can use R Markdown in notebook mode for analyst-to-analyst communication, and in report mode for analyst-to-decision-maker communication.

Answer 339

A syntax for formatting plain text files

Answer 340

lets users run your code and show the graphs that visualize that code.

Answer 341

-HTML, PDF, and Word docs -Slide Presentation -Dashboard

Answer 342

The set of markup symbols and code used to create a webpage

Answer 343

Jupyter Kaggle Google Colab AKA Colab

Answer 344

The Jupyter Notebook is an open source web application that you can use to create and share documents that contain live code, equations, visualizations, and text. Documents that contain computer code and rich text elements (e.g., comments, links, or descriptions of your analysis and results). They can be useful for everything from data cleaning and transformation to statistical modeling and visualizations. They're compatible with R so are an alternative to R markdown. Privacy: Because you use Jupyter in a web browser, some people are understandably concerned about using it with sensitive data. However, if you followed the standard install instructions, Jupyter is actually running on your own computer. Notes: -If you're working in Kaggle, there are two types of notebooks available: Jupyter notebooks and scripts (including R markdown scripts). -Jupyter notebooks can be used in google colab

Answer 345

A notebook is a shareable document that combines computer code, plain language descriptions, data, rich visualizations like 3D models, charts, graphs and figures, and interactive controls. A notebook, along with an editor (like JupyterLab), provides a fast interactive environment for prototyping and explaining code, exploring and visualizing data, and sharing ideas with others.

Answer 346

Notebook documents contains the inputs and outputs of a interactive session as well as additional text that accompanies the code but is not meant for execution. In this way, notebook files can serve as a complete computational record of a session, interleaving executable code with explanatory text, mathematics, and rich representations of resulting objects. These documents are internally JSON files and are saved with the .ipynb extension. Since JSON is a plain text format, they can be version-controlled and shared with colleagues. *JSON: a format used to store and export data

Answer 347

a format used to store and export data

Answer 348

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Answer 349

R markdown file headings in the report are created when you include one or more hashtags (#) before the heading text, such as ## Including Plots. The more hashtags used, the smaller the heading font. # Including Plots creates a Header 1 style heading whereas ## Including Plots creates a Header 2 style heading.

Answer 350

All code chunks begin and end with delimiters. To start a code chunk, you can type three tick marks followed by a lowercase “r” in curly brackets: ```{r} To end it, type just the three tick marks: ``` Hotkeys to add code: ctrl+alt+I (PC)

Answer 351

To start a new paragraph, end a line with two spaces To apply italics to a word or phrase, place an asterisk at the beginning and at the end of the word or phrase, for example, *italics works* To apply bold to a word or phrase, place two asterisks at the beginning and at the end of the word or phrase, for example, **bold is useful** To create a header, type a hashtag (#) followed by a space and your text for example: # Getting Started with R Markdown

Answer 352

Headers will appear in blue A single hashtag is the largest header The more hashtags you add (up to six), the smaller the header

Answer 353

Yet Another Markup Language A language for data that translates it so it's readable.

Answer 354

[click here](URL

Answer 355

![caption](image URL)

Answer 356

Code added to a .rmd file (r markdown file) is standardly called a code chunk

Answer 357

A character that indicates the beginning or end of a data item.

Answer 358

```{r} and ``` PC hotkeys: ctrl+alt+I

Answer 359

When working in RStudio, you can set the output of a document in R Markdown by changing the YAML header. For example, the following code creates an HTML document: --- title: "Demo" output: html_document --- And the following code creates a PDF document: --- title: "Demo" output: pdf_document The Knit button in the RStudio source editor renders a file to the first format listed in its output field (HTML is the default). You can render a file to additional formats by clicking the dropdown menu next to the knit button.

Answer 360

pdf_document – This creates a PDF file with LaTeX (an open source document layout system). If you don’t already have LaTeX, RStudio will automatically prompt you to install it. word_document – This creates a Microsoft Word document (.docx). odt_document – This creates an OpenDocument Text document (.odt). rtf_document – This creates a Rich Text Format document (.rtf). md_document – This creates a Markdown document (which strictly conforms to the original Markdown specification) github_document – This creates a GitHub document which is a customized version of a Markdown document designed for sharing on GitHub.

Answer 361

beamer_presentation – for PDF presentations with beamer ioslides_presentation – for HTML presentations with ioslides slidy_presentation – for HTML presentations with Slidy powerpoint_presentation – for PowerPoint presentations revealjs : : revealjs_presentation – for HTML presentations with reveal.js (a framework for creating HTML presentations that requires the reveal.js package) Learn more: https://rmarkdown.rstudio.com/lesson-11.html

Answer 362

The flexdashboard package lets you publish a group of related data visualizations as a dashboard. Flexdashboard also provides tools for creating sidebars, tabsets, value boxes, and gauges.

Answer 363

Shiny is an R package that lets you build interactive web apps using R code. You can embed your apps in R Markdown documents or host them on a webpage. To call Shiny code from an R Markdown document, add runtime: shiny to the YAML header: --- title: "Shiny Web App" output: html_document runtime: shiny

Answer 364

The bookdown package is helpful for writing books and long-form articles. The prettydoc package provides a range of attractive themes for R Markdown documents. The rticles package provides templates for various journals and publishers.

Answer 365

A delimiter is a character that marks the beginning and end of a data item. It can mark a single line of code, or a whole section of code in an .rmd file.

R Flashcards

(402 cards)