R Flashcards

(402 cards)

1
Q

IDE

A

Integrated developer environment (e.g., RStudio)
A software application that brings together all the tools you may want to use in a single place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

R

A

R is a programming language used for statistical analysis, visualization, and other data analysis.

R is a programming language that can be used to perform tasks in every phase of the data analysis process. R can help you structure, organize, and clean your data and create detailed visuals and make dynamic documents.

R is typically used by professionals who have a statistical or research-oriented approach to solving problems; among them are scientists, statisticians, and engineers

A few advantages of R include its:
-Popularity: R is frequently used for data analysis
-Tools: R has a convenient library of ready-to-use tools for data cleaning and analysis
-Focus: R was created with statistics in mind; data analysts can conveniently use a rich library of statistical routines
-Adaptability: R adapts well for use in both machine learning and data analysis projects
-Availability: R is an open source programming language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the following languages used for?

-R
-Python
-HTML5
-CSS
-Swift
-Java
-C#
-Ruby
-PHP
-C++

A

-R: Offers statistical features for data analysis and is useful for creating advanced data visuals.

-Python: General purpose language that can be used to create what you need for data analysis.

-HTML5: Used by web designers to create structure for web pages and is used to connect to hosting platforms.

-CSS: Used for web page design and to control graphic elements (e.g., color, layout, and font) and page presentation on multiple devices (e.g., large screens, mobile screens, and printers).

-Swift: Used by mobile application developers to make apps run faster.

-Java: Official language for Android development (i.e., Android apps, etc.) Used by web application developers to create enterprise web applications that can run on multiple clients.

-C# (C sharp): Object-oriented language used to create mobile apps in the .NET open source developer platform. It is also used by game developers to create games.

-Ruby: General-purpose, object-oriented programming language used for web application development.

-PHP: Scripting language suited for web application development.

-C++: Extension of the C programming language that is used to create console games like those for Xbox.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Statistical Analysis

A

The science of collecting, exploring, and presenting large amounts of data to discover underlying patterns and trends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

R Console/Console Pane

A

Program window in R where you make use of the R programming language. It is an interface that lets you view, write, edit, and execute your R code.

The R console is a simple environment in which you can write single codes of R code. It won’t save your code beyond a single session, but it is valuable for running simple functions.

(RStudio is an IDE (interactive development environment) that build on the simplicity of the R console.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RGui

A

RStudio’s Graphical User Interface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to Save in RStudio

A

If you want to save the code you execute, it is better to save it in a text file or an .rmd file (which you will learn more about in upcoming lessons).

Note: Keep in mind that everything you write in the R Console disappears after you end your session (or close the console).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to see instructions for how to cite R in a publication

A

Type citation() after the prompt and press Enter (Windows) or Return (Mac)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Packages

A

Packages are units of reproducible R code.

Members of the R community create packages to keep track of the R functions that they write and reuse.

Packages offer a helpful combination of code, reusable R functions, descriptive documentation, tests for checking your code, and sample data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

tidyverse

A

A collection of packages in R with a common design philosophy for data manipulation, exploration, and visualization.

For a lot of data analysts, the tidyverse is an essential tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hot key/Key Bindings Cheat Sheet

A

Help>Keyboard Shortcuts Help

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Open-source

A

Code that is freely available and may be modified and shared by the people who use it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

To start a new file

A

File>New File>R Script

OR ctrl + shift + n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Base R

A

What you get after you install R. (The extra functionality comes from add-ons available from developers.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to install a package

How to install multiple packages at one

A

To install a package, you use the code install.packages(“package_name”, dependencies = TRUE)
OR
install.packages(“package_name”)

*Note: You can add the option dependencies = TRUE, which tells R to install the other things that are necessary for the package or packages to run smoothly. Otherwise, you may need to install additional packages to unlock the full functionality of a package.li

*Make sure to enter this script it in the console in the lower left-hand pane.

Or

Tools (on the top bar)>Install Packages>enter the package name (it will auto-complete the name if you don’t know the precise spelling)

To install multiple packages at once:

install.packages(c(“name1”, “name2”))
install.packages(c(“tidyverse”, “dslabs”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to load a package

A

To load a package, you use the code library(package_name).

Make sure to enter this script it in the console in the lower left-hand pane.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to use a dataset from a package you’ve loaded

&

How to see that dataset

A

If you also want to use a dataset from a package you have loaded, then you use the code data(dataset_name).

To see the dataset, you can take the additional step of View(dataset_name).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What command do you use to see all of the packages you’ve installed?

A

installed.packages()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

After a package has been installed, what command do we use to load the package every time we want to use it?

A

library(pkg_name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

If you try to load a package with library(blahblah) and get a message like Error in library(blahblah) : there is no package called ‘blahblah’, it means_________

A

you need to install that package first with install.packages().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to tell if a dataset is currently loaded in your R environment

A

ls()

This function lists all objects currently in your global environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Hot Keys to Save a Script

A

Ctrl+S on Windows and Command+S on Mac

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Hot keys to run an entire script

A

Ctrl+Shift+Enter on Windows Command

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hot keys to run a single line of script

A

Ctrl+Enter on Windows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Hot keys to open a new script
Ctrl+Shift+N on Windows
25
Objects
Stuff that is stored in R/Things that are stored in containers in R. Objects can be variables or more complicated entities like functions (and datasets).
26
To assign a value to an object:
Use <- OR = to assign a value to a variable. Examples: a <- 1 b <- 1 c <- 1 a = 1 b = 1 c = 1
27
To see the value an object has, you
enter the value (e.g., a) and then hit enter OR print(object name) and then hit enter Examples: a OR print(a)
28
How to see the names of the objects saved in your workspace
ls() Note: A dataset is also considered an object, so it will also tell you if a dataset is currently loaded in your R environment Also, IDEs standardly have a tab that shows you all the variable names (e.g., in RStudio, you can look at the Environment>Global Environment pane)
29
What does it mean when you get an error message like the following? Error: Object 'x' not found
You haven't defined x yet
30
Example of a function without an argument
ls ls without the parentheses is a function without an argument. It will show you the code for ls. With the parentheses (the argument) it will show you all objects saved in your workspace. Note: In general to evaluate a function, we need to use parentheses. If we type a function without parentheses, R shows us the code for the function. Most functions also require an argument, i.e., something to be written inside the parenthesis.
31
Variable Naming Rules
Rules have to start with a letter and they can't contain spaces Best practice: stick to lowercase and use underscores instead of spaces. Be careful to not use any names already in use (e.g., don't use install.packages as it's already a function).
32
Comments
Any line that starts with # will not be evaluated
33
How do you access help files?
help(function_name) OR ?function_name Example: help(log) OR ?Log
34
sum and seq do what?
seq creates a list of numbers sum adds them up Example: n<- 10 x <- seq(1, n) sum(x) SO n is 10 x is the list of numbers between 1 through 10: 1 2 3 4 5 6 7 8 9 10 (think of the comma like through or an en-dash) sum is: 55 (OR 1+2+3+4+5+6+7+8+9+10) The output will be [1] 55 because 1+2+3+4+5+6+9+10=55
35
Logarithm
Log2(16) Logarithm base 2 of 16 AKA how many times does 2 (the base) have to be multiplied by itself to equal 16? 2*2=4 4*2=8 8*2=16 so the answer is 4 (because 2 has to be multiplied by itself 4 times to get 16)
36
square root
The square root of a number is a value which on multiplied by itself gives the original number. It is represented by the symbol '√'. For example, the square root of 25 is √25 = 5. 5*5=25 sqrt(4) will =2 (because 2*2=4)
37
What line of code will always return the value stored in x if x is a numeric variable.
log(exp(x))
38
exponent
number or variable written in the upper right of a base number that indicates how many times that base number should be multiplied by itself. 2^2 2*2=4
39
Class
The type of object For example, a <- 2 class(a) output: [1] "numeric" Or class(ls) output: "function"
40
Why does an output in R start with [1]?
In R, the [1] (or [2], [3], etc.) you see at the start of output is an index label that tells you the position of the first element being printed on that line. Here’s why: R prints vectors, lists, and other objects in a linear stream of values. To help you keep track of where you are in the sequence, R prints an index in square brackets at the beginning of each line. [1] means β€œthe first element printed here is element number 1 of the vector.” If the output wraps to the next line, you might see [11], meaning β€œthe first element on this line is the 11th element of the vector.” Example x <- 1:15 x Output: [1] 1 2 3 4 5 6 7 8 9 10 [11] 11 12 13 14 15 On the first line, [1] tells you it’s showing elements starting from the 1st. On the second line, [11] tells you it’s showing elements starting from the 11th. πŸ‘‰ It’s not part of your dataβ€”just a printing guide so you know where in the vector you are.
41
Numeric Vectors
A single number is technically a vector, but in general, they have several entries.
42
Data Frames
# states from the `murders` data set Think of data frames as tables The columns contain one variable and the rows have a set of values that match each column. You can have different data types in a data frame. Each column should have the same number of items, even if some are missing. We store data in a data frame, e.g., data("murders") Then if you do class(murders), the output will make it clear that it is a "data.frame"\ Example: library(dslabs) data(murders) Save temperatures in an object called `temp` temp <- c(35, 88, 42, 84, 81, 30) Store city names in a `city` object city <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro", "San Juan", "Toronto") Generate a data frame with city names and temperatures city_temps <- data.frame(name = city, temperature = temp) Define a states variable `states` to contain the name of the states <- c(murders$state) Define a `ranks` variable to determine the rank of sizes # of population ranks <- rank(murders$population) Generate a `my_df` data frame with the name of the states and their rank my_df <- data.frame(states = states, ranks = ranks)
43
How to see the structure of an object?
str(object_name) Example: str(murders) will show the structure of the data stored inside of the data frame "murders" Output: 'data.frame': 51 obs. of 5 variables: $ state : chr "Alabama" etc. $ abb : etc. $ region: etc $ population : etc. $ total : etc. etc. (I did not list the full output above as it was lengthy)
44
What does obs. stand for?
observations AKA rows in a table
45
abb
abbreviation
46
How do you see the first six lines of the data stored inside the data frame "murders"? What script do you use?
head(murders)
47
What is $
It's the accessor Example: murders$population The output is the column associated with population in the dataset stored in the data frame "murders" [1] 4779736 710231 6392017 etc. *Note: the order of the entries in the list 'murder$population' preserves the order of the rows in the data table.
48
How do you get the names of columns in a data frame?
names(name_of_data_frame) Example: names(murders) Output: [1] "state" [2] "abb" [3] "region" [4] "population" [5] "total" OR str(object_name) Example: str(murders) will show the structure of the data stored inside of the data frame "murders" Output: 'data.frame': 51 obs. of 5 variables: $ state : chr "Alabama" etc. $ abb : etc. $ region: etc $ population : etc. $ total : etc. etc. (I did not list the full output above as it was lengthy)
49
What function do you use to determine how many entries are in a vector?
the function length length(vector_name) example: length(pop) output: [1} 51 because there are 51 entries, one for each state.
50
Character Vectors
Note: We use quotes to distinguish between variable names and characters strings "a" will give you the character string a
51
Logical Vectors
These must either be true or false For example: z <- 3 == 2 z output: [1] FALSE class(z) output: [1] "logical'
52
==
a relational operator that asks if a value equals another value For example: z <- 3 == 2 z output: [1] FALSE It's false because 3 does not equal 2.
53
Factor
Factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables. Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. Factors/categorical variables are stored in levels. R stores each level as an integer. (This is more memory efficient that storing all of the characters.) Note: the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow an alphabetical order. Factors are useful for storing categorical data. In the below example, regions are categorical. example: class(murders$region) class tells you what kind of object it is murders is the data frame that contains the data about the number of homicides in the U.S. $ is the accessor; output is the column "region" in the murders data frame So you're asking r what type of object is the column region in the murders data frame output: [1] "factor" So region is a factor, not a character vector To see the specific regions: levels(murders$region) output: [1] "Northeast" "South" "North Central" "West"
54
How to see the categories in a factor?
levels(data_frame_name$Column_name) Example: levels(murders$region) murders is the data frame that contains the data about the number of homicides in the U.S. $ is the accessor; output is the column "region" in the murders data frame output: [1] "Northeast" "South" "North Central" "West"
55
What is the storage process in R?
In the background of R, we store integers. Integers are smaller memory-wise than characters.
56
Factors vs Characters
Factors can be easily confused with characters. be careful! general advice: avoid factors as much as possible, though they are sometimes necessary to fit statistical models that depend on categorical data.
57
the function class() helps us_______
determine the type of an object
58
data frames can be thought of as____
tables with rows representing observations and columns representing different variables
59
a vector is_____
an object consisting of several entries and can be a numeric vector, a character vector, or a logical vector (i.e., must either be true or false). A vector is a series of values, all of the same type. They are the most basic unit of data in R. They can store numerical, character or logical data. In R, we can create a vector with the c function, which is short for concatenate. To concatenate, we write the elements of the vector separated by a comma in parentheses. For example, a numerical vector containing costs can be created like this: cost <- c(50, 75, 90, 100, 150)
60
____ are useful for storing categorical data and are more efficient that storing characters
Factors
60
we use __ to distinguish between variable names and character strings
quotes "a" will give you the character string a
61
levels()
The function levels() can be used to determine the levels of a factor. Example: levels(murders$revion) output: [1] "Northeast" "South" "North Central" "West"
62
nlevels()
use it to determine the number of factors Example: # R program to get the number of levels of a factor Creating a factor gender <- factor(c("female", "male", "male", "female")); gender Calling nlevels() function to get the number of levels nlevels(gender) Output: [1] female male male female Levels: female male [1] 2
63
How do you extract the population column/variable from the murders dataset?
p <- murders$population Or o <- murders[["population"]]
64
concatenate
c() the action of connecting objects to a string The `c()` function connects all the strings into a single vector
65
table()
The function table takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region. Example: table(murders$region) Output: Northeast South North Central West 9 17 12 13
66
How to assign area codes to country Italy to 380 Canada to 124 Egypt to 818
codes <- c(italy = 380, canada = 124, egypt = 818) OR codes <- c("italy" = 380, "canada" = 124, "egypt" = 818) OR codes <- c(380, 124, 818) country <- c("italy","canada","egypt") names(codes) <- country
67
seq
seq creates a list of numbers First Argument: Defines the start of the sequence Second Argument: Defines the end of the sequence Third Argument (optional): Tells seq how much to jump by; the default--if a third argument is not entered--is to go up by consecutive intervals of one. Examples: seq(1, 10) output: [1] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 seq(1, 10, 2) [1] 1, 3, 5, 7, 9 Note: If you want consecutive integers, you can use the following shorthand: 1:10 Notes: The second argument is a maximum, not necessarily the end. So if we write: seq(7,50,7), we will obtain the vector of integers like we had written: seq(7, 49, 7) This can be useful because sometimes we will want consecutive numbers that are less than a predetermined value. When we use these functions, R produces integers, not numerics, because they are typically used to index something
68
Useful functions for creating vectors
the function c() or concatenate AND seq() which generates sequences
69
Subsetting
# Using square brackets is useful for subsetting to access specific elements of a vector Lets you access specific parts of a vector by using square brackets to access elements of a vector. Examples: codes[2] output: canada 124 codes[c(1,3)] output: italy egypt 380 818 codes[1:2] output: italy canada 380 124 If the entries of a vector are named, they may be accessed by referring to their name codes["canada"] output: canada 124 codes[c("egypt","italy")] ouptut: egypt italy 818 380 To access the 3rd, 4th, and 5th elements of the cost vector: cost[3:5] OR cost[c(3,4,5)] *The : operator helps condense the code and obtain consecutive values from a range. To access just the first item and fifth item in the cost vector: cost[c(1,5)]
70
Coercion
Is an attempt by R to be flexible with data types by guessing what was meant when an entry does not match as expected. For example: x <- c(1, "canada", 3) R coerced the data into characters. It guessed that because you put a character string in the vector (i.e., "canada") that you meant the 1 and 3 to be character strings (i.e., "1" and "3") Note: You won't get an error or a warning with the above example. You may not initially realize that R changed 1 and 3 to character strings In the above situation, we'd say that "R coerced the data into a character string"
71
What function turns numbers into characters?
as.character() Example: x <- 1:5 y <- as.character(x) y output: [1] "1", "2", "3", "4", "5"
72
What function turns characters into numbers?
as.numeric() Example: Change "1", "2", "3", "4", "5" (characters) to 1, 2, 3, 4, 5 (numeric) x <- 1:5 y <- as.character(x) y output: "1", "2", "3", "4", "5" as.numeric(y) output: 1, 2, 3, 4, 5
73
What does NA mean in R?
In R, missing data is assigned to the value NA NA is the special value for missing data. Not Available Example: When R fails to coerce something, we'll get NA. For instance, in the below example R will be able to convert the "1" and "3" to numeric values, but it won't know what to do with b. x <- c("1", "b", "3") as.numeric(x) output: [1] 1 NA 3 Warning message: NAs introduced by coercion *So the output is 1 (missing value) 3 This makes it clear that "b" is the missing value R doesn't know what to do, so instead of converting "b" to a number, it tells us that it's NA/Not Available/Missing Data It's very common to come across NA as it's used for all missing data. Be ready to see a lot of NAs
74
Practice: How do you find the length of the sequence 32:99?
length(32:99)
75
# its size Practice: Generate a sequence of numbers from 6 to 55, in increments of 4/7 and determine its size.
76
length.out
The seq() function has another useful argument: length.out. This argument allows us to generate sequences that increment by the same value but generate a vector of a specific length. Example 1: x <- seq(0, 100, length.out = 5) output: 0, 25, 50, 75, 100. *There are only 5 numbers in the output. Example 2: seq.int(3, 30, length.out = 10) output: 3, 6, 9, 12, 15, 18, 21, 24, 27, 30 *There are only 10 numbers in the output.
77
How do you create a vector that is the integer class?
Add the letter L after the integer Example: class(3L) output: integer
78
Integers vs Numeric Classes
The main difference is that integers take up less space in a computer's memory. For large operations, using integers can have a substantial impact. Otherwise, for most purposes, integers and numerics are indistinguishable. For example, 3, the integer, minus 3, the number, is 0. Example: 3L-3 output: 0
79
Which are enclosed in quotes: character vectors or numeric vectors?
Character vectors
80
What function sorts a vector in increasing order?
sort() Example: To see the largest number of gun murders listed from least to most: library(dslabs) data(murders) sort(murders$total) output: [1] 2 4 5 5 7 8 11 [8] 12 12 16 19 21 22 27 etc.
81
*What function produces the indices needed to obtain the sorted vector? That is, what function tells you how to get the numbers in ascending order?
order() Example: x <- 31 4 15 92 65 sort(x) [1] 4 15 31 65 92 order(x) [1] 2 3 1 5 4 BECAUSE x <- 31 (1st) 4 (2nd) 15(3rd) 92 (4th) 65 (5th) sort(x) [1] 4 (2) 15 (3) 31 (1) 65 (5) 92 (4) order(x) [1] 2, 3, 1, 5, 4
82
*What function gives us the ranks of the items in the original vector? That is, what function tells you what order the numbers in the original vector are in?
rank() rank() tells you what order the numbers are in. ex// x<- 31, 4, 15, 92, 65 sort(x) [1] 4, 15, 31, 65, 92 rank(x) [1] 3, 1, 2, 4, 5 BECAUSE sort(x) [1] 4 (1st), 15 (2nd), 31(3rd), 65(4th), 92(5th) x<- 31(3), 4(1), 15(2), 92(5), 65(4) rank(x) [1] 3, 1, 2, 5, 4
83
What function returns the largest value?
max() Example: max(murders$total) [1] 1257 That is, the state with the highest number of murders listed in the totals column of the murders dataset is 1257 (California).
84
What function returns the index of the largest value?
which.max() Example: max(murders$total) [1] 1257 i_max <- which.max(murders$total) i_max [1] 5 murders$state[i_max] [1] "California"
85
What's the difference between the functions max() and .max()?
max() returns the largest value which.max() returns the index of the largest value?
86
What function returns the smallest value?
min() Example: min(murders$total) [1] 1 That is, the state with the lowest number of murders listed in the totals column of the murders dataset is 1 (Vermont).
87
What function returns the index of the smallest value?
which.min() Example: min(murders$total) [1] 1 i_min <- which.min(murders$total) i_min [1] 46 murders$state[i_min] [1] "Vermont"
88
permutation
the selection of objects in which the order of selection matters. Example: There are 720 permutations of the digits 1, 2, 3, 4, 5, and 6 vs. a combination, which means the selection of objects in which the order does not matter.
89
What's the difference between the functions min() and which.min()
min() returns the smallest value which.min() returns the index of the smallest value
90
is.na
When we apply the is.na function to a vector, it gives us a logical vector that tells us which inputs are NA. NA: Not Available. Commonly used for missing data; a common problem in real-world data sets.
91
Note: The mean() function returns NA if it finds at least one NA
92
operator !
Logical denial !TRUE becomes FALSE !FALSE becomes TRUE
93
When using the data set murders (with column population), how do you calculate the number of murders for every 100,000 people? And how do order the states by murder rate in decreasing order rate (with column state)?
murder_rate <- murders$total/murders$population*100000 murders$state[order(murder_rate, decreasing=TRUE)]
94
In R, arithmetic operations on vectors occur element-wise. What does this mean?
It means that if you apply an operation like +, -, *, or / to two vectors, R matches up the elements in the same positions of those vectors and performs the operation pair by pair Example: Convert heights from inches to centimeters heights <- c(69, 62, 66, 70) heights * 2.54 [1] 175.26, 157.48, 167.64, 177.80 So 69*2.54=175.26 62*1.75=157.48 etc.
95
Logical Operators
< less than <= less than or equal to > greater than >= greater than or equal to == exactly equal to != not equal to ! NOT & AND NOTE: & makes two logicals true, only when they're both true TRUE & TRUE > TRUE TRUE & FALSE > FALSE FALSE & FALSE > FALSE Example: We want to find a state in the western U.S. with a murder rate less than or equal to 1 per 100,000 people. west <- murders$region == "West" safe <- murder_rate <=1 index <- safe & west murders$state[index] [1] "Hawaii" [2] "Idaho" [3] "Oregon" [4] "Utah" [5] "Wyoming' | OR
96
which()
gives us the entries of a logical vector that are true. Example x <- FALSE, TRUE, FALSE, TRUE, TRUE, FALSE [1] 2, 4, 5 Use Case: We want to look up Massachusetts' murder rate. The function "which" tells us which entries of a logical vector are true, so we can type: index <- which(murders$state == "Massachusetts") index [1] 22 murder_rate[index] [1] 1.802 (You could just use the below index vector to find that info, but it makes the index a much smaller object if we use which. Other option: index <- murders$state == "Massachusetts" murder_rate[index] [1] 1.802
97
match
Looks for entries in a vector and returns the index needed to access them. Example: We want to find the murder rate for several different states (i.e., New York, Florida, and Texas). This function tells us which indices of the second vector match each of the entries a first vector. index <- match(c("New York", "Florida", "Texas"), murders$state) index [1] 33 10 44 murder$state[index] [1] "New York", "Florida", "Texas" murder_rate[index] [1] 2.668 3.398 3.201 Notes: 33 is the index that matched New York, 10 matched Florida, 44 matched Texas
98
%in%
Tells you whether or not each element of a first vector is in a second vector. Example: x <- c("a", "b", "c", "d", "e") y <- c("a", "d", "f") y %in% x [1] TRUE, TRUE, FALSE Example: You aren't sure if Boston, Dakota, and Washington are states, but you want to find out. c("Boston", "Dakota", "Washington") %in% murders$state [1] FALSE, FALSE, TRUE Only Washington is a state
99
What function does the following? To count how many entries are true, the function __ returns the sum of these entries.
sum TRUE > 1 FALSE > 0 So when we sum them, we're basically counting the cases that are true Example: sum(index) 5 So 5 cases in the index are true
100
We can use logicals to index vectors
For example, if we compare a vector to a single number, it performs the test for each entry. Example: We will define the index as the murder rate smaller than 0.71 per 100,000, or if we want to know if it's less or equal, we can use less than or equal. index <- murder_rate < 0.71 index <- murder_rate <= 0.71 index [1] FALSE FALSE FALSE FALSE TRUE etc. There are 50 entries that are either false or true. The entries that are true are the cases for which the murder rate is smaller or equal than 0.71 per 100,000 people. To see which these are, we can leverage the fact that vectors can be indexed with logicals. murders$state[index] [1] "Hawaii" [2] "Iowa" [3] "New Hampshire" [4] "North Dakota" [5] "Vermont" To count how many entries are true, the function sum returns the sum of these entries. TRUE > 1 FALSE > 0 So when we sum them, we're basically counting the cases that are true sum(index) 5 So 5 cases in the index are true (i.e., 5 states have murder rates less than or equal to 0.71 per 100,000 people).
101
What's the main strength of R?
Exploratory data visualization
102
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a important step in data science and data analytics as it visualizes data to understand its main features, find patterns and discover how different parts of the data are connected.
103
D3
D3 is a JavaScript library and framework for creating visualizations. It's more flexible and powerful than R, but it takes longer to generate a plot.
104
In R you can take an inference (e.g., states with larger populations are likely to have more murders) and plot the two quantities together to ascertain the truth of that statement. Example
Example: population_in_millions <- murders$population/10^6 total_gun_murders <- murders$total plot(population_in_millions, total_gun_murders)
105
What are histograms best used for? Histogram example using R
Histograms are graphical summaries that give you a general overview of the types of values you have. In R, they can be produced using the hist() function. a histogram of murder rates murders <- mutate(murders, rate = total / population * 100000) hist(murders$rate)
106
What are boxplots best used for? Boxplot example using R
Boxplots provide a more compact summary of a distribution that a histogram and are more useful for comparing distributions. boxplots of murder rates by region boxplot(rate~region, data = murders) In a single line of code, stratify state populations by region # and generate boxplots for the strata for the `murders` data set boxplot (population~region, data = murders)
107
How do you create a scatterplot? Example of a simple scatterplot
Use the function plot() a simple scatterplot of total murders versus population x <- murders$population /10^6 y <- murders$total plot(x, y) Or For a quick plot that avoids accessing variables twice, we can use the with function: with(murders, plot(population, total)) The function with lets us use the murders column names in the plot function. It also works with any data frames and any function.
108
exponential relationship
a mathematical relationship where one quantity changes by a constant multiplicative factor for a constant change in another quantity, meaning the output value is multiplied by a fixed amount each time the input increases by a fixed amount.
109
linear relationship
a connection between two variables that graphs as a straight line, where a change in one variable results in a proportional, constant change in the other
110
mutate
Use to add a new column or to change an existing one to the data table first argument = data frame second argument = name and value of the variable Example 1: add the murder rates to our murders data frame. library(dslabs) data("murders") murders <- mutate (murders, rate= total/population*100000) So data frame: murders name and value of variable: rate=total/population*100000 Example 2: Take the log transformation of the population variable: mutate(murders, population = log10(population)) Example 3: apply the same transformation to several variables. mutate(murders, across(c(population, total), log10)) Example 4: mutate(murders, across(where(is.numeric), log10) *Notes: -Notice that we used "total" and "population" inside the function which are objects that are not defined in our workspace. We don't get an error, because functions inside the dplyr package know to look for variables in the data frame provided in the first argument. In the call to mutate above, total will have the values in murders$total. -The output of the above code will show an updated table/murders object with a new column. BUT even though we have overwritten the original murders object, it does not change the object that is loaded with data(murders). So, if we load the murders data again, the original will overwrite our mutated version. -Like the filter function, we can use the data table variable names inside the function, and we'll know that we mean the columns and not the objects in the workspace.
111
select
Use to subset the data by selecting specific COLUMNS. *Subset: Lets you access specific parts of a vector by using square brackets to access elements of a vector. Example: new_table <-table select (murders, state, region, rate) filter(new_table, rate <= 0.71) In the above code, we select 3 columns, assign them to a new object, and then filter the object. OR select(data_frame, column_name_1, column_name_2) Example: select(murders, state, abb) when selecting the columns "state" and "abb" from the data frame "murders" Use Case: We have a data table with hundreds of columns, but we just want to view a few of the columns.
112
pipe operator
Use to perform a series of operations
113
filter
Use to subset the data by filtering specific ROWS. first argument = data table second argument = the conditional statement Example: new_table <- select(murders, state, region, rate) filter(new_table, rate <= 0.71) So, first we defined a new object (i.e., new_table) in the murders data table/data frame and then we select variable names (i.e., state, region, rate) AKA columns and then filter rows. Notes: -Like the mutate function, we can use the data table variable names inside the function, and we'll know that we mean the columns and not the objects in the workspace. *Subset: Lets you access specific parts of a vector by using square brackets to access elements of a vector.
114
pipe
pipe operator: |> (available starting with R version 4.1.0) OR %>% (tidyverse operator) Normally, we have to define an intermediate object in order to use select, mutate, and filter together. Example: We defined "new_table" (the intermediate object) below. new_table <- select(murders, state, region, rate) filter(new_table, rate <= 0.71) BUT in dplyr, we can avoid that. We can write code that looks more like what we want to do (i.e., data>select>filter) AKA take the original data>select some columns>and then filter some rows. We use the pipe to do that. So, the pipe makes it possible to perform a series of operations by sending the results of one function to another function using the pipe operator %>% Example: murders |> select(state, region, rate) |> filter(rate <= 0.71) So: data table |> (pipe) select(variable names/columns) %>% (pipe) filter(variable names/rows) Note: -When using the pipe, we no longer need to specify the required argument as dplyr assumes that whatever is being piped should be operated on.
115
How to install and load the dplyr package
installing and loading the dplyr package install.packages("dplyr") library(dplyr)
116
rank(x) vs rank(-x)
rank(x) gives you the ranks from lowest to highest. rank(-x) gives you the ranks from highest to lowest.
117
What operator do you use to remove rows?
!= Example: Remove Florida row no_florida <- filter(murders, state != "Florida")
118
How to create data frames in r
Example: grades <- data.frame(names=c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90)) grades output: names exam_1 exam_2 1 John 95 90 2 Juan 80 85 3 Jean 90 85 4 Yao 85 90 Note: -data.frame turns characters into factors by default. To avoid that, put the entries in quotes to keep the strings as characters (i.e., "John", "Juan", "Jean", "Yao". (Before version R 4.0, you had to write "stringsAsFactors = FALSE)" at the end of the line of code. That's no longer necessary.)
119
Factor vs Character
Character Type: A character vector is used to store text data. Usage: Typically used for text strings or identifiers that do not have a defined set of levels. Example: "apple", "banana", "cherry". When to Use: Use character vectors for free-text data that doesn’t have a fixed set of categories. Factor Type: A factor is used to represent categorical data. Internally, factors are stored as integer vectors with a corresponding set of character labels. Usage: Used to store categorical data that has a fixed set of possible values (levels). Example: A factor variable fruit with levels "apple", "banana", and "cherry". When to Use: Use factors for categorical data, especially when the categories are important for statistical analysis or when you need to specify the order of levels.
120
rank(x) vs rank(-x)
Note that if rank(x) gives you the ranks of x from smallest to largest, rank(-x) gives you the ranks from largest to smallest.
121
nrow()
Count the number of rows Example: library(dplyr) library(dslabs) data(murders) no_south <- filter(murders, region != "South") nrow(no_south) You'll get the number of rows where the region does not equal "South"
122
Practice: Use a single line of code to create a new data frame, called my_states, that has homicide rate and rank columns (with the range ordered from highest to lowest), considering only states in the northeast or west that have a homicide rate less than 1 and containing only the state, rate, and rank columns. The line must have four components separated by three |> operators: The original murders data set A reference to mutate to add homicide rate and rank. A reference to filter to keep only northeastern or western states that have a homicide rate less than 1. A reference to select that keeps only the columns with the state name, homicide rate, and rank. vs. the code without the |> operator
library(dplyr) library(dslabs) data(murders) my_states <- murders |> mutate(rate = total / population * 100000, rank = rank(-rate)) |> filter (region %in% c("Northeast", "West" ) & rate < 1) |> select(state, rate, rank) vs without the |> operator: library(dplyr) library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate)) my_states <- filter(murders, region %in% c("Northeast", "West") & rate < 1 ) select(my_states, state, rate, rank)
123
dplyr
The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember Operations: mutate() adds new variables that are functions of existing variables select() picks variables based on their names. (COLUMNS) filter() picks cases based on their values. (ROWS) summarise() reduces multiple values down to a single summary. arrange() changes the ordering of the rows.
124
Package vs library vs data
Package: a collection of R functions, data and compiled code. Library: The location where the packages are stored is called the library. Data: the data table
125
Standard Deviation (SD)
The measure of how spread out numbers are around their average. How to calculate SD: -subtract mean from each number -square the results -add them up -divide by the length of the list (i.e., the total number of numbers in the list) -take the square root of the result Low standard of deviation: the data is closely clustered around the mean/average High standard of deviation: the data is dispersed over a wider range of values Use standard deviation to determine if data is standard and expected OR unusual and unexpected. A data point that is beyond a certain number of standard deviations from the mean (e.g., 3Οƒ) represents an outcome that is significantly above or below the average. This can be used to determine if a result is "statistically significant" or part of "expected variation". Use Case: Is a bottle with an extra ounce of soda expected, or is it statistically significant and warranting additional investigation into the production line? Notes: -The mathematics symbol (not relevant for r) for standard deviation is the lowercase Greek letter sigma (Οƒ) 68-95-99.7 rule (or Empirical rule): -68% of the data fall within one standard deviation of the mean. -95% of the data fall within 2 standard deviations of the mean. -99.7% of the data fall within 3 standard deviations of the mean. 5 Sigma Results: Results that are 5 standard deviations above or below the mean. (A result that deviates this much may signify a discovery as it has only a 1 in 3.5 million chance that it is due to random fluctuation.)
126
summarize
dplyr verb The summarize function in dplyr provides a way to compute summary statistics for your data. Example: library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 10^5) s <- murders |> filter(region == "West") |> summarize(minimum = min(rate), median = median(rate), maximum = max(rate)) s Minimum Median Maximum [1] 0.515 1.29 3.63 Note: -Because the resulting table s is a data frame, we can access the components with the accessor (i.e., $) s$median [1] 1.29 Example: For example, if you wanted to know what the mean for `carat` was in this dataset, you could run the code in the chunk below: summarize(diamonds, mean_carat = mean(carat))
127
arrange
use arrange to order entire data frames (as opposed to using order and sort to order different columns). Example: Order the states by population size: murders |> arrange(population) |> head() state abb region pop total rate Wyoming WY West 563626 5 0.887 So, with arrange we get to decide which column to sort by. To see the states sorted by murder rates, for example, we would use arrange(rate) instead. Note: Default is ascending order. To do descending order: murders |> arrange(desc(rate))
128
minimum, median, and maximum
Example: library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 10^5) s <- murders |> filter(region == "West") |> summarize(minimum = min(rate), median = median(rate), maximum = max(rate)) s Minimum Median Maximum [1] 0.515 1.29 3.63
129
dangers with the average
Example: If we wanted to compute the murder rate for the entire country, we could not just take the average rate, because it does not take into account that some states are more populous than others and need to be weighed more. Average rate: mean(murders$rate) [1] 2.78 Instead we can compute the rate using this code & the summarize function: us_murder_rate <- murders |> summarize(rate = sum(total)/ sum(population * 10^5 us_murder_rate rate [1] 3.03
130
quantile()
# Then call the summarize function Example: The below line of code returns the minimum (o), median (0.5), and maximum (1) values. min= 0th percentile of the vector median = 50th percentile of the vector maximum = 100th percentile of the vector quantile(x, c(0,0.5,1)) This returns the minimum, median, and maximum of the vector x. So, if you use a function that returns 2 or more values (see below), summarize returns a table with 3 rows, one for each outcome of the call of the function quantile. You get a vector. murders |> filter(region == "West") |> summarize (range = quantile(rate, c(0, 0.5, 1))) range 1 0.515 2 1.292 3 3.630 If you want to have them in columns, than you need to write a function that returns a data frame rather than a vector. #Define quantile as a function my_quantile <- function(x){ r <- quantile(x, c(0, 0.5, 1)) data.frame(minimum = r[1], median = r[2]. maximum = r[3]) } murders |> filter(region == "West") |> summarize(my_quantile(rate)) Min Median Max [1] 0.515 1.29 3.63
131
pull()
dplyr pull function can be used to access values stored in data when using pipes. When a data object is piped that object and its columns can be accessed using the pull(f) function. Notes: -dplyr function summarize always returns a data frame. This may be a problem if you want to use the result with functions that require a numeric value. Use the pull function is you want a numeric value rather than a data frame. Examples: library(tidyverse) library(dplyr) library(dslabs) data(murders) murders <- mutate(murders, rate = total / population * 10^5) #average rate adjusted by population size (weighted average): us_murder_rate <- murders %>% summarize(rate = sum(total) / sum(population) * 10^5) us_murder_rate #us_murder_rate is stored as a data frame: class(us_murder_rate) #the pull function can return it as a numeric value: us_murder_rate %>% pull(rate) #using pull to save the number directly: us_murder_rate <- murders %>% summarize(rate = sum(total) / sum(population) * 10^5) %>% pull(rate) us_murder_rate #us_murder_rate is now stored as a number: class(us_murder_rate)
132
How do you get numeric values or vectors?
Use the access function $ or the dplyr pull function
133
. the dot
Think of the dot as a placeholder for the data that's being passed through the pipe. Example of a dot being used to imitate the pull function: us_murder_rate <- murders |> summarize(rate = sum(total) / sum(population) * 10^5 |> .$rate us_murder_rate [1] 3.03 class(us_murder_rate) [1] numeric
134
group_by ()
A common operation in data exploration: Split the data into groups and then compute summaries for each group. Use the group_by () function to do this. Example: In the below code, the summarize function will apply a summarization to each group separately. (So, this happens whenever summarize follows group_by.) murders |> group_by (region) |> summarize(median = median(rate))
135
What is a package? What are the following? dplyr purr ggplot2
A package is a set of R functions, compiled code, and sample data dplyr: A package for manipulating data frames purr: A package for working with functions ggplot2: A graphing package
136
Observation
In statistics, an observation is one occurrence of something you're measuring. Example: You're measuring the weight of a certain species of turtle. Each turtle that you collect the weight of counts as one single observation. In r, each observation is represented as a row in a data frame.
137
Matrix vs Data Frame vs vector
Matrix: A 2-D collection of elements of the same data type (i.e., all numeric, all character, etc.) This means it has both rows and columns. Visual Example: [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 Data Frame: It's like a table in a spreadsheet; it can hold different data types in each column (e.g., numeric, character, logical, etc.), , although each column must be of the same data type. Visual Example: Data Frame (mixed types) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Name β”‚ Age β”‚ Passed β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Alice β”‚ 25 β”‚ TRUE β”‚ β”‚ Bob β”‚ 30 β”‚ FALSE β”‚ β”‚ Carol β”‚ 22 β”‚ TRUE β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Vector: a one-dimensional sequence of data elements of the same type.
138
We say that a data table is in tidy format when...
each row represents one observation and the columns represent the different variables available for each of these observations.
139
sd()
standard deviation
140
arrange()
Use arrange to decide which column to sort by. Example: murders |> arrange(rate) |> head() output: state abb region pop total rate Vermont VT Northeast 625741 2 0.320 Hawaii HI West 1360301 7 0.515 Iowa IA N Central 3046355 21 0.689 Note: In dplyr, the default is to arrange in ascending order. To see in descending order: murders |> arrange(desc(rate))
141
Nested Sorting
If we are ordering by a column with ties, we can use a second column to break the tie. Similarly, a third column can be used to break ties between first and second and so on. Example: Here we order by region, then within region we order by murder rate: murders |> arrange(region, rate) |> head() #> state abb region population total rate #> 1 Vermont VT Northeast 625741 2 0.320 #> 2 New Hampshire NH Northeast 1316470 5 0.380 #> 3 Maine ME Northeast 1328361 11 0.828 #> 4 Rhode Island RI Northeast 1052567 16 1.520 #> 5 Massachusetts MA Northeast 6547629 118 1.802 #> 6 New York NY Northeast 19378102 517 2.668
142
top_n
This function takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. Example: murders |> top_n(5, rate) #> state abb region population total rate #> 1 District of Columbia DC South 601723 99 16.45 #> 2 Louisiana LA South 4533372 351 7.74 #> 3 Maryland MD South 5773552 293 5.07 #> 4 Missouri MO North Central 5988927 321 5.36 #> 5 South Carolina SC South 4625364 207 4.48
143
tbl (pronounced tibble)
A special kind of data frame. You can think of them as modern versions of data frames. Tibbles -Never change the data type of the inputs -Never change the names of your variables -Never create row names Make printing easier (i.e., they won't overload your console as they're set up to only pull up the first ten rows). Differences between a data frame and a tibble: -The print method for tibbles is more readable. -If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or a scalar. With tibble, this does not happen, which is useful since tidyverse functions require data frames as an input. -With tibbles, if you want to access the vector that defines a column, and not get back a data frame, you need to use the accessor $: class(as_tibble(murders)$population) #> [1] "numeric" -While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. -Tibbles can be grouped The function group_by returns a special kind of tibble: a grouped tibble. tibbles are the preferred format in the tidyverse, so tidyverse functions that produce a data frame from scratch return a tibble. (The functions group_by and summarize always return a tbl data frame.)
144
as_tibble()
To convert a dataframe to a tibble. Example: as_tibble(murders)
145
placeholder operator
. Used as shorthand reference to the current object being passed through the pipeline. Example, Implicit Placeholder: 1:5 |> mean() |> sqrt visual flow: 1:5 #mean(1:5) mean(.) #sqrt(mean(1:5)) sqrt(.) Example, Explicit Placeholder Inside an Expression: 1:5 |> mean() |> {. - 2} visual flow: 1:5 #mean(1:5) mean(.) #(mean(1:5)) -2 {. -2| Example, Data Frame with dplyr ?? iris |> filter(.$Species == "setosa") |> summarize(avg = mean(.$Sepal.length)) visual flow: iris #filter(Iris, Iris$Species == "setosa") filter(.$Species == "setosa") #summarize(Iris, avg = mean(Iris$Sepal.Length)) summarize(avg = mean(.$Sepal.Length)) Example, Formula Shorthand: sapply(1:5, ~ .^2) visual flow: Each element of 1:5 #(1)^2, (2)^2, (3)^2, ... .~2
146
Do you need to reinstall packages every session?
No. once you install a package, it remains installed and only needs to be loaded with library.
147
args()
If you want a quick look at the argument without opening the help system, use the args() function. Example: args(log)
148
How do you see the arithmetic operators? How do you see the relational operators?
Arithmetic operators: help("+") + x - x x + y x - y x * y x / y x ^ y x %% y x %/% y Relational operators: help(">") x < y x > y x <= y x >= y x == y x != y
149
To specify arguments, we must use ____ and cannot use ___.
To specify arguments, we must use =, and cannot use <-. Example: To what power must I raise 2 to get 8? log(base = 2, x = 8) #> [1] 3
150
Prebuilt Objects
There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing: data() This shows you the object name for these datasets. These datasets are objects that can be used by simply typing the name. For example, if you type: co2 R will show you Mauna Loa atmospheric CO2 concentration data.
151
vector numeric vector character vector logical vector
This term is used to refer to objects with several entries. The function length tells you how many entries are in the vector. Example: the object murders$population is not one number but several so it is a vector. Numeric Vector: In a numeric vector, every entry must be a number. Character vector: All entries in a character vector need to be a character. Logical Vector: All must be either True or False.
152
How do you see all relational operators?
?Comparison Description Binary operators which allow the comparison of values in atomic vectors. Usage x < y x > y x <= y x >= y x == y x != y
153
How do you change the levels in a factor?
levels(): You can specify an order through the level argument when creating the factor with the factor function. reorder(): lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. Example: Take the sum of the total murders in each region and reorder the factor following those sums. region <- murders$region value <- murders$total region <- reorder(region, value, FUN = sum) levels(region) [1] "Northeast" "North Central" "West" "South" *Note: Factors sometimes behave like characters and sometimes they don't. Confusing factors and characters are a common source of bugs. Reminder: Factors are variables in R which take on a limited number of different values; such variables are often refered to as categorical variables. Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. Factors/categorical variables are stored in levels. R stores each level as an integer. (This is more memory efficient that storing all of the characters.) Note: the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow an alphabetical order.
154
list()
Data Frames are a special case of lists. Lists allow you to store any combination of different types. Example, How to create a list: record <- list(name = "John Doe", student_id = 1234, grades = c(95, 82, 91, 97, 93), final_grade = "A")
155
How to extract components of a list?
With the accessor $ or with double square brackets [[ Examples: records$student_id [1] 1234 OR record[["student_id"]] [1] 1234 BUT if a list does not have names, you can only extract the elements with the brackets, not the accessor. Example: record[[1]] [1] "John Doe" Notes: What does it mean if a list does not have names? Usually when you create a list, you give names to each element. For example: $name [1] "Ken" $age [1] 34 $city [1] "Chicago" In this example, each element has a name (i.e., name, age, city) so you can access them by name: person$name person[["city"]] BUT you can also create lists without naming the elements: my_list <-list("Ken", 34, "Chicago" To access those elements, you would need to do: my_list[[1]] #"Ken" my_list[[2]] # 34
156
How do you access specific entries in a matrix?
Use square brackets. syntax: matrix_name[row, column] Example 1: If you want the second row, third column in a matrix: mat[2,3] [1] 10 Example 2: If you want the entire second row, you leave the column spot empty: mat[2, ] [1] 2 6 10 Example 3: If you want the entire third column, leave the row spot empty. mat [ , 3' [1] 9 10 11 12 Example 4: Access more than one column or more than one row. mat[ , 2:3] Example 5: You can subset both rows and columns: mat [1:2, 2:3] Reminder: Matrix: A 2-D collection of elements of the same data type (i.e., all numeric, all character, etc.) Visual Example: [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 Data Frame: It's like a table in a spreadsheet; it can hold different data types in each column (e.g., numeric, character, logical, etc.) Visual Example: Data Frame (mixed types) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Name β”‚ Age β”‚ Passed β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Alice β”‚ 25 β”‚ TRUE β”‚ β”‚ Bob β”‚ 30 β”‚ FALSE β”‚ β”‚ Carol β”‚ 22 β”‚ TRUE β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
157
What's one way to access rows and columns of a data frame?
Use single square brackets. syntax: data_frame_name[row, column] Example: murders[25, 1] [1] "Mississippi" Example: murders[2:3, ] (*You get all columns, because only specified the rows wanted) state abb region pop total Alaska AK West 710231 19 Arizona AZ West 6392017 232
158
What function do you use to create vectors? Character vectors?
concatenate c() Example: codes <- c(380, 12, 818) codes [1] 380 124 818 Character Vectors: Use quotes to denote that the entries are characters rather than variable names. country <- c("italy', "canada", "egypt") You can also use single quotes: country <- c('italy', 'canada', 'egypt') Note: If you don't use quotes with characters, you will get an error. R will be looking for variables and won't find any, which will result in an error.
159
How do you name entries of a vector? Why would you do that?
Why? It can be useful. Example: When defining a vector of a country codes, you can use names to connect the codes with the country name. codes <- c(italy = 380, canada = 124, egypt = 818) codes [1] italy canada egypt [2] 380 124 818 OR codes <- c("italy" = 380, "canada" = 124, "egypt" = 818) codes [1] italy canada egypt [2] 380 124 818 OR Use the names function to assign names: codes <- c(38-, 124, 818) country <- c("italy", "canada", "egypt") names(codes) <- country codes [1] italy canada egypt 380 124 818 Note: The object codes continues to be a numeric vector: class(codes) [1] numeric But with names: names(codes) [1] "italy" "canada" "egypt"
160
numeric vs integer
Integers: whole numbers (i.e., numbers without a decimal point) Numeric: numbers that contain a decimal.
161
Recycling
In R, recycling refers to how R handles operations between vectors of different lengths. When you perform arithmetic (like addition, subtraction, multiplication, etc.) on two vectors that aren’t the same length, R automatically β€œrecycles” (repeats) the shorter vector until it matches the length of the longer one. Note: it's a common source of unnoticed errors. If you accidentally have mismatched vector lengths, R won’t stop you β€” it’ll quietly recycle values. That can give you plausible-looking but wrong results. that is, you won't receive an error or a warning that this will be done, so you must be careful to ensure that your vectors are the same length!
162
Flow control
set actions to occur only if a condition or a set of conditions are met.
163
Conditional Expressions
Conditional expressions are one of the basic features of programming. They are used for what is called flow control. The most common conditional expression is the if-else statement Example: a <- 0 if (a != 0) { print(1/a) } else{ print("No reciprocal for 0.") } #> [1] "No reciprocal for 0."
164
ifelse
if else function allows you to perform element-wise conditional operations on vectors or data frames. syntax: ifelse(test_expression, x, y) test_expression: an object which can be coerced to logical mode x: the value or expression to be returned when the condition is true. It can be a single value, vector, or expression. y: the value or expression to be returned when the condition is false. It can be a single value, vector, or expression.
165
%%
modulo operator Gives you the remainder of an integer division. Example: # create a vector a = c(5,7,2,9) # check if each element in a is even or odd ifelse(a %% 2 == 0,"even","odd") [1] "odd" "odd" "even" "odd" So, 5/2 = 2.5 so 5 %% 2 is 5 and since that isn't 0, it's odd Remember: Dividing by 2 can tell you if a number is even or odd. Even numbers will result in whole numbers without a remainder/even numbers are perfectly divisible by 2. Odd numbers will result in numbers with a remainder.
166
any function all function
any: takes a vector of logicals and returns TRUE if any of the entries are TRUE all: takes a vector of logicals and returns TRUE if all of the entries are TRUE Example: z <- c(TRUE, TRUE, FALSE) any(z) [1] TRUE all(z) [1] FALSE
167
namespace
A labeled container that keeps track of which functions and variables belong to which package or environment, so that two packages (or functions) can use the same name without clashing. Example: Both the dplyr package and the stats package have a function named filter() dplyr filter(): filters rows in a data frame stats filter(): applies a linear filter to a time series Think of it as R asking: When you call a function named filter(), which filter(0 do you mean--the one from dplyr or the one from stats?
168
How do you make sure that you use the right version of a function that has the same name in different packages?
Example: Both the dplyr package and the stats package have a function named filter() You can force the use of a specific namespace by using double colons ( ::) like this: stats:: filter dplyr:: filter
169
How do you use a function in a package without loading the entire package?
Use the double colon package_name:: function_name Example: stats:: filter
170
For Loop
In programming, a for loop is a control flow statement that allows you to execute a block of code repeatedly that is based on a specific condition. It is commonly used when you know how many times you want to execute a block of code. For Loop syntax varies on the programming language, but for r it's: for (x in 1:10) [ print(x) } Example: Print every item in this list. fruits <- list("apple", "banana", "cherry") for (x in fruits) { print(x) } [1] "apple" [1] "banana" [1] "cherry"
171
Which is used more in R, for loops or vectorization?
Vectorization, because it results in shorter and clearer code.
172
Vectorized Function
A function that will apply the same operation on each of the vectors Example: x <- 1:10 sqrt(x) #> [1] 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16 y <- 1:10 x*y #> [1] 1 4 9 16 25 36 49 64 81 100
173
Scalar, Define
The scalar data structure holds only a single atomic value at a time. Vectors that have a single value (length 1) are called scalars. Vectors can contain numbers, characters, factors, or logicals. But all the elements inside a vector must be of the same class. In other words, vectors can contain either numbers, characters, or logicals but not mixtures of these types of data. There is only one exception to this rule: you can include NA (this is a special type of logical) to denote missing data in vectors with other data types.
174
Functionals
Functions that help us apply the same function to each entry in a vector, matrix, data frame, or list.
175
sapply
Allows you to perform element-wise operations on any function. Example: x<- 1:10 sapply(x, sqrt) #> [1] 1.00 1.41 1.73 2.00 2.24 2.65 2.83 3.00 3.16 So, each element of x is passed on to the function sqrt and the result is returned. These results are concatenated. In this case, the result is a vector of the same length as the original x. Sooo, the for loop can be written as follows: n <- 1:25 s_n <- sapply(n, compute_s_n) Note: compute_s_n is a user-defined function (i.e., earlier in the script what it does must be defined).
176
177
FUN
stands for "function" in many of R's "apply family" functions(i.e., apply, lapply, tapply, mapply, vapply, replicate). #Apply this function (FUN) to each element of x xapply(x, FUN) Example 1: #FUN = sqrt sapply(1;5, sqrt) [1] 1.00 1.41 1.73 2.00 2.23 Example 2: #sapply(x=n, FUN = compute_s_n) sapply(x = n, FUN = compute_s_n) So x is the data(1:25) FUN is the function to apply (compute_s_n) FUN is just a parameter that tells R: "Hey, this is a function I want you to apply to every element of X."
178
R Apply Family
A set of functions thatallow users to apply a function to elements of a vector, list, or matrix. The Functions: apply, lapply, tapply, mapply, vapply, and replicate BUT it is considered legacy functionality and should not be used for new code. Instead, use the purr package for all looping in R.
179
dplyr package purrr package ggplot
dplyr package: use to manipulate data frames. Introduces functions that perform some of the most common operations when working with data frames. purrr package: used for working with functions ggplot: a graphing package
180
dplyr package functions
A few of the functions: Rows: filter(): chooses rows based on column values slice(): chooses rows based on location arrange(): changes the order of rows --desc to veer from default of ascending Columns: select(): changes whether or not a column is included rename(): changes the names of the columns mutate(): changes the values of columns and creates new columns relocate(): changes the order of the columns The pipe |>
181
dplyr helper functions
starts_with : matches names that begin with "abc" contains: matches names that contain "xyz" ends_with: matches names that end with "xyz" matches: selects variables that match a regular expression. This one matches any variables that contain repeated characters. num_range: ("x", 1:3): matches x1, x2, and x3 Example, start_with: select(penguins, starts_with("Bill")) output: bill_length_mm bill_depth_mm 39.1 18.7 39.5 17.4 Example, contains: select(penguins, contains("length")) bill_length_mm flipper_length_mm 39.1 181 39.5 186 ends_with select(penguins, ends_with("_mm), ends with("__g") bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 39.1 18.7 181 3750
182
How to create data frames
Create a data frame in the tibble format: grades <- tibble(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90)) Base R (without packages loaded) had the data.frame function that can be used to create a regular data frame rather than a tibble: grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), exam_1 = c(95, 80, 90, 85), exam_2 = c(90, 85, 85, 90))
183
purr package
Includes functions similar to sapply (i.e., the function that applies the same function or procedure to elements of an object), but they will never convert our result to a character under certain circumstances. Purr functions will return objects of a specified type or return an error if that is not possible. Reminder: sapply allows you to perform element-wise operations on any function.
184
purrr functions, map vs map_dbl
Like sapply, allows you to perform element-wise operations on any function, but it will always return a list Example: library(purrr) s_n <- map(n, compute_s_n) class(s_n) [1] "list" But if you want a numeric vector, use map_dbl, which always returns a vector of numeric values. s_n <- map_dbl(n, compute_s_n) class(s_n) [1] "numeric"
185
case_when
The case_when function is useful for vectorizing conditional statements. It's similar to ifelse but can output any number of values (as opposed to just TRUE or FALSE). Example: Split numbers into negative, positive, and 0: x <- c(-2, -1, 0, 1, 2) case_when(x < 0 ~ "Negative", x > 0 ~ "Positive", TRUE ~ "Zero") [1] "Negative" "Negative" "Zero" "Positive" Positive" Common Use Case: Define categorical variables based on existing variables. Example 2: Compare the murder rates in 4 groups of states: New England, West Coast, South, and Other. Start by assigning these categories to the variables. murders |> mutate(group = case_when( abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England", abb %in% c("WA", "OR", "CA") ~ "West Coast", region == "South" ~ "South", TRUE ~ "Other)) |> group_by(group) |> summarize(rate = sum(total)/sum(population)*10^5) A tibble: 4x2 group rate New England 1.72 Other 2.71 South 3.63 West Coast 2.90 Note: "vectorized" means that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.
186
between
use to see if a value falls inside an interval. Example: Check to see if the elements of vector x are between a and b: between (x, a, b)
187
na.rm = TRUE
Used to ignore NA values when calculating Example: NHANES |> filter(AgeDecade == " 20-29" & Gender == "female") |> summarize( minbp = min(BPSysAve, na.rm = TRUE), maxbp = max(BPSysAve, na.rm = TRUE) )
188
data.table
Separate package that requires installation. #install the data.table package install.packages("data.table") #load data.table package library(data.table) load other packages and datasets library(tidyverse) library(dplyr) library(dslabs) data(murders)
189
What's the first step when using data.table?
Convert the data frame into a data.table object using the as.data.table function. Example: murders_dt <- as.data.table(murders)
190
Selecting with data.table vs Selecting with dplyr
Selecting with data.table: murders_dt[, c("state", "region")] OR (use .() data.table notation to alert R that variables inside the parenthesis are column names, not objects in the R environment.) murders_dt[, .(state, region)] vs. Selecting with dplyr: select(murders, state, region)
191
Adding or transforming variables with data.table vs with dplyr
Example: Add a new column "rate" to the table: Using data.table: murders_dt[, rate := total / population * 100000] *Note: data.table avoids new assignment (i.e., the := operator modifies data in place/it directly alters the existing table without creating a new copy). This takes up less memory than the dplyr mutate. This helps with large datasets that take up most of your computer's memory. vs. Using dplyr: murders <- mutate(murders, rate = total / population * 100000)
192
How do you define new columns using data.table?
Example: murders_dt[, ":="(rate = total / population * 100000, rank = rank(population))]
193
Reference vs Copy in r
data.table package was designed to avoid wasting memory, so if you make a "copy of the table in any of the following ways, you're just creating a new name for an object, you're not actually creating a new object. 1. Assignment x<- data.table(a = 1) y <- x 2. Modify x x[, a := 2] y #> a #> #> 1: 2 3. Modify y y[, a := 1] x #> a #> #> 1: 1 To create an actual copy: Copy x <- data.table(a = 1) y <- copy(x) x[, a := 2] y #> a #> #> 1: 1
194
In data.table parlance, all set functions do what?
In data.table parlance, all set functions change their input by reference. This means that no copy is made, other than temporary working memory, which is as large as one column. The only other data.table operator that modifies input by reference is :=
195
setDT()
Use the setDT() function to convert a data frame to a data table. Syntax: setDT(x, keep.rownames=FALSE, key=NULL, check.names=FALSE) Example: x <- data.frame(a = 1) setDT(x) x: name of the data frame to convert to a data table keep.rownames: whether to keep the row names from the data table in a new column key: character vector of one or more column names to pass to setkeyv check.names: whether to check names for valid formats before converting data frame to data table *Use setDT() when working with larger data sets that take up a considerable amount of RAM because the operation will modify each object in place, conserving memory. For data that is a very small percentage of RAM, using data.table's copy-and-modify is fine.
196
data frame vs data table
data.frame Base R data structure for tables No package required Speed: slower for large data Memory usage: copies data often Syntax: standard, verbose Best for: small/medium datasets, beginners data.table Enhanced version of data.frame data.table package is required speed: very fast memory usage: updates by reference (no copies) syntax: concise, powerful best for: large datasets, fast data manipulation
197
subset in data.table vs dplyr
Example: Extract rates less than or equal to 0.7 data.table: murders[rate <= 0.7] dplyr: filters(murders, rate <= 0.7)
198
filtering and selecting in data.table vs dplyr
Example: select the state and rate for those with a rate less than or equal to 0.7 data.table: murders[rate <= 0.7, .(state, rate)] dplyr: murders |> filter(rate <= 0.7) |> select (state, rate)
199
example of how to load packages and prepare the data for data.table
library(tidyverse) library(dplyr) library(dslabs) data(murders) library(data.table) murders <- setDT(murders) murders <- mutate(murders, rate = total / population * 10^5) murders[, rate := total / population * 100000]
200
.()
In data.table, we call functions inside .() and they will be applied to rows.
201
summarize in dplyr vs data.table
Example: load packages and prepare the data - heights dataset library(tidyverse) library(dplyr) library(dslabs) data(heights) heights <- setDT(heights) dplyr: s <- heights |> summarize(average = mean(height), standard_deviation = sd(height) data.table: s <- heights [, .(average = mean(height), standard_deviation = sd(height))] multiple summaries in data.table: heights[, .(median_min_max(height))]
202
subsetting and then summarizing in data.table vs dplyr
load packages and prepare the data - heights dataset library(tidyverse) library(dplyr) library(dslabs) data(heights) heights <- setDT(heights) dplyr: s <- heights |> filter(sex) == "Female") |> summarize(average = mean(height), standard_deviation = sd(height)) data.table: s <- heights[sex == "Female", .(average = mean(height), standard_deviation = sd(height))]
203
grouping and then summarizing in data.table
load packages and prepare the data - heights dataset library(tidyverse) library(dplyr) library(dslabs) data(heights) heights <- setDT(heights) #get mean height and sd for males and females heights[, .(average = mean(height), standard_deviation = sd(height)), by = sex]
204
order rows in a data frame using a data.table
load packages and datasets and prepare the data library(tidyverse) library(dplyr) library(data.table) library(dslabs) data(murders) murders <- setDT(murders) murders[, rate := total / population * 100000] order by population murders[order(population)] |> head() order by population in descending order murders[order(population, decreasing = TRUE)] order by region and then murder rate murders[order(region, rate)]
205
tbl
A tbl (pronounced "tibble") is a special kind of data frame. Tibbles are the default data frame in the tidyverse. Tibbles display better than regular data frames. Subsets of tibbles are tibbles, which is useful because tidyverse functions require data frames as inputs. Tibbles will warn you if you try to access a column that doesn't exist. Entries in tibbles can be complex - they can be lists or functions. The function group_by() returns a grouped tibble, which is a special kind of tibble.
206
tbl, view a dataset
murders |> group_by(region)
207
tbl, see the class
murders |> group_by(region) |> class()
208
tbl, compare the print output of a regular data frame to a tibble gapminder
as_tibble(gapminder) Note: gapminder is a dataset in R saved in a data frame (or tbl) containing social and econonmic indicators for countries over time.
209
tbl, compare subsetting a regular data frame and a tibble
class(murders[,1]) class(as_tibbles(murders[,1])
210
tbl, access a column vector not as a tibble using $ (accessor)
class(as_tibble(murders)$state)
211
create a tibble
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
212
213
dplyr, summarize. What does summarize do/why is it needed?
βœ… The core idea When you use filter() and select(), you still have a data frame (a table). But the mean() function expects a vector (just a single column of numbers). If you try to do mean(height) directly after a pipe, R gets confused because you're still working with a data frame. πŸ’‘ What summarize() does summarize() (or summarise()) reduces a data frame down to a single summary value per group. It takes your column (height) and applies a function to it (mean()). It converts this: height sex 61 Female 65 Female 62 Female ... ... Into this: mean_height_cm 162.5 So summarize() tells R: "Take this column and compute a summary from it."
214
if else statement
General Form: if(boolean condition){ expressions } else { alternative expressions } Example: Find the states with a minimum murder rate less than 0.5 ind <- which.min(murder_rate) if(murder_rate[ind] < 0.5){ print(murders$state[ind]) } else{ print("No state has murder rate that low") } [1] Vermont
215
subsetting in dplyr vs data.table
load packages and prepare the data library(tidyverse) library(dplyr) library(dslabs) data(murders) library(data.table) murders <- setDT(murders) murders <- mutate(murders, rate = total / population * 10^5) murders[, rate := total / population * 100000] subsetting dplyr: filter(murders, rate <= 0.7 data.table: murders[rate <= 0.7]
216
combining filter and select in dplyr vs data.table
load packages and prepare the data library(tidyverse) library(dplyr) library(dslabs) data(murders) library(data.table) murders <- setDT(murders) murders <- mutate(murders, rate = total / population * 10^5) murders[, rate := total / population * 100000] #combining filter and select dplyr: murders %>% filter(rate <= 0.7) %>% select(state, rate) data.table: murders[rate <= 0.7, .(state, rate)]
217
ifelse function
This function takes 3 arguments: 1 logical argument and 2 possible answers. If the logical is true, the first answer is returned. if it's false, the second answer is returned. *Ifelse works on vectors. It examines each element of a logical vector and returns a corresponding answer. Example: If a is bigger than zero, return the reciprocal. If not, return NA. a <- 0 ifelse(a > 0, 1/a, NA) [1] NA Notes: Logical Vector: TRUE, FALSE, or NA (for missing values)
218
Check & Remove NAs from a logical vector
Examples: #How many nas? data(na_example) sum(is.na(na_example)) [1] 145 #Remove nas and then confirm they're gone. no_nas <- ifelse(is.na(na_example), 0, na_example) sum(is.na(no_nas)) [1] 0
219
any function
The any function takes a vector of logicals and returns true if any of the entries are true. Example: Are there any true in the vector z? z <- c(TRUE, TRUE, FALSE) any(z) [1] TRUE Example: Are there any true in the vector b? b <= c(FALSE, FALSE, FALSE) any(b) [1] FALSE
220
all function
takes a vector of logicals and returns true if all of the entries are true. z <- c(TRUE, TRUE, FALSE) all (z) [1] FALSE z <- c(TRUE, TRUE, TRUE) all(z) [1] TRUE
221
General form of functions
my_function <- function(x) {operations that operate on x which is defined by the user of the function; the value's final line is returned} Functions can have more than one variable. my_function <- function(x, y, z) { operations that operate on x, y, z, which is defined by the user of the function; the value's final line is returned} example of defining a function to compute the average of a vector x avg <- function(x){ s <- sum(x) n <- length(x) s/n } Notes: -Functions are objects so must be assigned a variable name with the arrow operator.
222
namespace
Namespaces are a convention used by programming languages ​​to be able to use the same variable names to access the values ​​of different objects. For example, in the following code we use the same name, x, to represent two different objects, one in the Namespace inside the function and another in the Namespace outside the function. my_func <- function(x){ x <- x + 1 print(x) return(NULL) } x <- 1 my_func(x) print(x) Note that when we redefine x as x+1, this happens to the object named x in the Namespace of the my_func function. Therefore the x defined in the Namespace outside the function is not affected.
223
lexical scoping
R uses lexical scoping, meaning variables defined inside a function are separate from those defined outside. Example: x <- 3 my_func <- function(y){ x <- 5 y print(x) } my_func(x) [1] 5
224
For-Loops
Code performs the same task over and over again while changing a variable. For-Loops let us define the range that our variable takes. They change the value as you loop and evaluate the expression every time. general form: for(i in range of values) { operations that use i, which is changing across the range of values } example: for(i in 1:5) { print(i) } [1] 1 [1] 2 [1] 3 [1] 4 [1] 5 (Note: At the end of the loop, the value of i is the last value of the range. So, if you type i after the above for-loop, you get back 5.) i [5]
225
Which functions are more commonly used than For-Loops in R?
apply sapply tapply mapply
226
apply sapply tapply mapply
227
split cut quantile reduce identical unique
228
Create an empty vector
Examples: create an empty vector s_n <- vector(length = m) create a for-loop that calculates the sum of integers from 1 to n for 10 different values ​​of n and stores them in a vector called results: results <- vector("numeric", 10) n <- 10 for(i in 1:n){
229
Are all spreadsheets in a text format?
No. Not all spreadsheet files are in a text format. These cannot be viewed in a text editor. Examples include Google Sheets, which are rendered on a browser, and Microsoft Excel, which has its own proprietary format.
230
When creating spreadsheets with text files, a new row is defined with ____ and columns are separated with some predefined special character like ____.
When creating spreadsheets with text files, a new row is defined with return and columns are separated with some predefined special character. The most common characters are comma (,), semicolon (;), space ( ), and tab (a preset number of spaces or \t).
231
Up to this point, we have been using data sets already stored as R objects. However, it is common to import data into R from __________
Up to this point, we have been using data sets already stored as R objects. However, it is common to import data into R from either a file, a database, or other sources.
232
directories
You can think of your computer’s filesystem as a series of nested folders, each containing other folders and files. We refer to folders as directories.
233
root directory
We refer to the folder that contains all other folders as the root directory.
234
working directory
We refer to the directory in which we are currently located as the working directory.
235
path of a file
a list of directory names that can be thought of as instructions on what folders to click on and in what order to find a file
236
relative path
The path of a file is a list of directory names that can be thought of as instructions on what folders to click on and in what order to find a file IF the instructions are for finding the file starting in the working directory, we refer to it as a relative path.
237
full path
The path of a file is a list of directory names that can be thought of as instructions on what folders to click on and in what order to find a file IF the instructions are for finding the file starting from the root directory, we refer to it as a full path
238
system.file()
Example: system.file(package = "dslabs") #> [1] "/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/dslabs" Note that the output will be different across different computers. The system.file function finds the full path to the files that were added to your system when you installed the dslabs package. The strings separated by slashes are the directory names. The first slash represents the root directory and we know this is a full path because it starts with a slash.
239
list.files()
We can use the function list.files to show the names of files and directories in any directory. For example, here are the files in the dslabs package directory: dir <- system.file(package = "dslabs") list.files(dir) #> [1] "data" "DESCRIPTION" "extdata" "help" #> [5] "html" "INDEX" "Meta" "NAMESPACE" #> [9] "R" "script" Note that these do not start with slash which implies they are relative paths. These relative paths give us the location of the files or directories if the path stored in dir is our working directory.
240
relative paths vs full paths
We highly recommend only using relative paths in your code. The reason is that full paths are unique to your computer and you want your code to be portable.
241
getwd function
If you want to know the full path of your working directory using the getwd function. wd <- getwd()
242
setwd function
If you need to change your working directory, you can use the function setwd or you can change it through RStudio by clicking on β€œSession”.
243
file.path function
The file.path function combines characters to form a complete path, ensuring compatibility with the respective operating system. This function is useful because often you want to define paths using a variable. Here is an example that constructs the full path for a spreadsheet containing the murders data. Here the variable dir contains the full path for the dslabs package and extdata/murders.csv is the relative path of the spreadsheet if dir is considered the working directory. dir <- system.file(package = "dslabs") file_path <- file.path(dir, "extdata/murders.csv")
244
file.copy function
You can copy the file with full path file_path to your working directory using the function file.copy: file.copy(file_path, "murders.csv") #> [1] TRUE If the file is copied successfully, this function will return TRUE.
245
delimiter
When text files are used to store a spreadsheet, line breaks are used to separate rows and a predefined character, referred to as the delimiter, is used to separate columns within a row. The most common delimiters are comma (,), semicolon (;), space (), and tab (a preset number of spaces or \t).
246
How do you determine the delimiter?
In some cases, the delimiter can be inferred from file suffix. For example, files ending in csv or tsv are expected to be comma and tab delimited, respectively. However, it is harder to infer the delimiter for files ending in txt. As a result we recommend looking at the file rather than inferring from the suffix. You can look at any number of lines from within R using the readLines function: readLines("murders.csv", n = 3) #> [1] "state,abb,region,population,total" #> [2] "Alabama,AL,South,4779736,135" #> [3] "Alaska,AK,West,710231,19" This immediately reveals that the file is indeed comma delimited. It also reveals that the file has a header: the first row contains column names rather than data. This is also important to know. Most parsers assume the file starts with a header, but not all files have one.
247
readlines
The most common delimiters are comma (,), semicolon (;), space (), and tab (a preset number of spaces or \t). Slightly different approaches are used to read these files into R, so we need to know what delimiter was used. In some cases, the delimiter can be inferred from file suffix. For example, files ending in csv or tsv are expected to be comma and tab delimited, respectively. However, it is harder to infer the delimiter for files ending in txt. As a result we recommend looking at the file rather than inferring from the suffix. *You can look at any number of lines from within R using the readLines function: readLines("murders.csv", n = 3) #> [1] "state,abb,region,population,total" #> [2] "Alabama,AL,South,4779736,135" #> [3] "Alaska,AK,West,710231,19" This immediately reveals that the file is indeed comma delimited. It also reveals that the file has a header: the first row contains column names rather than data. This is also important to know. Most parsers assume the file starts with a header, but not all files have one.
248
binary files
Unlike text files, which are designed for human readability and have standardized conventions, binary files can adopt numerous formats specific to their data type. Opening image files such as jpg or png in a text editor or using readLines in R will not show comprehensible content because these are binary files.
249
What is a frequent issue when importing data?
incorrectly identifying the file’s encoding. At its core, a computer translates everything into sequences of 0s and 1s. ASCII is an encoding system that assigns specific numbers to characters. Using 7 bits, ASCII can represent unique symbols, sufficient for all English keyboard characters. However, many global languages contain characters outside ASCII’s range. For instance, the Γ© in β€œMΓ©xico” isn’t in ASCII’s catalog. To address this, broader encodings, such as Unicode, emerged. Unicode offers variations using 8, 16, or 32 bits, known as UTF-8, UTF-16, and UTF-32. RStudio typically uses UTF-8 as its default. Notably, ASCII is a subset of UTF-8, meaning that if a file is ASCII-encoded, presuming it’s UTF-8 encoded won’t cause issues. However, there other encodings, such as ISO-8859-1 (also known as Latin-1) developed for the western European languages, Big5 for Traditional Chinese, and ISO-8859-6 for Arabic.
250
What is a parser?
A parser/file parser/importing function is an importing function
251
scan function
With scan, you can read-in each cell of a file. Example: x <- scan("murders.csv", sep = ",", what = "c") x[1:10] #> [1] "state" "abb" "region" "population" "total" #> [6] "Alabama" "AL" "South" "4779736" "135" Why this is useful: When reading in spreadsheets many things can go wrong. The file might have multiline headers or be missing cells. With experience you will learn how to deal with different challenges.
252
readr
The readr package includes parsers, for reading text file spreadsheets into R. readr is part of the tidyverse, but you can load it directly using: library(readr)
253
readr functions available to read-in text file spreadsheets
Function Format Typical suffix read_table white space separated values txt read_csv comma separated values csv read_csv2 semicolon separated values csv read_tsv tab delimited separated values tsv read_delim general text file format, must define delimiter txt It also includes read_lines with similar functionality to readLines. It also includes guess_encoding, which tries to guess at encoding: guess_encoding("murders.csv") #> # A tibble: 1 Γ— 2 #> encoding confidence #> #> 1 ASCII 1
254
readxl package
The readxl package provides functions to read-in Microsoft Excel formats. library(readxl)
255
readxl package functions
The readxl package provides functions to read-in Microsoft Excel formats. Function Format Typical suffix read_excel auto detect the format xls, xlsx read_xls original format xls read_xlsx new format xlsx The excel_sheets function gives us the names of all the sheets in an Excel file.
256
data.table package, fread function
a powerful and fast utility designed for reading large datasets. fread automatically detects the format of the input, whether it’s delimited text or even files compressed in formats like gzip or zip. It offers a significant speed advantage over the other parsers described here, especially for large files. library(data.table) dat <- fread("murders.csv") Note fread returns a data.table object.
257
tempdir function & tempfile function
tempdir: creates a directory with a random name that is likely to be unique. tempfile: creates a character string, not a file, that is likely to be a unique filename. So you can run a command like this which erases the temporary file once it imports the data: tmp_filename <- tempfile() download.file(url, tmp_filename) dat <- read_csv(tmp_filename) file.remove(tmp_filename)
258
Naming Conventions
You want the names you pick for objects, files, and directories to be memorable, easy to spell, and descriptive. This is actually a hard balance to achieve and it does require time and thought. One important rule to follow is do not use spaces, use underscores _ or dashes instead -. Also, avoid symbols; stick to letters and numbers.
259
Dates
Write Dates as YYYY-MM-DD
260
Best Practices for Spreadsheets
Choose Good names Write Dates as YYYY-MM-DD: We recommend using this global ISO8601 standard No Empty Cells: Fill in all cells and use common code for missing data Put Just One Thing in a Cell - It is better to add columns to store the extra information rather than having more than one piece of information in one cell. Make It a Rectangle - The spreadsheet should be a rectangle. Create a Data Dictionary - If you need to explain things, such as what the columns are or what the labels used for categorical variables are, do this in a separate file. No Calculations in the Raw Data Files - Excel permits you to perform calculations. Do not make this part of your spreadsheet. Code for calculations should be in a script. Do Not Use Font Color or Highlighting as Data - Most import functions are not able to import this information. Encode this information as a variable instead. Make Backups - Make regular backups of your data. Use Data Validation to Avoid Errors - Leverage the tools in your spreadsheet software so that the process is as error-free and repetitive-stress-injury-free as possible. Save the Data as Text Files - Save files for sharing in comma or tab delimited format.
261
How do you import data from a spreadsheet?
When importing data from a spreadsheet, the first step is to locate the file containing the data. You could use an approach similar to what you do to open files in Microsoft Excel (although we do not recommend it) by clicking on the RStudio β€œFile” menu - β€œImport Dataset”, and then through folders until you find the file. Our preference is to write code. We need to let the R functions doing the importing know where to look for the file containing the data. The simplest way to do this is to have a copy of the file in the folder in which the importing functions look by default. Example: # Copy the spreadsheet containing the US murders data (included as part of the dslabs package) filename <- "murders.csv" dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename) file.copy(fullpath, "murders.csv") Once the file is copied, import the data with a line of code. Use the read_csv function from the readr package (included in the tidyverse) library(tidyverse) dat <- read_csv(filename)
262
How do you obtain a full path without writing it out explicitly?
Example: filename <- "murders.csv" dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename)
263
How do you copy a file to the working directory?
Example: filename <- "murders.csv" dir <- system.file("extdata", package = "dslabs") fullpath <- file.path(dir, filename) file.copy(fullpath, "murders.csv")
264
How do you open a file to look at or use the function read_lines to look at a few lines?
Example: read_lines("murders.csv", n_max = 3)
265
How do you confirm the data has been read-in
View(dat) Note: Read-in means "imported"
266
download.file
Use the download.file function in order to have a local copy of the file. download.file(url, "murders.csv") Note: Be careful as it will overwrite files without warning
267
What command do you run to erase the temporary file once it imports the data?
tmp_filename <- tempfile() download.file(url, tmp_filename) dat <- read_csv(tmp_filename) file.remove(tmp_filename) Note: A temporary file (often called a temp file) is a scratch file created by your computer during a program's execution. It’s meant to store data only briefly while R is working. R has a built-in function tempfile() that creates a path to a file in your system's temporary directory β€” a special folder used just for temporary data. This folder gets cleaned automatically by your operating system or when R closes. Step What Happens Where is the data? 1 A temp file path is created No data yet 2 File downloaded from the internet Stored temporarily on your disk 3 File read into R using read_csv() Now lives in R memory 4 The downloaded temp file is deleted Doesn't matterβ€”data still exists
268
how do you use a scan to read-in each cell of a file?
path <- system.file("extdata", package = "dslabs") filename <- "murders.csv" x <- scan(file.path(path, filename), sep = ",", what = "c") x[1:10]
269
Functions
A body of reusable code used to perform specific tasks in R
270
Argument
Information that a function in R needs in order to run
271
Variable
A representation of a value in R that can be stored for use later during programming. Variables can also be called objects
272
Variable Naming Rules
Must always start with a letter (e.g., you should not use 5penguins) It can contain numbers and underscores
273
assignment operator
<-
274
Vector
A group of data elements of the same type stored in a sequence in R Example: vec_1 <- c(12, 48.5, 6, 99) [1] 12 48.5 6 99
275
Pipe
A tool in R for expressing a sequence of multiple operations, represented with %>% or |> depending on the version used Example: ToothGrowth %>% filter(dose == 0.5) %>% arrange(len)
276
Types of Vectors
Vector: A group of data elements stored in a sequence in R 1) Atomic: homogenous -integer: positive and negative whole values (3) -double: decimal values(101.175) -logical: True/False -character: string/character value ("Coding") -complex -raw Numeric Vectors: Integers or Doubles Atomic Vectors: Logical, Numeric, or Character *complex and raw aren't commonly used in data analysis. 2) Recursive: heterogenous -list
277
Store numeric data in a vector vs create a vector of integers vs create a vector containing characters or logicals vs create a vector of a sequence of numbers
store numeric data in a vector c(2.5, 48.5, 101.5) create a vector of integers c(1L, 5L, 15L) create a vector containing characters or logicals c("Sara", "Lisa", "Anna") c(TRUE, FALSE, TRUE) create a vector of a sequence of numbers z <- c(4:10) z
278
Functions: typeof() is.logical() is.double() is.integer() is.character()
typeof() function is used to determine a vector's type. Examples: typeof(c("a", "b")) [1] "character" typeof(c(1L, 3L)) [1] "integer" check if a vector is a specific type by using one of the following functions: is.logical() is.double() is.integer() is.character() Examples: x <-c(2L, 5L, 11L) is.integer(x) [1] TRUE y <- c(TRUE, TRUE, FALSE) is.character(y) [1] FALSE
279
names()
You can name elements in vectors of any type with the names() function. Example: x <- c(1, 3, 5) names(x) <- c("a", "B", "c") x [1] a b c 1 3 5
280
How do you extract a subset of a vector?
Reference the element's position in the vector or its name with the extract operator [] Example: x <- c(1, 3, 5) names(x) <-c("a", "b", "c") x x["b"] output: a b c 1 3 5 b 3 Example: x <- c(1, 3, 5) names(x) <- c("a", "b", "c") x x["b"] output: a b c 1 3 5 b 3
281
now() today()
now(): run it to get the current data-time today(): to get the current data, month, and day
282
lubridate package
It's in the tidyverse install.packages("tidyverse") library(tidyverse) library(lubridate) lubridate contains tools to convert strings to dates or date-times so you can perform operations on them. (Date/time data often comes as character strings, so it must be converted before operations can be performed.) To use these functions, arrange "y" "m" and "d) (i.e., year, month, and date) in the order wanted. Example: ymd("2023-01-20") [1] "2023-01-20" mdy("January 20th, 2023) [1] "2023-01-20" dmy("20-Jan-2021") [1] "2021-01-20"
283
y m d h m s
year month date hour minute second
284
as_date()
as_date(now() [1] "2021-01-20"
285
Data Frame
A data frame is a collection of columns containing data, similar to a spreadsheet or SQL table. Each column has a name that represents a variable and includes one observation per row. Data frames summarize data and organize it into a format that is easy to read and use.
286
data.frame()
If you need to manually create a data frame in R, you can use the data.frame () function. data.frame(x = c(1, 2, 3) , y = c(1.5, 5.5, 7.5)) x y 1 1.5 2 5.5 3 7.5
287
[ ] extractor operator
syntax: 0row_to_extract, column_to_extract) Example: the data frame: x y 1 1.5 2 5.5 3 7.5 z <- data.frame(x = c(1, 2, 3), y = c(1.5, 5.5, 7.5)) z[2, 1]
288
file.create()
Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the function. Your file types will usually be something like .txt, .docx, or .csv. Examples: file.create("new_text_file.txt") file.create("new_word_file.docx") file.create("new_csv_file.csv")
289
file.copy()
Copy a file with the file.copy() function. In the parentheses, add the name of the file to be copied. Then, enter a comma, and add the name of the destination folder that you want to copy the file to. Syntax: file.copy("new_text_file.txt", "destination_folder")
290
unlink()
You can delete R files with the unlink() function. Enter the file’s name in the parentheses of the function. Syntax: unlink("some_.file.csv")
291
matrix()
To create a matrix in R, you can use the matrix() function. The matrix() function has two main arguments that you enter in the parentheses. First, add a vector. The vector contains the values you want to place in the matrix. Next, add at least one matrix dimension. You can choose to specify the number of rows or the number of columns by using the code nrow = or ncol =. For example, to create a 2x3 (two rows by three columns) matrix containing the values 3-8, enter a vector containing that series of numbers: c(3:8). Then, enter a comma. Finally, enter nrow = 2 to specify the number of rows. Run the code: matrix(c(3:8), nrow = 2) R displays a matrix with three columns and two rows (typically referred to as a β€œ2x3”) that contain the numeric values 3, 4, 5, 6, 7, 8. R places the first value (3) of the vector in the uppermost row, and the leftmost column of the matrix, and continues the sequence from left to right. Example 2: You can also choose to specify the number of columns (ncol = ) instead of the number of rows (nrow = ). Run the code: matrix(c(3:8), ncol = 2) R infers the number of rows automatically.
292
Assignment Operators
Used to assign values to variables and vectors
293
Arithmetic Operators
Used to complete math calculations + (addition) - (subtraction) * (multiplication) / (division)
294
and or not
and: & Example: You want to find observations (rows) in which conditions are both extremely sunny and windy. You define this as observations that have a Solar measurement of over 150 and a Wind measurement of over 10. This code specifies that R should return a value of TRUE for rows in which the airquality dataset’s Solar.R value is greater than 150 and its Wind value is greater than 10, and a value of FALSE otherwise. airquality[, "Solar.R"] > 150 & airquality[, "Wind"] > 10 or: | Example: you want to specify rows where it’s extremely sunny or it’s extremely windy, which you define as having a Solar measurement of over 150 or a Wind measurement of over 10. This code specifies that R should return a value of TRUE when either the airquality dataset’s Solar.R value is greater than 150 or its Wind value is greater than 10. Otherwise, R will return a value of FALSE. airquality[, "Solar.R"] > 150 | airquality[, "Wind"] > 10 not: != Example: focus on the weather measurements for days that aren't the first day of the month. R should return a value of TRUE when the airquality dataset’s Day value is not 1 and a value of FALSE when the Day value is 1. airquality[, "Day"] != 1
295
Conditional Statement
A declaration that if a certain condition holds, a certain event must take place.
296
if statement
The if statement sets a condition, and if the condition evaluates to TRUE, the R code associated with the if statement is executed. Syntax: In R, you place the code for the condition inside the parentheses of the if statement. The code to be executed if the condition is TRUE follows in curly braces (expr). Note that in this case, the second curly brace is placed on its own line of code and identifies the end of the code that you want to execute. if (condition) { expr } Example: if x is greater than 0, then R will print out the string "x is a positive number". x <- 4 if (x > 0) { print("x is a positive number") } Output; As x is equal to 4, the condition is true (because 4 is greater than 0). Therefore, when you run the code, R prints out the string "x is a positive number". But, if you change x to a negative number, such as -4, then the condition will be FALSE because -4 is not greater than 0. If you run the code, R will not execute the print statement. Instead, a blank line will appear as the result.
297
else statement
The else statement is used in combination with an if statement. Syntax: if (condition) { expr1 } else { expr2 } Example 1: x <- 7 if (x > 0) { print ("x is a positive number") } else { print ("x is either a negative number or zero") } Example 2: the if statement checks the Temp value for the first row in airquality. if (airquality$Temp[1] < 80) { print("It's not a hot day!") } else { print("It's a hot day.") }
298
else if statement
Syntax: if (condition1) { expr1 } else if (condition2) { expr2 } else { expr3 } If the if condition (condition1) is met, then R executes the code in the first expression (expr1). If the if condition is not met and the else if condition (condition2) is met, then R executes the code in the second expression (expr2). If neither of the two conditions are met, R executes the code in the third expression (expr3).
299
Logical Operators
AND (&), OR (|), NOT(!) Logical operators can be used to check a condition and return a logical data type. In R, logical data is presented as T or TRUE when a condition is met, and F or FALSE when it is not.
300
str(diamonds) glimpse(diamonds)
The str() and glimpse() functions will both return summaries of each column in your data arranged horizontally.
301
colnames()
returns a list of column names from your dataset
302
rename()
Use this function to rename the columns or variables in your data. Example 1: Rename "carat" column to "carat_new" in the diamonds dataset. rename(diamonds, carat_new_carat) Example 2: Rename more than one variable. rename(diamonds, carat_new = carat, cut_new = cut)
303
= vs <- vs ==
Both "=" and "<-" are valid assignment operators in R, but they serve slightly different purposes and contexts. The "=" operator is typically used within function calls, while the "<-" operator is preferred for general assignments. == is exactly equal to. A logical comparator testing for equality. example: x <- 10 y <- 10 z <- 5 Check if x is equal to y x == y # Output: [1] TRUE The "=" operator is commonly used for assignments within function calls. example: x = 10 print(x) [1] 10 The "<-" operator is the traditional assignment operator in R. It is specifically designed for assignment operations and is considered a best practice by many R programmers for regular variable assignments. example: y <- 20 print(y) [1] 20
304
ggplot2, '+' symbol
To build a visual with 'ggplot2' you layer plot elements together with a '+' symbol. examples: Take the `diamonds` data, plots the carat column on the X-axis, the price column on the Y-axis, and represents the data as a scatter plot using the `geom_point()` command ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point()
305
Packages(R)
Units of reproducible R code that include: reusable R functions, documentation about the functions, sample datasets, and test for checking your code (to make sure that it does what you want it to do). Base R: a set of packages in RStudio that are available when you start.
306
CRAN
Comprehensive R Archive Network An online archive with R packages, source code, manuals, and documentation. CRAN ensures that packages are authentic and valid, so if you source a package through CRAN, you can feel confident in its legitimacy.
307
Conflicts when loading packages
Conflicts happen when packages have functions with the same names as other functions. Whatever package you most recently loaded, will be the default package for the current R session.
308
8 core tidyverse packages
ggplot2: used for data visuals. tibble: works with data frames tidyr: used for data cleaning (i.e., makes tidy data). It works with wide and long data. readr: used for importing data. To read a dataset with readr, combine the function with a column specification. The column specification describes how each column should be converted to the most appropriate data type. But this isn't usually necessary as readr will figure it out for you automatically. purrr: works with functions and vectors dplyr: offers functions that help you complete common data manipulation tasks (e.g., the filter function finds cases where certain conditions are true). stringr: includes functions that make it easier to work with strings. forcats: provides tools that solve common problems with factors.
309
tidyverse_update() update.packages() install.packages("package name")
tidyverse_update(): Use this function to check for updates. (The packages in tidyverse change a lot.) You can then update your packages. A few options to do so: update.packages(): Use this function to update all of your packages. This will take some time. install.packages("package name"): use this function to quickly update one package. Best Practice: update packages regularly to make sure you have the latest version in your code.
310
Vignette
Documentation that acts as a guide to an R package. A vignette shares details about the problem that the package is designed to solve and how the included functions can help you solve it. Use browseVignettes function to read through the vignettes of a loaded package. Example: Use browseVignettes on ggplot2 browseVignettes("ggplot2") Output: Vignettes in package ggplot2 -Aesthetic specifications - HTML source R code Extending ggplot2 - HTML source R code Using ggplot2 in packages - HTML source R code
311
installed.packages()
installed.packages() function in R returns a matrix containing information about all packages installed in the specified libraries. This matrix includes details such as the package name, the library path where it's located, its version number, and other metadata like dependencies, imports, and suggestions.
312
Nested
In programming, "nested" describes code that performs a particular function and is contained within code that performs a broader function.
313
How do we read nested functions?
From the inside out.
314
pipe hot key
ctrl+shift+m
315
Tidy data standards
-variables are organized into columns -observations are organized into rows -each value must have its own cell
316
str()
Get the structure of the data frame. provides high-level info like the column names and the type of data contained in each column Example: Data Frame: Names Age Bob 48 Shirley 61 Liz 73 Dave 39 str(people) output: 'data.frame': 4 obs of 2 variables: $ names: chr "Bob" "Shirley" "Liz" "Dave" $ age : num 48 61 73 39
317
colnames()
get the column names Example: Data Frame: Names Age Bob 48 Shirley 61 Liz 73 Dave 39 colnames(people) output: "names" "age"
318
glimpse()
Example: Data Frame: Names Age Bob 48 Shirley 61 Liz 73 Dave 39 glimpse(people) Ouput: Rows: 4 Columns: 2 $ names "Bob", "Shirley", "Liz", "Dave" $ age 48, 61, 73, 39
319
mutate()
Use this function to make changes to the data frame. Part of the dplyr package which is in the tidyverse. You need to load the tidyverse library to use it. Syntax: mutate(data_name_to_change, name_of_the_new_col_to_create)
320
readr functions
Use to import data from a csv, tsv, dlim, fw, table, or log file. read_csv(): comma-separated values (.csv) files ex// read_csv(readr_example("mtcars.csv")) read_tsv(): tab-separated values files read_delim(): general delimited files read_fwf(): fixed-width files read_table(): tabular files where columns are separated by white-space read_log(): web log files
321
readxl package
Part of the tidyverse but not a core tidyverse package so must load readxl in R by using library() function. Use the read_excel() function to read a spreadsheet file. ex// read_excel(readxl_example("type-me.xlsx") Used the excel_sheets() function to list the names of individual sheets excel_sheets(readxl_example("type-me.xlsx")) You can also specify a sheet by a name or number: logical_coercion numeric_coercion date_coercion text_coercion ex// read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion") output: R willr eturn a tibble fo the sheet.
322
rename() rename_with() clean_names()
rename() change column names Example: rename_with(penguins, toupper) this will change all of the column names to uppercase. or rename_with(penguins, tolower) this will change all of the column names to lowercase, which is more common. clean_names(): ensures that there are only characters, numbers, and underscores in the names ex// clean_names(penguins) *the dataset is called "penguins"
323
skim_without_charts() glimpse()
Both functions return a summary of the data frame, including the number of columns and rows.
324
File Naming Conventions, Dos & Don'ts
Do: -keep your filenames to a reasonable length -use underscores and hyphens for readability -start or end your filename with a letter or number -use a standard date format when applicable; example YYYY-MM-DD -Use filenames for related files that work well with default ordering (e.g., chronological order, logical order with numbers first, etc.) Don't: -Use unnecessary additional character in filenames -Use spaces or illegal characters, e.g., &, %, #, <., > -start or end your filename with a symbol -use incomplete or inconsistent date formats, e.g., M-D-YY -use filenames for related files that do not work well with default ordering, e.g., a random system of numbers or date formats, using letters first
325
4 Types of Operators
-Assignment: assign values to variables. x <- 2 -Arithmetic: perform basic math operations, such as addition, subtraction, multiplication, and division + - * / %% (modulus; returns the remainder after division) %/% Integer division(returns an integer value after division) ^ -Relational/comparators: allow you to compare values. The output for relational operators is TRUE or FALSE, which is a logical data type or boolean data type. < > <= >= == != -Logical: allow you to combine logical statements and return a logical value like TRUE or FALSE: all values must be TRUE for the entire operation to evaluate to TRUE & Element-wise logical AND operator | Element-wise logical OR operator: one value must be TRUE for the entire operation to evaluate to TRUE ! Logical Not (e.g., !TRUE = FALSE and !FALSE = TRUE) | Element-wise logical OR
326
separate()
The separate() function turns a single character column into multiple columns. employee <- data.frame(id, name, job_title) separate(employee, name, into=c('first_name', 'last_name'), sep=' ')
327
unite()
unite() function makes it possible to merge columns together. Syntax: unite(data, col, ..., sep = "_", remove = TRUE) data: The data frame col: The name of the new column as a string or symbol ...: A selection of columns. If empty, all variables are selected. You can select all variables between x and z with x:z or exclude y with '-y' sep: the separator to use between values (in the below example it's a space) Example: unite(employee, 'name', first_name, last_name, sep= ' ')
328
pivot_longer() pivot_wider()
pivot_longer(): Part of the tidyr package, use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns. pivot_wider() function: convert your data to have more columns and fewer rows. (presumably also part of the tidyr package).
329
Anscombe quartet
Four datasets that have nearly identical summary statistics
330
bias()
bias computes the average amount by which actual is greater than predicted. If it returns a positive value, your model is systematically underestimating the true values. If it returns a negative value, your model is systematically overestimating the true values. Syntax: bias(actual, predicted) Example 1: Compare predicted temp with actuals. install.packages("SimDesign") library(SimDesign) actual_temp <- c(68.3, 70, 72.4, 71, 67, 70) predicted_temp <- c(67.9, 69, 71.5, 70, 67, 69) bias(actual_temp, predicted_temp) [1] 0.7166667 Positive value so the model is underestimating the true values. That is, the prediction is biased toward lower temps. It's fairly close to zero, but it isn't as accurate as would be ideal. Example 2: Compare actual sales with stock (i.e., predicted sales): #No need to install a package and draw on library as SimDesign is already set up. actual_sales <- c(150, 203, 137, 247, 116, 287) predicted_sales <- c(200, 300, 150, 250, 150, 300) bias(actual_sales, predicted_sales) [1] -35 Negative value so the model is overestimating the true values. That is, they're ordering too much stock for release days.
331
sample()
In R, the sample() function allows you to take a random sample of elements from a data set. Use Case: We decided to add randomization to the position of the ads using R. We wanted to make sure that the ads with similar frequencies were near each other and to eliminate as much bias as possible. We used sample() to inject a randomization element into our R programming. In R, the sample() function allows you to take a random sample of elements from a data set. Adding this piece of code randomly shuffled the rows in our data. We presented the ads to users again, and this time, the position of the ads was random and controlled for bias. Less bias meant that the survey was more effective because the data was more reliable.”
332
smote()
SMOTE (Synthetic Minority Oversampling Technique) Unbalanced classification problems cause problems to many learning algorithms. These problems are characterized by the uneven proportion of cases that are available for each class of the problem. SMOTE (Chawla et. al. 2002) is a well-known algorithm to fight this problem. The general idea of this method is to artificially generate new examples of the minority class using the nearest neighbors of these cases. Furthermore, the majority class examples are also under-sampled, leading to a more balanced dataset. more details: https://search.r-project.org/CRAN/refmans/performanceEstimation/html/smote.html Syntax: smote(form, data, perc.over = 2, k = 5, perc.under = 2) Arguments form: A formula describing the prediction problem data: A data frame containing the original (unbalanced) data set perc.over: A number that drives the decision of how many extra cases from the minority class are generated (known as over-sampling). k: A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class. perc.under: A number that drives the decision of how many extra cases from the majority classes are selected for each case generated from the minority class (known as under-sampling) Use Case: In another instance of the data analysis process focusing on furniture sales, a significant issue arose when the dataset contained biased information related to the geographic representation of sales data. Certain regions were overrepresented, leading to skewed conclusions about popular furniture items. To address this bias, the furniture team employed statistical techniques to rebalance the dataset, oversampling underrepresented regions, and undersampling the overrepresented ones with R programming. *The team employed the SMOTE (Synthetic Minority Oversampling Technique) for oversampling underrepresented regions and the NearMiss algorithm for undersampling overrepresented regions. Bootstrapping and k-nearest neighbor are used by the SMOTE function to generate further observations of the bias through oversampling.
333
NearMiss algorithm
Generates synthetic positive instances using nearmiss algorithm. Syntax: nearmiss(df, var, k = 5, under_ratio = 1) df: data.frame or tibble. Must have 1 factor variable and remaining numeric variables. var: Character, name of variable containing factor variable. k: An integer. Number of nearest neighbor that are used to generate the new examples of the minority class
334
FWF
FWF (fixed-width file): A text file with a specific format, which enables the saving of textual data in an organized fashion
335
Log file
A computer-generated file that records events from operating systems and other software programs
336
data visual packages
ggplot2: most popular data visualization package in R. Can make scatterplots, bar charts, line diagrams, etc. can add titles, etc. *If you need a data visual function, ggplot2 probably has a function. (There is a cheat sheet.) Plotly: General purpose package that lets you do a wide range of visualization functions. RGL: Package that focuses on 3D visuals. Other visual packages: Lattice Dygraphs Leaflet Highcharter Patchwork Patchwork gganimate ggridges
337
ggplot2: -aesthetic -geom -facet -label and annotations
In ggplot2, -an aesthetic is a visual property of an object in your plot. Think of it as a connection or mapping between a visual feature in your pot and a variable in your data (e.g., in a scatterplot, aesthetics include things like the size, shape, color, or location (i.e., x- or y-axis) of your data points). -a geom is a geometric object used to represent your data (e.g., you can use points to create a scatterplot, bars to create a bar chart, lines to create a line diagram, etc.) -a facet lets you display smaller groups, or subsets, of your data -the label and annotate functions let you customize your plot (e.g., you can titles, subtitles, and captions to communicate the purpose of your plot).
338
ggplot2, scatterplot code
Syntax: ggplot(data=)+(mapping=aes(
339
? function_name
To learn more about any r function run the code ? function_name
340
How to add color to scatterplot
Example: Existing scatterplot: ggplot(data = penguins) + geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g)) Edited scatterplot with color by species & a legend: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species)) Edited scatterplot with shape by species: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species)) Edited scatterplot with a different shape and color for each species: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species, color=species, size=species)) #alpha aesthetic controls the transparency of the points. A good option when you have a dense plot with lots of data points. ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, shape=species, color=species, alpha=species)) #Just set all points to purple. We aren't mapping color to a specific variable like species, so this code needs to be outside the aes function. ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g), color="purple")
341
3 aesthetics in ggplot2
color: you can change the color of all points on your plot or the color of each data group size: you can change the size of the points on your plot by data group shape: you can change the shape of the points on your plot by data group
342
smooth plot vs plot points
Example of smooth: ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g)) Example of smooth and points: ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+geom_point(mapping=aes(x=flipper=length_mm, y=body_mass_g)) Example of plotting a different line type for each species: ggplot(data=penguins)+geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+geom_point(mapping=aes(x=flipper=length_mm, y=body_mass_g, linetype=species))
343
geom_jitter function
creates a scatterplot and then adds a small amount of random noise to each point in the plot. Jittering helps deal with overplotting (i.e., when data points in a plot overlap with one another). Jittering makes the points easier to find. By using the jitter function, we can get a better picture of the true underlying relationship between two variables in a dataset. However, we should be careful not to add too much jitter, as this can distort the original data too much. Example: ggplot(data=penguins)+ geom_jitter(mapping=aes(x=flipper_length_mm, y=body_mass_g, linetype=species))
344
geom_bar
Bar charts Example: *Note all examples draw on a diamond data set that includes data on the cut and fill of each diamond. ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut)) #to add color outlines to the bar chart: ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut, color=cut)) #to add color fill to the bar chart (i.e., to fully fill the columns): ggplot(data=diamonds)+geom_bar(mapping=aes(x=cut, fill=cut)) Note: (note: if you don't specify a variable for the y-axis, the code defaults to 'count') Example: library(ggplot2) library(tidyverse) #create the bar chart ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = hotel, fill = market_segment)) #filter the bar chart to just include city hotels that are online TA onlineta_city_hotels <- filter(hotel_bookings, (hotel=="City Hotel" & hotel_bookings$market_segment=="Online TA")) View(onlineta_city_hotels)
345
Loess smoothing vs Gam smoothing
Loess smoothing The loess smoothing process is best for smoothing plots with less than 1000 points. ggplot(data, aes(x=, y=))+ geom_point() + geom_smooth(method="loess") The gam smoothing, or generalized additive model smoothing, is useful for smoothing plots with a large number of points. ggplot(data, aes(x=, y=)) + geom_point() + geom_smooth(method="gam", formula = y ~s(x))
346
facet_wrap()
Use to facet your plot by a single variable. Example: Facet_wrap lets us create a separate plot for each species: ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+facet_wrap)~(~species)
347
tilde operator ~
Tilde operator is used to define the relationship between dependent variable and independent variables in a statistical model formula. The variable on the left-hand side of tilde operator is the dependent variable and the variable(s) on the right-hand side of tilde operator is/are called the independent variable(s). So, tilde operator helps to define that dependent variable depends on the independent variable(s) that are on the right-hand side of tilde operator.
348
facet_grid()
Use to facet your plot with two variables. Note: Unlike the facet_wrap() function, the facet_grid() function will include plots even if they're empty. Example: #2 variables: sex and species ggplot(data=penguins)+geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+facet_grid(sex~species)
349
How to create a plot with rotated labels
Example 1: ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = distribution_channel)) + facet_wrap(~deposit_type) + theme(axis.text.x = element_text(angle = 45)) Example 2: Same as above but with a different chart for each market segment: ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = distribution_channel)) + facet_wrap(~market_segment) + theme(axis.text.x = element_text(angle = 45))
350
Chart Labeling/Chart Creation functions
Chart Title: To add a title toa chart, use a label function: title = Average product rating Subtitle: use subtitle="Sample of Three Penguin Species" Ex// ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species") To create bars on a chart: geom_bar() To highlight underperforming products, use an aesthetics function: col = ifelse(x<2, 'blue', 'yellow') To create a scatterplot chart: geom_point() To create a trendline: geom_smooth() To compare data trends across average ratings, use a facets function: facet_wrap(~Average Rating) To label the axes, use an aesthetics function: aes(x=Average price (USD), y = Product) To add a caption: caption="enter caption here" Ex// ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species", caption="Data collected by Dr. Kristen Gorman") To remove the axis label: Setting labs(x = "") omits the label but still allocates space; setting labs(x = NULL) removes the label and its space.
351
Annotate
To add notes to a document or diagram to explain or comment upon it. The annotate function will allow you to put text inside the grid to call out specific data points. Ex: Add info about the Gentoos to the chart in large, bold, and purple text that is tilted at a 25 degree angle. annotate("text", x=220, y=3500, label="The Gentoos are the largest") library('ggplot2') library('palmerpenguins') ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+ labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species", caption="Data collected by Dr. Kristen Gorman")+ annotate("text", x=220, y=3500, label="The Gentoos are the largest", color="purple", fontface="bold", size=4.5, angle=25) OR if you want a shorter string of code, you could assign the first portion to a variable and then tack on the annotation: library('ggplot2') library('palmerpenguins') p <- ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species))+ labs(title="Palmer Penguins: Body Mass vs. Flipper Length", subtitle="Sample of Three Penguin Species", caption="Data collected by Dr. Kristen Gorman") p+annotate("text", x=220, y=3500, label="The Gentoos are the largest") There are 3 fonts that are guaranteed to work everywhere: sans (the default) serif mono There are 3 values for fontface: plain (the default) bold italic Alignment of the text: hjust(left, center, right, inward, outward) vjust(bottom, middle, top, inward, outward) check_overlap: If check_overlap = TRUE, overlapping labels will be automatically removed from the plot. The algorithm is simple: labels are plotted in the order they appear in the data frame; if a label would overlap with an existing point, it’s omitted. Notes: -more on syntax: https://ggplot2.tidyverse.org/reference/annotate.html
352
Annotation Text Types
# a data frame with all the annotation info Using ggplot2, 2 main functions are available for that kind of annotation: geom_text to add a simple piece of text geom_label to add a label: framed text Ex// library library(ggplot2) basic graph p <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() annotation <- data.frame( x = c(2,4.5), y = c(20,25), label = c("label 1", "label 2") ) Add text p + geom_text(data=annotation, aes( x=x, y=y, label=label), , color="orange", size=7 , angle=45, fontface="bold" ) Note: possible to shorten with annotate: # p + # annotate("text", x = c(2,4.5), y = c(20,25), # label = c("label 1", "label 2") , color="orange", # size=7 , angle=45, fontface="bold") Right chart: using labels p + geom_label(data=annotation, aes( x=x, y=y, label=label), , color="orange", size=7 , angle=45, fontface="bold" )
353
Add Shapes with Annotate()
The annotate() function allows to add all kind of shape on a ggplot2 chart. The first argument will control what kind is used: rect or segment for rectangle, segment or arrow. #Add rectangles p + annotate("rect", xmin=c(2,4), xmax=c(3,5), ymin=c(20,10) , ymax=c(30,20), alpha=0.2, color="blue", fill="blue") #Add segments p + annotate("segment", x = 1, xend = 3, y = 25, yend = 15, colour = "purple", size=3, alpha=0.6) #Add arrow p + annotate("segment", x = 2, xend = 4, y = 15, yend = 25, colour = "pink", size=3, alpha=0.6, arrow=arrow())
354
Custom Annotations
geom_text() and geom_label() to add text, as illustrated earlier. geom_rect() to highlight interesting rectangular regions of the plot. geom_rect() has aesthetics xmin, xmax, ymin and ymax. geom_line(), geom_path() and geom_segment() to add lines. All these geoms have an arrow parameter, which allows you to place an arrowhead on the line. Create arrowheads with arrow(), which has arguments angle, length, ends and type. geom_vline(), geom_hline() and geom_abline() allow you to add reference lines (sometimes called rules), that span the full range of the plot.
355
ggsave()
Useful function for saving a plot. It defaults to saving the last plot that you displayed and uses the size of the current graphics device. Example: library(ggplot2) library(palmerpenguins) ggplot(data=penguins)+ geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g, color=species)) ggsave("Three Penguin Species.png") Output: saves the graph as Three Penguin Species as a png. Visible in the Files>Cloud>Project directory (folder) on Posit Cloud (web service that delivers an IDE similar to RStudio).
356
Save a ggplot
Save ggplot into a PDF file: Create some plots library(ggplot2) myplot1 <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() myplot2 <- ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() Print plots to a pdf file (dev.off function closes the graphic device) pdf("ggplot.pdf") print(myplot1) # Plot 1 --> in the first page of PDF print(myplot2) # Plot 2 ---> in the second page of the PDF dev.off() #Print into a png file: png("myplot.png") print(myplot) dev.off() Save as Jpeg Image jpeg(file="saving_plot1.jpeg") hist(Temperature, col="darkgreen") dev.off() Save as png Image png(file="C:/Datamentor/R-tutorial/saving_plot2.png", width=600, height=350) hist(Temperature, col="gold") dev.off() Save as bmp Image bmp(file="saving_plot3.bmp", width=6, height=4, units="in", res=100) hist(Temperature, col="steelblue") dev.off() Save as pdf File pdf(file="saving_plot4.pdf") hist(Temperature, col="violet") dev.off() Save as postscript file postscript(file="saving_plot4.ps") hist(Temperature, col="violet") dev.off()
357
Plots can be saved as ___ or ____
Plots can be saved as bitmap images(raster), which are fixed size OR as vector images which are easily resizeable. Raster: type of image produced when scanning or photographing an object. Raster images are compiled using pixels containing unique color and tonal info that comes together to create an image. Since raster images are pixel-based, they are resolution dependent. Most of the images we come across like jpeg or png are bitmap images. They have a fixed resolution and are pixelated when zoomed enough. Functions that help us save plots in this format are jpeg(), png(), bmp() and tiff().
358
Define dynamic variables so that the text (in this example, the caption) updates automatically when the data is updated.
Example: hotel_bookings <- read.csv("hotel_bookings.csv") library(ggplot2) library(tidyverse) mindate <- min(hotel_bookings$arrival_date_year) maxdate <- max(hotel_bookings$arrival_date_year) ggplot(data = hotel_bookings) + geom_bar(mapping = aes(x = market_segment)) + facet_wrap(~hotel) + theme(axis.text.x = element_text(angle = 45)) + labs(title="Comparison of market segments by hotel type for hotel bookings", caption=paste0("Data from: ", mindate, " to ", maxdate), x="Market Segment", y="Number of Bookings")
359
Aesthetic (R):
A visual property of an object in a plot
360
Facets (R):
A series of functions that splits data into subsets in a matrix of panels
361
GAM (generalized additive model) smoothing (R):
A process for smoothing plots with a large number of points
362
Geom (R):
The geometric object used to represent data
363
Labels and annotations (R):
A group of R functions used for customizing a plot
364
Loess smoothing (R):
A process used for smoothing plots with fewer than 1,000 points
365
Mapping (R):
The process of matching up a specific variable in a dataset with a specific aesthetic
366
Smoothing (R):
A process used to make data visualizations in R clearer and more readable
367
Smoothing line (R):
A line on a data visualization that uses smoothing to represent a trend
368
R Markdown
A file format for making dynamic documents with R. You can use an R Markdown file as a code notebook to save, organize, and document your analysis using code chunks, comments, and other features. It allows you to save and execute code & generate shareable reports for stakeholders. You can use R Markdown in notebook mode for analyst-to-analyst communication, and in report mode for analyst-to-decision-maker communication.
369
Markdown
A syntax for formatting plain text files
370
R Notebook
lets users run your code and show the graphs that visualize that code.
371
R Notebook lets you convert your files into the following formats:
-HTML, PDF, and Word docs -Slide Presentation -Dashboard
372
HTML
The set of markup symbols and code used to create a webpage
373
Other than R Notebook, there are the following notebooks:
Jupyter Kaggle Google Colab AKA Colab
374
Jupyter Notebooks
The Jupyter Notebook is an open source web application that you can use to create and share documents that contain live code, equations, visualizations, and text. Documents that contain computer code and rich text elements (e.g., comments, links, or descriptions of your analysis and results). They can be useful for everything from data cleaning and transformation to statistical modeling and visualizations. They're compatible with R so are an alternative to R markdown. Privacy: Because you use Jupyter in a web browser, some people are understandably concerned about using it with sensitive data. However, if you followed the standard install instructions, Jupyter is actually running on your own computer. Notes: -If you're working in Kaggle, there are two types of notebooks available: Jupyter notebooks and scripts (including R markdown scripts). -Jupyter notebooks can be used in google colab
375
Notebook
A notebook is a shareable document that combines computer code, plain language descriptions, data, rich visualizations like 3D models, charts, graphs and figures, and interactive controls. A notebook, along with an editor (like JupyterLab), provides a fast interactive environment for prototyping and explaining code, exploring and visualizing data, and sharing ideas with others.
376
Jupyter Notebook Documents
Notebook documents contains the inputs and outputs of a interactive session as well as additional text that accompanies the code but is not meant for execution. In this way, notebook files can serve as a complete computational record of a session, interleaving executable code with explanatory text, mathematics, and rich representations of resulting objects. These documents are internally JSON files and are saved with the .ipynb extension. Since JSON is a plain text format, they can be version-controlled and shared with colleagues. *JSON: a format used to store and export data
377
JSON
a format used to store and export data
378
R Notebook, Knit button
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
379
.RMD file, hashtags
R markdown file headings in the report are created when you include one or more hashtags (#) before the heading text, such as ## Including Plots. The more hashtags used, the smaller the heading font. # Including Plots creates a Header 1 style heading whereas ## Including Plots creates a Header 2 style heading.
380
R Notebook How can you tell when a code chunk ends? How to start and end a code chunk?
All code chunks begin and end with delimiters. To start a code chunk, you can type three tick marks followed by a lowercase β€œr” in curly brackets: ```{r} To end it, type just the three tick marks: ``` Hotkeys to add code: ctrl+alt+I (PC)
381
R Notebook Basic Formatting:
To start a new paragraph, end a line with two spaces To apply italics to a word or phrase, place an asterisk at the beginning and at the end of the word or phrase, for example, *italics works* To apply bold to a word or phrase, place two asterisks at the beginning and at the end of the word or phrase, for example, **bold is useful** To create a header, type a hashtag (#) followed by a space and your text for example: # Getting Started with R Markdown
382
R Notebook Creating headers:
Headers will appear in blue A single hashtag is the largest header The more hashtags you add (up to six), the smaller the header
383
YAML
Yet Another Markup Language A language for data that translates it so it's readable.
384
To include a link in an RMarkdown document, use the following syntax:
[click here](URL
385
What is the correct syntax to add an image with a caption to an RMarkdown document?
![caption](image URL)
386
Code chunk
Code added to a .rmd file (r markdown file) is standardly called a code chunk
387
Delimiter
A character that indicates the beginning or end of a data item.
388
R Notebook Code chunk delimiters
```{r} and ``` PC hotkeys: ctrl+alt+I
389
You can create a template in R if you need to create a deliver a report on a regular basis.
390
Change the output of a document in R Markdown
When working in RStudio, you can set the output of a document in R Markdown by changing the YAML header. For example, the following code creates an HTML document: --- title: "Demo" output: html_document --- And the following code creates a PDF document: --- title: "Demo" output: pdf_document The Knit button in the RStudio source editor renders a file to the first format listed in its output field (HTML is the default). You can render a file to additional formats by clicking the dropdown menu next to the knit button.
391
In addition to the default HTML output (html_document), you can create other types of documents in R Markdown using the following output settings:
pdf_document – This creates a PDF file with LaTeX (an open source document layout system). If you don’t already have LaTeX, RStudio will automatically prompt you to install it. word_document – This creates a Microsoft Word document (.docx). odt_document – This creates an OpenDocument Text document (.odt). rtf_document – This creates a Rich Text Format document (.rtf). md_document – This creates a Markdown document (which strictly conforms to the original Markdown specification) github_document – This creates a GitHub document which is a customized version of a Markdown document designed for sharing on GitHub.
392
R Markdown renders files to specific presentation formats when you use the following output settings:
beamer_presentation – for PDF presentations with beamer ioslides_presentation – for HTML presentations with ioslides slidy_presentation – for HTML presentations with Slidy powerpoint_presentation – for PowerPoint presentations revealjs : : revealjs_presentation – for HTML presentations with reveal.js (a framework for creating HTML presentations that requires the reveal.js package) Learn more: https://rmarkdown.rstudio.com/lesson-11.html
393
flexdashboard
The flexdashboard package lets you publish a group of related data visualizations as a dashboard. Flexdashboard also provides tools for creating sidebars, tabsets, value boxes, and gauges.
394
Shiny
Shiny is an R package that lets you build interactive web apps using R code. You can embed your apps in R Markdown documents or host them on a webpage. To call Shiny code from an R Markdown document, add runtime: shiny to the YAML header: --- title: "Shiny Web App" output: html_document runtime: shiny
395
Other packages provide even more output formats:
The bookdown package is helpful for writing books and long-form articles. The prettydoc package provides a range of attractive themes for R Markdown documents. The rticles package provides templates for various journals and publishers.
396
A delimiter is a character that marks the beginning and end of a _________.
A delimiter is a character that marks the beginning and end of a data item. It can mark a single line of code, or a whole section of code in an .rmd file.
397
Which combination of text characters can be used to embed an image in a markdown document?
![]()
398
What symbol can be used to add bullet points in R Markdown?
Asterisks
399
What delimiter is used to indicate the YAML metadata in an R Markdown notebook?
---
400