key terms Flashcards

(77 cards)

1
Q

variable

A

name assigned to a value, and stored in global environment

shortcut = alt + -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

naming

A

legal names in R must begin with a letter

. _ and numbers all allowed (but not first)

case sensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

functions

A

carry out a calculation

no side-effects –> don’t change the arguements forever

args() prints summary of main arguements of a function, and default arguement values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

arguements

A

parts inside the parenthesis of a function

a function may have multiple arguements

separated by ,

arguement names don’t have to be specified if in the default order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

name-value pairs

A

arguements given as name-value pairs

‘name=value’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

function nesting

A

read from inside out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

vectors

A

1-dimensional data structure for storing a set of values

element = no. values in vector

c() combines vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

atomic vectors

A

contain data of 1 type
(e.g. all integers or all characters)

[1] indicates something is an atomic vector
(e.g. [1] 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

numeric vectors

A

numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

character vectors

A

character strings

have to put ‘ or “ around characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

logical vectors

A

elements take only 2 values: ‘TRUE’ or ‘FALSE’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

relational operators

A

x < y

x > y

x <= y

x >= y

x == y

x != y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

statistical variables

A

anything we can control or measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

data frames

A

table-like object with rows and columns

columns = statistical variables (each a vector of the same length (<chr>, <dbl>, or <int>))</int></dbl></chr>

rows = related observations

data.frame() makes dataframe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

extracting variables

A

use double square brackets around variable (variable needs “)

or use $ (variable doesn’t need “)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

packages

A

collection of folders and files combining code, data, and documents for sharing between computers

CRAN website contains all packages
Task View looks at what packages useful for your type of data analysis

packages must be installed (once) and loaded and attached (each session)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

tidyverse

A

collection of R packages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

data wrangling

A

cleaning and manipulating data ready for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

tidy data

A

1 variable = 1 column
each row has 1 unique observation

e.g. if biomass was measured at 2 time points, can’t have a T1 biomass column and T2 biomass column as this splits biomass across 2 columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

tibbles

A

tidyverse version of a data frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

dplyr

A

helps manipulate rectangular data

functions:
- glimpse
- select
- mutate
- filter
- arrange
- summarise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

select

A

selects certain variables and (optionally) renames them

don’t need ‘’, but if variable has a space, use ``

select all but certain variables using ! before their name

rename using name-vaue (<new>=<variable>)</variable></new>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

mutate

A

creates new variables from pre-existing ones, and keeps original variables

don’t use quotes

can rename at same time using name-value

can make multiple at same time, separated by ,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

rename

A

renames variables if only want to rename, not select as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
transmute
creates new variables from pre-existing ones and drops all the original variables if want to keep some original variables unaltered, add them as arguements
26
:
from - to
27
filter
filter variables using relational operators and logical operators (so only get the observations that fit that criteria) variables don't need '', but character string values do
28
logical operators
link relational operators x & y x | y
29
arrange
arranges rows according to their observations use desc() fr descending order
30
summarise
summarises variables according to functions used tibble returned doesn't need to be same length as initial vectors can name these new variables at same time using name-value
31
na.rm = TRUE
ignores NA values
32
group_by
performs calculations on individual groups - e.g. mean bill length by species can group by multiple variables missing values create extra groups
33
pipelines
pipe operator %>% shortcut = ctrl + shift + m %>% means take the product of the expression on the LHS and use it as the arguement where the . placeholder is can leave out . and R assumes it to be the first arguement
34
helper functions
used with dplyr functions starts_with, contains, case_when, etc.
35
types of variables
numeric - continuous or discrete (continuous variables can be made discrete by measuring apparatus) categorical - ordinal or nominal
36
ratio vs interval scales
ratio = meaningful 0 (represents the absence of a quantity) - can add/subtract and multipl/divide on the scale - often for physical quantities - e.g. one tree is twice as tall as another) interval - has a meaningful 0 (doesn't represent the absence of a quantity) - can add/subtract on scale, but not multiply/divide - e.g. date (would not say 1000AD is twice as long as 500AD)
37
populations vs samples
sample = small group drawn from the wider population exploratory data analysis works with exploring properties of samples
38
descriptive statistics
central tendency (averages), dispersion (variance, interquartile range, and standard deviation), and associations (Pearson's correlation coefficient if linear, Spearman's rank correlatoin if nonlinear, etc)
39
cross-tabulation - xtabs()
shows which combinations of categories are common in categorical data returns a contingency table
40
graphical statistics - data visualisation
graphics package - standard R package, flexible lattice package - good for multivariable relationships ggplot2 package - easy to make sophisticated plots
41
ggplot2 objects - ggplot()
at least 1 layer (associated with data and rules on how to display it) a scale for each aesthetic mapping a coordinate system per plot a facet specification if using a multi-panel plot (add graphical objects using +)
42
layers
5 components: - data (in form of data frame / tibble) - aesthetic mappings (describe how data is associated with aesthetics such as position, colour, size of points) - geometric object - aka geom (how to present info - i.e. grahp type) - statistical transformation - aka stat (transforms the raw data (if not summaried first in dplyr)) - position adjustment - tweak position of layer elements (how info for categories is separated - e.g. if bars are stacked or side-by-side)
43
scales
how variable info is mapped to aesthetic properties (every aesthetic mapping used must have a scale so that data can be mapped onto it) if multiple plots, must all have same scale for shared aesthetic mappings
44
coordinate system
takes position of objects (points, lines, etc) and maps them onto the plot
45
faceting
breaks dataset into subsets, with a different plot for each each plot has same layers and scales, etc
46
aesthetic mappings
aes()
47
geoms
geom_TYPE e.g. scattergraphs are geom_point
48
stat and position
stat = identity position = identity when we want to plot data wthout modification
49
standard workflow
ggplot(data, aes()) + geom_TYPE add comments using # to narrate what you're doing at each step
50
adding extra aesthetic mappings
e.g. aes(x=bill_length_mm, y=bill_depth_mm, colour=species)
51
facets
operates on the whole figure - e.g. can split data up by species or by island or both facet_wrap() - wraps a 1D sequence of panels into a 2D matrix with rows and columns (no empty panels), used for single grouping variable facet_grid() - 2D matrix of panels in rows and columns (empty panels if combo doesn't exist), 2+ categorical variables need ~ before the variable you want to facet
52
multiple layers
e.g. make summary layers of means over the raw data point layers
53
customising plots - geom properties
specify arguements in geom_TYPE function e.g. shape, size, transparency (alpha), etc colour() prints available colour choices
54
customising plots - scale
breaks arguement of scale_AES_TYPE adjusts the intervals/guides
55
labels
feature of whole plot, not a layer so have to use label function (labs()), not an arguement of geom
56
customising plots - themes
adjusts all visual elements not adjusted by geom or scales (ie the non-data parts such as background colour, grid lines, label positions, fonts) use theme() function so many adjustments possible, so people usually google specific adjustments as and when need some standard themes, e.g. theme_bw() (put any additional theme changes after setting the standard theme type)
57
making a plot
start with basic skeleton of a plot and build it upadding more customisation
58
histograms
useful for viewing distributions of large samples (>100) need to bin data first can bin in dplyr, or get ggplot2 to do it for us (using stat facility stat_bin()) (pick appropriate bin width using binwidth='') increasing binwidth smooths the histogram only need to define x axis of histogram as ggplot2 does y axis for us fill and colour to customise
59
dot plots
good for visualising distributions of small samples bins not evenly spaced no. stacked dots represents the height along the y axis
60
bar plots
good for counting frequency of occurences within each category use count() geom_bar - counts number in each category, don't supply y aesthetic, use on raw data geom_col - uses y values (not counting), so useful when data already summarised and want to plot exact values can adjust labels, widths of bars, colours can reorder the bars: make a character vector of species names in order, then use the limits = arguement when adjusting the scale coord_flip flips the axes if want to treat numerical values as categorical, convert to character vector using as.character()
61
associations between numeric variables
usually shown by scatter plot problem: don't reveal over-plotting (when points very close due to large dataset or many identical points due to coarse scale) solutions: use small points, reduce opacity, or change geom type
62
geom_bin_2d / geom_hex solves problem of large dataset so close points
like histograms but in 2 dimensions (divides plane into rectangles or hexagons, darker fill colour = more cases at that point)
63
geom_count solves problem of overlapping datapoints
gives scatter graph where point size scaled according to no. cases
64
associations between categorical variables
bar charts usually used separate bar for each combo of categories in the 2 variables have to define 2 aesthetic mappings so produce a stacked bar chart to ensure all variables are treated as categorical, convert to factors (factor()) (used by R to represent categorical variables) use levels arguement to set order to make side-by-side, use dodge arguement
65
categorical-numerical associations
usually use box plot looks at distribution of numerical variable within categories highlights outliers alternative = make multiple histograms
66
multiple histograms
overlay: set position arguement to identity and increase transparency --> shows overlaps faceting: use facet_wrap so each on different plot
67
multivariate associations (associations between +2 variables)
either use faceting to make a multi-panel plot (na.omit function gets rid of NA) or add extra aesthetic mapping
68
comparing descriptive statistics
to plot means: bar chart shoing just means -> calc means using summarise, plot using gaom_col more useful to plot error bars too
69
error bars
shows uncertainty first calc means and standard error (sd()) use geom_col and add geom_errorbar needs ymin and ymax aesthetic mappings - e.g. +/- 1 standard error position = position_dodge(0.9) makes error bars at centre of each bar
70
means and error bars as points
use geom_pointrange adds means and errors as single layer
71
adding text annotations
use geom_text(label = ) put labels in data frame (separate or same as one used for plot)
72
saving plots
ggsave function first arguement = path and name of file set file type by device = arguement
73
multipanel plots with different visualisation on each panel
use external package - cowplot make individual plots first and assign names use plot_grid() to print in order control number of rows (nrow) and columns (ncol) rel_widths and rel_heights controls plot widths and heights
74
help
help files for each package give examples of uses of functions help.start() shows help files for each installed package: - manuals - reference -packages - search engine & key words - miscellaneous material - user manuals - frequently asked questions
75
help files for specific functions
help(topic=) only searches in loaded packages or just put ? before the function name
76
sections of help files
description (function overview) usage arguements details (how it behaves) value (what object is returned) references see also (related functions) examples
77
package vignettes
give account of a package's features use vignettes() function to view all available vignettes vignettes(package = ) or vignettes(topic = , package=) to be more specific browseVignettes() opens them in browser