week 2 - data wrangling Flashcards

(16 cards)

1
Q

Statistics & Samples

A
  • Types of Data
    • Data = measurements of 1/+ variables made on a sample of individuals.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Categorical Variables

A
  • Describe membership in a category/group.
  • Describe qualitative characteristics of individuals that do not correspond to a degree of diff. on a numerical scale.
  • Categorical variables = attribute/qualitative variables.
  • E.g. survival (alive or dead),
  • Nominal if the diff. categories = no inherent order.
  • Nominal = name.
  • values of an ordinal categorical variable = ordered.
  • Magnitude of the diff betw. consecutive values = not known.
  • Ordinal = having an order
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Numerical Variables

A
  • When measurements of individuals = quantitative & have magnitude.
  • Numbers.
  • E.g core body temp. (e.g., degrees Celsius [°C])
  • Either continuous/discrete.
  • Continuous numerical data
    • Take on any real-number value within some range.
    • Betw any 2 values of a continuous variable, an infinite number of other values = possible.
    • Continuous data = rounded to a predetermined number of digits, set for convenience
  • Discrete numerical data
    • Data come = in indivisible units.
    • Often analyzed as though they = continuous, if large # of possible values.
  • Numbers might also be used to name categories
  • Numerical data = be reduced to categorical data by grouping → result contains less info
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explanatory & Response Variables

A
  • To relate 1 variable to another by examining associations betw. variables & differs betw. groups.
  • Measuring an association = measuring a difference
  • Goal = to assess how well 1 of the variables (explanatory variable) predicts/affects the other variable (response variable).
  • Treatment variable = manipulated by the researcher → explanatory variable
  • Measured effect of the treatment → response variable.
  • Neither variable = manipulated by the researcher → association
  • described by the “effect” of 1 of the variables on the other → not direct evidence for causation.
  • IV = explanatory
  • DV = response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Frequency Distributions

A
  • Diff individuals in a sample = diff.
  • Measurements → observed by frequency dist.
  • Freq. of a specific measurement in a sample = #of observations having a particular value of the measurement.
  • Freq. dist shows how often each value of the variable occurs in the sample
  • Informs us about the dist of the variable in the pop. it came from.
  • Gives intuitive understanding of the variable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Probability Distribution

A
  • Distribution of a variable in the whole pop. = prob dist.
  • Real prob in nature = almost never known.
  • Researchers → theoretical prob dists to approx the real prob dist.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Normal Distribution

A
  • Normal dist = “bell curve.”
  • Most important prob. dist. in stats.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describing Data - Sample Mean

A
  • Avg. of the measurements in the sample
  • Sum of all observations divided by # observations
  • x̄ (symbol)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describing Data - Variance & Standard deviation (SD)

A
  • Used to measure of the spread of a dist
  • How far from the avg the observations =
  • SD = large → most observations = far from mean
  • SD = small → most observations = close to mean
  • Calc. from variance
  • SD = square root of variance
  • SD = better b/c = same units as variable
  • Deviation from mean = diff. betw. measurement & mean
  • -ve deviations cancel +ve deviations
  • Need to avg squared deviations
  • Deviations above & below the mean contribute +ly to var
  • Never -ve & same units as OG observations
  • SD = connected to freq. dist. b/c bell-shaped freq, then ⅔ of observations = lie in 1SD of the mean & 95% = lie 2SD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describing Data - Coefficient of Variations (CV)

A
  • Calcs. the SD as a % of mean
  • High CV = more variability
  • Low CV = individuals = similar & more relative to mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Median

A
  • Middle observation in data set
  • Dividing data set into 2 by sorting from smallest to largest
  • Even # of observation find average of middle values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

IQR

A
  • Dividing data into quarters
  • Q1 = middle value lying below the median
  • Q2 = median
  • Q3 = middle value lying above the median
  • IQR = Q3 - Q1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Box Plots

A
  • Shows median & IQR
  • Lower & upper edges = Q1 & Q3 → span of the box
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measuring Spread & Location Comparison - Mean vs median

A
  • Median = middle measurement of a dist.
  • Mean = center of all points including to outliers → balance
  • Mean = sensitive to extreme outliers
  • Median = unaffected
  • Mean = displaced from location of normal measurement when freq. dist. = strongly skewed → extreme values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measuring Spread & Location Comparison - SD vs. IQR

A
  • Calc from square of deviations → more sensitive to extreme observations
  • IQR = better indicator of spread b/c strongly skewed data due to extreme values
  • SD reflects variation among all data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Estimating w/ Uncertainty

A
  • Estimation = process of inferring a pop parameter from sample data.
  • All estimates = sampling dist, → prob dist of all the possible values of the estimate that might be obtained under rando sampling w/ given sample size.
  • Standard error (SE) of an estimate = SD of its sampling dist.
  • SE measures precision.
  • Smaller SE = more precise estimate
  • SE & CI assume that sampling = rando
  • SE of estimate declines w/ increase in sample size
  • CI = range of values calced from sample data → likely to contain within its span the value of the target parameter.
  • 95% CIs calced from independent random samples = include the value of the parameter 19/20 times
  • 2SE rule = rough approx to the 95% CI for a mean.
  • Error bars to graphs reps SEs/CIs