Statistical Foundations + Analysis Flashcards

(30 cards)

1
Q

What is the difference between descriptive and inferential statistics?

A

Descriptive statistics summarize data through measures like mean, median, standard deviation. Inferential statistics use sample data to make predictions or test hypotheses about larger populations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain what a correlation coefficient measures.

A

Correlation coefficient measures the strength and direction of linear relationship between two variables, ranging from -1 to +1. Values near +1 indicate positive correlation, near -1 indicate negative correlation, near 0 indicate no relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an outlier and how should you handle it?

A

An outlier is a data point significantly different from others. Handle by: investigating the cause, validating data quality, using robust statistical methods, removing if erroneous, or analyzing separately if valid and meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain what a distribution is and why it matters in data analysis.

A

Distribution describes how data values spread across a range. Understanding distribution reveals patterns, helps choose appropriate statistical tests, and identifies skewness or kurtosis that affects analysis validity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the normal distribution and its key properties?

A

Normal distribution is symmetric, bell-shaped curve where mean equals median equals mode. Properties: 68% data within 1 standard deviation, 95% within 2, 99.7% within 3. Many statistical tests assume normality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the difference between mean, median, and mode.

A

Mean is arithmetic average of all values. Median is middle value when sorted. Mode is most frequently occurring value. Mean is affected by outliers, median is robust, mode useful for categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is standard deviation and how does it relate to variance?

A

Standard deviation measures spread of data around mean. Variance is square of standard deviation. Higher standard deviation indicates wider spread from mean. Both measure data variability and dispersion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain what a percentile is and its practical use.

A

Percentile indicates value below which a percentage of data falls. 90th percentile means 90% of data is below this value. Used for ranking, identifying outliers, setting performance benchmarks, and understanding data distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is skewness and how does it affect data interpretation?

A

Skewness measures asymmetry of distribution. Positive skew has tail on right (outliers to high), negative skew has tail on left (outliers to low). Affects mean position relative to median and determines appropriate statistical tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the concept of kurtosis.

A

Kurtosis measures peak/flatness of distribution. High kurtosis indicates peaked distribution with heavy tails (more extreme values). Low kurtosis indicates flat distribution. Affects probability of extreme events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is probability and how does it relate to statistics?

A

Probability measures likelihood of event occurring between 0 and 1. Statistics uses probability theory to make inferences from data. Probability is theoretical foundation; statistics is applied to real data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the central limit theorem and its importance.

A

Central limit theorem states that sample means approximate normal distribution regardless of underlying population distribution when sample size is large. Enables use of parametric tests even with non-normal data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a confidence interval and how is it interpreted?

A

Confidence interval is range of values likely to contain true population parameter with specified confidence level (e.g., 95%). Wider intervals indicate more uncertainty. Narrower intervals indicate more precise estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain statistical significance and p-values.

A

Statistical significance indicates result unlikely due to chance alone. P-value is probability of observing data if null hypothesis is true. P < 0.05 typically indicates statistical significance. Doesn’t equal practical significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a hypothesis test and the null hypothesis?

A

Hypothesis test evaluates evidence for/against a claim about population. Null hypothesis (H0) assumes no effect or difference. Alternative hypothesis (H1) assumes effect exists. Tests determine whether to reject null.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain Type I and Type II errors.

A

Type I error (false positive) rejects true null hypothesis. Type II error (false negative) fails to reject false null hypothesis. Significance level (alpha) controls Type I risk. Power (1-beta) relates to Type II risk.

17
Q

What is the difference between correlation and causation?

A

Correlation describes relationship between variables but doesn’t imply causation. Causation means one variable directly affects another. Confounding variables can create correlation without causation.

18
Q

Explain what regression analysis is used for.

A

Regression models relationship between variables, predicting dependent variable from independent variables. Used for forecasting, identifying relationships, understanding variable importance, and making predictions.

19
Q

What is linear regression and its key assumptions?

A

Linear regression models linear relationship between variables. Assumes: linearity, independence of errors, homoscedasticity (constant variance), normality of errors, no multicollinearity. Violations affect model validity.

20
Q

Explain the concept of R-squared and its interpretation.

A

R-squared measures proportion of variance explained by model, ranging 0 to 1. Higher R-squared indicates better fit. However, high R-squared doesn’t guarantee good predictions or causal relationships.

21
Q

What is a histogram and when would you use one?

A

Histogram shows distribution of continuous numeric data in bins/intervals. Use to identify distribution shape, detect outliers, find gaps, or check normality. Different from bar chart which shows categorical data.

22
Q

Explain the concept of percentiles and quartiles.

A

Percentiles divide ordered data into 100 equal parts; nth percentile means n% of data falls below it. Quartiles are specific percentiles: Q1 (25th), Q2/median (50th), Q3 (75th), Q4 (100th). Used for understanding distribution and identifying outliers.

23
Q

What is the interquartile range (IQR) and its uses?

A

IQR is range between Q3 and Q1, capturing middle 50% of data. Use to identify outliers: values below Q1-1.5IQR or above Q3+1.5IQR are typically considered outliers. Robust measure unaffected by extreme values.

24
Q

Explain the difference between population and sample.

A

Population is entire group being studied. Sample is subset of population. Population parameters (μ, σ) are fixed; sample statistics (x̄, s) vary. Statistics estimates population parameters.

25
What is sampling bias and how does it affect analysis?
Sampling bias occurs when sample doesn't represent population accurately. Types include selection bias, non-response bias, and self-selection bias. Biased samples lead to incorrect conclusions that don't generalize.
26
Explain the concept of effect size and its importance.
Effect size measures magnitude of difference or relationship, independent of sample size. Complements p-values by indicating practical significance. Large effect size means meaningful difference even if statistically significant.
27
What is Bayes' theorem and when would you use it?
Bayes' theorem calculates posterior probability given prior probability and evidence. Used for classification, spam filtering, diagnostic testing. Formula: P(A|B) = P(B|A) × P(A) / P(B).
28
Explain the concept of sensitivity and specificity.
Sensitivity (true positive rate) measures proportion of actual positives correctly identified. Specificity (true negative rate) measures proportion of actual negatives correctly identified. Trade-off exists between them.
29
What is precision and recall in classification?
Precision is proportion of positive predictions that were correct. Recall is proportion of actual positives correctly identified. Both important; balance between them depends on use case (e.g., cancer detection vs. spam filtering).
30
Explain the concept of power in hypothesis testing.
Power is probability of rejecting null hypothesis when it's false (detecting true effect). Increases with sample size, effect size, and significance level. Standard target is 80% power. Related to Type II error.