What is the difference between descriptive and inferential statistics?
Descriptive statistics summarize data through measures like mean, median, standard deviation. Inferential statistics use sample data to make predictions or test hypotheses about larger populations.
Explain what a correlation coefficient measures.
Correlation coefficient measures the strength and direction of linear relationship between two variables, ranging from -1 to +1. Values near +1 indicate positive correlation, near -1 indicate negative correlation, near 0 indicate no relationship.
What is an outlier and how should you handle it?
An outlier is a data point significantly different from others. Handle by: investigating the cause, validating data quality, using robust statistical methods, removing if erroneous, or analyzing separately if valid and meaningful.
Explain what a distribution is and why it matters in data analysis.
Distribution describes how data values spread across a range. Understanding distribution reveals patterns, helps choose appropriate statistical tests, and identifies skewness or kurtosis that affects analysis validity.
What is the normal distribution and its key properties?
Normal distribution is symmetric, bell-shaped curve where mean equals median equals mode. Properties: 68% data within 1 standard deviation, 95% within 2, 99.7% within 3. Many statistical tests assume normality.
Explain the difference between mean, median, and mode.
Mean is arithmetic average of all values. Median is middle value when sorted. Mode is most frequently occurring value. Mean is affected by outliers, median is robust, mode useful for categorical data.
What is standard deviation and how does it relate to variance?
Standard deviation measures spread of data around mean. Variance is square of standard deviation. Higher standard deviation indicates wider spread from mean. Both measure data variability and dispersion.
Explain what a percentile is and its practical use.
Percentile indicates value below which a percentage of data falls. 90th percentile means 90% of data is below this value. Used for ranking, identifying outliers, setting performance benchmarks, and understanding data distribution.
What is skewness and how does it affect data interpretation?
Skewness measures asymmetry of distribution. Positive skew has tail on right (outliers to high), negative skew has tail on left (outliers to low). Affects mean position relative to median and determines appropriate statistical tests.
Explain the concept of kurtosis.
Kurtosis measures peak/flatness of distribution. High kurtosis indicates peaked distribution with heavy tails (more extreme values). Low kurtosis indicates flat distribution. Affects probability of extreme events.
What is probability and how does it relate to statistics?
Probability measures likelihood of event occurring between 0 and 1. Statistics uses probability theory to make inferences from data. Probability is theoretical foundation; statistics is applied to real data.
Explain the central limit theorem and its importance.
Central limit theorem states that sample means approximate normal distribution regardless of underlying population distribution when sample size is large. Enables use of parametric tests even with non-normal data.
What is a confidence interval and how is it interpreted?
Confidence interval is range of values likely to contain true population parameter with specified confidence level (e.g., 95%). Wider intervals indicate more uncertainty. Narrower intervals indicate more precise estimates.
Explain statistical significance and p-values.
Statistical significance indicates result unlikely due to chance alone. P-value is probability of observing data if null hypothesis is true. P < 0.05 typically indicates statistical significance. Doesn’t equal practical significance.
What is a hypothesis test and the null hypothesis?
Hypothesis test evaluates evidence for/against a claim about population. Null hypothesis (H0) assumes no effect or difference. Alternative hypothesis (H1) assumes effect exists. Tests determine whether to reject null.
Explain Type I and Type II errors.
Type I error (false positive) rejects true null hypothesis. Type II error (false negative) fails to reject false null hypothesis. Significance level (alpha) controls Type I risk. Power (1-beta) relates to Type II risk.
What is the difference between correlation and causation?
Correlation describes relationship between variables but doesn’t imply causation. Causation means one variable directly affects another. Confounding variables can create correlation without causation.
Explain what regression analysis is used for.
Regression models relationship between variables, predicting dependent variable from independent variables. Used for forecasting, identifying relationships, understanding variable importance, and making predictions.
What is linear regression and its key assumptions?
Linear regression models linear relationship between variables. Assumes: linearity, independence of errors, homoscedasticity (constant variance), normality of errors, no multicollinearity. Violations affect model validity.
Explain the concept of R-squared and its interpretation.
R-squared measures proportion of variance explained by model, ranging 0 to 1. Higher R-squared indicates better fit. However, high R-squared doesn’t guarantee good predictions or causal relationships.
What is a histogram and when would you use one?
Histogram shows distribution of continuous numeric data in bins/intervals. Use to identify distribution shape, detect outliers, find gaps, or check normality. Different from bar chart which shows categorical data.
Explain the concept of percentiles and quartiles.
Percentiles divide ordered data into 100 equal parts; nth percentile means n% of data falls below it. Quartiles are specific percentiles: Q1 (25th), Q2/median (50th), Q3 (75th), Q4 (100th). Used for understanding distribution and identifying outliers.
What is the interquartile range (IQR) and its uses?
IQR is range between Q3 and Q1, capturing middle 50% of data. Use to identify outliers: values below Q1-1.5IQR or above Q3+1.5IQR are typically considered outliers. Robust measure unaffected by extreme values.
Explain the difference between population and sample.
Population is entire group being studied. Sample is subset of population. Population parameters (μ, σ) are fixed; sample statistics (x̄, s) vary. Statistics estimates population parameters.