What is statistics and what are two types?
The science of collecting and analyzing data for drawing conclusions and making decisions
1. Descriptive statistics
2. Inferential statistics
What is descriptive statistics?
It is a method of organizing, summarizing and presenting data in a convenient and informative way
- For example through graphs or numbers
What is inferential statistics?
What is the difference between probability and statistics?
Probability is deductive, meaning given the information in a box, you can figure out what is in your hand
Statistics is inductive, meaning given the information in your hand, you can figure out what is in a box
Qualitative (categorical) data representation vs quantitative data representation
Qualitative data representation means data is grouped into non-numerical and descriptive categories, and is then used to compare categories or proportions. Ex: Car colors
Quantitative data representation involves data that includes numbers and measurable quantities, and is used to analyze distributions, patterns or correlations. Ex: Car speeds
Common tools for qualitative data representation
Common tools for quantitative data representation
What are three good practices when presenting data?
What are 10 good practices when visualizing data?
What are two key concepts in numerical representation of data?
What are box plots and why are they useful?
Shows the median, quartiles and outliers (important!)
- A graphical representation of dispersion, skewness, outliers and other prominent features in data using quartiles
They are useful because if the median is closer to the bottom or top of the box, it suggests skewness
How do you compare boxplots?
How to construct boxplots
What are main concepts in inferential statistics?
What is a sample?
An observed subset of a population
What does statistical inference include?
What does 95% confidence level mean?
If we repeat the sampling process many times, 95% of the intervals would contain the true mean
What is the purpose of hypothesis testing and how do you formulate a hypothesis?
The purpose is to determine whether there is enough statistical evidence in favor of a certain belief about a parameter.
Method
- A null hypothesis, H0, is formulated and is assumed to be true
- The alternative hypothesis, Ha, is a claim contrary to H0
- The possible conclusions from hypothesis-testing analysis is then to REJECT H0 (if there is enough statistical evidence that it is not true) or FAIL TO REJECT H0 (if there is not enough statistical evidence to draw the conclusion that H0 is not true)
What are two concepts of modeling analysis?
Describe the steps in model development and two types of models
Two types
1. Linear (positive or negative linear relationship)
2. Non-linear
Regression analysis
A simple method of supervised learning that models causality and provides prediction
- Explains the effect of the independent variable X on the dependent variable Y
Linear regression is a type of regression analysis and has a deterministic and probabilistic component.
- Assumes that the dependence of Y on X is linear
What three aspects are estimating the coefficients in linear regression determined by?
Error term assumptions in regression models
For a valid model, some assumptions on error terms must be fulfilled
- Independent Identically Distributed (IID): Errors are independent from each other and have the same distribution across all observations
- Normally distributed (N): Errors follow a normal distribution with mean 0 and constant variance
What are key considerations for variable selection in regression models?
Explanatory power: Variables should significantly explain the variation in the dependent variable
- Can be assessed using measures like R2 (coefficient of determination, more explanatory power if R2 is higher) or hypothesis test on coefficients
Explanatory power includes
- Causality: Causal relationship between independent and dependent variable
- Model performance: Evaluate models fit and accuracy using adjusted R2, residual analysis or other performance metrics