Statistical Learning
A vast set of tools for modeling and understanding complex datasets.
Quantitative Data
The value is numerical. It’s the result of counting or measuring something.
Examples include:
* A person’s age
* The temperature of a room
* The height of a building
Qualitative Data
The value is non-numerical. It represents a quality, attribute, or characteristic. It’s often referred to as categorical data.
Examples include:
* A person’s eye color
* The type of car someone drives
* A product’s rating (e.g., “excellent,” “good,” “fair”)
* The brand of a cell phone
*
Regression
A supervised learning problem with a quantitative (numerical) response variable.
The goal is to predict a numerical value based on a set of input variables.
Examples include predicting:
* The price of a house
* The temperature tomorrow
* A person’s salary
Classification
A supervised learning problem with a qualitative (categorical) response variable.
The goal is to predict which category an observation belongs to, based on a set of input variables.
Examples include predicting whether:
* A person will click on an advertisement (Yes/No)
* A customer will buy a product (Yes/No)
* An email is spam or not (Spam/Not Spam)
* A digit in an image is a 0-9
Input Variables
The variables used to predict a response variable.
They are also commonly referred to as predictors, features, independent variables, or covariates.
In a regression or classification problem, these are the inputs to the model. For example, when predicting a house’s price, the input variables would be the square footage, number of bedrooms, and location.
Output Variables
The variable that a statistical learning model is designed to predict.
Also commonly referred to as the response variable, dependent variable, or target variable.
The nature of the output variable determines whether a problem is a regression or a classification problem.
True Model Function
Y = f(x) + epsilon
Non-systematic Error
An unavoidable error that is inherent in the data itself.
This error, also known as Irreducible Error, cannot be eliminated by using a better model. It arises from the noise or random variation in the measurement process and from unmeasured variables that influence the output.
No matter how well you estimate the relationship between the input and output variables, there will always be some error that is independent of your model.
Systematic Error
The error introduced by approximating a complex, real-life problem with a much simpler model.
This error, also known as reducible error, is not an error that can be reduced by increasing the amount of training data.
Estimated Model Function
Y_hat = f_hat(X)
Prediction
In a statistical or machine learning context, prediction is the process of using a trained model to forecast the value of a target variable for a new, unseen input. It involves feeding new data into a model that has learned patterns from existing data, and then receiving an output that represents the model’s best guess for the outcome.
Inference
In statistics and machine learning, inference is the process of using data analysis to deduce properties of an underlying population or data-generating process. Unlike prediction, which focuses on forecasting new outcomes, inference aims to understand the relationships between variables, estimate parameters, and test hypotheses.
For example, a model built for inference might be used to determine if a specific drug has a statistically significant effect on a disease, while a model for prediction might simply be used to guess whether a new patient will have the disease based on their symptoms, without necessarily explaining why.
Q: Use cases for inference?
Parametric Methods
A two-step modeling approach. First, an assumption is made about the functional form or shape of the relationship between the inputs and output (e.g., assuming the relationship is linear). Second, the training data is used to fit the model, which simplifies the problem to estimating a set of parameters (e.g., the coefficients in a linear model). The main disadvantage is that the chosen model may not accurately reflect the true underlying relationship, leading to poor predictions.
Non-Parametric Methods
Non-parametric methods are a class of statistical learning models that do not make explicit assumptions about the functional form of the relationship between the predictors and the response. Instead, they try to estimate the function f without assuming a predefined shape.
This approach offers greater flexibility as it can accurately fit a wider range of possible shapes for f. The main disadvantage is that non-parametric methods often require a very large number of observations to obtain an accurate estimate of the function. This can lead to a more complex and computationally expensive model with a higher risk of overfitting to the training data.
Examples of non-parametric methods include thin-plate splines, support vector machines, and K-Nearest Neighbors.
(Ordinary) Least Squares
A method used to fit a linear regression model. It works by finding the unique line (or hyperplane) that minimizes the sum of the squared residuals. A residual is the difference between an observed data point’s actual value and the value predicted by the model on the regression line.
Overfitting
A phenomenon where a statistical model learns the training data too well, to the point that it begins to model the random noise and inaccuracies in the data rather than the true underlying relationship. An overfit model will have a very low error rate on the training data but a high error rate on new, unseen test data, as it fails to generalize. This typically happens when the model is too complex or flexible for the amount of data available.
Supervised Learning
A type of machine learning where a model is trained on labeled data. This means each observation has both predictor variables and a corresponding known response variable. The goal is to build a model that can predict the response for new, unseen observations. The two main types of supervised learning are regression for quantitative responses and classification for qualitative responses.
Unsupervised Learning
Unsupervised Learning is a type of machine learning where there is no associated response variable to supervise the algorithm. Unlike supervised learning, the dataset consists only of a set of features for each observation. The goal is not to make predictions but to discover interesting patterns, relationships, or groupings within the data.
Common tasks in unsupervised learning include:
Mean Squared Error (MSE)
A common metric used to evaluate the performance of a regression model. It is calculated by taking the average of the squared differences between the predicted values and the actual observed values. Because it squares the errors, it gives more weight to large differences. A lower MSE indicates that the model’s predictions are closer to the true values and thus performing better.
Error Rate
The proportion of misclassified observations for a classifier. It is calculated by dividing the number of incorrect predictions by the total number of observations. For example, if a model correctly classifies 95 out of 100 observations, its error rate is 5%. It is a simple and common metric for evaluating the performance of a classification model.
Q: Why do we estimate f?
We estimate f, the function representing the relationship between the predictors and the response, for two main reasons:
Q: How do we estimate f?
There are two main approaches to estimating the function f: