Introduction_to_Statistical_Learning Flashcards

DONE: 01,02,03 TODO: 04 (86 cards)

1
Q

Statistical Learning

A

A vast set of tools for modeling and understanding complex datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Quantitative Data

A

The value is numerical. It’s the result of counting or measuring something.

Examples include:
* A person’s age
* The temperature of a room
* The height of a building

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Qualitative Data

A

The value is non-numerical. It represents a quality, attribute, or characteristic. It’s often referred to as categorical data.

Examples include:
* A person’s eye color
* The type of car someone drives
* A product’s rating (e.g., “excellent,” “good,” “fair”)
* The brand of a cell phone
*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Regression

A

A supervised learning problem with a quantitative (numerical) response variable.

The goal is to predict a numerical value based on a set of input variables.

Examples include predicting:
* The price of a house
* The temperature tomorrow
* A person’s salary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Classification

A

A supervised learning problem with a qualitative (categorical) response variable.

The goal is to predict which category an observation belongs to, based on a set of input variables.

Examples include predicting whether:
* A person will click on an advertisement (Yes/No)
* A customer will buy a product (Yes/No)
* An email is spam or not (Spam/Not Spam)
* A digit in an image is a 0-9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Input Variables

A

The variables used to predict a response variable.

They are also commonly referred to as predictors, features, independent variables, or covariates.

In a regression or classification problem, these are the inputs to the model. For example, when predicting a house’s price, the input variables would be the square footage, number of bedrooms, and location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Output Variables

A

The variable that a statistical learning model is designed to predict.

Also commonly referred to as the response variable, dependent variable, or target variable.

The nature of the output variable determines whether a problem is a regression or a classification problem.

  • In regression, the output variable is quantitative (e.g., a house price).
  • In classification, the output variable is qualitative (e.g., spam or not spam).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True Model Function

A

Y = f(x) + epsilon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Non-systematic Error

A

An unavoidable error that is inherent in the data itself.

This error, also known as Irreducible Error, cannot be eliminated by using a better model. It arises from the noise or random variation in the measurement process and from unmeasured variables that influence the output.

No matter how well you estimate the relationship between the input and output variables, there will always be some error that is independent of your model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Systematic Error

A

The error introduced by approximating a complex, real-life problem with a much simpler model.

This error, also known as reducible error, is not an error that can be reduced by increasing the amount of training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Estimated Model Function

A

Y_hat = f_hat(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Prediction

A

In a statistical or machine learning context, prediction is the process of using a trained model to forecast the value of a target variable for a new, unseen input. It involves feeding new data into a model that has learned patterns from existing data, and then receiving an output that represents the model’s best guess for the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Inference

A

In statistics and machine learning, inference is the process of using data analysis to deduce properties of an underlying population or data-generating process. Unlike prediction, which focuses on forecasting new outcomes, inference aims to understand the relationships between variables, estimate parameters, and test hypotheses.

For example, a model built for inference might be used to determine if a specific drug has a statistically significant effect on a disease, while a model for prediction might simply be used to guess whether a new patient will have the disease based on their symptoms, without necessarily explaining why.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: Use cases for inference?

A
  • Which predictors are associated with the response?
  • What is the relationship between the response and each predictor?
  • Can the relationship between Y and each predictor be adequately summarized using a linear equation or is the relationship more complicated?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Parametric Methods

A

A two-step modeling approach. First, an assumption is made about the functional form or shape of the relationship between the inputs and output (e.g., assuming the relationship is linear). Second, the training data is used to fit the model, which simplifies the problem to estimating a set of parameters (e.g., the coefficients in a linear model). The main disadvantage is that the chosen model may not accurately reflect the true underlying relationship, leading to poor predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Non-Parametric Methods

A

Non-parametric methods are a class of statistical learning models that do not make explicit assumptions about the functional form of the relationship between the predictors and the response. Instead, they try to estimate the function f without assuming a predefined shape.

This approach offers greater flexibility as it can accurately fit a wider range of possible shapes for f. The main disadvantage is that non-parametric methods often require a very large number of observations to obtain an accurate estimate of the function. This can lead to a more complex and computationally expensive model with a higher risk of overfitting to the training data.

Examples of non-parametric methods include thin-plate splines, support vector machines, and K-Nearest Neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

(Ordinary) Least Squares

A

A method used to fit a linear regression model. It works by finding the unique line (or hyperplane) that minimizes the sum of the squared residuals. A residual is the difference between an observed data point’s actual value and the value predicted by the model on the regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Overfitting

A

A phenomenon where a statistical model learns the training data too well, to the point that it begins to model the random noise and inaccuracies in the data rather than the true underlying relationship. An overfit model will have a very low error rate on the training data but a high error rate on new, unseen test data, as it fails to generalize. This typically happens when the model is too complex or flexible for the amount of data available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Supervised Learning

A

A type of machine learning where a model is trained on labeled data. This means each observation has both predictor variables and a corresponding known response variable. The goal is to build a model that can predict the response for new, unseen observations. The two main types of supervised learning are regression for quantitative responses and classification for qualitative responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Unsupervised Learning

A

Unsupervised Learning is a type of machine learning where there is no associated response variable to supervise the algorithm. Unlike supervised learning, the dataset consists only of a set of features for each observation. The goal is not to make predictions but to discover interesting patterns, relationships, or groupings within the data.

Common tasks in unsupervised learning include:

  • Clustering: The process of partitioning data into distinct groups, or clusters, based on similarity. Examples include customer segmentation based on purchasing behavior or grouping similar documents.
  • Dimensionality Reduction: The process of reducing the number of variables or features in a dataset while retaining as much information as possible. This is often used for visualization or as a preprocessing step for other models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Mean Squared Error (MSE)

A

A common metric used to evaluate the performance of a regression model. It is calculated by taking the average of the squared differences between the predicted values and the actual observed values. Because it squares the errors, it gives more weight to large differences. A lower MSE indicates that the model’s predictions are closer to the true values and thus performing better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Error Rate

A

The proportion of misclassified observations for a classifier. It is calculated by dividing the number of incorrect predictions by the total number of observations. For example, if a model correctly classifies 95 out of 100 observations, its error rate is 5%. It is a simple and common metric for evaluating the performance of a classification model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q: Why do we estimate f?

A

We estimate f, the function representing the relationship between the predictors and the response, for two main reasons:

  • Prediction: We can use the estimated f to predict the value of the response variable Y for a new observation where the predictor variables are known. This is useful when the primary goal is to make accurate forecasts.
  • Inference: We can use the estimated f to understand how the response variable Y is affected by the predictors. This allows us to determine which predictors are important, how they relate to the response, and the nature of that relationship (e.g., is it positive or negative).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Q: How do we estimate f?

A

There are two main approaches to estimating the function f:

  1. Parametric Methods: This approach involves a two-step process. First, we make an assumption about the functional form or shape of f (for example, assuming f is a linear function). Second, we use the training data to fit or train this model. This simplifies the problem of estimating f down to estimating a set of parameters (like the coefficients in a linear model). This method is less flexible but requires less data.
  2. Non-Parametric Methods: These methods do not make explicit assumptions about the functional form of f. Instead, they seek to find an estimate of f that gets as close to the data points as possible without being too rough or wiggly. This approach is more flexible and can fit a wider range of shapes for f, but it generally requires a much larger number of observations to obtain an accurate estimate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q: What is the Trade-Off Between Prediction Accuracy and Model Interpretability?
The trade-off refers to the inverse relationship between a model's prediction accuracy and its interpretability. * **Interpretable Models** (e.g., linear regression) are simple and easy to understand. We can clearly see how a change in a predictor affects the response. However, their simplicity means they often make strong assumptions and may not be flexible enough to achieve the highest possible prediction accuracy. * **Flexible Models** (e.g., boosting, support vector machines) can capture complex, non-linear relationships and often achieve superior prediction accuracy. However, due to their complexity, they are often considered "black boxes," making it very difficult to understand how they arrive at a particular prediction. The choice between an interpretable versus a flexible model depends on whether the primary goal is **inference** (understanding the relationship) or **prediction** (achieving the highest possible accuracy).
26
Q: Supervised vs. Unsupervised Learning
The key difference between supervised and unsupervised learning lies in the presence of a **response variable**. In **Supervised Learning**, the goal is to predict a response variable for new observations. The training data includes both the predictor variables and the corresponding, known response variable. Common tasks are **regression** (for quantitative responses) and **classification** (for qualitative responses). In **Unsupervised Learning**, there is no response variable. The training data consists only of the predictor variables. The goal is not to predict an outcome but to discover interesting patterns, relationships, or structure within the data itself. Common tasks include **clustering** and **dimensionality reduction**.
27
Q: Regression vs Classification Problems
The distinction between regression and classification problems is based on the nature of the response variable. **Regression Problems** involve predicting a **quantitative** response. The response variable is a continuous, numerical value. For example, predicting the price of a house, the height of a person, or the score on a test. **Classification Problems** involve predicting a **qualitative** response. The response variable belongs to one of several categories or classes. For example, predicting whether an email is spam or not spam, identifying whether a tumor is benign or malignant, or classifying a handwritten digit into one of the ten possible digits.
28
Q: How do we assess model accuracy/fit?
We assess a model's accuracy by evaluating its performance on a **test set** of data that was not used during training. This provides an estimate of how well the model will perform on new, unseen data. The specific metric used depends on the type of problem: * For **regression problems**, which predict a quantitative response, accuracy is commonly measured using the **Mean Squared Error (MSE)**. A lower MSE indicates higher accuracy. Another important measure is the R-squared (R2) statistic, which measures the proportion of variance in the response variable that is explained by the model. An R2 value closer to 1 indicates a better fit. * For **classification problems**, which predict a qualitative response, accuracy is commonly measured using the **Error Rate**, which is the proportion of misclassified observations. A lower error rate indicates higher accuracy. It's important to be cautious, as a model that fits the training data too well may be overfitting and will not perform well on new, unseen data.
29
Q: Model Accuracy vs Model Fit
**Model fit** refers to how well a model performs on the **training data** it was built on. It is measured by the training error, such as Mean Squared Error (MSE) for regression or the error rate for classification. A good model fit means the model has captured the relationships present in the data it has seen. **Model accuracy** refers to how well a model performs on **new, unseen test data**. This is the ultimate goal, as it measures the model's ability to generalize to new situations. It is measured by the test error. The key distinction is that a model can have a very good **fit** on the training data (low training error) but poor **accuracy** on the test data (high test error). This phenomenon is known as **overfitting**.
30
Bias-Variance Trade-Off
The **Bias-Variance Tradeoff** is a fundamental concept that describes the relationship between two sources of error in a statistical learning model. **Bias** is the error introduced by approximating a complex, real-life problem with a simpler model. A model with high bias is inflexible and makes strong assumptions about the data, leading to **underfitting**. **Variance** is the amount by which a model's prediction would change if it were trained on a different dataset. A model with high variance is very flexible and learns the random noise in the training data, leading to **overfitting**. The tradeoff is that as you increase a model's flexibility to reduce bias, you tend to increase its variance. The ultimate goal is to find the right level of flexibility that minimizes the total test error, which is the sum of the bias and variance.
31
Bayes Classifier
The **Bayes classifier** is a theoretical, ideal classifier that achieves the lowest possible test error rate. It is not a practical method because it requires us to know the true conditional distribution of the response variable given the predictors, which is never known in real-world applications. The classifier works by assigning each observation to the class for which the conditional probability of the response, given the predictors, is highest. This is based on Bayes' theorem. The test error rate for this classifier is called the **Bayes error rate**, and it represents the absolute minimum achievable error for a given dataset and set of predictors. All other classifiers can be thought of as trying to approximate the Bayes classifier.
32
K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a non-parametric method used for both classification and regression. To make a prediction for a new observation, KNN first identifies the K closest points to that observation in the training dataset. - For classification, it takes a majority vote from these K neighbors. The new observation is assigned to the class that is most common among its neighbors. - For regression, it averages the response values of the K neighbors to produce the prediction. The choice of K is a critical parameter that controls the model's flexibility and the bias-variance tradeoff. A small K results in a very flexible, high-variance model, while a large K results in a less flexible, high-bias model.
33
Linear Regression
34
Simple Linear Regression
35
Intercept
36
Slope
37
Coefficient
38
Parameter
39
Q: How do we estimate the coefficients?
40
Residual
41
Residual Sum of Squares
42
Q: How do we assess the accuracy of our coefficient estimates?
43
Population Regression Line
44
Least Squares Line
45
Bias
46
Variance
47
Standard Error
48
Residual Standard Error
49
Confidence Interval
50
Hypothesis Test
51
Null Hypothesis
52
Alternative Hypothesis
53
t-statistic
54
p-value
55
Q: How do we assess the accuracy of the model?
56
57
R^2
58
Multiple Linear Regression
59
F-statistic
60
Variable Selection
61
Forward Variable Selection
62
Backward Variable Selection
63
Mixed Variable Selection
64
Prediction Interval
65
Dummy Variable
66
Baseline
67
Interaction/Synergy Effect
68
Hierarchy Principle
69
Main Effect
70
Q: Potential Problems with Linear Regression
71
P1: Non-Linearity of the response-predictor relationships
72
P2: Correlation of error terms
73
P3: Non-constant variance of error terms
74
P4: Outliers
75
P5: High-Leverage points
76
P6: Collinearity
77
Residual Plot
78
Tracking
79
Heteroscedasticity
80
High-Leverage
81
Leverage Statistic
82
Multi-collinearity
83
Variance Inflation Factor (VIF)
84
Q: How does linear regression compare and contrast to K-Nearest Neighbors (KNN)?
85
Curse of Dimensionality
86