What is the variance-bias trade-off?
The variance-bias trade off is a fundamental concept in machine learning, which assesses how well a model has fit data for predictions. There are two properties - variance and bias - that assess model fits. And there’s a trade-off between the two such that an increase in one decreases the other. And, the main aim for machine learning algorithms and various techniques (e.g. ensembling and regularization) is to reduce both.
Let’s consider the meaning of bias and variance:
Bias: Bias is the difference between the average prediction of your model and the true value you are trying to predict. High bias means your model is overly simplistic. It has strong assumptions about the data which prevents it from learning the real underlying patterns. This leads to underfitting.
Variance: Variance measures how much your model’s predictions change with different training datasets. High variance means the model is highly sensitive to specific data its training on, capturing noise and peculiarities rather than the generalizable trend. This leads to overfitting.
The Trade-Off:
Let’s understand the trade-off in relation to the complexity of the model’s decision boundary.
Complex Models: Flexible models (like polynomial linear models, deep neural networks, decision trees with many splits) have the capacity to fit complex patterns in the data. This reduces bias but can lead to high variance if they start fitting the noise in the training data rather than the true underlying trend.
Simple Models: Linear models or shallow decision trees have high bias because of their simplifying assumptions. However, they are less prone to overfitting and tend to have lower variance.
Techniques to Address the Trade-Off
1. Regularization: Adds penalties to complex models to favor simpler explanations (e.g. L1/L2 regularization)
2. Hyper-parameter tuning using cross-validation: Helps evaluate the best combination of parameters from multiple splits of data and produce the optimal complexity in decision boundary that minimize both variance and bias.
3. Ensemble Methods: Combines multiple models to reduce variance (e.g. bagging, random forests).
What is cross-validation? How does it work?
Suppose that you randomly sample and allocate 70% of your data for training the model and the remaining 30% of the data for testing your model. Let’s also assume that the prediction problem is regression so mean squared error is used to evaluate the model. You compute the MSE of the model prediction on testing data and conclude 679.34.
This approach seems to work, but there is a drawback.
Suppose that model performance is evaluated on new testing data, resulting in 824.44 MSE. This discrepancy suggests that there is variability in the model prediction error from one set of data to the next. Cross-validation reduces this uncertainty in estimating the error by averaging multiple testing errors across multiple folds of the data.
The procedure of cross-validation is simple. You choose K number of folds, which partition the entire data into K folds, as shown below.
Each fold is subject to validation while the rest become data used for training the model. This process is repeated K times to ensure that all of the folds of the data are computed with errors.
Finally, the errors are averaged to produce a single error score with reduced uncertainty.
Cross Validation AUC is 0.90. However, when the model is productionalized, AUC drops to 0.75 Why?
The discrepancy in a model performance offline vs online happens when a holdout test wasn’t utilized to measure the model performance. Cross-validation will work well when the observations in the data are independent, meaning that the past observation does not influence the future. However, in many modeling exercises, observations are usually autocorrelated (i.e. (1) a fraud user previously banned will re-appear with a new accounts and new behaviors to avoid detection, (2) an online shopper will change spending patterns over time). Given the dependence, you can see how it’s problematic to use future data to predict and evaluate past data in cross-validations. Hence, the better design is to utilize time-series cross-validation always training on historical data and predicting and evaluating future data.
What happens if you increase the number of folds in cross validation?
If you increase the number of folds, that’s more values you are using to calculate the average. This will decrease the variance of the evaluation while increasing the bias of the evaluation.
Does cross-validation improve model performance?
No, cross validation is not designed to improve model performance. It’s designed to improve the measurement of model performance.
How would you handle multicollinearity?
To begin answering this question, let’s first provide an explanation of the definition of multicollinearity.
Multicollinearity is the presence of two or more correlated features in a model. The best practice in building a highly-predictive, interpretable machine learning model is to remove multicollinearity.
Multicollinearity can harm:
1. The predictive performance of a model because of overfitting.
2. The interpretability of the model such that the variable importance of a feature correlating with another would be inaccurate.
3. The maintainability of large correlated features in a production environment.
Now, let’s address the interviewers’ question on treating multicollinearity in a model. You can list techniques:
1. Use Pearson and Spearman correlations to identify correlated variables. Use Pearson correlation if the relationship between two variables is linear. If not, use Spearman.
Suppose that a feature set contains 800 variables in a supervised model. How would you handle multicollinearity?
Handling multicollinearity is a combination of art and best practice. There are many approaches. Here is one approach that could work. Assume that the 800 variables further break down into 300 categorical variables and 500 numerical variables. To make de-correlation easy, transform the 300 categorical variables into numerical variable with numerical encodings, such as weight-of-evidence, mutual information, or class probability. Prior to computing correlations on pairs of any two variables, scale to remove outliers and standardize the numerical range.
Suppose you are building a credit fraud model with the minority class being less than 1%. How would you build a classification model that can handle an extremely imbalanced dataset?
Always relate back to the problem which is credit fraud. Just simply listing techniques for handling imbalanced datasets is not enough.
This solution will cover popular techniques and lightly touch the theory behind how each technique works. There are depths of statistical underpinning on why the techniques work and when those fail, but this guide should provide a guide on how to respond to the interviewer’s question.
When the class distribution is extremely imbalanced, you should never simply apply a binary classification model. Although doing so can provide a baseline performance to beat, you need to apply best practices to tackle this problem.
Common techniques include:
1. Choosing the right criterion to measure model performance
2. Re-sampling data to balance the class
3. Applying cost-sensitive learning
Before exploring each technique, let’s add substance to the context that the interviewer posed to you. This further demonstrates to the interviewer that you have a framework on how you approach the problem. You can say something along the line of:
“I’m going to assume that I have access to historical data with records from 2015 through 2017. Let’s also assume that there are 2.4 million credit card transactions, and about 0.5% are known fraudulent transactions. That’s merely 12,000 known bad compared to millions of known goods in the dataset.
Best Practice #1 - Choose the right metric for evaluating model performance.
Do not use accuracy, which is:
(True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative)
The class distribution is heavily skewed toward the good population (negatives). When your model yields the following results below, based on accuracy, the model performance is:
True Positive: 1,200
True Negative: 2,280,000
False Positive: 0
False Negative: 10,800
(1,200 + 2,280,000) / (1,200 + 10,800 + 2,280,000) = 95%
For a model that only predicts 1,200 of the 12,000 actual bads, or 10% correctly, the accuracy at 95% makes it appear that the model is doing well. But, in reality, it is not.
Do not use ROC-Curve.
When the class distribution is extremely skewed toward goods over bads, ROC-Curve becomes ineffective in evaluating the performance of a classification model.
Consider that ROC-curve is the plot formed by true-positive rates (TPR) and false-positive rates (FPR) across the threshold range from 0 to 1, inclusive.
TPR = True Positive / (True Positive + False Negative)
FPR = False Positive / (True Negative + False Positive)
Consider a simple example below which consists of 10,000 observations with 9990 goods and 10 bads as shown below. The deciles represents the probability threshold applies to testing data. The size represents the total number of observations with probability scores above or equal to the threshold. At each threshold, the corresponding true-positive (TP), false-negatives (FN), true-negatives (TN), false-positive (FP), TPR and FPR are computed.
When you observe the numerical summary, the poor prediction of true positives is overshadowed by the extremely disproportionate number of true negatives. Consider that the majority of the true-negatives mass on the lower range of probability scores than that of the true-positives.
When you examine, decile threshold at lets say, 0.6, TPR is 0.8 and FPR is 0.3003. When you examine the model on this threshold, the model seems to do quite well given that 80% of the bads will be predicted accurately and about 30% of the negatives will be misclassified as good. However, this overlooks the volume of the total negatives misclassified in relation to the true positives. In other words, this metric is missing precision.
Use PR-Curve:
The best practice is to evaluate the model using PR-Curve, which uses precision and recall, evaluated across model score range from 0 to 1, inclusive. Precision and recall both target true-positive as a measure for model performance.
Recall = True Positive / (True Positive + False Negatives)
Precision = True Positive / (True Positive + False Positive)
Note that recall is synonymous with TPR. Precision, on the other hand, includes false positives, which is the key to evaluating a model under an extremely imbalanced problem. Reviewing threshold 0.6 in the numerical summary, you observe that precision is merely 0.27%. This is a far different picture of the model’s ability to predict positives. Even the curve provides a different picture than ROC-curve.
Best practice #2 - Conduct resampling of minority and majority class
There are three widely-known variances of resampling techniques - downsampling, oversampling, and SMOTE. We will cover downsampling and oversampling in this solution. For SMOTE, there is plenty of academic literature that covers how it works.
The intuition behind downsampling and oversampling is simple.
The majority class is dispersed across the areas of minority class causes difficulty in any classification model to draw a decision boundary that separates the two classes.
When there’s too much noise from the majority class (orange) around the minority class (blue) as shown in the illustration, the boundary between the orange and blue data points becomes fuzzy. Down-sampling can reduce the noise, thereby, help a classification model form a better boundary as shown on the right.
With a similar objective in mind, oversampling the minority class can also be performed. Essentially, this is bootstrapping the minority class to improve the balance between the ratio of goods and bads.
Best Practice #3 - Try using cost-sensitive learning.
In credit fraud modelling, there are two types of decisions and four types of outcomes. In terms of decisions, the model labels them as fraud or non-fraud. But, based on the actual label, there are four outcomes each associated with its own cost:
1 (Pred) 1 (Actual) - Cost(1,1)
1 (Pred) 0 (Actual) - Cost(1,0)
0 (Pred) 1 (Actual) - Cost(0,1)
0 (Pred) 0 (Actual) - Cost(0,0)
There is a multitude of cost-sensitive learning. One main type you should be aware of is cost-sensitive learning with respect to threshold determination.
When class is highly imbalanced, you never want to choose 0.5 as the threshold for predicting class as fraud. Some measure incorporating cost is required.
Let’s assume that, for sake of simplicity, the costs of Cost(1,1) and Cost(0,0) are 0. If you either predict fraud or nonfraud, accurately, then you should not expect to see the cost associated with the outcome. However, there are costs associated with misclassifying.
Cost(1,0) is the cost of false positive (FP) and while Cost(0,1) is the cost of false-negative (FN). You can determine your threshold, P*, based on the following:
False Positive / (False Negative + False Positive) = P*
Predict class as fraud if P(Fraud|X) >= P*.
The derivation of the formula is out of scope in the solution. There is plenty of literature that covers the topic of what is an optimal model threshold based on costs. Given that not all fraud problems are the same and the cost of each determination is different from company to company, cost-sensitive classification is subject to change.
What is the curse of dimensionality? How do you prevent it?
WHAT IS CURSE OF DIMENSIONALITY:
The curse of dimensionality refers to a set of challenges and problems that arise when working with high-dimensional data (datasets with many features). The core issue is sparsity:
As the number of dimensions increases, the volume of the data space increases exponentially. This means data points become increasingly scattered and far apart, making it difficult to find patterns.
As the data points become less clustered and sparse, the decision boundary begins to overfit, which decreases generalization. Additionally, curse dimensionality, unnecessarily increases model training and inference time.
HOW DO YOU MITIGATE CURSE OF DIMENSIONALITY:
The following methods can handle curse of dimensionality:
Dimension reduction: Techniques like PCA to project data into lower dimensions.
Feature Selection: Identifying the most important features and dropping the rest
Regularization parameters: Every common ML algorithm contains parameters that mitigate against overfitting
Decision tree - pruning
Random forest - bootstrap, number of trees, column
and row sample
XGBoost - bootstrap, column and row sample, L1/L2
regularization term
Neural Network - Dropout, L1/L2 regularization term
What is AUC? How is it helpful when labels are imbalanced?
Suppose there are a total of 100 observations - 99 goods and 1 bad. Your classification model predicts all 100 observations as good, resulting in 99% accuracy. The model is great at correctly classifying goods with a 100% true positive rate, but horrendous at classifying the bad with a 100% false positive rate.
AUC is the metric to apply in such a case when the labels are imbalanced. AUC stands for “Area Under the Curve” and it is typically used in the context of the ROC curve, or Receiver Operating Characteristic curve, in statistics and machine learning. Here’s a breakdown of each term:
AUC (Area Under the Curve):
1. Overview: AUC refers to the area under a curve in a graph. In the context of classification problems in machine learning and statistics, it usually refers to the area under the ROC curve, which is a graphical representation of a model’s diagnostic ability.
2. Significance: AUC provides a single scalar value that represents the likelihood that the classifier will rank a randomly chosen positive instance higher than a randomly chose negative one. It’s often used to evaluate the performance of binary classification algorithms, though it can be extended to multi-class classification.
3. Range: AUC ranges from 0 to 1, with a value of 0.5 representing a model that performs no better than random and a value of 1 representing a perfect model. Generally, an AUC above 0.7 is considered acceptable, but this threshold may vary depending on the application.
ROC Curve(Receiver Operating Characteristic Curve)
1. Plot: The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR, or sensitivity) against the false positive rate (FPR, or 1-specificity).
2. Interpretation: Each point on the ROC curve represents a different threshold used to convert the model’s real-valued predictions into binary classifications. The curve illustrates the trade-off between sensitivity and specificity at various thresholds.
3. Usage: It’s a commonly used tool for evaluating the performance of classification algorithms, especially in the field of medical decision-making, and for comparing different classifiers.
Consider a model that is designed to predict whether a user will make a purchase after visiting a product sales page. From a total of 10,000 users who visited the page, 1,500 made a purchase. Among the users the model predicted as potential buyers, 1,000 were accurately identified as buyers but 700 were falsely predicted as buyers. Given this information, how can we assess the performance of this model?
For an imbalanced dataset like this, where the number of conversions is much smaller than non-conversions, relying solely on accuracy can overstate the model’s effectiveness. Precision, recall, and the F1/F2 scores offer a more balanced and insightful evaluation of the model’s performance, highlighting areas for improvement in predicting user conversions on the sales page.
First, let’s clarify and organize the provided data:
- Total users who actually made a purchase (Converted): 1500
- Users correctly predicted to have made a purchase (True positives, TP): 1,000
- Users incorrectly predicted to have made a purchase (False Positives, FP): 700
- Total users who visited the page: 10,000
Now, we calculate the missing components for our evaluation:
- False Negatives (FN), those who made a purchase but were not predicted as buyers: This is the total number of actual conversions minus the True Positives, resulting in 500.
- True Negatives (TN), those who were not predicted to buy and did not buy: This is derived as the total number of users minus the True Positives, False Positives, and -False Negatives, which gives us 7,800
Given these calculations, we can now discuss the model’s performance using various metrics:
1. Accuracy: Calculated as (True Positives + True Negatives) / Total Users, which gives 88% (8,800 / 10,000). However, accuracy can be misleading in cases of imbalanced classes, such as when the number of conversions is significantly less than non-conversions.
2. Precision and Recall: These metrics offer a more nuanced view of the model’s performance.
- Precision (the proportion of predicted conversions
that were correct) is calculated as TP / (TP + FP),
yielding approximately 59%.
- Recall (the proportion of actual conversions that
were correctly identified) is TP / (TP + FN)
3. F1/F2 Scores: These scores balance precision and recall, with the F1 score providing a harmonic mean and the F2 score weighting recall higher than precision. These are more suitable metrics when dealing with imbalanced datasets.
What is precision and how is it calculated?
Precision is the proportion of predicted conversions that were correct and is calculated as TP / (TP + FP)
What is recall and how is it calculated?
Recall is the proportion of actual conversions that were correctly identified and is TP / (TP + FN). Also known as True Positive Rate or Sensitivity.
How do you evaluate whether a model is underfitting or overfitting?
A way to determine if a model is underfitted or overfitted is to look at the model’s error curves on train and validation datasets as seen below. Consider the number of trees (iterations) sequentially constructed in the XGBoost model. As iterations increase, the errors of both train and validation are in a downward trajectory. This is an indication that the model has underfitted and there’s still room to identify the optimal fit.
But, as the iterations increase beyond the optimal fit, the validation error increases while the train error decreases. This is a sign that the model is overfitting on the training data. And, this is a sign that the iterations need to be cut-back, or some other regularization method (e.g. L1/L2 term, column and row samples) should be considered to reduce overfitting.
How do you conduct feature selection when building models?
To conduct feature selection for machine learning, here are a couple of key points to consider:
Types of Feature Selection
Methods:
- Correlation analysis: Calculate correlation coefficients (Pearson, Spearman, etc) to identify strong linear and monotonic relationships.
- Information Gain, Mutual Information: Measure how much information a feature provides about the target.
- Chi-Square Test: Evaluate the independence between a feature and the target variable (for categorical data).
- Variance Threshold: Remove features with low variance (i.e., that barely change).
- P Values: Use linear regression and select the variables based on their p-values. For features with small p-values (generally <= 0.05), you can reject the null hypothesis and mark that feature as important.
Pros: Fast, model agnostic, good for initial screening.
Cons: Might miss features that only become important when combined with others.
Methods:
Recursive Feature Elimination (RFE): Fit a model, rank feature importance, and recursively eliminate the least important features until the optimal subset remains.
Forward Selection: Start with no features, then iteratively add the most useful one at a time until performance stops improving.
Backward Elimination: Start with all features, then iteratively remove the least useful one at a time until performance stops improving.
Note:
Backward Elimination removes one feature at a time and re-evaluates model performance (e.g. accuracy) after each removal to decide what to drop next. RFE uses the model’s internal feature importance scores (e.g. coefficients in logistic regression, feature importances in a random forest) to rank and eliminate features, rather than re-evaluating performance each time.
How many features they drop at once:
Backward Elimination always removes one feature per step. RFE can be configured to eliminate multiple features per iteration, making it faster.
Pros: Take feature interactions into account, can lead to better performing models.
Cons: Computationally expensive, risk of overfitting to the specific model.
Methods:
- L1 Regularization (Lasso): Adds a penalty term to the model that forces coefficients of less important features towards zero.
- Decision Trees and Random Forests: These algorithms inherently provide feature importance scores (like SHAPLEY values).
Pros: Efficient, integrated into the modelling process. In addition, the regularization approach can help reduce multicollinearity in the features.
Cons: Specific to the type of model used.
Methods:
- PCA/ICA: These approaches use matrix factorizations to reduce the overall dimensions of the feature data.
- Autoencoder - This is a neural network based model that trains on the input set of features, and attempts to recreate the input. The hidden layer embodies the dense, lower dimensional representation of the input data.
Pros: PCA/ICA are easy to implement in capturing the lower dimension representation of the initial data. This can prevent overfitting.
Cons: Loses the interpretability of the feature data as the model trains on the output of the dimensionality reduction models.
Choosing the right techniques
Consider these factors when deciding on feature selection methods:
1. Dataset Size: Filter methods are generally fast, suitable for large datasets. Wrapper methods might be too computationally expensive for very high-dimensional problems.
2. Model type: Embedded methods are often convenient if your primary model choice is already something like a decision tree or a penalized regression model.
3. Goal: If your goal is primarily to understand the most important features, filter or embedded models are good starting points. If maximizing predictive performance is the priority, wrapper methods might yield better results.
Additional Tips:
1. Domain Knowledge: Incorporate your understanding of the problem to guide feature selection.
2. Remove highly correlated features: Identify and potentially remove features that are essentially duplicates of each other.
3. Iteration: Feature selection is often an iterative process; try different methods and evaluate their impact.
What are the differences between Euclidean and Manhattan distances?
Euclidean and Manhattan distances are two common ways to measure the distance between data points. These measure are often used in algorithms that rely on distance calculations, such as k-nearest neighbors (k-NN) and clustering algorithms. Additionally, these measures are used in recommender systems (e.g. collaborative filtering).
Euclidean Distance
Euclidean Distance is the straight line distance between two points in Euclidean space. It is the most common use of distance in geometry and can be generalized to multiple dimensions. The calculation is shown below. It’s the square root of the sums of the squared difference between each component in a pair two vectors.
Euclidean(x,y) = sqrt((x1-y1)^2 + (x2-y2)^2)
The problem with Euclidean distance is that it is sensitive to outliers given that one of the dimensions may produce a large difference, which may inflate the overall calculation of the Euclidean distance value.
Manhattan Distance
Manhattan Distance is another distance measure that uses the absolute value of the distance, rather than the squared difference as seen in the Euclidean function. And it is more sensitive to outliers.
Manhattan(x,y) = |x1 - x2| + |y1 - y2 |
How do you handle missing values in data when building models?
Handling missing values is a crucial part of data preparation for machine learning. Here’s a breakdown of the most common techniques, along with considerations on when to use them:
Methods:
Row Deletion - Remove entire rows (samples) if they contain missing values.
Column Deletion - Remove the column if it contains a specified proportion of missing values.
When to use:
When a substantial percentage of your data is missing.
When missingness is completely random (not correlated with other features or the target variable).
Risks: Significant loss of information, potential introduction of bias if missingness isnt random.
Methods:
Mean/Median Imputation: Replace missing values with the mean (for continuous variables) or median (for continuous or ordinal variables) of the feature.
Mode Imputation: Replace missing values with the most frequent value (for categorical features)
Predictive Modelling: Create a model (e.g. KNN) to predict the missing values based on other features.
When to use:
To preserve the sample size
When you have some understanding of the underlying distribution of the features or the relationships between them.
Risks:
Imputed values are estimations, potentially reducing the variance in your data.
Simple imputation (mean/median) can distort the data’s distribution.
Methods:
Create a new category “missing” for categorical features.
For numerical features, sometimes a special value (e.g. -999) can indicate missingness
When to use:
When missingness itself might hold predictive information.
Risks:
Carefully consider if it makes sense in the context of your problem.
Choosing the Right Strategy
1. Understand the source of the missingness. First, identify whether the missingness is generated from a data pipeline issue, or that it is an field that is optional, deprecated, or created.
2. Measure the proportion of missing values. If the proportion of missing values is high, this may necessitate removal of the column as it would carry low predictive power for the model. If it’s moderate amount, then consider strategies such as imputation or replacement as a unique category.
3. Test different strategies. Ultimately, the best strategy for missingness depends on the model performance. Measure the baseline performance. Iterate through the model building process with different missing value handling approaches.
How do you evaluate regression model metrics?
There are five metrics for regression models that we can evaluate.
Choosing the Right Metrics
The choice of metrics depends on several factors:
Problem Context: If the absolute magnitude of errors is crucial (e.g., predicting financial losses), MAE might be better. If understanding the proportion of variance explained is important, R^2 is a good choice.
Outliers: If your data has outliers, MAE is often more robust than MSE or RMSE.
Scale of the Target Variable: Metrics like MSE and RSME are affected by the scale of your target variable. Consider normalization if the scale is significantly different from the range of predicted values.
What is the formula for Mean Squared Error?
MSE = 1/n * summation(y_i - y_hat_i)^2 where
n = number of samples
y_i = actual value for sample i
y_hat_i = predicted value for sample i
What is the formula for Root Mean Squared Error?
RMSE = sqrt(1/n * summation(y_i - y_hat_i)^2) where
n = number of samples
y_i = actual value for sample i
y_hat_i = predicted value for sample i
What is the formula for Mean Absolute Error?
MAE = 1/n * summation(absolute value(y_i - y_hat_i)) where
n = number of samples
y_i = actual value for sample i
y_hat_i = predicted value for sample i
What is the formula for R-Squared (also known as Coefficient of Determination)?
R^2 = 1 - ( summation(y_i - y_hat_i)^2 / summation(y_i - y_mean)^2 ) where
y_i = actual value for sample i
y_hat_i = predicted value for sample i
y_mean is the average of the actual values
How do you conduct hyperparameter tuning?
Hyperparameter tuning is the process of finding the optimal settings for a machine learning model’s hyperparameters. Here are three common techniques for hyperparameter tuning:
Choosing the right technique:
Grid Search: A good choice for low-dimensional hyperparameter spaces (few hyperparameters) or when interpretability of the search process is important. It guarantees exploration of the entire defined grid.
Random Search: A strong alternative to grid search, especially for high-dimensional spaces. It’s generally faster and less prone to getting stuck in bad regions of the search space.
Bayesian Optimization: Ideal for expensive-to-evaluate models or complex hyperparameter spaces. Its efficiency comes from focusing on promising areas and avoiding redundant evaluations. However, it requires more expertise to set up and interpret the results.
How do L1 and L2 regularization terms work?
L1 and L2 regularization are techniques used in machine learning to prevent overfitting. An overfit model learns the training data too well, adapting to noise and peculiarities, which leads to poor performance on new, unseen data.
Here’s how they work:
L1 Regularization (Lasso)
Concept: L1 regularization adds a penalty term to the model’s cost function that is proportional to the ABSOLUTE VALUE of the model’s coefficients (weights).
Effect: It encourages coefficicents to become zero, effectively performing feature selection. Only the most important features will have large, non-zero coefficients. This can help create simpler and more interpretable models.
Sparsity: L1 regularization introduces sparsity into the model. A sparse model has many coefficients set to zero.
L2 Regularization (Ridge)
Concept: L2 regularization adds a penalty term to the model’s cost function that is proportional to the SQUARE of the coefficients.
Effect: It discourages large coefficients by penalizing them more heavily, but it doesn’t force them to become exactly zero. This leads to smaller, more spread-out weights.
Prevents Overfitting: L2 regularization helps avoid situations where single features get large weights and dominate the model’s predictions.
Visualizing the Difference:
Observe the plot showing the L1 and L2 regularization constraint regions.
When to Use Which
L1 for Feature Selection: If you want to reduce the number of features in your model and improve interpretability, L1 regularization is often preferable.
L2 for Preventing Overfitting: Generally, L2 is a good default choice for preventing large weights and reducing overfitting, especially when you don’t need explicit feature selection.
Elastic Net: Combines L1 and L2, useful when you want the benefits of both kinds of regularization.