ML Breadth Depth Qs Flashcards

(230 cards)

1
Q

What is the variance-bias trade-off?

A

The variance-bias trade off is a fundamental concept in machine learning, which assesses how well a model has fit data for predictions. There are two properties - variance and bias - that assess model fits. And there’s a trade-off between the two such that an increase in one decreases the other. And, the main aim for machine learning algorithms and various techniques (e.g. ensembling and regularization) is to reduce both.

Let’s consider the meaning of bias and variance:
Bias: Bias is the difference between the average prediction of your model and the true value you are trying to predict. High bias means your model is overly simplistic. It has strong assumptions about the data which prevents it from learning the real underlying patterns. This leads to underfitting.

Variance: Variance measures how much your model’s predictions change with different training datasets. High variance means the model is highly sensitive to specific data its training on, capturing noise and peculiarities rather than the generalizable trend. This leads to overfitting.

The Trade-Off:
Let’s understand the trade-off in relation to the complexity of the model’s decision boundary.
Complex Models: Flexible models (like polynomial linear models, deep neural networks, decision trees with many splits) have the capacity to fit complex patterns in the data. This reduces bias but can lead to high variance if they start fitting the noise in the training data rather than the true underlying trend.
Simple Models: Linear models or shallow decision trees have high bias because of their simplifying assumptions. However, they are less prone to overfitting and tend to have lower variance.

Techniques to Address the Trade-Off
1. Regularization: Adds penalties to complex models to favor simpler explanations (e.g. L1/L2 regularization)
2. Hyper-parameter tuning using cross-validation: Helps evaluate the best combination of parameters from multiple splits of data and produce the optimal complexity in decision boundary that minimize both variance and bias.
3. Ensemble Methods: Combines multiple models to reduce variance (e.g. bagging, random forests).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is cross-validation? How does it work?

A

Suppose that you randomly sample and allocate 70% of your data for training the model and the remaining 30% of the data for testing your model. Let’s also assume that the prediction problem is regression so mean squared error is used to evaluate the model. You compute the MSE of the model prediction on testing data and conclude 679.34.

This approach seems to work, but there is a drawback.

Suppose that model performance is evaluated on new testing data, resulting in 824.44 MSE. This discrepancy suggests that there is variability in the model prediction error from one set of data to the next. Cross-validation reduces this uncertainty in estimating the error by averaging multiple testing errors across multiple folds of the data.

The procedure of cross-validation is simple. You choose K number of folds, which partition the entire data into K folds, as shown below.

Each fold is subject to validation while the rest become data used for training the model. This process is repeated K times to ensure that all of the folds of the data are computed with errors.

Finally, the errors are averaged to produce a single error score with reduced uncertainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cross Validation AUC is 0.90. However, when the model is productionalized, AUC drops to 0.75 Why?

A

The discrepancy in a model performance offline vs online happens when a holdout test wasn’t utilized to measure the model performance. Cross-validation will work well when the observations in the data are independent, meaning that the past observation does not influence the future. However, in many modeling exercises, observations are usually autocorrelated (i.e. (1) a fraud user previously banned will re-appear with a new accounts and new behaviors to avoid detection, (2) an online shopper will change spending patterns over time). Given the dependence, you can see how it’s problematic to use future data to predict and evaluate past data in cross-validations. Hence, the better design is to utilize time-series cross-validation always training on historical data and predicting and evaluating future data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What happens if you increase the number of folds in cross validation?

A

If you increase the number of folds, that’s more values you are using to calculate the average. This will decrease the variance of the evaluation while increasing the bias of the evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Does cross-validation improve model performance?

A

No, cross validation is not designed to improve model performance. It’s designed to improve the measurement of model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How would you handle multicollinearity?

A

To begin answering this question, let’s first provide an explanation of the definition of multicollinearity.

Multicollinearity is the presence of two or more correlated features in a model. The best practice in building a highly-predictive, interpretable machine learning model is to remove multicollinearity.

Multicollinearity can harm:
1. The predictive performance of a model because of overfitting.
2. The interpretability of the model such that the variable importance of a feature correlating with another would be inaccurate.
3. The maintainability of large correlated features in a production environment.

Now, let’s address the interviewers’ question on treating multicollinearity in a model. You can list techniques:
1. Use Pearson and Spearman correlations to identify correlated variables. Use Pearson correlation if the relationship between two variables is linear. If not, use Spearman.

  1. Employ variance inflation factor (VIF) to identify correlated variables in a regression model.
  2. Apply the wrapper methods such as backward, forward, or stepwise to build a model that uses a feature set with low presence of multicollinearity.
  3. Use regularized regression such as elastic net, lasso, or ridge. You can use the constructed model as the final model for prediction, or extract features with nonzero coefficients as decorrelated features for a final model such as GBM or neural network.
  4. Use principal component analysis to compress a feature set into a smaller set of decorrelated features.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Suppose that a feature set contains 800 variables in a supervised model. How would you handle multicollinearity?

A

Handling multicollinearity is a combination of art and best practice. There are many approaches. Here is one approach that could work. Assume that the 800 variables further break down into 300 categorical variables and 500 numerical variables. To make de-correlation easy, transform the 300 categorical variables into numerical variable with numerical encodings, such as weight-of-evidence, mutual information, or class probability. Prior to computing correlations on pairs of any two variables, scale to remove outliers and standardize the numerical range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Suppose you are building a credit fraud model with the minority class being less than 1%. How would you build a classification model that can handle an extremely imbalanced dataset?

A

Always relate back to the problem which is credit fraud. Just simply listing techniques for handling imbalanced datasets is not enough.

This solution will cover popular techniques and lightly touch the theory behind how each technique works. There are depths of statistical underpinning on why the techniques work and when those fail, but this guide should provide a guide on how to respond to the interviewer’s question.

When the class distribution is extremely imbalanced, you should never simply apply a binary classification model. Although doing so can provide a baseline performance to beat, you need to apply best practices to tackle this problem.

Common techniques include:
1. Choosing the right criterion to measure model performance
2. Re-sampling data to balance the class
3. Applying cost-sensitive learning

Before exploring each technique, let’s add substance to the context that the interviewer posed to you. This further demonstrates to the interviewer that you have a framework on how you approach the problem. You can say something along the line of:

“I’m going to assume that I have access to historical data with records from 2015 through 2017. Let’s also assume that there are 2.4 million credit card transactions, and about 0.5% are known fraudulent transactions. That’s merely 12,000 known bad compared to millions of known goods in the dataset.

Best Practice #1 - Choose the right metric for evaluating model performance.

Do not use accuracy, which is:
(True Positive + True Negative) / (True Positive + True Negative + False Positive + False Negative)

The class distribution is heavily skewed toward the good population (negatives). When your model yields the following results below, based on accuracy, the model performance is:

True Positive: 1,200
True Negative: 2,280,000
False Positive: 0
False Negative: 10,800

(1,200 + 2,280,000) / (1,200 + 10,800 + 2,280,000) = 95%

For a model that only predicts 1,200 of the 12,000 actual bads, or 10% correctly, the accuracy at 95% makes it appear that the model is doing well. But, in reality, it is not.

Do not use ROC-Curve.

When the class distribution is extremely skewed toward goods over bads, ROC-Curve becomes ineffective in evaluating the performance of a classification model.

Consider that ROC-curve is the plot formed by true-positive rates (TPR) and false-positive rates (FPR) across the threshold range from 0 to 1, inclusive.

TPR = True Positive / (True Positive + False Negative)
FPR = False Positive / (True Negative + False Positive)

Consider a simple example below which consists of 10,000 observations with 9990 goods and 10 bads as shown below. The deciles represents the probability threshold applies to testing data. The size represents the total number of observations with probability scores above or equal to the threshold. At each threshold, the corresponding true-positive (TP), false-negatives (FN), true-negatives (TN), false-positive (FP), TPR and FPR are computed.

When you observe the numerical summary, the poor prediction of true positives is overshadowed by the extremely disproportionate number of true negatives. Consider that the majority of the true-negatives mass on the lower range of probability scores than that of the true-positives.

When you examine, decile threshold at lets say, 0.6, TPR is 0.8 and FPR is 0.3003. When you examine the model on this threshold, the model seems to do quite well given that 80% of the bads will be predicted accurately and about 30% of the negatives will be misclassified as good. However, this overlooks the volume of the total negatives misclassified in relation to the true positives. In other words, this metric is missing precision.

Use PR-Curve:

The best practice is to evaluate the model using PR-Curve, which uses precision and recall, evaluated across model score range from 0 to 1, inclusive. Precision and recall both target true-positive as a measure for model performance.

Recall = True Positive / (True Positive + False Negatives)
Precision = True Positive / (True Positive + False Positive)

Note that recall is synonymous with TPR. Precision, on the other hand, includes false positives, which is the key to evaluating a model under an extremely imbalanced problem. Reviewing threshold 0.6 in the numerical summary, you observe that precision is merely 0.27%. This is a far different picture of the model’s ability to predict positives. Even the curve provides a different picture than ROC-curve.

Best practice #2 - Conduct resampling of minority and majority class

There are three widely-known variances of resampling techniques - downsampling, oversampling, and SMOTE. We will cover downsampling and oversampling in this solution. For SMOTE, there is plenty of academic literature that covers how it works.

The intuition behind downsampling and oversampling is simple.

The majority class is dispersed across the areas of minority class causes difficulty in any classification model to draw a decision boundary that separates the two classes.

When there’s too much noise from the majority class (orange) around the minority class (blue) as shown in the illustration, the boundary between the orange and blue data points becomes fuzzy. Down-sampling can reduce the noise, thereby, help a classification model form a better boundary as shown on the right.

With a similar objective in mind, oversampling the minority class can also be performed. Essentially, this is bootstrapping the minority class to improve the balance between the ratio of goods and bads.

Best Practice #3 - Try using cost-sensitive learning.

In credit fraud modelling, there are two types of decisions and four types of outcomes. In terms of decisions, the model labels them as fraud or non-fraud. But, based on the actual label, there are four outcomes each associated with its own cost:

1 (Pred) 1 (Actual) - Cost(1,1)
1 (Pred) 0 (Actual) - Cost(1,0)
0 (Pred) 1 (Actual) - Cost(0,1)
0 (Pred) 0 (Actual) - Cost(0,0)

There is a multitude of cost-sensitive learning. One main type you should be aware of is cost-sensitive learning with respect to threshold determination.

When class is highly imbalanced, you never want to choose 0.5 as the threshold for predicting class as fraud. Some measure incorporating cost is required.

Let’s assume that, for sake of simplicity, the costs of Cost(1,1) and Cost(0,0) are 0. If you either predict fraud or nonfraud, accurately, then you should not expect to see the cost associated with the outcome. However, there are costs associated with misclassifying.

Cost(1,0) is the cost of false positive (FP) and while Cost(0,1) is the cost of false-negative (FN). You can determine your threshold, P*, based on the following:

False Positive / (False Negative + False Positive) = P*
Predict class as fraud if P(Fraud|X) >= P*.

The derivation of the formula is out of scope in the solution. There is plenty of literature that covers the topic of what is an optimal model threshold based on costs. Given that not all fraud problems are the same and the cost of each determination is different from company to company, cost-sensitive classification is subject to change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the curse of dimensionality? How do you prevent it?

A

WHAT IS CURSE OF DIMENSIONALITY:
The curse of dimensionality refers to a set of challenges and problems that arise when working with high-dimensional data (datasets with many features). The core issue is sparsity:

As the number of dimensions increases, the volume of the data space increases exponentially. This means data points become increasingly scattered and far apart, making it difficult to find patterns.

As the data points become less clustered and sparse, the decision boundary begins to overfit, which decreases generalization. Additionally, curse dimensionality, unnecessarily increases model training and inference time.

HOW DO YOU MITIGATE CURSE OF DIMENSIONALITY:
The following methods can handle curse of dimensionality:
Dimension reduction: Techniques like PCA to project data into lower dimensions.
Feature Selection: Identifying the most important features and dropping the rest
Regularization parameters: Every common ML algorithm contains parameters that mitigate against overfitting
Decision tree - pruning
Random forest - bootstrap, number of trees, column
and row sample
XGBoost - bootstrap, column and row sample, L1/L2
regularization term
Neural Network - Dropout, L1/L2 regularization term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is AUC? How is it helpful when labels are imbalanced?

A

Suppose there are a total of 100 observations - 99 goods and 1 bad. Your classification model predicts all 100 observations as good, resulting in 99% accuracy. The model is great at correctly classifying goods with a 100% true positive rate, but horrendous at classifying the bad with a 100% false positive rate.

AUC is the metric to apply in such a case when the labels are imbalanced. AUC stands for “Area Under the Curve” and it is typically used in the context of the ROC curve, or Receiver Operating Characteristic curve, in statistics and machine learning. Here’s a breakdown of each term:

AUC (Area Under the Curve):
1. Overview: AUC refers to the area under a curve in a graph. In the context of classification problems in machine learning and statistics, it usually refers to the area under the ROC curve, which is a graphical representation of a model’s diagnostic ability.
2. Significance: AUC provides a single scalar value that represents the likelihood that the classifier will rank a randomly chosen positive instance higher than a randomly chose negative one. It’s often used to evaluate the performance of binary classification algorithms, though it can be extended to multi-class classification.
3. Range: AUC ranges from 0 to 1, with a value of 0.5 representing a model that performs no better than random and a value of 1 representing a perfect model. Generally, an AUC above 0.7 is considered acceptable, but this threshold may vary depending on the application.

ROC Curve(Receiver Operating Characteristic Curve)
1. Plot: The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR, or sensitivity) against the false positive rate (FPR, or 1-specificity).
2. Interpretation: Each point on the ROC curve represents a different threshold used to convert the model’s real-valued predictions into binary classifications. The curve illustrates the trade-off between sensitivity and specificity at various thresholds.
3. Usage: It’s a commonly used tool for evaluating the performance of classification algorithms, especially in the field of medical decision-making, and for comparing different classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Consider a model that is designed to predict whether a user will make a purchase after visiting a product sales page. From a total of 10,000 users who visited the page, 1,500 made a purchase. Among the users the model predicted as potential buyers, 1,000 were accurately identified as buyers but 700 were falsely predicted as buyers. Given this information, how can we assess the performance of this model?

A

For an imbalanced dataset like this, where the number of conversions is much smaller than non-conversions, relying solely on accuracy can overstate the model’s effectiveness. Precision, recall, and the F1/F2 scores offer a more balanced and insightful evaluation of the model’s performance, highlighting areas for improvement in predicting user conversions on the sales page.

First, let’s clarify and organize the provided data:
- Total users who actually made a purchase (Converted): 1500
- Users correctly predicted to have made a purchase (True positives, TP): 1,000
- Users incorrectly predicted to have made a purchase (False Positives, FP): 700
- Total users who visited the page: 10,000

Now, we calculate the missing components for our evaluation:
- False Negatives (FN), those who made a purchase but were not predicted as buyers: This is the total number of actual conversions minus the True Positives, resulting in 500.
- True Negatives (TN), those who were not predicted to buy and did not buy: This is derived as the total number of users minus the True Positives, False Positives, and -False Negatives, which gives us 7,800

Given these calculations, we can now discuss the model’s performance using various metrics:
1. Accuracy: Calculated as (True Positives + True Negatives) / Total Users, which gives 88% (8,800 / 10,000). However, accuracy can be misleading in cases of imbalanced classes, such as when the number of conversions is significantly less than non-conversions.
2. Precision and Recall: These metrics offer a more nuanced view of the model’s performance.
- Precision (the proportion of predicted conversions
that were correct) is calculated as TP / (TP + FP),
yielding approximately 59%.
- Recall (the proportion of actual conversions that
were correctly identified) is TP / (TP + FN)
3. F1/F2 Scores: These scores balance precision and recall, with the F1 score providing a harmonic mean and the F2 score weighting recall higher than precision. These are more suitable metrics when dealing with imbalanced datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is precision and how is it calculated?

A

Precision is the proportion of predicted conversions that were correct and is calculated as TP / (TP + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is recall and how is it calculated?

A

Recall is the proportion of actual conversions that were correctly identified and is TP / (TP + FN). Also known as True Positive Rate or Sensitivity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you evaluate whether a model is underfitting or overfitting?

A

A way to determine if a model is underfitted or overfitted is to look at the model’s error curves on train and validation datasets as seen below. Consider the number of trees (iterations) sequentially constructed in the XGBoost model. As iterations increase, the errors of both train and validation are in a downward trajectory. This is an indication that the model has underfitted and there’s still room to identify the optimal fit.

But, as the iterations increase beyond the optimal fit, the validation error increases while the train error decreases. This is a sign that the model is overfitting on the training data. And, this is a sign that the iterations need to be cut-back, or some other regularization method (e.g. L1/L2 term, column and row samples) should be considered to reduce overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you conduct feature selection when building models?

A

To conduct feature selection for machine learning, here are a couple of key points to consider:

Types of Feature Selection

  1. Filter Methods - Assess individual features based on their statistical relationship to the target variable, independent of the chosen model.

Methods:
- Correlation analysis: Calculate correlation coefficients (Pearson, Spearman, etc) to identify strong linear and monotonic relationships.
- Information Gain, Mutual Information: Measure how much information a feature provides about the target.
- Chi-Square Test: Evaluate the independence between a feature and the target variable (for categorical data).
- Variance Threshold: Remove features with low variance (i.e., that barely change).
- P Values: Use linear regression and select the variables based on their p-values. For features with small p-values (generally <= 0.05), you can reject the null hypothesis and mark that feature as important.

Pros: Fast, model agnostic, good for initial screening.
Cons: Might miss features that only become important when combined with others.

  1. Wrapper Methods - Use a machine learning model itself to evaluate the importance of features by iteratively adding and removing them.

Methods:
Recursive Feature Elimination (RFE): Fit a model, rank feature importance, and recursively eliminate the least important features until the optimal subset remains.

Forward Selection: Start with no features, then iteratively add the most useful one at a time until performance stops improving.

Backward Elimination: Start with all features, then iteratively remove the least useful one at a time until performance stops improving.

Note:

Backward Elimination removes one feature at a time and re-evaluates model performance (e.g. accuracy) after each removal to decide what to drop next. RFE uses the model’s internal feature importance scores (e.g. coefficients in logistic regression, feature importances in a random forest) to rank and eliminate features, rather than re-evaluating performance each time.

How many features they drop at once:

Backward Elimination always removes one feature per step. RFE can be configured to eliminate multiple features per iteration, making it faster.

Pros: Take feature interactions into account, can lead to better performing models.
Cons: Computationally expensive, risk of overfitting to the specific model.

  1. Embedded Methods - Feature selection is built directly into the learning algorithm.

Methods:
- L1 Regularization (Lasso): Adds a penalty term to the model that forces coefficients of less important features towards zero.
- Decision Trees and Random Forests: These algorithms inherently provide feature importance scores (like SHAPLEY values).

Pros: Efficient, integrated into the modelling process. In addition, the regularization approach can help reduce multicollinearity in the features.
Cons: Specific to the type of model used.

  1. Dimensionality Reduction - Reduce the dimensions of the feature space using dimensionality reduction methods like PCA, ICA, and autoencoder.

Methods:
- PCA/ICA: These approaches use matrix factorizations to reduce the overall dimensions of the feature data.
- Autoencoder - This is a neural network based model that trains on the input set of features, and attempts to recreate the input. The hidden layer embodies the dense, lower dimensional representation of the input data.

Pros: PCA/ICA are easy to implement in capturing the lower dimension representation of the initial data. This can prevent overfitting.
Cons: Loses the interpretability of the feature data as the model trains on the output of the dimensionality reduction models.

Choosing the right techniques

Consider these factors when deciding on feature selection methods:
1. Dataset Size: Filter methods are generally fast, suitable for large datasets. Wrapper methods might be too computationally expensive for very high-dimensional problems.
2. Model type: Embedded methods are often convenient if your primary model choice is already something like a decision tree or a penalized regression model.
3. Goal: If your goal is primarily to understand the most important features, filter or embedded models are good starting points. If maximizing predictive performance is the priority, wrapper methods might yield better results.

Additional Tips:
1. Domain Knowledge: Incorporate your understanding of the problem to guide feature selection.
2. Remove highly correlated features: Identify and potentially remove features that are essentially duplicates of each other.
3. Iteration: Feature selection is often an iterative process; try different methods and evaluate their impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the differences between Euclidean and Manhattan distances?

A

Euclidean and Manhattan distances are two common ways to measure the distance between data points. These measure are often used in algorithms that rely on distance calculations, such as k-nearest neighbors (k-NN) and clustering algorithms. Additionally, these measures are used in recommender systems (e.g. collaborative filtering).

Euclidean Distance

Euclidean Distance is the straight line distance between two points in Euclidean space. It is the most common use of distance in geometry and can be generalized to multiple dimensions. The calculation is shown below. It’s the square root of the sums of the squared difference between each component in a pair two vectors.

Euclidean(x,y) = sqrt((x1-y1)^2 + (x2-y2)^2)

The problem with Euclidean distance is that it is sensitive to outliers given that one of the dimensions may produce a large difference, which may inflate the overall calculation of the Euclidean distance value.

Manhattan Distance

Manhattan Distance is another distance measure that uses the absolute value of the distance, rather than the squared difference as seen in the Euclidean function. And it is more sensitive to outliers.

Manhattan(x,y) = |x1 - x2| + |y1 - y2 |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you handle missing values in data when building models?

A

Handling missing values is a crucial part of data preparation for machine learning. Here’s a breakdown of the most common techniques, along with considerations on when to use them:

  1. Deletion

Methods:
Row Deletion - Remove entire rows (samples) if they contain missing values.
Column Deletion - Remove the column if it contains a specified proportion of missing values.

When to use:
When a substantial percentage of your data is missing.
When missingness is completely random (not correlated with other features or the target variable).

Risks: Significant loss of information, potential introduction of bias if missingness isnt random.

  1. Imputation

Methods:
Mean/Median Imputation: Replace missing values with the mean (for continuous variables) or median (for continuous or ordinal variables) of the feature.
Mode Imputation: Replace missing values with the most frequent value (for categorical features)
Predictive Modelling: Create a model (e.g. KNN) to predict the missing values based on other features.

When to use:
To preserve the sample size
When you have some understanding of the underlying distribution of the features or the relationships between them.

Risks:
Imputed values are estimations, potentially reducing the variance in your data.
Simple imputation (mean/median) can distort the data’s distribution.

  1. Treat Missing Values as a Unique Category

Methods:
Create a new category “missing” for categorical features.
For numerical features, sometimes a special value (e.g. -999) can indicate missingness

When to use:
When missingness itself might hold predictive information.

Risks:
Carefully consider if it makes sense in the context of your problem.

Choosing the Right Strategy
1. Understand the source of the missingness. First, identify whether the missingness is generated from a data pipeline issue, or that it is an field that is optional, deprecated, or created.
2. Measure the proportion of missing values. If the proportion of missing values is high, this may necessitate removal of the column as it would carry low predictive power for the model. If it’s moderate amount, then consider strategies such as imputation or replacement as a unique category.
3. Test different strategies. Ultimately, the best strategy for missingness depends on the model performance. Measure the baseline performance. Iterate through the model building process with different missing value handling approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you evaluate regression model metrics?

A

There are five metrics for regression models that we can evaluate.

  1. Mean Squared Error (MSE) - Calculates the average squared difference between the actual values (y_i) and the predicted values (y_hat_i) for all samples. Lower MSE indicates a better fit. But, MSE is sensitive to outliers given the squaring errors amplifies the differences.
  2. Root Mean Squared Error (RSME) - Square root of MSE. Makes the errors easier to interpret in the original units of your data, especially when dealing with non-negative targets (e.g., housing prices).
  3. Mean Absolute Error (MAE) - Calculates the average absolute difference between actual and predicted values. Less sensitive to outliers compared to MSE.
  4. R-Squared (Coefficient of Determination) - Represents the proportion of variance in the target variable explained by the model. Ranges from 0 to 1, with a higher value indicating a better fit. However, its important to note that R^2 can increase simply by adding more features, even if they are not relevant.
  5. Adjusted R-squared - A modification of R^2 that penalizes for adding more features to the model, helping to avoid overfitting. Generally considered a more reliable measure of fit compared to the standard R^2.

Choosing the Right Metrics

The choice of metrics depends on several factors:

Problem Context: If the absolute magnitude of errors is crucial (e.g., predicting financial losses), MAE might be better. If understanding the proportion of variance explained is important, R^2 is a good choice.
Outliers: If your data has outliers, MAE is often more robust than MSE or RMSE.
Scale of the Target Variable: Metrics like MSE and RSME are affected by the scale of your target variable. Consider normalization if the scale is significantly different from the range of predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for Mean Squared Error?

A

MSE = 1/n * summation(y_i - y_hat_i)^2 where
n = number of samples
y_i = actual value for sample i
y_hat_i = predicted value for sample i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the formula for Root Mean Squared Error?

A

RMSE = sqrt(1/n * summation(y_i - y_hat_i)^2) where
n = number of samples
y_i = actual value for sample i
y_hat_i = predicted value for sample i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the formula for Mean Absolute Error?

A

MAE = 1/n * summation(absolute value(y_i - y_hat_i)) where
n = number of samples
y_i = actual value for sample i
y_hat_i = predicted value for sample i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the formula for R-Squared (also known as Coefficient of Determination)?

A

R^2 = 1 - ( summation(y_i - y_hat_i)^2 / summation(y_i - y_mean)^2 ) where
y_i = actual value for sample i
y_hat_i = predicted value for sample i
y_mean is the average of the actual values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you conduct hyperparameter tuning?

A

Hyperparameter tuning is the process of finding the optimal settings for a machine learning model’s hyperparameters. Here are three common techniques for hyperparameter tuning:

  1. Grid Search:
    Concept: Grid search evaluates a models performance on a predefined grid of hyperparameter values. This grid is created by specifying a range and number of steps for each hyperparameter.
    Exhaustive Search: It tries every single combination of hyperparameter values within the defined grid. This can be computationally expensive, especially for models with many hyperparameters.
    Finding the Best: The combination that yields the best performance metric (e.g., accuracy, F1-score) on a validation set is considered the optimal hyperparameter configuration.
  2. Random Search:
    Concept: Similar to grid search, random search also evaluates a model on different hyperparameter combinations. However, instead of an exhaustive grid, it randomly samples values from predefined ranges for each hyperparameter.
    More Efficient: Random search is often more computationally efficient than grid search, especially for large hyperparameter spaces. It avoids evaluating unnecessary combinations that might occur in a grid search with a dense grid.
    Stochastic Approach: Although random, it ensures each hyperparameter value has a chance of being selected, preventing biases towards specific regions of the search space.
  3. Bayesian Optimization
    Concept: Bayesian optimization is a more sophisticated approach that uses a probabilistic model to guide the search for optimal hyperparameters. It iteratively selects the most promising hyperparameter combinations to evaluate based on past evaluations and a statistical model of the objective function (e.g., loss function).
    Intelligent Search: It prioritizes regions of the search space that are more likely to contain good hyperparameter combinations based on past evaluations. This makes it efficient, especially when dealing with expensive-to-evaluate models.
    Requires More Setup: Compared to grid and random search, it requires more initial setup to define the statistical model and the acquisition functions used to select the next hyperparameter configuration.

Choosing the right technique:
Grid Search: A good choice for low-dimensional hyperparameter spaces (few hyperparameters) or when interpretability of the search process is important. It guarantees exploration of the entire defined grid.
Random Search: A strong alternative to grid search, especially for high-dimensional spaces. It’s generally faster and less prone to getting stuck in bad regions of the search space.
Bayesian Optimization: Ideal for expensive-to-evaluate models or complex hyperparameter spaces. Its efficiency comes from focusing on promising areas and avoiding redundant evaluations. However, it requires more expertise to set up and interpret the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do L1 and L2 regularization terms work?

A

L1 and L2 regularization are techniques used in machine learning to prevent overfitting. An overfit model learns the training data too well, adapting to noise and peculiarities, which leads to poor performance on new, unseen data.

Here’s how they work:

L1 Regularization (Lasso)
Concept: L1 regularization adds a penalty term to the model’s cost function that is proportional to the ABSOLUTE VALUE of the model’s coefficients (weights).
Effect: It encourages coefficicents to become zero, effectively performing feature selection. Only the most important features will have large, non-zero coefficients. This can help create simpler and more interpretable models.
Sparsity: L1 regularization introduces sparsity into the model. A sparse model has many coefficients set to zero.

L2 Regularization (Ridge)
Concept: L2 regularization adds a penalty term to the model’s cost function that is proportional to the SQUARE of the coefficients.
Effect: It discourages large coefficients by penalizing them more heavily, but it doesn’t force them to become exactly zero. This leads to smaller, more spread-out weights.
Prevents Overfitting: L2 regularization helps avoid situations where single features get large weights and dominate the model’s predictions.

Visualizing the Difference:
Observe the plot showing the L1 and L2 regularization constraint regions.

  • L1 regularization’s constraint region is diamond-shaped. This makes it likely for coefficients to get pushed to zero.
    -L2 regularization’s constraint region is circular, resulting in smaller coefficients but less likely to become exactly zero.

When to Use Which
L1 for Feature Selection: If you want to reduce the number of features in your model and improve interpretability, L1 regularization is often preferable.
L2 for Preventing Overfitting: Generally, L2 is a good default choice for preventing large weights and reducing overfitting, especially when you don’t need explicit feature selection.
Elastic Net: Combines L1 and L2, useful when you want the benefits of both kinds of regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How do you make your models more robust to outliers?
Data Preparation Techniques: - Outlier detection and treatment: * Use techniques like box, z-scores, or isolation forests to identify potential outliers. * Consider removing outliers if they are clearly erroneous, or cap extreme values at a reasonable threshold. - Robust Scaling: Use robust scalers like median absolute deviation (MAD) that are less affected by outliers compared to standard scaling methods. - Transformations: Applying transformations like log transformation or square root transformation to your data can help reduce the impact of outliers. Robust Modeling Algorithms - Tree-Based Models: Decision trees, random forests, and gradient boosting machines are generally less sensitive to outliers than linear models. - Robust Regression: Use robust regression techniques like RANSAC, Theil-Sen estimator, or Huber Regression, which are designed to handle outliers. Regularization - L1 Regularization (Lasso): Can shrink the coefficients of less important features towards zero, potentially reducing the impact of outliers. - L2 Regularization (Ridge): Shrinks all coefficients, which can make the model less sensitive to individual data points (outliers). Ensemble Methods: - Bagging: Training multiple models on different subsets of data and averaging the predictions help reduce the impact of individual outliers. - Boosting: Sequentially building models, with each new model focusing on the errors of the previous models, can improve robustness.
26
How do you build a predictive model end?
1. Business Understanding -This initial phase involves defining the business problem and goals. You need to identify what you're hoping to achieve with the model. Some factors to consider include: * Business Goals: What specific outcomes are you hoping to achieve? * Business Impact: How will the model impact the business financially? * Latency Requirements: How quickly do you need the model to generate predictions? * Data Requirements: What data is available to train the model, and how much of it is required? 2. Data Preparation - Once you have a clear understanding of the business goals, you can start collecting and preparing the data for modelling. This stage involves several steps including: * Data Collection: Gathering data from relevant sources * Data Cleaning: Identifying and fixing errors and inconsistencies in the data * Data Parsing: Formatting the data into a usable form for modelling * Data Transformation: Transforming the data to create new features that might be useful for modeling 3. Feature Engineering - Involves creating new features from the existing data. The goal is to improve the model's ability to learn patterns from the data. Common feature engineering techniques include: * Aggregation: Combining data points into summaries * Encoding: Converting categorical data into numerical data. *Feature binning: Grouping similar values into bins. 4. Feature Selection - Selecting the most relevant features for modeling. Several techniques are considered in feature selection: * Model-Based: Feature importance internally build into the model * Filtering: Using univariate statistics like Pearson Correlation to rank signals based on the order of importance * Wrapper: Forward selection and backward elimination that tests various combinations of signals to identify the combination that produces the best performing model * PCA / ICA: Dimensionality reduction of the features into lower-dimensional space. * Feature Clustering: Grouping of features into single signal. 5. Model Training - This stage involves splitting your data into train, validation, and test then training your model on training and optimizing parameters on validation set. There are many different algorithms available, the best choice for your project will depend on your modelling task: * Regression: Used for predicting continuous outcomes (e.g. predicting house prices). * Classification: Used for predicting categorical outcomes (e.g., predicting whether a customer will churn). * Clustering: Used to segment (e.g. creating customer archetypes based on customer profile and card transaction data). 6. Model Evaluation - This stage involves testing your model on the test set with the metric based on your modeling task.: * Regression: MSE, RMSE, MAPE * Classification: Accuracy, AUC, F1-Score, Precision, Recall * Clustering: Accuracy (External Validation), Sum of Squares (Internal Validation), *****Silhouette Coefficient (Internal Validation) 7. Model Deployment - After training and evaluating the model, you deploy it to production. Depending on the prediction task, the inference is batch or real-time. * Batch Inference: Predictions are queued and then provided in an hours, daily, weekly, or monthly cadence. This is often used for providing reports with forecasts. * Realtime Inference: This involves creating a model API service using Rest API frameworks like FastAPI to generate real-time prediction from user activity and profile to generate model scores. Additionally, this involves model staging and versioning, which involves tracking DEV, UAT, and PROD versions of the model, then versioning the model based on updates. Lastly, the model needs to be continually monitored for performance consistency and updated if the performance degrades over time.
27
How can you conduct feature engineering?
Feature engineering involves conducting exploratory data analysis to create and identify new variables from raw ones that can produce signals to improve model performance. Based on the variable type, feature engineering involves different methods. 1. Continuous Variable: * Discretization: Bins continuous variables into a finite number of intervals. This can be useful for models that don't handle continuous data well or for creating features that represent specific ranges of values. * Log Transformation: Transforms skewed data (data that is not symmetrical) to a more normal distribution. This can improve the performance of some machine learning models. * Scaling: Standardizes features to have a mean of zero and standard deviation of one. This can be important for some machine learning models, especially those that use distance-based metrics. 2. Categorical Variables: * One-Hot Encoding: Converts categorical variables into binary vectors, with a new feature created for each category. This is useful for machine learning models that can't understand categorical data directly. * Label Encoding: Assigns a numerical value to each category. However, this can mislead the model into interpreting the difference between categories as ordinal, when they may not be. 3. Text Variables: * Bag-of-Words: Represents text data as a collection of words, ignoring grammar and word order. Each word is a feature, with its value indicating its frequency in the text. * TF-IDF: Similar to bag-of-words, but it weights the importance of words based on how common they are in the dataset overall and how frequently they appear within a specific document. This can help identify words that are distinctive and informative. * Text Embedding: Transforms text data into numerical vectors that capture the semantic meaning of the words. This allows machine learning models to perform operations on text data like similarity comparisons. 4. Time Variables: * Date/Time Decomposition: Extracts features like year, month, day, hour, minute, and second from a date/time variable. This can be useful for tasks like modeling seasonal patterns. *Sin/Cos Transformation: Encodes cyclical patterns. For example, you could convert sine and cosine to the hour of the day to capture daily seasonality.
28
How do you handle categorical variables in modeling?
Handling categorical variables properly is vital as the wrong method could potentially cause the model to overfit and/or underperform. Here are several factors to consider: * Type of categorical variable: Nominal or ordinal * Cardinality of the feature: Number of unique categories * Relationship with the target variable * Model you're using: Some models handle categorical variables natively, while others require encoding. Here are some common techniques for handling categorical variables: 1. Encoding Techniques * One-Hot Encoding: Create new binary columns for each unique category in the original feature. Pros: Easy to implement, avoids introducing ordinality into categorical data. Cons: Can significantly increase dimensionality (especially for high-cardinality features), potentially leading to overfitting. * Label encoding: Assigns a unique integer to each category. Pros: Simple, computationally efficient. Cons: Introduces a false sense of order/magnitude, potentially misleading models (especially tree-based ones). * Target Encoding (Mean Encoding): Replaces each category with the mean target value (for regression) or the proportion of positive outcomes (for classification) within that category. Pros: Handles overfitting better than one-hot encoding, can capture meaningful relationships between the category and the target. Cons: Prone to overfitting if the number of samples within each category is small. 2. Feature Engineering Techniques * Frequency Encoding: Replaces categories with their frequency (how often they occur in the data). This is often used for handling rare categories or when category frequency may be relevant. Pros: Can handle unseen categories, useful for rare categories. Cons: Loses information about the original categories. * Hashing Trick: Reduces dimensionality by hashing categories into a smaller number of buckets. Useful for very high cardinality features where one-hot encoding is impractical. 3. Specialized Models * Tree-based models (Decision Trees, Random Forests, Gradient Boosting): Can inherently handle categorical variables without explicit encoding. * Embeddings (mainly in Neural Networks): Maps categories to dense numerical representations, capturing semantic relationships between categories.
29
How does model ensembling work?
Model ensembling is a powerful technique in machine learning where you combine the predictions from multiple models to improve overall performance. Here's how it works: Core Concepts: 1. Base Models: Ensembling begins with training several different models, often referred to as "base models" or "weak learners". These base models can be different types of algorithms (e.g. decision tree, linear regression, neural network) or the same type of algorithm trained on different subsets of the data or with different hyperparameters. 2. Diversity: The key to successful ensembling is ensuring diversity among your base models. This means that the models should make different kinds of errors so that their strengths and weaknesses complement each other. 3. Combination Techniques: Once you have predictions from your base models, you need a way to combine them into a final prediction. Here are common ensembling techniques: * Averaging (Regression): Simply calculate the average of the predictions from each model. This works well for regression problems. * Voting (Classification): Each model "votes" for a class label. The final prediction is assigned the class that receives the most votes. * Weighted Averaging/Voting: Assign a weight to each model based on its performance on the validation, and compute a weighted average of predictions or a weighted majority vote. * Stacking: Train a meta-model that learns how to best combine the predictions of the base models. Why Ensembling Works: * Bias-Variance Tradeoff: Different models have different biases (tendency to consistently under or overfit) and variances (sensitivity to changes in training data). By combining multiple models, ensembling can reduce overall variance and potentially reduce bias, leading to more stable and generalizable models. * Wisdom of the Crowd: The idea that the collective judgement of a group is often better than individual judgement applies in machine learning as well. Combining multiple models can leverage their individual strengths. Common Ensembling Methods: * Bagging (e.g. Random Forests): Trains multiple models in parallel on different random subsets of the data, reducing variance * Boosting (e.g. Gradient Boosting, AdaBoost): Trains models sequentially, where each new model focuses on correcting the errors of the previous one, reducing bias. * Stacking: Combining models through a meta-learner that figures out how to best combine their individual predictions.
30
How do you handle text features in prediction tasks?
Handling text features in prediction tasks involves transforming raw text into meaningful numerical representations that machine learning models can understand. Here's a breakdown of the key steps: 1. Preprocessing a. Tokenization: Break down the text into individual units like words, phrases, or characters. b. Normalization: Convert text to lowercase, remove punctuation, handle misspellings. c. Stop word removal: Filter out common words like "the", "and", "is", that may carry less semantic value. d. Stemming/Lemmatization: Reduce words to their root form (e.g. "running" -> "run") to group similar words together. 2. Feature Representation * Bag-of-Words (BoW): Create a vocabulary of all unique words, represent each document as a vector where each element corresponds to the count of a word within the document. * TF-IDF (Term Frequency-Inverse Document Frequency): Extension of BoW, downweighs common words across documents while emphasizing distinctive words. TF-IDF is often a better representation for predictive tasks. * Word Embeddings: Represents words as dense, low-dimensional vectors that capture semantic and syntactic relationships between words (e.g. Word2Vec, GloVe). This approach allows models to understand context and similarities. 3. Feature Engineering (Optional) * n-grams: Consider sequences of words (e.g. bigrams, trigrams) in addition to single words to capture phrases and context. * Topic Modeling: Using techniques like Latent Dirichlet Allocation (LDA) to identify underlying topics within a collection of documents. * Sentiment Analysis: Extract sentiment scores (positive, negative, neutral) from text to use in predictive models. 4. Modeling Choose an appropriate model based on your task: * Classification: Naive Bayes, Support Vector Machines (SVMs), Random Forests, Gradient Boosting, Neural Networks (with text embedding layers) are commonly used for text classification tasks. * Regression: Models like Linear Regression, Lasso/Ridge Regression, and Neural Networks can handle text features for regression tasks (e.g., predicting sentiment scores). Example: Sentiment Analysis Let's say you want to predict whether a movie review is positive or negative: 1. Preprocessing: Tokenize the reviews, normalize text, remove stop words. 2. Representation: Create TF-IDF matrix representation of the reviews. 3. Modeling: Train a classification model (e.g., Support Vector Machine, Logistic Regression, or Neural Network) using the TF-IDF representation as input and the sentiment labels (positive/negative) as targets.
31
What is the pseudocode of the Logistic Regression model?
Initialize weights w to zeros (or small random values) Initialize bias b to 0 Set learning rate α and number of epochs For each epoch: For each training example (x, y): # Forward pass z = w · x + b # linear combination ŷ = σ(z) = 1 / (1 + e^(-z)) # sigmoid activation # Compute gradient (using BCE loss) dw = (ŷ - y) · x db = (ŷ - y) # Update parameters (gradient descent) w = w - α · dw b = b - α · db (Optional) Compute and log loss: L = -[y·log(ŷ) + (1-y)·log(1-ŷ)] Prediction If ŷ ≥ 0.5 → class 1 Else → class 0 Key pieces to remember: The sigmoid function squashes the linear output into a probability between 0 and 1. The loss function is Binary Cross-Entropy (log loss). The gradients turn out to be elegantly simple — (ŷ - y) — which is identical in form to linear regression's gradient, just with ŷ produced by the sigmoid rather than being raw.
32
What is the pseudocode of the decision tree algorithm?
# The base decision tree algorithm pseudocode Decision tree training involves recursively splitting the dataset into smaller subsets based on. A (feature, threshold) pair that best separate the different classes. The process starts with all data at the root node. The algorithm examines all the features and possible split points (thresholds), calculating measures like Gini impurity to determine the (feature, threhold) pair that creates the most homogenous subsets (ideally containing mostly examples of a single class). The data is divided based on the selected (feature, threshold) pair and branches are created. This splitting process is repeated recursively on these subsets until stopping criteria are met (like reaching a maximum tree depth or having too few examples in a node). The final nodes, called leaves, represent the predicted class labels. In regression, these predictions are based on the averaged target values in leaves. In classification, the predictions are based on the proportion of the positive class found in leaves. Decision Tree (CART) Pseudocode: Function BuildTree(dataset, depth): # Base cases (stopping criteria) If all samples belong to same class → return Leaf(class) If no features remaining → return Leaf(majority class) If depth ≥ max_depth → return Leaf(majority class) If num_samples < min_samples_split → return Leaf(majority class) # Find the best split best_feature, best_threshold = None best_score = +∞ (or −∞ for info gain) For each feature f: For each unique threshold t in f's values: left = samples where f ≤ t right = samples where f > t score = impurity(left, right) If score is better than best_score: best_score = score best_feature = f best_threshold = t # If no valid split improves purity → return leaf If no improvement → return Leaf(majority class) # Recurse left_split = samples where best_feature ≤ best_threshold right_split = samples where best_feature > best_threshold left_child = BuildTree(left_split, depth + 1) right_child = BuildTree(right_split, depth + 1) return Node(best_feature, best_threshold, left_child, right_child) --- Impurity Measures --- Gini(S) = 1 − Σ pᵢ² Entropy(S) = − Σ pᵢ · log₂(pᵢ) Weighted impurity of a split: Score = (|left|/|total|)·Impurity(left) + (|right|/|total|)·Impurity(right) --- Prediction --- Function Predict(x, node): If node is Leaf → return node.class If x[node.feature] ≤ node.threshold: return Predict(x, node.left_child) Else: return Predict(x, node.right_child) __________________________________________________________ Key points to remember: The algorithm is a greedy, recursive partitioning strategy — at each node it picks the locally optimal split, with no backtracking. The two most common impurity measures are Gini impurity (used by CART/sklearn default) and information gain / entropy (used by ID3/C4.5). Stopping criteria (max depth, min samples, min impurity decrease) act as a form of regularization to prevent overfitting.
33
What's the difference between random forest and logistic regression?
Both logistic regression and random forest are popular models for performing binary classification. But, how the model methodologies differ greatly. Random forest leverages a model technique called bagging, which averages predictions across several models fitted on bootstrap data. Such a technique can reduce the variance of the model; thereby increasing model generalization. The equation for bagging is: f_hat = 1/T * summation from t=1 to T of f_t(x) (c.f. image). Each f_t(x) is a prediction from a forest of T decision trees. Averaging across T trees generates the random forest prediction, f_hat. On the other hand, logistic regression is a member of generalized linear models (GLM). The model uses a sigmoid function: f(x) = 1 / (1 + e^(-x) ) , to map a binary response of 0 and 1 on a probability range. The transformed response can then be explained in a linear equation consisting of coefficients and an intercept: ln(p/1-p) = beta_0 + beta_1*X_1 + beta_2*X_2 + ... + beta_k*X_k (c.f. image) Random Forest Procedure 1. For b = 1 to B { (a) Draw a bootstrap sample Z* of size N from the training data. (b) Grow a decision tree T to the bootstrap data. For each tree, train on the random select of M variables. } 2. For each observation, predict class across trained trees. Choose the majority. Logistic Regression Procedure 1. Initialize parameters (coefficients + intercept) with standard normal values. 2. Apply gradient descent to optimize parameters. Repeat until stopping rule { (c.f. image with application of gradients) } Model Tuning When you discuss model tuning, relate how the tuning affects model performance. Even better, relate each tuning parameter to variance and bias trade-off. Note that the best way to fine tune hyperparameters is through grid-search with cross validation. You would pick the model parameters that result in the best performance in a grid search. Random Forest: 1. Number of Trees - Determines the number of boostrapped trees to train. As the number of trees increases, variance decreases, but bias increases. Think about how the variance of distribution decreases as the sample size increases. Similarly, when more trees, prediction scores converge toward a mean with less variability. Hence, variance decreases. An increase in trees can benefit model generalization, but too much can be excessive. 2. Tree Depth - Depth determines how well the tree fits on training data. Too much will increase variance and decrease bias. Finding an optimal balance between tree depth and number of trees is a must to a build high-performance random forest model. 3. Column and Row Sample - If you took two trees, trained on the same set of observations and features, the resulting prediction is the same. This defeats the purpose of a "random" forest which is designed to decrease variance with averaging of several tree estimators. A sampling of observations and columns prevents overfitting and increases generalizations. 4. Minimum Samples Leaf and Minimum Samples Split - The two parameters evaluate whether a node of a tree should be split or not. Suppose minimum samples split equals 8. If a node contains 9 observations, the node will split. However, if the minimum sample leaf is 2, and the resulting split in two leaf nodes contains 1 and 8 observations, the split will not occur. As the thresholds for both decrease, bias decreases while variance increases. Logistic Regression Unlike random forest, logistic regression contains a smaller set of parameters to tweak. Key parameters involve regularization that reduces model overfitting. Two main types of regularizations are Ridge and Lasso which compress coefficients of weak coefficients to 0. c.f. image for Ride and Lasso regression definitions. There are two main differences between Ridge and Lasso shrinkages. Unlike Ridge, Lasso can reduce coefficients of weak predictors completely to 0. Therefore, Lasso lends feature selection where strong predictors remain in the model while weak predictors are removed. Another main difference is that Ridge can be solved in closed-form, meaning the beta vector can be solved using matrix algebra. Lasso betas, on the other hand, cannot be solved using beta vector such that convex optimization such as gradient descent is required to estimate beta. Note that both ridge and lasso forgo unbias estimation of the beta coefficients while reducing variance. Its important to understand the trade-off between variance and bias. When you increase bias, variance decreases. The benefit of such a trade-off is that model generalization to unseen data can increase. Model Interpretability Both random forest and logistic regression models provide variable importance. Random forest uses mean Gini decrease. On the other hand, logistic regression uses p-values with lower values to convey higher predictor importance. Additionally, logistic regression provides additional interpretability that random forests does not provide. For each predictor, logistic regression provides an odds ratio comparison between focal group versus reference group (i.e. male versus female).
34
What's the difference between bagging vs boosting?
A single decision tree is prone to over-fitting as the tree depth increases. As the depth increases, the flexibility of the decision boundary increases, which in turn increases the variance of the model. Hence, to reduce the model variance, averaging predictions from multiple trees trained on the same data can help. Two such techniques are bagging and boosting. Bagging: Bagging is quite simply averaging of models trained on the random sample of the same original data. Random forest, for instance, first trains a K number of decision trees, independently, then averages the predictions from the trees to produce a final prediction. This technique is helpful in reducing the variance of the model. Boosting: Boosting involves an iterative training of weak learners such as the decision tree. The errors of one weak learner are assigned as weights in the training sample of the next learner in the iteration. The predictions from the weak learners are averaged to produce a final prediction. Given that the errors from the previous learner influence the training of the next model, the bias of the model decreases. In addition the variance of the model decreases as the predictions from the learners are averaged.
35
1. How is the random forest trained? 2. How does increasing depth affect the model in terms of variance and bias trade-off? What about increasing the number of trees?
Random forest is an averaging of decision trees independently trained on a bootstrapped dataset. A single decision tree is prone to overfitting. Averaging trees, which is a technique called bagging in machine learning, is a way to reduce the variability of the model, thereby, reducing overfitting. When you increase the depth of the random forest model, the variance of the model increases as the training data is split more, leaving fewer samples to average in the terminal leaves. In exchange the bias of the model will decrease given the variance-bias trade-off. When you increase the number of trees in the random forest model, the variance decreases while the bias increases. As discussed in the intuition behind the random forest, as more trees are averaged, the model's variance decreases.
36
Explain decision tree, random forest and boosted trees in terms of variance and bias.
Key Points: 1. Discuss how each of the three models fits data 2. Discuss variance and bias 3. Compare and contrast the trees based on the criteria This is not an open-ended question. You are evaluated on the accuracy of your response. If your explanation is unclear or incorrect, the mistake will cost you greatly. Ensure that you understand how each of the three algorithms work as they are common questions across companies (not just the top ones). Also, do not misinterpret variance and bias. Decision tree, the simplest of all three, uses GINI-impurity, a measure of heterogenity in a sub-space of data, to define decision trees. It will optimize a split that results in the highest purity, such that a partition will contain all the classes 0's and the adjacent will contain all the class 1's. The issue with decision tree is overfitting, meaning that its prediction on unseen data performs poorly. Random forest, address, the issue with decision tree with bagging - bootstrapping samples of data and averaging the predictions across multiple decision trees. Boosted trees, such as XGBoost, AdaBoost, GBM, CatBoost, and LightGBM, all have the same idea, in that, the trees are constructed sequentially such that the observation weight and the weight of the current model is based on the misclassification of the tree model constructed in the previous iteration. Ultimately, unlike random forest, which assumes independence and averaging with equal weights across trees, Boosting assumes sequential dependence and weighted averaging. Consider that decision tree has the lowest bias and highest variance, resulting in overfitting on training data. Random forest addresses the high variance issue with bagging of several trees. With a decrease in variance, bias will slightly increase as there is a trade-off between bias and variance. Boosted trees provide the best performance out-of-the-box as bias is low and variance is moderate (assuming that the hyperparameters are tuned properly).
37
Suppose you apply K-Means to cluster two different types of datasets - one raw and one scaled. How does clustering change after scaling?
Key Points: 1. Briefly outline K-Means 2. Explain scaling techniques and the impacts on clustering 3. Provide reasoning on recommendation. The interview question assumes that you are familiar with the K-Means algorithm, a basic machine learning technique. Oftentimes, the interviewer will ask you a scenario-based question that tests beyond basics. Start with the definition of K-Means to demonstrate that you grasp the basics. K-Means randomly initialize K-centroids and assign each data point to the nearest centroid. The position of each centroid is updated based on the multivariate mean of data points assigned to its cluster. This process is repeated until convergence on an internal or external validity score. Next, discuss the benefit of scaling on K-Means. Note that, generally, euclidean distance is used in K-Means to assign each datapoint to the nearest centroid. The consequence is that Euclidean distance is sensitive to outliers given that the computation requires the mean. For instance, suppose that feature set A contains 1, 2, 3, 3, 4, and 10. The mean, without the outlier 10, is 2.6. With the outlier, the mean skews to 4.8 The bottomline is that outliers skew the quality of clusters. Therefore, scaling data is a must as it handles outliers. However, note that not all scalings are the same as some still employ averaging. Use techniques such as robust scaling, which leverages quantiles. The other benefit to scaling is that when two variables greatly differ in range, the centroid will shift toward the feature with the highest variance. Suppose that X1 contains a 5 and 6. as the 10th and 90th percentile respectively. X2 contains 5000 and 16000. Additionally, the variance of X1 is 0.23 while the variance of X2 is 2540. The high variance will distort the centroid positioning. Therefore, scaling is a must.
38
What is Principal Component Analysis (PCA)?
The Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features (dimensions) in a dataset while retaining as much of the important information (variance) as possible. It creates new, uncorrelated features called "principal components". These are linear combination of the original features. The first principal component explains the most variance, the second explains the second-most, and so on. This allows you to discard less important components. C. F. the image. Why do we use PCA? 1. Reduce Overfitting: High-dimensional data can lead to overfitting in machine learning models. PCA lowers dimensionality, helping to manage this. 2. Improved Visualization: Its easier to visualize data in 2D or 3D. PCA helps project high-dimensional data into lower dimensions for visualization. 3. Faster Computation: Machine learning models generally train and run faster with fewer features. PCA Procedure 1. Standardize the data: Subtract the mean from each feature and scale to have unit variance. 2. Calculate the covariance matrix: This matrix shows how the features in your data are related. 3. Eigenvectors and Eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of maximum variance in the data and eigenvalues represent the amount of variance explained by each eigenvector. 4. Choose components: Sort eigenvectors by decreasing eigenvalues and select the top "k" which explain a desired amount of variance. The typical range is choosing K components that capture at least 80 to 90% of the variance. 5. Transform the data: Project the original data onto the selected eigenvectors to get the new lower-dimensional representation.
39
How does the K-Means algorithm work? What is the pseudocode?
C.f. image. K-Means is an unsupervised clustering algorithm that groups data points into distinct clusters based on their similarity. Here's the core idea: 1. Initialization: * Choose the number of clusters you want to create (this is the K in K means) * Randomly place K centroids (cluster centers) in the data space 2. Assignment For each datapoint in your dataset: *Calculate the distance between the data point and each of the "K" centroids *Assign the datapoint to the cluster whose centroid is the closest. 3. Centroid Update: For each cluster: *Recalculate the centroid by taking the average (mean) of all the data points assigned to that cluster. 4. Repeat * Repeat steps 2 and 3 until the centroids stop moving significantly or a set number of iterations is reached. Pseudocode Function KMeans(dataset X, k, max_iters): # Step 1: Initialize k centroids centroids = randomly select k points from X (or use K-Means++ initialization) For iter = 1 to max_iters: # Step 2: Assignment Step For each data point xᵢ: Compute distance to each centroid cⱼ Assign xᵢ to cluster of nearest centroid: clusterᵢ = argmin_j ‖xᵢ − cⱼ‖² # Step 3: Update Step For each cluster j = 1 to k: cⱼ = mean of all points assigned to cluster j cⱼ = (1/|Cⱼ|) · Σ xᵢ for xᵢ ∈ Cⱼ # Step 4: Check convergence If centroids have not changed (or change < ε): break return centroids, cluster_assignments * Important Considerations * Determining "K": Choosing the right number of clusters is a non-trivial part of K-means. Methods like the Elbow Method or Silhouette Scores can help. Initialization Sensitivity: Due to random initialization, K-means might converge to different clustering solutions on different runs. Running it multiple times with different starting points can help. Distance Measures: Euclidean distance is most common, but other distance metrics (e.g., Manhattan distance) can be used depending on your data. Key points to remember: The algorithm alternates between two steps — assign points to nearest centroid, then recompute centroids as cluster means. It is guaranteed to converge (inertia decreases monotonically), but only to a local minimum, not the global one. That's why initialization matters so much: K-Means++ gives an O(log k) approximation guarantee and is the sklearn default. The objective being minimized is WCSS (Within-Cluster Sum of Squares), also called inertia.
40
How do you find the optimal K in K-Means clustering?
The two common methods in K-Means are the Elbow and Silhouette Methods. Check the images and refer to the text below... 1. The Elbow Method - You plot the explained variance (or within-cluster sum of squares) as a function of the number of clusters (K). The goal is to find the point where adding more clusters no longer leads to a significant improvement in data representation. This point is "the elbow". Here's the procedure: 1. Run K-means for different values of K (e.g., K = 1 to 10) 2. For each K, calculate the within-cluster sum of squares (WCSS) - this measures how compact the clusters are. 3. Plot WCSS vs K. Look for the "elbow" - the point where the rate of decrease in WCSS sharply slows down. As seen in the image, you see that the inflection point is K=4, which indicates the optimal K in the Elbow Method. 2. The Silhouette Method * Intuition: Measures how well a data point fits within its assigned cluster compared to how well it would fit in neighboring clusters. A high silhouette score indicates good clustering. * Steps: a. Run K-means for different values of K b. For each K, calculate the average silhouette score across all data points c. Plot average silhouette score vs K. Choose K with the highest peak.
41
How do you evaluate regression modelling?
The common metrics in regression modeling are the following: 1. Mean Squared Error (MSE) - The average squared error. It's heavily affected by outliers. Lower MSE is better. In MSE, the units are squared, making interpretation less intuitive. 2. Root Mean Squared Error (RMSE) - The square root of MSE, giving error in the same units as the target variable. Lower RMSE is better. RMSE is still affected by outliers, though less so than MSE. 3. Mean Absolute Error (MAE) - The average absolute error. Less sensitive to outliers than MSE/RMSE. Lower MAE doesn't indicate if errors are generally underestimates or overestimates. 4. R-Squared (R^2) - The proportion of variance in the target variable explained by the model. Higher R^2 (closer to 1) is better. The drawback is that it can be misleading. R-squared always increases by adding more features, even if they are irrelevant. 5. Mean Absolute Percentage Error (MAPE) - Average absolute percentage error. Useful when comparing models with different target variable scales. Lower MAPE is better. MAPE can be unstable with small target values or near zero. Not defined when the actual value is zero.
42
What is the pseudocode of KNN?
Here's the pseudocode for the K-Nearest Neighbors (KNN) algorithm, broken down for clarity. --- KNN has NO training step --- # It is a lazy learner: just store the dataset Function KNN_Predict(X_train, y_train, x_query, k): # Step 1: Compute distances For each training point xᵢ: dᵢ = distance(x_query, xᵢ) # Step 2: Find k nearest neighbors neighbors = select k points with smallest dᵢ # Step 3: Aggregate # Classification: prediction = mode(labels of neighbors) (optional) weighted vote: weight each neighbor by 1/dᵢ # Regression: prediction = mean(values of neighbors) (optional) weighted average: prediction = Σ (1/dᵢ)·yᵢ / Σ (1/dᵢ) return prediction --- Common Distance Metrics --- Euclidean: d = √(Σ (xᵢ − qᵢ)²) Manhattan: d = Σ |xᵢ − qᵢ| Minkowski: d = (Σ |xᵢ − qᵢ|ᵖ)^(1/p) Cosine: d = 1 − (x · q) / (‖x‖·‖q‖) __________________________________________________________ Key points to remember: KNN is a lazy learner — there is no training phase; all computation happens at prediction time, making training O(1) but inference O(n·d) per query. Feature scaling is critical because distance metrics are sensitive to magnitude differences — always standardize or normalize first. The choice of k controls the bias-variance tradeoff: small k → low bias, high variance (overfits to noise); large k → high bias, low variance (over-smooths boundaries).
43
What is the pseudocode of gradient boosted trees (GBTs)?
# Regression: High-Level Intuition: Each tree fixes the mistakes of all previous trees. By adding trees gradually with a small learning rate, the model steadily improves without overfitting too quickly. It's essentially gradient descent in function space — where each step is a tree instead of a parameter update. Pseudocode: 1. Initialize with a simple prediction F₀(x) = arg min_γ Σ L(yᵢ, γ) (e.g., mean of targets for regression, log-odds for classification) 2. For each round m = 1 to M: a. Compute pseudo-residuals for each sample: rᵢ = −∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ) In plain terms: "How wrong is the model, and in what direction?" For MSE regression this simplifies to (true − predicted). b. Fit a small decision tree hₘ(x) to those residuals. The new tree learns to predict the mistakes. This creates leaf regions where multiple samples land together. c. Compute the optimal output value for each leaf: γⱼ = arg min_γ Σ L(yᵢ, Fₘ₋₁(xᵢ) + γ) "For all samples landing in this leaf, what single value minimizes the loss?" For MSE regression: just the mean of residuals in the leaf. For log-loss classification: γⱼ = Σrᵢ / Σpᵢ(1−pᵢ) — a Newton-Raphson step using both gradient and curvature, which is why XGBoost explicitly uses the Hessian. d. Update the model: Fₘ(x) = Fₘ₋₁(x) + ν × hₘ(x) Add the new tree's predictions, scaled down by learning rate ν. 3. Final model: F_M(x) = sum of all trees Three knobs interviewers love to ask about: Number of trees (M): More trees = more capacity, but risk overfitting. Learning rate (ν): Smaller = slower learning, needs more trees, but generalizes better. There's a well-known tradeoff between learning rate and number of trees. Tree depth: Shallow trees (depth 3–5) act as weak learners — many weak learners combined > one strong one. Key distinction: For regression, residuals are simply (true − predicted). For classification, they become the negative gradient of the loss function (e.g., log-loss), which is why it's called "gradient" boosting — the framework generalizes to any differentiable loss.
44
What are the hyperparameters of gradient boosted trees (GBTs)?
Here's a breakdown of the most important hyperparameters commonly tuned in Gradient Boosted Tree (GBT) models: Core Hyperparameters: * n_estimators: The number of boosting rounds (the number of trees in the ensemble). Larger values generally lead to better performance but increase overfitting risk. * learning_rate: Shrinks the contribution of each individual tree. Smaller learning rates slow down learning, requiring more trees, but often improving generalization and preventing overfitting. * max_depth: The maximum depth of each tree. Limits the complexity of each tree and can help prevent overfitting. Shallower trees are weaker learners. * subsample: The proportion of samples randomly selected to train each tree. This introduces randomness and helps prevent overfitting (similar to random forest) * min_samples_split: The minimum number of samples required to split a node in a tree. Larger values prevent overly complex trees. * min_samples_leaf: The minimum number of samples required in a terminal node or leaf. Helps control tree complexity and avoid overfitting to noisy data. * colsample_bytree, colsample_bynode, colsample_bylevel: Controls the proportion of features randomly selected for each tree, node, or level. Introduces further randomness and helps prevent overfitting. * gamma: Minimum loss reduction required for a node split. Larger gamma favors more conservative models. * max_features: Limits the maximum number of features considered at each split, similar to random forests.
45
Why do you need activation functions? What are the pros and cons of Sigmoid, Tanh, and ReLU activation functions?
Activation functions are essential components of neural networks as they introduce non-linearity. This allows neural networks to model complex relationships between inputs and outputs, which are crucial for many tasks like image recognition or natural language processing. Common activation functions in neural networks are Sigmoid, Tanh, and ReLU as seen in the image. Here's a breakdown of the pros and cons of these activation functions: Sigmoid Pros: * Smooth output: Sigmoid's output ranges between 0 and 1, making it suitable for representing probabilities. * Easy to understand and implement Cons: * Vanishing gradients: For large negative or positive inputs, the gradient of the sigmoid function approaches 0. This can make it difficult for the network to learn during backpropagation. Tanh Pros: * Zero-centered output: Tanh's output ranges between -1 and 1, which can be helpful for some neural network architectures. * Smooth output: Similar to sigmoid, tanh's output is smooth and continuous. Cons: Vanishing gradients: Similar to sigmoid, tanh also suffers from vanishing gradients for large positive or negative inputs. * Computationally expensive: Compared to other activation functions, tanh is computationally expensive due to the involvement of exponential operations. ReLU (Rectified Linear Unit) Pros: * Fast computation: ReLU is computationally efficient because it only involves a simple threshold operation. * Avoids vanishing gradients: ReLU does not suffer from vanishing gradients for positive inputs. Cons: * Dying ReLU: ReLU neurons can die if they receive a large negative update during backpropagation, causing them to output zero permanently. * Non zero-centered: The output of ReLU is not zero centered. In choosing an activation function, it's important to consider the specific task and the network architecture. Sigmoid is a good choice for output layers requiring probability-like outputs (e.g., logistic regression). Tanh can also be useful in hidden layers, but it may not be the best choice for deep networks due to vanishing gradients. ReLU is a popular choice for hidden layers due to its computational efficiency, but can suffer from the "dying ReLU" problem.
46
Why should you normalize data when training neural networks?
The benefits of normalization are that it helps neural network models achieve more efficient convergence, prevents exploding, vanishing gradients, regularized weights, and improves interpretation of signals. Here are common normalization techniques. * Min-Max Scaling: Transforms data to typically be between 0 and 1 * Standardization (Z-Score): Subtracts the mean and divides by the standard deviation, resulting in zero mean and unit variance. Let's discuss in details the terms of the benefits to normalization. 1. Faster Convergence and Stability * Gradient Descent Optimization: Neural networks learn by optimizing weights using gradient descent algorithms. When features are on different scales, the loss function has an elongated shape with different curvatures in different directions. This makes it difficult for gradient descent to find the optimal path, slowing down convergence and potentially leading to getting stuck in local minima. * Normalization Helps: Normalizing features to have similar ranges makes the loss function more symmetrical and smoother, leading to faster and more stable convergence in gradient descent. 2. Preventing Exploding/Vanishing Gradients * Deep Architectures: In deep neural networks, gradients can either explode (become very large) or vanish (become very small) as they are propagated back through the layers. This makes learning difficult, especially for early layers. * Input Scaling Matters: If the input features have large variances, the magnitudes of weights can explode to compensate, leading to the exploding gradient problem. Similarly, if the input features are very small, weights shrink to compensate, leading to the vanishing gradient problem. * Normalization Mitigates: Normalizations helps keep features within a reasonable range, preventing weights from exploding or shrinking during backpropagation. This allows for better gradient flow and easier training. 3. Regularization Effect: * Weight Decay: Normalization indirectly creates a form of regularization. During weight decay, large weights are penalized more. Normalization helps control the magnitude of weights, which complements weight decay. 4. Feature Interpretation * Relative Importance: When features are on different scales, it's hard to interpret the relative importance of each feature based on their raw weights. Normalization makes feature weights more comparable for interpretability.
47
What are the differences between Gradient Descent, Stochastic Gradient Descent, and ADAM optimizers?
Let's break down the key differences between Gradient Descent (GD), Stochastic Gradient Descent (SGD) and Adam optimizers, focusing on how they update weights and their pros and cons: 1. Gradient Descent (GD) * How it works: - Calculates the gradient of the loss function with respect to all parameters for the entire training dataset. - Updates weights in the opposite direction of the gradient: weights = weights - learning_rate * gradient * Pros: - Theoretically guaranteed to find the global minimum if the loss function is convex * Cons: - Computationally expensive for large datasets as it processes the entire dataset for each update - Can get stuck in local minima 2. Stochastic Gradient Descent (SGD) * How it works - Calculates the gradients for a single example or a small batch of examples (called a mini-batch) - Updates weights more frequently: weights = weights - learning_rate * gradient_of_minibatch * Pros: - Faster iteration due to smaller computations - Noisy updates help escape local minima * Cons: - Noisy updates can lead oscillations around the optimal point instead of smooth convergence 3. Adam (Adaptive Moment Estimation) * How it works: - Combines ideas of momentum and adaptive learning rates (RMSprop) - Keeps track of exponentially decaying averages of past gradients (m) and squared gradients (v) - Adjusts the learning rate for each parameter based on these averages Here's the formula for Adam: Notation: t: Time step (current iteration) theta: Model parameters (weights and biases) g_t: Gradient of the loss function with respect to the parameters at time step t alpha: learning rate m_t: Exponentially decaying average of past gradients (momentum) v_t: Exponentially decaying average of past squared gradients (adaptive learning rate) beta1, beta2: Hyperparameters controlling the decay rates for the moving averages (usually set to 0.9 and 0.99 respectively) Adam Update Rule: 1. Calculate gradient: Compute the gradient g_t for a mini-batch of data. 2. Update biased first moment estimate (momentum): m_t = beta1 * m_{t-1} + (1 - beta) * g_t 3. Update biases first moment estimate (adaptive learning rate): v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2 (element-wise squaring) 4. Compute bias-corrected first and second moment estimates: m_hat_t = m_t / (1 - beta1^t)v_hat_t = v_t / (1 - beta2^t) 5. Update parameters: theta_{t+1} = theta_t - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon) (epsilon is a small value added for numerical stability) Pros: - Often converges faster than GD or SGD, especially in early stages of training - Handles sparse gradients effectively - Relatively robust to the choice of learning rate Cons: - Can sometimes overshoot the optimal point, especially in later stages of training. - Requires more memory to keep track of moments. Which one to choose? SGD: Often a good starting point due to its simplicity and computational efficiency. Adam: Often the default go-to in many deep learning scenarios. It generally works very well out of the box. GD: Practical for small datasets or convex optimization problems where converging to the global minimum is crucial Additional Considerations 1. Learning Rate: All optimizers are sensitive to learning rate. Adam can be less sensitive, but tuning is still important. 2. Dataset Size: SGD and its variants shine with very large datasets. 3. Sparsity: Adam handles sparse data effectively (common in text-based problems).
48
How do you hyperparameter tune dense neural networks?
Tuning dense neural networks involves the following parameters, and these can be tuned using parameter tuning strategies such as grid search, random search, and bayesian optimization. 1. Network Architecture: - Number of layers: How many hidden layers are in the network - Neurons per Layer: How many neurons (processing units) in each hidden layer. 2. Activation Functions: Nonlinear functions (e.g., ReLU, Sigmoid, Tanh) applied to neuron outputs, enabling complex decision boundaries. 3. Optimizer: The algorithm used for updating weights (e.g. Adam, SGD, RMSprop) - Learning Rate: How much to adjust weights with each update step. 4. Regularization: Techniques to reduce overfitting -L1/L2 Regularization: Penalize large weights. - Dropout: Randomly drop neurons during training 5. Batch Size: Number of samples used per training iteration. 6. Number of epochs: Number of times the networks sees the entire training dataset.
49
How does gradient descent work?
Gradient descent is an algorithm that aims to minimize a cost function. Imagine you're lost on a hilly landscape and want to find the lowest valley. Gradient descent helps you find that valley by iteratively moving in the direction of the steepest descent. Mathematical Formulation 1. Cost Functions: Let's denote our cost function as J(theta), where theta represents the parameters (weights) of our model. Our goal is the find the values of theta that minimize J(theta). 2. Gradient: The gradient of the cost function, denoted by *upsidedown triangle*J(theta), is a vector that points in the direction of the steepest increase of the function. Importantly, its negative, points in the direction of the steepest decrease. 3. Update Rule: The core of gradient descent is the following update: Theta = Theta - alpha * upsidedowntriangleJ(theta) - alpha is the learning rate, a hyperparameter that controls the size of the step we take in each update. - By subtracting the gradient (scaled by the learning rate) from our current parameter values, we take a step in the direction that decreases the cost function the most. - We repeat this process iteratively, with each step brining us closer to a (local) minimum of the cost function.
50
How does backpropagation work?
Backpropagation is an approach to updating the neural network weights and bias with the aim of minimizing a cost function (e.g. mean squared error, cross-entropy loss). Backpropagation calculates the gradient of the cost function with respect to each weight/bias in the network. These gradients tell us how to adjust the weights to reduce the error. Here's the breakdown: 1. Forward Pass: Input data is fed through the network layer by layer, from input to output. Each layer applies its weights and biases, then passes the result through an activation function, producing the network's final prediction. 2. Error Calculation: The cost function (e.g., MSE, cross-entropy loss) measures how far the network's prediction is from the true target, producing a single scalar loss value. 3. Backward Propagation of Error: Starting from the output layer, the derivative of the cost function with respect to the network's output is computed. Then, this gradient is propagated backward through each layer using the chain rule — multiplying the partial derivatives of each layer's output with respect to its inputs (involving the derivatives of the activation functions and the weights at each layer). Through this process, we obtain ∂J/∂w and ∂J/∂b for every weight and bias in the network. Each gradient signals how much a tiny change in that parameter would affect the overall loss. 4. Weight Update: Gradient Descent: Parameters are updated using gradient descent or its variants: w = w − α · (∂J/∂w), b = b − α · (∂J/∂b) where α is the learning rate. Parameters are adjusted in the direction opposite their gradient to reduce the loss. Mathematical Example (Single Neuron) Let's consider a single neuron with a sigmoid activation function: z = w*x+b (weighted sum of inputs) a = sigma(z) (output of the neuron after activation) If our cost function is J, then: partial deriv. J / partial deriv. a: Comes from the specific cost function used partial deriv. a / partial deriv. z = sigma(z)*(1-sigma(z)): Derivative of the sigmoid function partial deriv. z / partial deriv. w = x: Derivative of the weighted sum with respect to the weight. Using the chain rule: (c.f. image)
51
Your neural network model is overfitting. What are the signs? How do you prevent it?
Signs of overfitting To assess whether the neural network is overfitting, you can compare the training and validation errors. When the validation error is much higher than the training error, the neural network is most likely overfitting. You can look at the error plots across training epochs to assess overfitting. The sign to look for is when the validation curve is higher and diverges away from the training error as seen in the image. How to prevent overfitting: Here are common techniques used to combat overfitting: 1. Regularization - L1 and L2 Regularization: Penalty terms added to the cost function that discourages overly large weights, favoring simpler models. - Droput: Randomly drop neurons (and their connections) during training, preventing the network from relying too heavily on specific neurons. 2. Data Augmentation: Artificially expand your dataset by applying random transformations (rotation, flipping, noise, etc) to existing examples. This helps reduce overfitting to specific variations seen in the training data. 3. Early Stopping: Monitor validation set performance during training. Stop training before validation error starts to increase, preventing the model from memorizing the training data too closely. 4. Reduce Model Complexity: - Fewer layers - Fewer neurons per layer - This limits the model's capacity to memorize the training data 5. More Data: When possible, the most reliable way to prevent overfitting is to collect more diverse and representative training data.
52
What's the difference between encoder-decoder?
Encoder-decoder are neural network structures often seen in recurrent neural networks and Transformers as seen below. Encoders-decoders are core components of models used in machine translation, text summarization, and many other natural language processing tasks. Information flows from the encoder to the decoder, where the encoded representation serves as the foundation for generating a sequential output. Encoder - Processes an input sequence (text, audio, etc) and compresses it into a fixed-length context vector. This vector aims to capture the essence or meaning of the input. - Operation: * Reads the input sequence one element at a time * Maintains an internal hidden state that's updated with each input element * Final hidden state becomes the context vector that summarizes the input Decoder - Decodes the context vector generated by the encoder to produce an output sequence. Generates the output one element at a time - Operation * Takes the context vector as input * The internal state is initialized with the information from the context vector * Generates the output sequence element by element, using previous outputs to guide the generation of the next element. Example: Machine Translation - Encoder: Processes a sentence in the source language (e.g. French), producing a context vector - Decoder: Takes the context vectors and generates the translated sentence in the target language (e.g. English)
53
What's the difference between RNN and CNN?
Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) are two types of neural network architectures that serve difference purposes and are structured differently to address distinct types of problems in machine learning. In general, RNN is more suited for data where the sequence is vital (e.g. time series forecasting and machine translation) and CNNs are optimal for data where spatial relationships and patterns are important (e.g. computer vision). Let's do a deep dive on each of the architecture: Recurrent Neural Networks (RNNs): 1. Purpose and Applications - RNNs are designed to handle sequential data. They are particularly useful for tasks where the input is inherently sequential, such as natural language processing, time series prediction, and speech recognition. - They can process inputs of variable length, making them ideal for applications like language translation and generating text. 2. Structure - The key feature of RNNs is their internal memory, which captures information about what has been processed so far in a sequence. This allows them to exhibit temporal dynamic behavior. - An RNN has loops within its architecture that allow information to persist. In theory, RNNs can retain information in this loop over long sequences, but in practice, they often struggle due to issues like vanishing or exploding gradients. 3. Challenges - RNNs are hard to train effectively due to the long-range dependencies in sequences, which often lead to vanishing and exploding gradient problems. Techniques like LSTM (long short-term memory) and GRU (Gated Recurrent Units) cells have been developed to mitigate these issues. Convolutional Neural Networks (CNNs) 1. Purpose and Applications - CNNs are primarily used for processing data that has a grid-like topology, such as images. They are also used in video analysis, image classification, and areas where recognizing patterns from spatial data is crucial. - They excel at tasks that require identifying and extracting spatial hierarchies in features, such as recognizing faces, objects, or scenes in images. 2. Structure - CNNs use convolutional layers that apply convolution operations to the input. These layers use filters (or kernels) to capture spatial hierarchies and features like edges, textures, and shapes in parts of the input image. - This architecture typically includes pooling layers that reduce the dimensions of the data, simplifying the amount of computation required while still preserving essential features. 3. Advantages - CNNs are relatively efficient to train and are invariant to the scale and translation of image, making them robust to different viewpoints or variations in the appearance of objects. - They can automatically learn and generalize features from raw data, minimizing the need for manual feature extraction. Key Differences: 1. Data Handling: RNNs are better for sequential data, while CNNs excel with spatial data (like images) 2. Memory and Processing: RNNs can remember previous inputs due to their recurrent structure, which is useful for tasks that depend on historical inputs. CNNs, conversely, are better at perceiving patterns in a static input, where the location of a feature is key to classification. 3. Common Use Cases: RNNs are common in speech recognition, language modeling, and text generation. CNNs are prevalent in image and video recognition tasks.
54
How would you define the prediction point of a machine learning model?
In most machine learning tutorials, you are provided with a dataset with labels. In such cases, machine learning becomes a simple exercise requiring feature engineering, algorithm selection, and hyperparameter tuning. However, real-life projects are not simple. Often, you are not instructed when a model should predict. Your job as a data scientist or machine learning engineer is to define the prediction point of a model. In other words, at what point should your model predict? Should it produce a prediction at the onset of a profile creation or after behavioral data has been collected about the user? Your choice should depend on your modelling strategy. Let's consider this scenario below: Suppose you interview for a risk data scientist role. The interviewer asks you to define the prediction point of a bad actor on an eCommerce platform. Let's assume that the bad actor is a spammer on Facebook's marketplace. So, you have the following datasets: Profile: 1. Profile ID 2. Profile feature X 3. Profile feature Y 4. Profile feature Z Posts: 1. Post ID 2. Profile ID 3. Post Context X 4. Post Content Y 5. Includes_External_Link_Indicator A naive response would be that you predict based on posts that are flagged as spam, then extrapolate the author as a spammer. This is problematic. Suppose for a user X, your model flags spam or not based on the following (c.f. 1st image table) This user generated 5 posts, posts 1 and 3 being spam. The corresponding probability score of the user is a spammer at those events are 0.8 and 0.6, respectively. Should you flag a user X as a spammer? What if the user, as you collect more data real time, yields the following behaviour (c.f. 2nd image table) Now, it's not too clear whether the user X should be flagged as a spammer, right? To simplify the problem, you need to choose a slice in time on when the user should be scored as a spammer. Suppose you choose the third post as the prediction point as users. That means that your training and test data will only contain the third posts across users. Your feature set could contain the following information: 1. User profile information 2. Third event post information 3. Aggregations of the past two events Obviously, each feature set will contain a multitude of features. Now, how should you pick your prediction point? The choice should depend on various conditions involving the business objective, behavioral data, and volume of flags. Quite simply, if the problem is, let's say new account origination (NAO), then predict at the time of sign-up before any actions. Essentially, your model data won't contain any behavioral aggregations, rather it will strictly be based on profile information. If your user classification is based on transactions, then choose the event prediction point, use aggregations up to the prediction point for your predictions. This framework can work across various problems, not just in risk and fraud. Conversion: at the time of sign-up, predict the likelihood that a customer will purchase a good. Retention: at the time of sign-up, predict that the user will stay on the platform a year later. Recommender system: based on the 10th purchase, predict the user's next set of purchases.
55
An offline classification model scored a 0.90 AUC. But, the production model generated 0.75 AUC score. What does this mean? How would you address this?
Key Points: 1. Explain the possibility of overfitting. 2. Explain the possibility of behavioral change. 3. Explain the possibility of ignoring unit testings. Various reasons could impact difference in performance between offline and online model performance. The AUC of 0.90 to 0.75 online is severe, requiring investigation on one of the following three - overfitting on offline data, behavioral change, and absence of offline model testing. Let's briefly discuss the cycle of offline model production to online. Typically, a raw offline model is downloaded. The raw wrangled data is feature engineered into a dataset that is plugged into an offline model. Then, you evaluate the model using either cross-validation or LOOV. If the result looks good, push the offline model to production. Wire the feature engineering such that the same processing is applied on realtime data. Finally, let's assume that you evaluate your data online on the first three months since production. The first possibility is overfitting on the offline data, meaning the offline model was not evaluated properly. Cross-validation, in many cases, is not the best way to evaluate a model (Look a the interview question on cross-validation) for productionalization. The best approach is train, valid, test splits segmented on time periods that reflect how online testing is performed. For instance, suppose you have one year of training data, January 2019 through December 2019. Allocate the first eight months for training. The next 2 months for validation. The last two months for testing, which you completely leave out until final evaluation. Use the validation set for your hyperparameter tuning. The test result will provide a better generalization of how well the model performs online. Next, behavioural change is another issue. Suppose there's a change in a platform because of a major feature release, glitches on the application, or demographic shift among customers. Such a change can impact the model's efficacy. Last point to consider is failing to conduct unit testings on features created both offline and online. Suppose numerical encoding is applied on high-cardinality data. For the categorical value "A", a numerical encoding of 56 is used. Is the encoding consistent offline and online? Any inconsistency on preprocessing offline and online could lead to inconsistent results.
56
How do you deploy a model on AWS?
There are several ways to deploy a machine learning model on AWS. The best choice depends on factors like your model framework, the scale of your application, and your specific requirements. Here's a breakdown of the most common methods: 1. AWS Sagemaker - Fully managed Service: SageMaker is a comprehensive platform, simplifying the process of building, training, and deploying machine learning models. - Steps: a. Model Packaging: Prepare your model artifact (code, dependencies) in a format compatible with SageMaker. b. Create a Model: Upload your model artifact to S3 and create a SageMaker model, specifying the artifact location and container image. c. Create an Endpoint Configuration: Define the instance type(s) and the number of instances for your endpoint d. Deploy the Endpoint: Create a SageMaker endpoint using the model and configuration - Pros: Streamlined process, handles infrastructure, scaling, and monitoring - Cons: Some level of dependency on the SageMaker ecosystem. 2. Severless Deployment - AWS Lambda: Ideal for models that need to be invoked on demand or can handle small inference workloads. - Steps: a. Package your model as a Lambda function, along with dependencies b. Create a Lambda function and upload your packaged model c. API Gateway (optional): Create an API Gateway endpoint to trigger your Lambda function, allowing external requests. - Pros: Cost effective (pay per execution), automatic scaling, easy setup - Cons: Limited memory and execution time, potential cold starts (latency for initial requests) 3. Containerized Deployment - Flexibility: Docker containers provide a portable way to package your model and its environment - Options: * AWS Elastic Container Service (ECS): Manage containerized applications at scale * AWS Elastic Kubernetes Service (EKS): Deploy and manage Kubernetes clusters on AWS * AWS Fargate: Serverless compute for containers, simplifies deployment - Pros: Control over the environment, suitable for complex deployments or integrating into existing microservice architectures - Cons: More infrastructure management overhead 4. Batch Predictions - AWS Batch: For predictions on large datasets where real-time inference isn't needed. - Steps: a. Containerize your model: Create a Docker image for your model code. b. Define a Batch Job: Specify the container image, data location, and compute resources c. Submit the job: AWS Batch handles the provisioning and execution of the job. - Pros: Handles large-scale predictions efficiently, cost-effective for non-real time processes - Cons: Not suitable for real-time or low-latency requirements Additional Considerations: 1. Model Format: Ensure your model is saved in a format compatible with your chosen deployment method (e.g. TensorFlow SavedModel, ONNX). 2. Monitoring and Retraining: Implement systems to monitor the performance of your deployed model and retrain when necessary to address model drift.
57
What are the primary challenges in deploying and maintaining machine learning models in production?
There are three primary areas of challenges when deploying models to production: 1. Challenges from Development to Deployment * Model-Environment Mismatch: Models built in lab settings often don't translate seamlessly to the real world. Issues include data distribution differences, scalability constraints, and latency requirements. * Data Dependencies: Production data may differ in format, quality, or distribution (concept drift) from what was used during training. Robust data cleaning and preprocessing pipelines are essential. * Computational Overhead: Large, complex models can be computationally expensive to run, leading to high costs and latency issues in production. * Reproducibility: Ensuring experiments and the model development process are well-documented and reproducible can be difficult, especially in larger teams. 2. Challenges in Production * Monitoring: Model performance can degrade over time due to: a. Concept drift: Changes in the underlying real-world patterns that the model was trained on b. Data drift: Changes in the distribution of input data c. System Issues: Upstream data problems, infrastructure failures. * Feedback loops: Collecting reliable performance and usage data from production systems to retrain and improve the model is often difficult to implement effectively. * Continuous Integration and Delivery (CI/CD): Machine learning models necessitate a more streamlined process of updating models as new data is received or as performance degrades. 3. Operational Challenges * Scalability: Handling sudden spikes in demand or scaling to handle large volumes of data can be a complex engineering challenge, especially for real-time inference. * Security: ML models and their data are potential attack vectors. Secure deployment and monitoring are critical. * Governance: Establishing clear processes for model updates, approvals, and ethical use becomes crucial, especially in larger organizations Addressing These Challenges Many of these challenges are addressed through a robust MLOps framework. Key elements include: 1. Data and Model Versioning 2. Experiment Tracking 3. Automated CI/CD Pipelines for ML 4. Model Monitoring and Alerting 5. Tools for Model Serving and Infrastructure Management
58
How do you handle model failure in production?
When a machine learning model that performed well in development suddenly fails in production, there are these steps to consider: Troubleshooting Steps 1. Isolate the Issue - Data Changes: Check for differences in the distribution, format, or quality of production data compared to training data. This is a very common culprit. - Code Discrepancies: Ensure the code used in production exactly matches the development version. Look for data preprocessing errors, model loading mistakes, or configuration issues. - System Issues: Investigate external factors like infrastructure problems, network errors, or resource constraints that might affect the model's performance. 2. Gather Information: - Metrics: Compare performance metrics (accuracy, precision, recall, F1-score, etc) between the development environment and production. - Error Logs: Analyze any error logs generated by the system. These often contain valuable clues about the root cause. - Data Samples: Examine specific instances where the model fails in production and the input data associated with them. 3. Root Cause Analysis: - Concept Drift: Determine if the underlying relationships your model learned during training have changed in the real world. Data changes are often responsible. - Overfitting: If the model performed exceptionally well in development, but poorly in production, consider overfitting. It means the model memorized training data instead of learning generalizable patterns. - Training-Serving Skew: Verify that your data preprocessing pipeline in production are identical to those used during training. Inconsistent preprocessing can lead to wildly different inputs for the model. - Hidden Biases: Assess whether the data used for training was sufficiently representative. Biased training sets can lead to models that fail on specific segments of real-world data. 4. Action - Data Adjustment: If concept drift or data quality are issues, you'll likely need to collect new data and retrain the model, potentially with more diverse examples. - Code Fixes: Verify and correct any code-related discrepancies found in your analysis. - Model Simplification: If overfitting is suspected, try these techniques: a. Regularization (L1, L2) b. Dropout c. Early Stopping - Infrastructure: Address any shortcomings in processing power or memory that might be affecting the model Tips on Creating a Robust Production Environment - Robust Monitoring: Implement production monitoring systems to get real-time alerts when models show signs of performance degradation. This is key to catching issues early on. - Continuous Learning Pipelines: Setup automated processes to retrain or recalibrate models as new production data becomes available. - Pre-Deployment Testing: Have a thorough validation stage before models go live. This includes testing with data that mimics the expected production environment. - Gradual Deployment: Consider techniques like canary deployment (new model on a small subset of traffic) or shadow deployment (run the new model alongside the old one but don't use its outputs) to mitigate risk.
59
What is the role of Kubernetes in Model Deployment?
What is Kubernetes? At its core, Kubernetes is a powerful open-source system designed to automate the deployment, scaling, and management of containerized applications. It provides an abstraction layer over your underlying infrastructure (physical machines, virtual machines, or cloud instances) letting you focus on application logic rather than managing individual servers. A common tool in Kubernetes is Kubeflow. Key Concepts: Kubernetes operates with concepts like: - Pods, the smallest deployable unit, often containing one container. - Deployments, which manage scaling and updates for sets of Pods. - Services, that provide networking abstractions for Pods, making them accessible. - Nodes, the worker machines where your Pods run. Why use Kubernetes for Model Deployment? Scalability: - Handles dynamic scaling of your ML models based on demand. If traffic spikes, Kubernetes can automatically spin up more replicas of your model. - Efficiently uses resources by spreading your models across multiple nodes. Resilience: - Self-healing capabilities automatically restart failed containers or pods. - Distributes model replicas across nodes, ensuring high availability. Deployment Automation: - Versioning and rollbacks of model deployments become simple. - Allows for canary deployments (releasing a new model version to a subset of users) or blue-green deployments (zero downtime updates) to minimize risk. Portability: - Packages models and dependencies in containers, making them runnable anywhere Kubernetes is supported (local machines, cloud, etc). Complex Workflows: - Kubernetes can orchestrate more complex model serving workflows, involving model preprocessing, post processing, A/B testing, and feedback loops. How Kubernetes is used 1. Containerizing Your Model: - Package your trained ML model, code, and dependencies into a Docker image. 2. Creating Kubernetes Resources: - Write YAML configuration files to define Deployments, Services, and other necessary Kubernetes objects for your model. 3. Deploying to a Kubernetes Cluster - Use tools like kubectl to deploy these configurations to your Kubernetes cluster. Kubernetes handles the rest!
60
How does model orchestration work?
What is Model Orchestration? Model orchestration is the process of automating and managing the entire lifecycle of machine learning models. It addresses the challenges of putting models into production, including: - Workflow Coordination: Orchestrating complex pipelines involving data preparation, model training, evaluation, deployment, and continuous monitoring. - Dependency Management: Ensuring all components (code, data, libraries, etc) are compatible and up-to-date. - Scalability: Handling the scaling requirements when models experience increased workloads. - Monitoring and Retraining: Tracking model performance in production and triggering retraining when accuracy degrades. Key Components of Model Orchestration 1. Orchestrator: This is the core software responsible for scheduling, sequencing, and managing the various tasks within a machine learning workflow. Popular orchestrators include: a. Apache Airflow: A highly versatile tool for authoring workflows as DAGs (Directed Acyclic Graphs). b. Kubeflow Pipelines: Part of the Kubeflow project, designed for ML workflows on Kubernetes. c. Flyte: A cloud-native, type-safe orchestration platform focused on data and ML pipelines. d. MLflow: Includes components for experiment tracking and model management, providing orchestration aspects. 2. Workflow Definition: Orchestration workflows are typically defined either through code (e.g. Python with Airflow) or configuration files (YAML). They specify: a. Steps/Tasks: The individual components of the pipeline (data preprocessing, training, deployment, etc). b. Dependencies: How steps relate to each other and which tasks need to be completed before others can start. c. Execution Logic: Conditional branching, error handling, retries, etc. 3. Task Execution: - The orchestrator spins up containers or jobs to execute each step of the workflow. - Tasks might involve running training scripts, deploying models as APIs, or triggering data quality checks. 4. Resource Management: - The orchestrator interacts with infrastructure (on-premises or cloud) to allocate compute resources, memory, and storage as needed for various tasks. - Integration with Kubernetes is common for dynamic resource scaling. 5. Monitoring and Logging: - The orchestrator keeps track of task execution, logs errors and warnings, and collects performance metrics. - Monitoring dashboards help detect issues like model drift or data quality problems. Example Workflow: A simplified ML orchestration workflow might look like this: 1. Data Preprocessing: Clean and prepare new data. 2. Model Training: Train or retrain a model using the updated data. 3. Evaluation: Evaluate the model's performance on a validation set. 4. Deployment: If the model meets performance criteria, deploy it to a production environment. 5. Monitoring: Track model performance in production and trigger retraining if performance degrades.
61
How would you perform hyperparameter tuning of a computer vision model?
Train/holdout split with cross-validation. Use parameter search like Grid/Random/Bayesian search. Parameters - Learning Rate, Batch Size, Epochs, Layer Size and Counts, Activation Functions, Optimizers
62
How would you build an image search model? A user uploads an image and the search retrieves similar images. Don’t worry about scaling and system design.
Use pre-trained model like ResNet, VGG and extracts vectors from the feature extraction layer such that all images now have vector forms. For the target image vector, run similarity using KNN to identify top-K vectors that are closest.
63
Airbnb wants to list furniture in the host’s home given the images provided on the hosting page. How would you build this? Don’t worry about scaling/system design.
Gather a dataset of images of furniture with labels/tags (e.g. sofa, bed, dining table). These should already be provided by the host. Augment data using techniques like resizing/normalization. Use CNN (or other pre-trained models) to predict furnitures found in an image.
64
How does the choice of activation function in a neural network affect its ability to model complex patterns? For instance, compare the effects of using sigmoid, ReLU, and tanh activation functions.
Having different activation functions can affect the network’s convergence speed, training, and the ability to capture complex patterns in the data. Activation functions Sigmoid The range is between 0 and 1. Due to its formula, it suffers from the “vanishing gradient” problem, where the gradients become very small, which leads to slow learning of the network. ReLu If x is positive, it returns ‘x’; otherwise, it returns ‘0’. It helps with mitigating the ‘vanishing’ problem from ‘Sigmoid’, but it suffers from the ‘dying ReLU’ effect, in which certain neurons become inactive during training. Tan h Range is between -1 and +1. It can be used in cases where zero-centered outputs are preferred. It also mitigates the vanishing gradient problem to some extent.
65
Explain the problem of vanishing and exploding gradients in deep neural networks. How do these issues affect the training process, and what are some common strategies used to mitigate them?
Vanishing gradients -Gradients can become extremely small as they propagate backwards during training, which can cause issues with activation functions such as sigmoid and tanh. How do these issues affect the training process: Gradients "dissappear", the network virtually stops training, and does not converge to an optimal solution. Common strategies used to mitigate them: 1. Use a different set of activation functions (e.g. ReLU) 2. Batch normalization 3. Proper weight initialization. 4. Residual Connections Exploding gradient - Gradients can become very large during backpropagation, which can lead to divergence during training. How do these issues affect the training process: The model struggles to converge to a solution. Common strategies used to mitigate them: 1. Proper weight initialization. 2. Batch normalization. 3. Use another set of activation functions with smaller gradients. 4. Gradient clipping. 5. Residual Connections
66
Discuss the role of regularization techniques like dropout and L1/L2 regularization in preventing overfitting in neural networks. How do these techniques alter the learning process?
Regularization can prevent overfitting in neural nets by applying L1 and L2 regularization. Dropout During training, certain neurons are randomly selected to be dropped out by setting their outputs to ‘0’. This promotes the learning of better and more robust features. L1/L2 regularization Applies L1 or L2 regularization to the values of the weights, which penalizes large weights. This prevents the model from fitting the training data too closely.
67
Compare and contrast the effects of different optimization algorithms (such as SGD, Momentum, RMSprop, and Adam) on the training dynamics of a neural network. How do these algorithms influence the convergence rate and stability of training?
SGD updates parameters along the negative gradient, scaled by a fixed learning rate. It's simple and low-overhead, but converges slowly — especially near saddle points or local minima — because it's sensitive to gradient noise. Stability depends heavily on learning rate scheduling; too high and it oscillates, too low and it stalls. Momentum augments SGD by accumulating an exponentially decaying moving average of past gradients, adding "inertia" to updates. This accelerates convergence along consistent gradient directions and dampens oscillations across noisy or high-curvature directions. It helps escape shallow local minima and traverse flat regions faster than vanilla SGD, but introduces an additional hyperparameter (the momentum coefficient, typically ~0.9). RMSprop adapts the learning rate per-parameter by dividing by a running average of recent squared gradients. This normalizes updates so that frequently-updated parameters get smaller steps and infrequent ones get larger steps. It converges faster than SGD on noisy or non-stationary problems (e.g., RNNs), and the gradient normalization helps stabilize training, though it doesn't fully eliminate vanishing/exploding gradient issues. Adam combines RMSprop's per-parameter adaptive rates with Momentum's first-moment estimate. It maintains both first-moment (mean) and second-moment (variance) running averages, with bias correction for both. This gives it fast early convergence and strong performance on large-scale, high-dimensional, or sparse-gradient problems. However, the adaptive rates can sometimes lead to convergence to suboptimal solutions compared to well-tuned SGD. Key tradeoffs: SGD generalizes best but requires careful tuning. Momentum improves SGD's convergence speed at minimal added complexity. RMSprop adds per-parameter adaptivity, helping on rugged loss landscapes. Adam converges fastest with minimal tuning but may find sharper (less generalizable) minima. In practice, Adam is the default starting choice, but SGD+Momentum with a good learning rate schedule often wins for final model quality.
68
Explain the purpose of batch normalization in neural network training. How does it help in accelerating training and improving performance?
Batch normalization centers and scales input data such that the mean becomes 0 and standard deviation becomes 1. Doing so ensures that a feature with higher range of values, let’s say -100 to 100, does not overpower error propagation compared to that of a feature with a smaller range, let’s say -1 to 1. Furthermore, normalization stabilizes the input distributions prior to activation functions, which can suffer from vanishing or exploding gradients.
69
In what scenarios is transfer learning particularly effective for neural networks? Discuss the advantages of using pre-trained models as a starting point for new tasks.
Out-of-the-box pre-trained models like GPT3+ & Bert (Text), ViT & ImageNet (Images) are already trained on a large volume of data. This means that these models have learned the patterns of texts (representing a word token into embedding) and images (identify edges and textures). This makes it easy as a starting point for business specific tasks like classifying furniture types in Airbnb home images.
70
How can you interpret the feature importance in a convolutional neural network (CNN)? Discuss methods like feature visualization.
1. Visualize filters applied at each layer -Helps in understanding the types of patterns that each filter is sensitive to. -It can reveal low-level features (e.g. edges, textures, colors, etc.) as well as higher-level semantic features. 2. Look at Class Activation Mapping (CAM) -Generates heatmaps of the regions of an input image that contribute the most to a CNN’s prediction. 3. Consider Gradient-weighted CAM -Extends CAM in that it uses the gradients of the predicted class score w.r.t. the feature maps of the last convolutional layer. -It generates more localized and accurate heatmaps than CAM.
71
What are the primary challenges in training Recurrent Neural Networks (RNNs), and how do architectures like LSTM and GRU address these challenges?
Challenges -Vanishing / Exploding gradients. -Difficulty in capturing long-term dependencies in the data . * As the sequence length increases, the network’s ability to remember information from distant past steps diminishes (due to vanishing gradient problem). -Memory constraints * They have limited memory capacity, which depends on the length of the sequence. It can cause the network to not be able to retain information from earlier time steps for much longer. Solutions -LSTM (Long Short-Term Memory) * Addresses the vanishing gradient problem. Introduces the “memory cell” and other gating mechanisms to address the issue with long-term dependencies (see above). -GRU (Gated Recurrent Unit) * Simpler architecture than LSTM and with fewer parameters. * Controls the flow of information via a gating mechanism. * Computationally more efficient than LSTM, but still effective in capturing long-term dependencies.
72
Discuss the concept of neural network pruning. How can reducing the size of a neural network model improve its performance, and what are the trade-offs involved?
Definition - It’s a technique to reduce the size of the NN model by removing certain unnecessary neurons, layers, or connections while trying to maintain or improve its performance. Pros - It can reduce the computational cost of the model, improve efficiency, and provide regularization effects (i.e. can help with reducing overfitting). Trade-offs - Loss of performance - Aggressive pruning can lead to removing important parameters, which can decrease the performance of the model. This can be addressed by further fine-tuning the model after pruning. - Sensitivity to initialization and training - Depending on the initialization of the model, different pruning techniques can lead to different outcomes. - Increases the complexity of the training pipeline, since it adds to the set of tasks to do. - Loss of interpretability * If certain layers or portions of the network are removed, it can be challenging to understand the behavior of the pruned model.
73
How does data augmentation impact the performance of a neural network in tasks like image classification? What are some effective data augmentation techniques and their limitations?
Pros - Increases robustness of the model. - The model is able to generalize. - Improves the performance of the model. Cons - Increases training time, since more training samples are used. - Loss of information (not always) * Certain data augmentation. techniques may remove helpful information from the original data. - Artifacts * Some artifacts or unreal features may get introduced when applying data augmentation to the original data, which leads to a decrease in model performance. Typical techniques Rotation Flipping of the image (180 deg) Translation Scaling and Cropping of the images Noise injection Color Jittering
74
What increases the training time of a neural network? Increasing the number of hidden layers or number of nodes in a hidden layer?
We can be concrete about this using the number of weights that must be considered for feedforward and backpropagation as a proxy for “training time”. In feed forward process, the weight is used for h_ij = activation_function(w_ij * x_ij). In backpropagation, the weights are w_ij_new = w_ij_old - alpha * gradient_w_ij Consider Image: The number of weights are larger for architecture A which has more units in hidden than depth. This could be tested in other architectures, and the number of calculations that must be dealt with are, in deed, higher.
75
How do you find out which features are important given weights from a neural network?
Assuming all else being equal, meaning that the feature inputs are all scaled such that the mean and standard deviations are the same, we should expect input that are more influential in predicting output to have the highest weight. Consider a simple linear regression model example (c.f. 1st image) In this example we see that w1 = 5 has a higher weight than w2 = 0.1. This means that x1 has more influence in predicting y than x2. We can take a similar approach in extracting important signals in neural network. The more important variables should have higher weights linked to it. (c.f. 2nd image)
76
Explain False Positive Rate in simple English and provide the formula for calculating.
The False Positive Rate (FPR) is the probability of incorrectly classifying something as positive when it is actually negative. In other words, it measures how often a test incorrectly identifies a negative case as positive. Think of a security system that detects intruders: * A false positive happens when the system incorrectly thinks a friendly visitor is an intruder. * The False Positive Rate tells us how often the system makes this mistake out of all the actual friendly visitors. Formula for False Positive Rate (FPR) 𝐹𝑃𝑅 = False Positives (FP) / (False Positives (FP) + True Negatives (TN) ) Breaking it Down: False Positives (FP) = Cases where the model incorrectly predicted positive (e.g., the system flagged a friendly visitor as an intruder). True Negatives (TN) = Cases where the model correctly predicted negative (e.g., the system correctly ignored a friendly visitor). Denominator (FP + TN) = Total actual negative cases. A low FPR means the system rarely makes false alarms, while a high FPR means it often makes incorrect positive predictions.
77
Explain True Positive Rate in simple English and provide the formula for calculating.
The True Positive Rate (TPR), also called Recall or Sensitivity, measures how well a model correctly identifies actual positive cases. It answers the question: "Out of all the real positive cases, how many did the model correctly detect?" Example: Imagine a medical test for a disease: * A true positive happens when the test correctly identifies a sick person as sick. * The True Positive Rate tells us how often the test correctly detects sick people out of all the people who are actually sick. Formula for True Positive Rate (TPR) 𝑇𝑃𝑅 = True Positives (TP) / (True Positives (TP) + False Negatives (FN) ) Breaking it Down: True Positives (TP) = Cases where the model correctly predicted positive (e.g., the test correctly detected a sick person). False Negatives (FN) = Cases where the model missed a positive case (e.g., the test incorrectly said a sick person is healthy). Denominator (TP + FN) = Total actual positive cases. A high TPR means the model is good at detecting real positives, while a low TPR means it often misses them.
78
Explain True Negative Rate in simple English and provide the formula for calculating.
The True Negative Rate (TNR), also known as Specificity, measures how well a model correctly identifies actual negative cases. It answers the question: "Out of all the real negative cases, how many did the model correctly classify as negative?" Formula for True Negative Rate (TNR) 𝑇𝑁𝑅 = True Negatives (TN) / (True Negatives (TN) + False Positives (FP) ) Breaking it Down: True Negatives (TN) = Cases where the model correctly predicted negative (e.g., a security system correctly ignored a friendly visitor). False Positives (FP) = Cases where the model incorrectly predicted positive (e.g., a security system mistakenly flagged a friendly visitor as an intruder). Denominator (TN + FP) = Total actual negative cases. A high TNR means the model is good at avoiding false alarms, while a low TNR means it frequently misclassifies negatives as positives.
79
Explain False Negative Rate in simple English and provide the formula for calculating.
The False Negative Rate (FNR) measures how often a model misses actual positive cases. It answers the question: "Out of all the real positive cases, how many did the model incorrectly classify as negative?" Example: Imagine a medical test for a disease: * A false negative happens when the test incorrectly says a sick person is healthy. * The False Negative Rate tells us how often the test fails to detect sick people out of all the people who actually have the disease. Formula for False Negative Rate (FNR) 𝐹𝑁𝑅 = False Negatives (FN) / (True Positives (TP) + False Negatives (FN) ) ​Breaking it Down: False Negatives (FN) = Cases where the model missed a positive case (e.g., the test incorrectly said a sick person is healthy). True Positives (TP) = Cases where the model correctly predicted positive (e.g., the test correctly detected a sick person). Denominator (TP + FN) = Total actual positive cases. A low FNR means the model rarely misses real positive cases, while a high FNR means it often fails to detect positives.
80
What are the 9 guiding principles for system design?
1. Scalability 2. Latency 3. Availability 4. Reliability 5. Consistency 6. Fault Tolerant 7. Maintainability 8. Security 9. Cost Effective
81
What is Momentum, why does it help, and what can go wrong?
Momentum introduces a velocity term, v, that replaces the raw gradient in the weight update, accumulating an exponentially weighted average of past gradients. ``` v = beta * v + (1 - beta) * grad w = w - lr * v ``` Why it helps: On loss surfaces shaped like narrow valleys, steep walls on the sides, gentle slope along the bottom, vanilla GD oscillates: large gradients on the steep walls cause it to ping-pong side to side while progress along the valley floor is slow. Momentum damps those oscillations (side-to-side gradients cancel out in the rolling average) and accelerates progress in consistent directions (along-the-valley gradients keep accumulating). Failure Modes: Setting beta too high (-> 1.0) causes the optimizer to overshoot minima and adapt slowly when gradient direction changes - you're coasting on stale history. Too low (-> 0) degenerates back to vanilla GD. A subtler issue is cold start: because v is initialized at zero, early steps are artificially small - you're averaging true gradients with zeros.
82
Starting from vanilla gradient descent, what state variable does RMSprop add per parameter, how is it updated, and how does it change the weight update rule?
RMSprop adds one state variable per parameter, `v`, a running average of squared gradients, initialized to `0`. Each step: ``` v_w = beta * v_w + (1 - beta) * dw**2 # EMA of squared gradients w -= lr / (np.sqrt(v_w) + epsilon) * dw # adaptive update ``` Typical values: `beta=0.9`, `lr=0.001`, `epsilon=1e-8`. The effect: parameters with historically large gradients get a smaller effective learning rate; parameters with smaller gradients get a larger one - all automatically with no change to how gradients are computed.
83
What is the Variance Inflation Factor, what is it's formula, how can you interpret VIF values, and what does VIF not tell you?
VIF is a measure of how much the variance of a regression coefficient is inflated due to multicollinearity with other predictors. It quantifies how much a predictor can be explained by other predictors in the model. The formula is: VIF_j = 1 / (1 - R^2_j), where R^2_j is the R^2 from regressing the predictor X_j on all other predictors in the model. You can interpret VIF values like this: VIF = 1: No correlation with other predictors VIF = 1-5: Moderate correlation (generally acceptable) VIF = 5-10: High correlation (concern) VIF > 10 : Severe multicollinearity (action needed) VIF doesn't identify which pair of variables is causing the problem, only that a given variable is correlated with the others as a group. Use a correlation matrix alongside VIF for pairwise diagnosis.
84
Q: Adam combines Momentum and RMSprop — what two state variables does it maintain per parameter, how are they updated, what is bias correction and why is it needed, and what can go wrong?
Adam maintains two state variables per parameter: m — first moment: EMA of gradients (like Momentum) v — second moment: EMA of squared gradients (like RMSprop) ``` m = beta1 * m + (1 - beta1) * grad # first moment v = beta2 * v + (1 - beta2) * grad**2 # second moment m_hat = m / (1 - beta1**t) # bias-corrected v_hat = v / (1 - beta2**t) # bias-corrected w -= lr * m_hat / (np.sqrt(v_hat) + epsilon) ``` Typical values: beta1=0.9, beta2=0.999, lr=0.001, epsilon=1e-8. Bias correction: Both m and v are initialized at zero. In early steps, the EMAs are heavily biased toward zero — you're averaging true signal with a lot of zeros. Dividing by (1 - beta**t) rescales the estimates upward to compensate, shrinking toward 1.0 as t grows and the correction becomes irrelevant. Failure modes: Generalization gap: Adam often converges faster than SGD+Momentum but to a sharper minimum, which can hurt test performance. SGD with momentum tends to find flatter minima that generalize better — this is an active research area (see AMSGrad, AdamW). Weight decay coupling: Naive L2 regularization in Adam doesn't behave like true weight decay because the gradient of the penalty gets scaled by v_hat just like any other gradient. AdamW fixes this by decoupling the decay step. Epsilon sensitivity: A larger epsilon damps the adaptive behavior (pushing Adam toward plain Momentum); too small and near-zero v_hat causes exploding updates in rarely-updated parameters.
85
When should you use stratified sampling over uniform sampling?
When your dataset has a significant class imbalance. Uniform sampling may under-represent rare classes by chance, while stratified sampling guarantees each class is represented at a specified proportion.
86
What is the key failure mode of naive stratified sampling with very rare classes?
When a class is very rare (e.g., 0.1% of data), `round(proportion * n) can round to 0, meaning the rare class gets no samples at all. Fix by enforcing a minimum sample count per class or specifying proportions manually.
87
What problem does reservoir sampling solve, and what is the core invariant it maintains?
It allows uniform random sampling of `k` items from a stream too large to fit in memory. The invariant is that at every point in the stream, each item seen so far has an equal probability of being in the reservoir.
88
In reservoir sampling (Algorithm R), when item `i` arrives, what is the probability it enters the reservoir, and why?
`k/i` - a random integer `j` is drawn from `[0, i]`, and the item enters if `j
89
How does Bayesian Optimization work, step by step?
Bayesian Optimization treats hyperparameter tuning as a problem of efficiently searching an unknown landscape. Imagine you're trying to find the lowest point in a dark, hilly terrain, but every step you take costs you 30 minutes. You want to be strategic about where you step next. The two key components: 1. Surrogate Model (usually a Gaussian Process): This is a statistical model that approximates the true objective function (e.g., validation loss as a function of hyperparameters). After each evaluation, it updates its beliefs about what the landscape looks like, maintaining both a predicted value and an uncertainty estimate at every point. Where you've evaluated, uncertainty is low. Where you haven't, uncertainty is high. 2. Acquisition Function: This is the decision rule that picks the next point to evaluate. It balances exploitation (sampling where the surrogate predicts good performance) with exploration (sampling where uncertainty is high, because a hidden optimum might be lurking there). Common acquisition functions include Expected Improvement (EI) and Upper Confidence Bound (UCB). The loop: Start by evaluating a few random hyperparameter configurations. Fit the surrogate model to all results so far. Use the acquisition function to choose the next most promising configuration. Evaluate that configuration (train the model, measure validation performance). Update the surrogate model with the new result. Repeat until your budget runs out. Why it beats Grid/Random Search: Grid and Random search treat every trial as independent — they don't learn from previous results. Bayesian Optimization uses every past evaluation to make a smarter decision about what to try next, so it typically finds strong configurations in far fewer iterations. This matters most when each evaluation is expensive (e.g., training a deep neural network for hours). Key tradeoff: The surrogate model itself becomes harder to fit accurately as the number of hyperparameters grows (roughly beyond 10–20 dimensions), which is why Bayesian Optimization is best suited for tuning a moderate number of continuous hyperparameters rather than massive discrete search spaces.
90
What is R-squared in regression modeling, and what is its main drawback?
R² measures the proportion of variance in the target variable that is explained by the model. A higher value (closer to 1) is better. Its main drawback is that it always increases when you add more features, even irrelevant ones, which can be misleading about the model's true quality.
91
What distinguishes variance from standard deviation, and in what contexts would you use each?
Variance is the average squared deviation from the mean (σ²); standard deviation is its square root (σ), expressed in the same units as the data. Use variance when: Doing mathematical derivations or proofs (it's algebraically cleaner — variances of independent variables add directly) Working with statistical models internally (e.g., computing loss functions, PCA, linear regression) Use standard deviation when: Communicating results to stakeholders (interpretable units match the data) Describing spread in feature distributions or model error (e.g., "predictions are off by ±2.3 kg") Key tradeoff: Variance penalizes outliers more heavily due to squaring, making it sensitive to extreme values — worth keeping in mind during feature analysis.
92
How should p-values be interpreted, and what are frequent misconceptions about them?
A p-value is the probability of observing results at least as extreme as your data, assuming the null hypothesis (H₀) is true. It measures how surprising your data is under H₀ — not how true any hypothesis is. Correct interpretation: A small p-value (e.g., < 0.05) means the data is unlikely under H₀, giving grounds to reject it. It is a continuous measure of evidence against H₀, not a binary pass/fail. p < 0.05 ≠ 95% chance the result is real. p > 0.05 ≠ no effect exists. Low p ≠ large or important effect (conflates sample size with magnitude). 0.05 is a convention, not a threshold of truth. ML relevance: In feature selection or A/B testing, relying solely on p-values without considering effect size, confidence intervals, and practical significance is a common and costly mistake.
93
What is Gini impurity, its formula, and NumPy implementation? Finally, what is the weighed Gini of a split in a decision tree?
Gini impurity measures the probability of misclassifying a randomly chosen sample if labeled according to the node's class distribution. A pure node scores 0; a maximally mixed binary node scores 0.5. Formula: G = 1 - Σ pᵢ² where pᵢ is the proportion of samples belonging to class i. ``` def gini(y): # compute proportion of each class p = np.bincount(y) / len(y) # apply formula: 1 - sum(pi^2) return 1 - np.sum(p ** 2) ``` Sanity checks: gini([0,0,0,0]) → 0.0 (pure node) gini([0,0,1,1]) → 0.5 (maximally mixed) The weighted Gini of a split is just: G_split = (n_left/n) * G_left + (n_right/n) * G_right
94
You have a NumPy array `[1.4, 10.5, 3.5, 2.1]`. How can you calculate the thresholds between consecutive values in the array?
Use numpy.unique(array) to sort the array ascending, and use advanced indexing to get the consecutive pairs in the same index of two arrays. Then add the two arrays and divide. Here's the code: ``` thresholds = (np.unique(values)[:-1] + np.unique(values)[1:]) / 2 ```
95
What are some tools for running Machine Learning algorithms in parallel?
Some of the tools, software or hardware, used to execute Machine Learning algorithms in parallel include GPUs, MapReduce, and Spark.
96
What is Spark and how is it useful for ML?
Apache Spark is an open-source, distributed computing framework designed for large-scale data processing. Spark provides an in-memory computation engine that makes it significantly faster than disk-based frameworks like Hadoop MapReduce. Why it matters for ML: Spark is useful because real-world ML often involves datasets too large for a single machine. Spark lets you distribute data processing and model training across a cluster. Its key ML-relevant components include: Spark SQL / DataFrames — for cleaning, joining, and transforming large datasets (the bulk of any ML pipeline). MLlib — Spark's built-in library with distributed implementations of common algorithms (linear regression, random forests, k-means, ALS for recommendations, etc.). Spark Streaming — enables near-real-time feature engineering and inference on streaming data. PySpark — a Python API that lets data scientists use familiar syntax while leveraging distributed compute under the hood. Key advantages to mention in an interview: In-memory processing — avoids repeated disk I/O, making iterative algorithms (like gradient descent) much faster than MapReduce. Lazy evaluation + DAG optimizer — Spark builds a directed acyclic graph of transformations and optimizes the execution plan before running anything, reducing unnecessary shuffles. Scalability — scales horizontally by adding nodes; you can go from gigabytes to petabytes without rewriting code. Unified pipeline — you can do data ingestion, feature engineering, model training, and evaluation all within one framework.
97
What are the different Machine Learning approaches?
1. Supervised Learning - where the output variable (the one you want to predict) is labeled in the training dataset. Includes Regression and Classification. 2. Unsupervised Learning - where the training dataset does not contain the output variable. The objective is to group similar data together instead of predicting any specific value. Includes Clustering, Dimensionality Reduction, and Anomaly Detection 3. Semi-supervised Learning: This technique falls between Supervise and Unsupervised Learning because it has a small amount of labeled data with a relatively large amount of unlabeled data. You can find its applications in problems such as Web Content Classification and Speech Recognition, where it is very hard to get labeled data but you can easily get lots of unlabeled data. 4. Reinforcement Learning: RL focuses on finding a balance between Exploration (of unknown new territory) and Exploitation (of current knowledge). It monitors the response of actions taken through trial and error and measures the response against a reward. The goal is to take such actions for the new data so that the long-term reward is maximized.
98
What is the difference between Causation and Correlation?
Causation is a relationship between two variables such that one of them is caused by the occurrence of the other. Correlation, on the other hand, is a relationship observed between two variables which are related to each other but not caused by one another.
99
What is the difference between Online and Offline (Batch) learning? Highlight 5 key differences.
Online learning updates the model incrementally as each new data point (or small mini-batch) arrives. The model learns continuously and adapts to new patterns without needing access to the full historical dataset. Examples include stochastic gradient descent on a stream of data, or recommendation systems that update in real time as users interact. Offline (batch) learning trains the model on the entire dataset at once. You collect all the data, train the model, evaluate it, and deploy it. When new data becomes available, you retrain from scratch (or from a checkpoint) on the full updated dataset. Key differences to highlight in an interview: Data access — online sees one example at a time; batch sees everything at once. Adaptability — online adapts quickly to distributional shifts (concept drift); batch requires retraining to incorporate changes. Compute/memory — online is lightweight per update and doesn't need to store the full dataset in memory; batch can be expensive and requires the full dataset to be available. Stability — batch training tends to be more stable and reproducible; online learning can be sensitive to data ordering and noisy examples. Use cases — online is ideal for streaming data, non-stationary environments, or when data is too large to store (e.g., ad click prediction). Batch is preferred when you have a fixed dataset, need reproducibility, or the data distribution is stable.
100
Define Sampling. Why do we need it?
Sampling is a process of choosing a subset from a target population which would serve as its representative. We use the data from the sample to understand the pattern in the population as a whole. Sampling is necessary because often we can not gather or process the complete data in a reasonable time. There are many ways to perform sampling, some commonly used techniques are Random Sampling, Stratified Sampling, and Cluster Sampling.
101
Define Confidence Interval
A confidence interval is an interval estimate which is likely to include an unknown population parameter, the estimated range being calculated from the given sample dataset. It simply means the range of values for which you are completely sure that the true value of your variable would lie in.
102
What do you mean by i.i.d. assumption?
We often assume that the instances in the training dataset are independent and identically distributed (i.i.d.), i.e, they are mutually independent of each other and follow the same probability distribution. It means that the order in which the training instances are supplied should not affect your model and that the instances are not related to each other. If the instances do not follow an identical distribution, it would be fairly difficult to interpret the data.
103
Why do we call it GLM (Generalized Linear Model) when it is clearly non-linear?
The Generalized Linear Model (GLM) is a generalization of ordinary linear regression in which the response variables have error distribution models other than a normal distribution. The "linear" component in GLM means that the predictor is a linear combination of the parameters, and it is related to the response variable via a link function.
104
Define Conditional Probability.
Conditional probability is the probability of an event A occurring given that another event B has already occurred (or is known to be true). It's written as P(A|B) and read as "the probability of A given B." The formula: P(A|B) = P(A ∩ B) / P(B), where P(B) > 0 This says: to find the probability of A given B, take the probability that both A and B happen, and divide by the probability of B. You're essentially narrowing the sample space to only the outcomes where B is true, then asking how often A also occurs within that subset. Intuitive example: Suppose you're drawing from a standard deck of 52 cards. The probability of drawing a king is 4/52. But if someone tells you the card is a face card (given B), you've narrowed the space to 12 cards, and now P(King | Face card) = 4/12 = 1/3. Why it matters for ML: 1. Bayes' theorem is built directly on conditional probability: P(A|B) = P(B|A) · P(A) / P(B). This is the foundation of Naive Bayes classifiers, Bayesian inference, and probabilistic graphical models. 2. Classification is fundamentally about estimating P(class | features) — a conditional probability. 3. Chain rule of probability decomposes joint distributions into a product of conditionals, which underpins language models (predicting the next word given all previous words). 4. Feature independence assumptions in models like Naive Bayes are statements about conditional probabilities.
105
Are you familiar with Bayes Theorem? Can you tell me why it is useful?
Bayes' Theorem provides a way to reverse conditional probabilities. If you know P(B|A), it lets you compute P(A|B). The formula is: ```P(A|B) = P(B|A) · P(A) / P(B)``` Each term has a name that's worth knowing: P(A|B) — Posterior: what you want to know — the updated belief about A after observing B. P(B|A) — Likelihood: how probable the observed evidence B is if A were true. P(A) — Prior: your belief about A before seeing any evidence. P(B) — Evidence (marginal likelihood): the total probability of observing B under all possible hypotheses. Acts as a normalizing constant. The core intuition: Bayes' theorem is a principled framework for updating beliefs with evidence. You start with a prior belief, observe data, and arrive at a posterior belief. The more data you observe, the more the posterior is shaped by the likelihood rather than the prior. Why it's useful in ML: 1. Naive Bayes classifiers — directly apply Bayes' theorem to classify text, spam, sentiment, etc. by computing P(class | features) using the likelihood of each feature given the class. 2. Bayesian inference — instead of learning a single point estimate for model parameters, you maintain a full posterior distribution, which naturally quantifies uncertainty. 3. Bayesian optimization — used for hyperparameter tuning, where you build a probabilistic model of the objective function and update it as you evaluate new configurations. 4. Medical/fraud/anomaly detection — Bayes' theorem helps reason about rare events. A positive test result doesn't mean high probability of disease unless you account for the prior (base rate). This is the classic "base rate fallacy" that Bayes corrects for. 5. Probabilistic graphical models — Bayesian networks use Bayes' theorem to perform inference over complex joint distributions.
106
How can you get an unbiased estimate of the accuracy of the learned model?
Use Training-Validation-Test datasets. Also leverage Cross-Validation to reduce the variance in your estimate.
107
What is the definition of Bias (in bias-variance tradeoff, model bias)
This refers to the error introduced by approximating a complex real-world problem with a simplified model. A high-bias model makes strong assumptions about the data and tends to underfit — it misses relevant patterns. For example, fitting a linear model to data that has a quadratic relationship produces high bias. Formally, it's the difference between the expected prediction of the model and the true value.
108
What is the definition of Variance (in bias-variance tradeoff, model variance)
Variance in ML refers to the amount by which the model's predictions would change if it were trained on a different dataset drawn from the same distribution. It measures how sensitive the model is to the specific training data it saw.
109
What is a probabilistic graphical model?
A probabilistic graphical model is a powerful framework which represents the conditional dependency among random variables in a graph structure. It can be used in modeling a large number of random variables having complex interactions with each other.
110
Define non-negative matrix factorization. Give an example of its application.
Matrix factorization means factorizing a matrix into 2 or more matrices such that the product of these matrices approximates the actual matrix. This technique can greatly simplify complex matrix operations and can be used to find the latent features in given data. An example of this is in Recommendation Systems, where it could be used to find the similarities between two users. In non-negative matrix factorization, a matrix is factorized into 2 sub-matrices such that all 3 matrices have no negative elements.
111
What does it mean to fit a model? How do the hyperparameters relate?
Fitting a model is the process of learning the parameters of a model using the training dataset. Parameters help define the mathematical formulas behind the Machine Learning models. Hyperparameters are the "high-level" parameters that cannot be learned from the data. They define the properties of a model, such as the model complexity or learning rate.
112
When would you use standard Gradient Descent over SGD and vice-versa?
Gradient Descent theoretically minimizes the error function better than SGD, however SGD converges much faster once the dataset becomes large. Thus GD is preferable for small datasets, while SGD is preferable for larger ones. In practice, SGD is often used because it minimizes the error function well enough while being much faster and more memory efficient for large datasets.
113
What could be the reason for GD to converge slowly or not converge at all in various ML algorithms?
1. If the learning rate is too small, it can take a long time to converge. If it is too large, the function may jump around the optimum value and not converge. 2. It may converge slowly in the case of a Symmetric Positive Definite (SPD) matrix. The eigen values lay down the curvature of the function, and in the case of SPD, they are all positive and generally different, which leads to a non-circular contour. Due to this, converging to the optimal point would take a lot of steps. In short, the more circular the contour is, the faster your algorithm would converge. 3. Sometimes, due to rounding errors, the GD may not converge at all. GD generally stops when the expected cost/error is either zero or very small. However, rounding errors might make it so that your error never becomes absolute zero, in which case the algorithm would keep converging. 4. If the function does not have a minimum, the GD would continue to descend forever. 5. Some functions are not differentiable in certain regions, and the gradient cannot be calculated at those points.
114
How much data should you allocate for training, validation, and test datasets?
There's no correct answer, it truly depends. It is common to do a 80:20 split for training-test, and further splitting the training 80% into training-validation sets. With massive data for deep learning that 80:20 split can sometimes be more skewed, to 90:10 or even 95:5.
115
What do you mean by paired t-test? Where would you use it?
A paired t-test is a statistical procedure which is used to determine whether the mean difference between two sets of observations is zero or not. It has 2 hypotheses, the null hypothesis and the alternative hypothesis. The null hypothesis (H_0) assumes that the true mean difference between the paired samples is zero. Conversely, the alternative hypothesis assumes that the true mean difference is not equal to zero. We use paired t-test to compare the means of the two samples in which the observations in one sample can be paired with the observations in the other sample.
116
Define F-Test. Where would you use it?
An F-test is any statistical hypothesis test where the test statistic follows an F-distribution under the null hypothesis. If you have 2 models that have been fitted to a dataset, you can use F-test to identify the model which best fits the sample population.
117
What is a chi-squared test?
A chi-squared test is any statistical hypothesis test where the test statistic follows a chi-squared distribution (a distribution of the sum of squared standard normal deviates) under the null hypothesis. It measures how well the observed distribution of data fits with the expected distribution, if the variables are independent.
118
What is an F1 score?
The F1 score is a measure of a model's performance. It is the weighted average of the precision and recall of a model. The result ranges from 0 to 1 with 0 being the worst and 1 being the best model. F1 score is widely used in the fields of Information Retrieval and NLP. F1 Score = 2*Precision*Recall/(Precision + Recall)
119
What do you understand by Type I and Type II errors?
Type I error occurs if you reject the null hypothesis when it was true, also known as a False Positive. Type II error occurs if you accept the null hypothesis when it was false, also known as False Negative.
120
What is a Bayesian Classifier?
A Bayesian classifier is a probabilistic model which tries to minimize the probability of misclassification. From the training dataset, it calculates the probabilities of the features, given the class labels and uses this information in the test dataset to predict the class given (some of) the feature values by using the Bayes rule.
121
How can you use Naive Bayes classifier for categorical features? What if some features are numerical?
You can use any kind of predictor in a Naive Bayes classifier. All it needs is the conditional probability of a feature given the class, i.e., P(F|Class). For categorical features, you can estimate P(F|Class) using a distribution such as multinomial or Bernoulli. For numerical features, you can estimate P(F|Class) using a distribution such as Normal or Gaussian. Since Naive Bayes assumes the conditional independence of features, it can use different types of features together. You can calculate each feature's conditional probability and multiply them together to get the final prediction.
122
Why is Naive Bayes called "naive"?
Naive Bayes assumes all the features in a dataset are equally important and conditionally independent of each other. These assumptions are rarely true in real world scenarios which is why Naive Bayes is called "Naive".
123
Compare the time complexity of training Naive Bayes vs Logistic Regression.
Let n be the number of features. Time complexity for Naive Bayes is O (log n) while it is O (n) for Logistic Regression.
124
What is the difference between a generative approach and a discriminative approach? Give an example of each.
A generative model learns the joint probability distribution P(x, y) whereas a discriminative model learns the conditional probability distribution P(y|x) where y is the output class label and x is the input variable. Generative models learn the distribution of the individual classes whereas discriminative models learn the boundary between classes. Naive Bayes is a generative approach as it generates the joint probability distribution of the features and the output label using P(Y) and P(X|Y), whereas Logisitic Regression is a discriminative approach because it tries to find a hyperplane which separates the classes.
125
Explain prior probability, likelihood and marginal likelihood in the context of Naive Bayes algorithm.
Prior probability is the proportion of dependent (binary) variable in the dataset. It is the closest guess you can make about a class, without any further information. For example, let's say that you have a dataset in which the dependent variable is binary, spam or not spam. The proportion of spam is 75% and not spam is 25%. Hence, you can estimate that there are 75% chances that any new email would be spam. Likelihood is the probability of classifying a given observation as true in the presence of some other variable. For example, the probability that the word "CASH" is used in the previous spam message is a likelihood. Marginal likelihood is the probability that the word "CASH" is used in any message.
126
Define laplace estimate. What is m-estimate?
The Laplace estimate (add-one smoothing) prevents zero probabilities by adding 1 to every count: ```P(x) = (count(x) + 1) / (N + d)```. It's essential for Naive Bayes where a single zero probability would zero out the entire prediction. The m-estimate generalizes this as ```P(x) = (count(x) + m·p) / (N + m)```, where m controls smoothing strength and p is the prior. Laplace is the special case where ```m = d``` and ```p = 1/d```. The m-estimate offers more flexibility when you have domain knowledge or want to tune the bias-variance tradeoff in probability estimates.
127
What is a confusion matrix? Explain it for a 2-class problem.
A confusion matrix is a table layout which describes the performance of a model on the test dataset for which the true values are known. For a binary or 2-class classification, which can take two values, 0 or false and 1 or true, a confusion matrix can be drawn as seen in the image.
128
Compare Logistic Regression with Decision Trees.
Decision Trees partition the feature space into smaller and smaller subspaces, whereas Logistic Regression fits a single hyper-surface to divide the feature space exactly into two. When the classes are not well separated, decision trees are susceptible to overfitting whereas Logistic Regression generalizes better. Decision Tree is more prone to overfitting whereas Logistic Regression, being simple and having less variance, is less prone to overfitting. So, for datasets with very high dimensionality, it is better to use Logistic Regression to avoid the Curse of Dimensionality.
129
How can you choose a classifier based on the size of training set?
If the training set is small, high bias/low variance models such as Naive Bayes tend to perform better because they are less likely to overfit. If the training set is large, low bias/high variance models such as Decision Trees can perform better because they can reflect more complex relationships.
130
What do you understand by the term "decision boundary"?
A decision boundary or decision surface is a hypersurface which divides the underlying feature space into two subspaces, one for each class. If the decision boundary is a hyperplane, then the classes are linearly separable.
131
What are some reasons where you would want to use a Decision Tree?
When you fit a Decision Tree to a training dataset, the top few nodes on which the tree is split are basically the most important features in the dataset and thus, you can use it for feature selection to select the most relevant features in the dataset. Decision Trees are also insensitive to outliers since the splitting happens based on the proportion of samples within the split ranges and not on the absolute values. Finally, their tree like structure makes them very easy to understand and interpret. They do not need data to be normalized and work well even when features have nonlinear relationships with each other.
132
What are some of the disadvantages of using a Decision Tree algorithm?
1. Even a small change in input data can, at times, cause large changes in the tree as it may drastically impact the information gain used by Decision Trees to select features. 2. Decision trees, moreover, examine only a single field at a time, leading to rectangular classification boxes. This may not correspond well with the actual distribution of records in the decision space. 3. Decision Trees are inadequate when it comes to applying regression and predicting continuous variables. A continuous variable can have an infinite number of values within an interval, capturing which, in a tree having only a finite number of branches and leaves, is very hard. 4. There is a possibility of duplication with the same sub-tree on different paths, leading to complex trees. 5. Every feature in the tree is forced to interact with every feature further up the tree. This is extremely inefficient if there are features that have no or weak interactions.
133
Define entropy. Then, provide the Numpy implementation.
Entropy is a measure of uncertainty associated with a random variable, Y. It is the expected number of bits required to communicate the value of the variable. It is calculated as −Σ pᵢ log₂(pᵢ), where pᵢ is the proportion of class i in a node. It ranges from 0 (pure node) to log₂(k) for k classes.
134
What is meant by "information gain"?
Information gain is used to identify the best feature to split a given training dataset. It selects the split S that most reduces the conditional entropy of output Y for the training set D. In simple terms, the Information Gain is the change in the Entropy, H, from a prior state to a new state when split on a feature. Formally: IG(D, S) = H(D) − H(D|S), where H(D) is the entropy of the dataset before splitting and H(D|S) is the conditional entropy after splitting on feature S. When splitting for a decision tree, we use the weighted average of child entropies to calculate the information gain for the split (much like with Gini impurity): Formula: IG(D, S) = H(D) − Σ (|Dᵥ| / |D|) · H(Dᵥ) Where H(X) = −Σ p(xᵢ) · log₂(p(xᵢ)) is the entropy, Dᵥ are the subsets after splitting on feature S, and |Dᵥ|/|D| is the weight for each subset.
135
How can Information Gain be biased or less optimal?
Information Gain is biased towards the tests with many outcomes. For instance, consider a feature that uniquely identifies each training sequence. Splitting on this feature would result in many branches, each of which is "pure" (has instances of only one class) i.e., maximal information gain and this affects the model's generalization accuracy. To address this limitation, the C4.5 algorithm uses a splitting criterion known as the Gain Ratio. Gain Ratio normalizes the Information gain by dividing it by the entropy of the split being considered, thereby avoiding the unjustified favoritism of Information Gain. The GainRatio formula is attached as an image.
136
What are 4 splitting rules used by different Decision Tree algorithms?
1. Information Gain 2. Gain Ratio 3. Gini Impurity 4. Multi-variate split - Multivariate decision trees can use splits that contain more than one attribute at each internal node.
137
Is using an ensemble like Random Forest always good?
Always using an ensemble may seem like a better approach than a single Decision Tree, but Random Forests have their own limitations. These include: 1. Ensembles generally do not perform well when the relationship between dependent and independent variables is highly linear. 2. Unlike Decision Trees, the classification made by Random Forests is difficult to interpret easily. 3. Random Forest ensembles are technically more computationally expensive than a single Decision Tree (but can be trained and inferenced in parallel, however Gradient Boosted Machines are trained sequentially).
138
What is pruning (in decision trees) and why is it important?
Pruning is a technique which reduces the complexity of the final classifier by removing sub-trees whose existence does not impact the accuracy of the model. In pruning, you grow the complete tree and then iteratively prune back some nodes until further pruning is harmful. This is done be evaluating the impact of pruning each node on the tuning (validation) dataset accuracy and greedily removing the one that most improves the tuning dataset accuracy. One simple way of pruning a Decision Tree is to impose a minimum number of the training examples that reach a leaf. Pruning keeps the tree simple without affecting the overall accuracy. It helps solve the overfitting issue by reducing size as well as complexity of the tree.
139
What are four advantages and four disadvantages of using K-Nearest Neighbors?
Advantages: 1. Simple to understand and implement. KNN requires no explicit training phase — it simply stores the training data and defers all computation to prediction time (this is what "lazy learner" means). 2. Flexible choice of distance metrics and features. You can adapt KNN to different problem types by choosing appropriate distance functions (e.g., Euclidean, Manhattan, cosine similarity). 3. Naturally handles multi-class classification. Unlike some algorithms that require special adaptations for more than two classes, KNN works seamlessly with any number of classes. 4. No assumptions about data distribution. KNN is non-parametric, meaning it makes no assumptions about the underlying shape of the data, which makes it versatile across many problem types. Disadvantages: 1. High memory usage. Because KNN stores the entire training dataset and uses it at prediction time, it can be very memory-intensive for large datasets. 2. Slow prediction time at scale. For each new prediction, KNN must compute distances to every training point, making inference slow when the training set is large. 3. Sensitive to irrelevant or poorly scaled features. If features aren't carefully selected or normalized, irrelevant dimensions can dominate the distance calculation and hurt accuracy. 4. Requires a large, representative training set. KNN needs sufficient data density to find meaningful nearest neighbors — with sparse or small datasets, predictions can be unreliable.
140
How do you choose the optimal k in k-NN?
Here are some solid methods for choosing optimal k in k-NN: Cross-Validation (most reliable) Train with different k values and evaluate using k-fold cross-validation. Pick the k that minimizes validation error. This is the gold standard approach. The √n Rule of Thumb Start with k ≈ √n, where n is the number of training samples. It's a quick baseline, not a final answer. Elbow Method Plot error rate vs. k values. Look for the "elbow" — the point where error stops decreasing meaningfully. Beyond that point, you're gaining little while increasing bias. Odd k for Binary Classification Always prefer odd k values when you have 2 classes to avoid tie-breaking ambiguity. And here are some practical considerations: 1. Small k → low bias, high variance (overfitting) 2. Large k → high bias, low variance (underfitting) 3. Larger datasets generally tolerate larger k 4. Use weighted k-NN (distance-weighted votes) to reduce sensitivity to k choice
141
What is Hamming Distance?
Hamming Distance is the number of positions at which two equal-length sequences differ. Examples: 1011101 vs 1001001 → 2 positions differ → Hamming distance = 2 "karolin" vs "kathrin" → 3 positions differ → Hamming distance = 3
142
If you have a lot of noise in your dataset, how would you vary k for k-NN?
You should increase k to handle any noise. A large k value would average out or nullify noise or outliers in a given dataset.
143
What is a t-distribution?
A t-distribution is a probability distribution similar to the normal distribution but with heavier tails, used when working with small sample sizes or when the population standard deviation is unknown. One liner: "It's the normal distribution, but more uncertain - because you're estimating variance from data, not assuming you know it."
144
What are the two ways you can speed up the k-NN's computation (including both training and testing) time?
1. Edited nearest neighbors - Instead of retaining all the training instances, select a subset of them which can still provide accurate classifications. Use either forward selection or backward elimination to select the subset of the instances which can still represent other instances. 2. K-dimensional Tree - It is a smart data structure used to perform nearest neighbor and range searches. A k-d tree is similar to a decision tree except that each internal node stores one data instance (i.e., each node is a k-dimensional data point) and splits on the median value of the feature having the highest variance.
145
Define logistic regression.
Logistic regression is a statistical method for analyzing a dataset in which one or more independent variables determine the outcome, that can have only a limited number of values, i.e. the response variable is categorical in nature. Logistic regression is a go-to method for classification problems when the response (output) variable is binary.
146
What are three ways you can evaluate a Logistic Regression model?
1. AUROC: You can use the AUROC curve along with a confusion matrix, plus recall, precision, accuracy, and F1 score. 2. AIC (Akaike Information Criterion): Analogous metric of adjusted R^2 in logistic regression. AIC is the measure of fit which penalizes the mode for the number of model coefficients. We prefer a model with a minimum AIC value. 3. Deviance: Deviance represents the goodness of fit for a model. We prefer a model with lower deviance value. Null deviance means the response only has an intercept and the residual deviance indicates the response has non-zero weight vector.
147
What is a link function in Logistic Regression?
A link function provide the relationship between the expected value of the response variable and the linear predictor. Logistic Regression uses Logit as its link function, which is the term wx in P(y) = 1/(1+e^-(wx)).
148
What is the range of Logistic Regression?
(0, 1)
149
When is Logistic Regression multinomial?
Logistic Regression is multinomial when the number of classes to separate are more than two. A Multinomial Logistic Regression algorithm predicts the probabilities of each possible class as the outcome.
150
What is One vs All Logistic Regression?
In One Vs All, if there are n classes, then you have n different independent classification problems, one for each class. For the ith classification problem, you learn all the points which belong to the class i, and all the other points are assumed to belong to a pseudo class not i. For a new test data, you use the most occurring response from all the n classifiers to predict its output (ie, majority vote).
151
What can you do to speed up your logistic regression training without compromising a lot on models accuracy, if your training dataset is huge?
Reducing the number of iterations during gradient descent would reduce the training time but it will hamper the accuracy as well. So, you can increase the learning rate to speed up the convergence, while still maintaining a similar accuracy. Alternatively, you can use learning rate decay to have it high for fast initial convergence, then reduced to better settle into a local minima. You can also consider Momentum, RMSProp, and Adam.
152
What do you understand by "maximal margin classifier"? Why is it beneficial?
(This is related to Support Vector Machines). A margin classifier gives the distance of a data instance from the decision boundary. In the case of Support Vector Machines, a decision boundary is a hyperplane separating the two class labels. A "maximal margin classifier" is a classifier that draws the separation hyperplane between the two class labels in a way such that the distance of the hyperplane is maximum, i.e., the hyperplane is at an equal distance from them. The maximal margin is the optimal hyperplane separating the classes and does not suffer from overfitting.
153
How do you train a Support Vector Machine (SVM)? What about hard SVM and soft SVM?
Core Idea: An SVM finds the optimal hyperplane that separates classes by maximizing the margin — the distance between the hyperplane and the nearest data points from each class (called support vectors). Hard-Margin SVM: Assumption: Data is perfectly linearly separable — no misclassifications allowed. Soft-Margin SVM: Assumption: Data may not be perfectly separable — allows some misclassifications. Quick Comparison: Hard-SVM: 1. Must be linearly separable 2. No slack variables 3. No hyperparameter 4. Highly sensitive to outliers 5. Rare in real-world use Soft-SVM: 1. Works with overlapping classes 2. slack variable ξi ≥ 0 (Xi, pronunced "Zy") 3. C hyperparameter is used for regularization, if C = infinity, recover Hard-SVM. Small C tolerates more violations, get wider margin and more generalization. Large C penalizes violations heavily, get narrower margin and less tolerance. 4. Outlier sensitivity controlled by C 5. Standard in real-world use. Soft-SVM is typically always used IRL because real data is noisy and rarely perfectly separable.
154
What is a kernel? Explain the Kernel trick.
A kernel is a function K(xi, xj) that computes the dot production between two data points in a higher-dimensional feature space, without explicitly transforming the data into that space: ```K(xi​,xj​) = ϕ(xi​)⋅ϕ(xj​)``` where ϕ is a mapping to a higher-dimensional space - but you never actually compute ϕ. The Problem Kernels Solve: Many real-world datasets are not linearly separable in their original space. The naive solution is to map data to a higher-dimensional space where it becomes linearly separable. ```x∈R^d ⟶ϕ⟶ ​ϕ(x)∈R^D (D≫d)``` But this can be quite expensive if D is huge (or infinite!) and become computationally intractable. The Kernel Trick: The key insight is that the SVM dual formulation only ever uses data through dot products xi​⋅xj​. So you can substitute: ```xi​⋅xj​ ⟶ K(xi​,xj​) = ϕ(xi​)⋅ϕ(xj​)``` You get the power of a high-dimensional transformation at the cost of a simple pairwise function evaluation. This is the kernel trick. You never compute ϕ(x) explicitly — you only ever evaluate K(xi​,xj​) directly in the original space.
155
What are four common Kernels, their formulas, and their use cases?
1. Linear; ```xi⋅xj``` - Used for linearly separable data. 2. Polynomial; ```(xi⋅xj+c)^d``` - Use when there is moderate non-linearity. 3. RBF/Gaussian; ```exp(−(∥xi​−xj​∥^2)/​(2σ^2))​``` - IS General purpose, most popular. 4. Sigmoid; ```tanh(αxi​⋅xj​+c)``` - Used when having neural net-like boundaries. Note that the RBF kernel maps data to an infinite-dimensional space - yet computing it is just a single exponential evaluation!
156
Recall the concrete example of why the kernel trick works.
157
When training a Support Vector Machine, what value are you optimizing for?
The SVM problem can have many possible hyperplanes which separate the positive and negative instances. But the goal is to choose the hyperplane which maximizes the margin between the classes. The reason behind this optimization is that such a hyperplane, which not only separates the training instances, but is also as far away from them as, generalizes the best and does not result in overfitting.
158
How does a kernel method scale with the number of instances? (e.g. a Gaussian rbf kernel)?
A kernel method generally constructs the kernel matrix of the order R^(N x N), where N is the number of data instances. Hence, the complexity of a kernel function depends on the number of data instances not on the number of features. A kernel method scales quadratically (referring to the construction of the gram matrix), and cubic (referring to the matrix inversion) with the number of data instances.
159
List three ways to overcome scaling issues with SVMs.
1. Nystrom Method - Kernel matrix computation varies quadratically with N, the number of data instances, which becomes a bottleneck when N becomes very large. To alleviate this issue, Nystrom approximation is used which generates a low-rank kernel matrix approximation, having d << N. 2. Taking random features, but query/nearest neighbors. This involves mapping the input data to a randomized low-dimensional feature space such that the inner products of the transformed data are approximately equal to those in the original one. 3. Distributed/Parallel training algorithms and applying multiple SVM classifiers together.
160
What are the pros and cons of using Gaussian processes or general kernel methods approach to learning for SVMs?
Pros: General kernel methods can work well with non-linearly separable data, are non-parametric, and more accurate in general. Cons: They do not scale well with the number of data instances and require hyperparameter tuning.
161
Can you find the solutions in SVMs which are globally optimal?
Yes, since the learning task is framed as a convex optimization problem, which is bound to have one optimum solution only, and that is the global minima. There is only a single global minimum in the case of SVM as opposed to the multi-layer neural network, which has multiple local minima and the solution achieved may or may not be a global minimum, depending upon the initial weights.
162
What is an Artificial Neural Network?
An Artificial Neural Network (ANN) is a computational model inspired by biological neural networks. They are used as a random function approximation tool. Typically, ANNs are organized in layers. The first layer consists of input neurons. They send input data on to hidden layers (where each neuron is called a hidden unit), which in turn send the output neurons to the final output layer.
163
What are five advantages and four disadvantages of using an ANN?
Advantages: 1. It is a nonlinear model that is easy to use and understand as compared to statistical methods. 2. It has the ability to implicitly detect complex nonlinear relationships between the dependent (output) and independent (input) variables. 3. It can easily train on a large amount of data. 4. It can be easily run in a parallel architecture, thereby drastically reducing the computation time. 5. It is a non-parametric model (does not assume the data distribution to be based on any finite set of parameters such as mean or variance) which does not need a lot of statistics background. Disadvantages: 1. Because of its black-box nature, it is difficult to interpret how the output is generated from the input. 2. It cannot extrapolate the results. One reason for this shortcoming can be its non-parametric nature. 3. It can suffer from overfitting easily. Due to a large number of hidden units (neurons), ANNs can be very complex models which often leads to overfitting on the training dataset and poor performance on the test dataset. Regularization and early stopping can help generalize the model and reduce overfitting. 4. ANNs generally converge slowly. Can be sped up with Momentum, RMSProp, and Adam.
164
What is a "perceptron"
A perceptron is an algorithm which learns a binary classifier by directly mapping the input vector "x" to the output response "y" with no hidden layers. ```y = 1 if w.x + b > 0 else y = 0``` where w is a vector of real-valued weights representing the slope and b is the bias, representing the horizontal shift of the output vs input curve from the origin.
165
What is the role of hidden units in ANNs?
Hidden units transform the input space into a new space where the perceptrons suffice. They numerically represent new features constructed from the original features in the input layer. Each hidden layer (consisting of hidden units) transforms its input layer in a new feature space which is easier for the output layer to interpret. For instance, you have a raw image supplied as the input layer, and the first hidden layer transforms the raw pixel data into the edges in the image, the second hidden layer detects shapes from the edges, and the output layer performs object recognition on those shapes.
166
What is an activation function?
An activation function, also known as the transfer function, is simply the output value of a hidden or output unit. It can be Identity, Sigmoid function, etc.
167
Does gradient descent converge to a global minimum in a single-layered network? What about a multi-layered network?
Since a single-layered neural network is a convex function, the gradient descent is bound to converge to a global minimum. On the contrary, a multi-layered neural network is not a convex function and hence, the gradient descent may or may not converge to a global minimum, depending upon the initial weights.
168
How should you initialize weights for sigmoid units?
The weights should be initialized with small values so that the activations are in the range where the derivative is large (learning is quicker), and random values to ensure symmetry breaking (i.e., if all weights are the same, the hidden units will all represent the same thing). Typical initial weights are in the range of [-0.01, 0.01].
169
How should you set the value of the learning rate?
You can set the learning rate either using hyperparameter tuning, or through the "hit and trial" method depending on the particular problem (The hit and trial method is a problem-solving technique involving repeated, educated guesses to find a solution). If the learning rate is set too small, convergence takes too long or not at all. If the learning rate is too large, you get divergence.
170
Can backpropagation work well with multiple hidden layers?
With many layers, back propagation can struggle to work well as the increase in layers can lead to vanishing or exploding gradients. We can mitigate this with residual connections.
171
What is the loss function in an Artificial Neural Network?
A loss function is a function which maps the values of one or more variables onto a real number that represents the "cost" associated with those values. For backpropagation, the loss function calculates the difference between the actual output value and its expected output. The loss function is also sometimes referred to as the cost function or error function.
172
How does an Artificial Neural Network with three layers (one input layer, one hidden layer, and one output layer) compare to a Logistic Regression?
Logistic Regression, in general, can be thought of as a single layer Artificial Neural Network. It is mostly used in cases where the classes are more or less linearly separable whereas and ANN can solve much more complex problems. One of the nice properties of Logistic Regression is that the Logistic cost function is convex, which means that you are guaranteed to find the global minimum. But, in the case of a multi-layer neural network, you lose this convexity and may end up at a local minimum, depending upon the initial weights.
173
What do you understand by Rectified Linear Units?
Rectified linear unit (ReLU) is an activation function, given by f(x) = max(0, x). Because of it's linear form, it greatly speeds up the convergence of stochastic gradient descent. It makes the activation sparse and efficient as it yields 0 activation for negative inputs. But ReLU can be fragile during training, that it, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. With large learning rate, a large gradient can "kill" a ReLU such that the input becomes negative. This leads to ReLU being in f(x) = 0 region, making the gradient 0, leading to no changes in the weights. This means that the neurons which go into this state will stop responding to any change in the error and hence "die".
174
Can you explain the Tangent Activation Function? How is it better than the sigmoid function?
Tangent activation function, also known as tanh function, is a hyperbolic activation function often used in Neural Networks. It's formula is: ```g_tanh(x) = (e^x - e^-x)/(e^x+e^-x)``` The output of the tanh function lies in the range (-1, 1). This provides an advantage over sigmoid in that the sigmoid output ranges from (0, 1), so very large negative inputs can cause dying neurons (much like with ReLU). With Tanh, very large negative inputs output close to -1, so this does not occur. Only zero-valued inputs to Tanh would result in zero-values outputs.
175
Why is the softmax function used as an output layer in neural networks?
In a neural network, the output variable is usually modeled as a probability distribution where the output nodes (the different values that the output variable can take) are mutually exclusive of each other. The softmax function is a generalization of the logisitc function. It squashes the k-dimensional output variable into a k-dimensional probability distribution where each entry is the probability of the output variable taking that values. Hence, each output node takes a values in the range (0, 1) and the sum of the values of all the entries is 1.
176
What are 5 good steps to take when training a Deep Neural Network?
1. Deep Neural Networks are mostly data hungry, so the more data you have, the better prediction you may get from them. 2. Hidden units - Having more hidden units is still acceptable but if you have less than the optimal number of hidden units, your model may suffer from underfitting. 3. Use back-propagation with Rectified Linear Units (ReLU activation functions). 4. Always initialize the weights with small random numbers to break the symmetry between different units. 5. You can try to use a gradually decreasing learning rate, which reduces after every epoch or few hundreds of instances, in order to speed up the convergence. (Honestly, there are a loooot more but these 5 are listed in the book).
177
Name three regularization methods that can be applied to Artificial Neural Networks.
Regularization is an approach used to prevent overfitting of a model. Three ways to perform regularization in Artificial Neural Networks are: 1. Early Stopping - This is an upperbound on the number of iterations to run before the model begins to overfit. 2. Dropout - This is a technique where you randomly drop units (along with their connections) from the neural network during training. This prevents the units from co-adapting too much and helps reduce overfitting. 3. L1 or L2 penalty terms - L1 and L2 are the regularization techniques which add an extra parameter, lambda, to penalize the coefficients if they start overfitting.
178
What are autoencoders?
Autoencoders are artificial neural networks which belong to Unsupervised Learning Algorithms and are used to learn the encoding of the given dataset, typically for the purpose of Dimensionality Reduction. They consist of 2 parts: 1. Encoding (converting the higher dimensional input to a much lower dimension hidden layer(s)) 2. Decoding (converting the hidden layer(s) to the output. Autoencoders try to learn the approximation to the input, and not actually predict any output. They are extremely useful as they find the low dimensional representation of the given dataset and also remove any redundancy present in it.
179
Describe Convolutional Neural Networks (CNNs) and their 4 primary "building blocks".
CNNs are well suited for tasks such as image recognition or sequences in which the input has spatial structure. They are based on 4 building blocks: 1. Convolution - The primary purpose of Convolution is to extract the features from the input image. A small matrix, known as a filter or kernel, slides over the image and the dot product is computed. This dot product is called the Convolved Feature or Feature Map. By varying the filter, you can achieve different results such as Edge Detection. Blur, etc. 2. Rectified Linear Units - The purpose of ReLU is to introduce nonlinerity, since most of the real-world data would be nonlinear. It is applied after every convolution step. 3. Pooling or Sub sampling - Spatial Pooling (also called downsampling) reduces the dimensionality of each feature map. 4. Classification (Fully Connected Layer) - It is a traditional Multi-layer Perceptron. The term "fully connected" implies that every neuron in the previous layer is connected to every neuron in the next layer. The output from the convolution and pooling layers represent high-level features of the input image. THe purpose of the Fully Connected Layer is to use these features for classifying the input image into various classes based on the training dataset.
180
Which one is better - random weights or same weights assignment to the units in the hidden layer?
The weights should be initialized with random values to ensure symmetry breaking (i.e. if all weights are the same, the hidden units will all represent the same thing). Typical initial weights are in the range [-0.01, 0.01].
181
If the weights oscillate a lot over training iterations (often swinging between positive and negative values), what parameter do you need to tune to address this issue?
The Learning Rate. If the learning rate is too high, it will cause the result to jump over the optimal point resulting in the weights oscillating between positive and negative. If it is too low, it may take a very long time to converge.
182
Tell me about Recurrent Neural Networks (RNNs)
The idea behind RNNs is to make use of sequential information. They are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations. They have applications in various NLP tasks such as Speech Recognition, Image Captioning, and Language Modelling. Unlike traditional Neural Networks, RNNs have loops in them, allowing information to persist. The figure shows an RNN being unrolled into a full network, which simply means writing out the network for the complete sequence, where x_i is the input at time i and h_i is the corresponding output. The output at time i depends on the previous information. For instance, predicting the next word in a sentence would depend on the words seen so far.
183
What is regression analysis?
Regression analysis is a set of statistical processes that estimate the relationship between the independent and dependent variables. The most common approach is to estimate the conditional expectation of the dependent variable given the independent variable (based on the assumption that the independent variables are linearly independent).
184
How does Regression belong to the Supervised Learning approach?
Regression belongs to the Supervised Learning category because it learns the model from a labeled dataset to predict continuous or discrete variables.
185
What are three types of regression?
1. Linear Regression - It tries to fit a straight line to model the relationship between the dependent variable and the independent variable. 2. Logistic Regression - It finds the probability of success. It is used when the dependent variable is binary. 3. Polynomial Regression - It fits a curve between dependent and independent variables, where the dependent variable is a polynomial function of the independent variable.
186
Can you think of a scenario where a learning algorithm with low bias and high variance may be suitable?
Low bias and high variance can be used in K-Nearest Neighbors. They have low bias because they do not assume anything special about the data distribution and high variance because they can easily change their prediction in response to the composition of the training set.
187
What can you interpret from regression coefficient estimates?
First off, you get an Intercept coefficient, B_0 and a set of B_i coefficients. The regression equation can be written as: ``` Y = B_0 + Summation(i=1 to N) of (B_i * X_i) The Intercept, B_0 can be interpreted as the predicted value for the response variable when all the predictor values are 0. The Coefficients for Continuous predictors - For all continuous predictors Xi, their corresponding coefficient B_i represents the difference in the response variable's predicted value for each one unit difference in X_i keeping all other X_j constant. The Coefficients for Categorial predictors - For all the categorical predictors X_i, since they can be coded as 0,1,2, etc., a one unit difference in X_i represents switching from one category to the other, keeping all other X_j constant.
188
What are the downfalls of using too many or too few variables for performing regression?
Too many variables can cause overfitting. If you have too many variables in your regression model, then your model may suffer from lack of degree of freedom and have some variable correlated to each other. Having too few variables, on the other hand, will lead to underfitting as you won't have enough predictors to learn the training dataset.
189
What is linear regression? Why is it called linear?
In linear regression, the dependent variable y is the linear combination of the parameters. For instance, if x is the independent variable, and Beta_0 and Beta_1 are two parameters: ``` y = Beta_0 + Beta_1*x ``` Note that instead of x, you can have any function of x, such as x^2. In that case: ``` y = Beta_0 + Beta_1*x^2 ``` This is still a linear regression as y is still a linear combination of the parameters (Beta_0 and Beta_1).
190
What is an embedding layer, and how does it relate mathematically to one-hot encoding followed by a matrix multiply?
An embedding layer is a trainable lookup table that maps integer indices to dense vectors. Mathematically, it's equivalent to multiplying a one-hot vector by a weight matrix W — but since that multiplication just selects the ith row of W (where i is the index of the 1 in the one-hot vector), an embedding layer skips the multiply entirely and directly indexes W[i]. Same result, much cheaper computation.
191
What two problems do embeddings solve compared to using one-hot encoded vectors as features?
First, dimensionality — one-hot vectors grow to the size of the vocabulary (e.g., 50,000 for words), which is wasteful since they're almost entirely zeros. Embeddings compress this to a chosen dense dimension (e.g., 256). Second, similarity — one-hot vectors imply zero similarity between all categories (every pair is equidistant), while embedding vectors are learned during training so that semantically similar items end up with nearby vectors in the embedding space.
192
Walk through the full pipeline of how a categorical feature becomes an embedding vector.
Three steps: (1) Map the category string to an integer index via a vocabulary mapping (e.g., "blue" → 1). (2) Use that integer to index into the embedding weight matrix W of shape (vocab_size, embed_dim). (3) Return W[index], which is a dense, trainable vector. During training, backpropagation updates only the rows of W that were looked up in each batch.
193
How can you check if a regression model fits data well?
You can use the following statistics to test the model's fitness: 1. R-squared - measures how much of the variation in your outcome variable is explained by your model's predictors. 2. F-Test - evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative hypothesis that at least one is not. It is used to identify the best model which fits the given dataset. 3. Root Mean Squared Error (RMSE) - The square root of the variance of the residuals. It measures the average deviation of the estimates from the observed value.
194
When/how can you use k-Nearest Neighbors for regression?
You can use K-NN in regression for estimating the continuous variables. One of the algorithms is using a weighted average of the k nearest neighbors, weights by the inverse of their distance.
195
In regression, do you always need the intercept term? When do you need it and when don't you?
The intercept term signifies the response variable's shift from the origin. It ensures that the model would be unbiased, i.e., the residual mean is 0. If you omit the intercept term, then your model is forced to go through the origin and the slop would become steeper (and biased). Hence, you should not remove the intercept term unless you are completely sure that it is 0. For instance, if you are calculating the area of a rectangle, with height and width as the predictor variables, you can omit the intercept term since you know that the area should be 0 when both height and width are 0.
196
What is meant by "collinearity"?
Collinearity is a phenomenon in which two predictor variables are linearly related to each other. Let X1 and X2 be two variables, then: ```X1 = lambda_0 + lambda1*X2``` where lambda_0 and lambda_1 are constants. X1 and X2 would be perfectly collinear if lambda_0 is 0.
197
Explain multicollinearity.
Multicollinearity is a phenomenon in regression where one predictor (independent variable x_i) can be predicted as a linear combination of other predictors with a significant accuracy. ``` lambda_0 + lambda_1*X1 + lambda_2*X2 + ... + lambda_k*Xk = 0 ``` The issue with perfect multicollinearity is that it makes X.transpose @ X non-invertible, which is used by the Ordinary Least Squares method to solve linear regression by finding optimal estimates. So, you would need to first remove the redundant feature and then perform OLS. Note that OLS is a method for estimating the unknown parameters in a linear regression model, by minimizing the sum of the squares of the differences between the observed and predicted response variable.
198
What are the five assumption that standard linear regression models with standard estimation techniques make?
The standard linear regression model makes the following five assumptions: 1. A linear relationship between the parameters and response variable exists. 2. The residuals follow the normal distribution (A residual, e_i, is the difference between the predicted value and the true value of the corresponding dependent (response) variable, y_i, where i represents the specific example in the data.) 3. No perfect multicollinearity exists among the predictors. 4. The number of observations is greater than the number of predictors. 5. The mean of the residuals is zero.
199
What is Regularization?
Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function that discourages model complexity. The core idea: Instead of just minimizing training loss, you minimize: Total Loss = Training Loss + λ · Penalty where λ (lambda) controls how strongly complexity is penalized. Common types: L1 (Lasso) — penalizes the sum of absolute weights (|w|). Drives some weights to exactly zero, producing sparse models and acting as built-in feature selection. L2 (Ridge) — penalizes the sum of squared weights (w²). Shrinks all weights toward zero but rarely to exactly zero. Most common in practice. Elastic Net — combines L1 + L2, balancing sparsity and shrinkage. Dropout (neural nets) — randomly deactivates neurons during training, forcing the network to learn redundant representations. Early stopping — halts training before the model overfits the training data. Why it works: The penalty discourages the model from assigning large weights to any single feature, forcing it to find simpler, more generalizable patterns rather than memorizing training noise. Key tradeoff: Higher λ → more regularization → lower variance but higher bias. Tuning λ is typically done via cross-validation.
200
When does Regularization become necessary in Machine Learning?
Regularization becomes important when the model begins to either overfit or underfit. Another scenario where it is useful is to have regularization is when you want to optimize two competing functions simultaneously. In that case, there is a trade-off between them and a regularization/penalty term is used to optimize the more important function at the cost of the less important one.
201
Q: What is softmax?
Softmax is a function that converts a vector of raw scores (logits) into a probability distribution over multiple classes. The formula: σ(zᵢ) = e^zᵢ / Σ e^zⱼ For each score zᵢ, exponentiate it and divide by the sum of all exponentiated scores. Key properties: 1. All outputs are in the range (0, 1) 2. All outputs sum to exactly 1 → interpretable as probabilities 3. Amplifies differences — the largest logit gets disproportionately more probability mass (due to the exponential) 4. Generalizes the sigmoid function to multiple classes (sigmoid is just softmax for 2 classes) Where it's used: 1. Output layer of multi-class classifiers (e.g., image classification with 1000 categories) 2. Attention mechanisms in Transformers — softmax normalizes attention scores so they sum to 1 across tokens 3. Policy networks in RL — converts raw action scores into a probability distribution for sampling
202
What do you understand by Ridge Regression? Why do you need it? How is it different from OLS Regression?
Ridge Regression is a linear regression technique that adds an L2 penalty to the OLS loss function to shrink coefficients and reduce overfitting. The objective: ``` Loss = Σ(yᵢ − ŷᵢ)² + λ Σwⱼ² ``` OLS minimizes residuals alone; Ridge minimizes residuals plus the sum of squared weights. Why you need it: Multicollinearity — when features are highly correlated, OLS coefficient estimates become unstable and high-variance. Ridge stabilizes them. Overfitting — in high-dimensional settings (many features, relatively few samples), OLS overfits. Ridge constrains the model. Ill-conditioned systems — OLS requires inverting XᵀX, which can be singular or near-singular. Ridge adds λI to make it always invertible: ``` (XᵀX + λI)⁻¹ Xᵀy ``` The bias-variance tradeoff: Ridge intentionally introduces bias (coefficients are shrunk, not exact) in exchange for lower variance — predictions generalize better to unseen data. Key interview point: Ridge never sets coefficients to exactly zero (unlike Lasso), so it keeps all features in the model. If sparsity/feature selection matters, Lasso or Elastic Net is preferred. λ is tuned via cross-validation. For Ridge vs OLS key differences, check the image.
203
What is Lasso Regression? How is it different from OLS?
Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a linear regression technique that adds an L1 penalty to the OLS loss function, shrinking some coefficients to exactly zero — effectively performing automatic feature selection. The objective: ``` Loss = Σ(yᵢ − ŷᵢ)² + λ Σ|wⱼ| ``` Why it's powerful — sparsity: Unlike Ridge (which shrinks weights toward zero), Lasso can shrink weights to zero. This means it eliminates irrelevant features entirely, producing a simpler, more interpretable model. Lasso vs. Ridge — when to use which: Use Lasso when you suspect only a few features truly matter and want a sparse model Use Ridge when most features are relevant and multicollinearity is the main concern Use Elastic Net when you want both sparsity and stability under multicollinearity Key interview point: Lasso has no closed-form solution because the absolute value function is non-differentiable at zero. It's solved using methods like coordinate descent or subgradient methods. λ is tuned via cross-validation. Check the image for a comparison between Lasso and OLS.
204
How does Ridge Regression differ from Lasso regression?
Both of them are regularization techniques, with the difference in their penalty functions. The penalty in Ridge regression is the sum of the squares of the coefficients whereas, for Lasso, it is the sum of the absolute values of the coefficients. Lasso regression is used to achieve sparse solutions by making some of the coefficients zero. Ridge regression tries to smoothen the solution as it keeps all the coefficients but minimizes the sum of their squares. Lasso (L1) is more robust since it is resistant to outliers. Ridge (L2) squares the error, so for any outlier, its square term would be huge. L2 also produces a unique solution, since there is only a single shortest distance between two points, whereas with L1, you can have multiple solutions.
205
Why does L1 produce zeros but L2 doesn't?
This comes down to geometry. The L1 constraint region is a diamond (sharp corners), and the loss function's elliptical contours are likely to touch it at a corner — where one or more weights are exactly zero. The L2 constraint is a sphere (no corners), so the contours touch it along a smooth edge, leaving all weights nonzero.
206
Why and where do you use Cluster Analysis?
Cluster analysis is the task of grouping (clustering) a set of objects in such a way that objects in the same cluster are much more similar to each other than those in other clusters. In some cases, Cluster analysis can be used for the initial analysis of the given dataset based on the different target attributes. For data lacking output labels, you can use a clustering technique to automatically find the class label by grouping input data instances into different clusters and then assigning a unique label to each cluster.
207
What are two examples of Cluster analysis methods?
Two of the most commonly used Clustering methods are: 1. Hierarchical Clustering - This produces a hierarchy of clusters, either by merging smaller clusters into larger ones or dividing the larger clusters into smaller ones. The merging or splitting of clusters depends on the metric used for measuring the dissimilarity between the sets of data instances. Some of the commonly used metrics are Euclidean distance, Manhattan distance, or Hamming distance. 2. K-Means Clustering - It assigns the data points to k clusters such that each data point belongs to the cluster with the closest mean. This is suitable for when you have a large number of data-points and users far less iterations than Hierarchical Clustering.
208
Provide 3 differences between Partitioning method and Hierarchical method for clustering.
1. A partitional clustering is a division of the set of data objects into non-overlapping clusters such that each object is in exactly one cluster, whereas a hierarchical clustering is a set of nested clusters which are organized as a tree. 2. Hierarchical clustering does not require any input parameters, whereas partitional clustering algorithms typically require the number of clusters to begin (not HDBSCAN though...but K-Means does). 3. Partitional clustering is generally faster than hierarchical clustering.
209
How do you evaluate the quality of clusters that are generated by a run of K-means?
One way to evaluate the clusters quality is to resample the data (via bootstrap or by adding small noise) and compute the closeness of the resulting partitions, measured by Jaccard similarity. This approach allows you to estimate the frequency with which similar clusters can be recovered after resampling. Jaccard Similarity measures the overlap between two clusters by comparing shared members to total members: ``` J(A,B) = |A ∩ B| / |A ∪ B| ``` Ranges from 0 (no overlap) to 1 (identical). Useful when comparing cluster assignments across runs or against ground truth labels.
210
How would you assess the quality of a Clustering technique? (Hint, there are four common assessment approaches).
Cluster evaluation is a hard problem, and most of the time, there is no perfect solution to it. Otherwise, it would be a classification problem where each cluster represents one class. Four common assessment approaches are: 1. Internal Evaluation, where the clustering result is compared with the found clusters and their relationship with each other. 2. External Evaluation, where the result of the clustering is compared to an existing "ground truth". However, obtaining an external reference result is not straightforward in most of the cases. 3. Manual Evaluation by a human expert. 4. Indirect Evaluation by evaluating the utility of the clustering in its intended application.
211
What is Dimensionality Reduction and why do you need it?
As the name suggests, Dimensionality Reduction means finding a lower-dimensional representation of the dataset such that the original dataset is preserved as much as possible even after reducing the number of dimensions. Dimensionality Reduction reduces time and storage space required. It also addresses multi-collinearity which improves the performance of the ML model. Many high-dimensional datasets such as videos, human genes, etc are difficult to process as is. For such data, you need to remove the unnecessary and redundant features and keep only the most informative ones to better learn from them.
212
Are Dimensionality Reduction techniques supervised or unsupervised?
Generally, you use Dimensionality Reduction for Unsupervised Learning tasks, but that can also be used in Supervised Learning. One of the standard methods of Supervised Dimensionality reduction is called Linear Discriminant Analysis (LDA). It is designed to find low-dimensional projections that maximizes class separation. Another approach is partial least squares (PLS), which looks for the projection having the highest covariance with group labels.
213
List five ways of reducing the dimensionality of a given dataset.
1. Principal Component Analysis 2. Backward Feature Elimination 3. Forward Feature Selection 4. Linear Discriminant Analysis 5. Generalized Discriminant Analysis
214
Is feature selection a Dimensionality Reduction technique?
Feature selection is a special case of Dimensionality Reduction in which the set of features made by feature selection must be a subset of the original feature set. In Dimensionality Reduction, it is not always the case that the new features are a subset of the original features (consider PCA which reduces the dimensionality by making new synthetic features from the linear combination of the original ones).
215
What is the difference between density-sparse and dimensionally-sparse data?
Density sparse data means that a high percentage of the data contains 0 or null values. Dimensionally sparse data is the one which has a large feature space, in which some of the features are redundant, correlated, etc.
216
Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
Reducing the number of features will definitely reduce the computational complexity of the model but it may not improve the performance of the SVM model, because SVM automatically uses regularization to avoid overfitting. So, performing dimensionality reduction before SVM modelling may not improve the performance of the SVM classifier.
217
Suppose you have a very sparse matrix with highly dimensional rows. Is projecting these rows on a random vector of relatively small dimensionality a valid dimensionality reduction technique?
Although it may not sound intuitive, but random projection is a valid dimensionality reduction method. It is a computationally efficient way to reduce the dimensionality by trading a controlled amount of error for smaller model sizes and faster processing times. Random projection is based on the idea that if the data points in a sparse feature space have a very high dimension, then you can project them into a lower dimensional space in a way that preserves the pairwise distances between the points approximately.
218
What is Independent Component Analysis? What is the difference between ICA and PCA?
Independent Component Analysis is a statistical technique in Unsupervised Learning which decomposes a multi-variate signal into independent non-Gaussian components. It defines a generative model in which the data variables are assumed to be linear or nonlinear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed to be non-Gaussian and mutually independent, and they are called the independent components of the observed data. These independent components are found by ICA. ICA has been used for Facial Recognition and Stock Prediction. PCA helps to find the low-rank representation of the dataset such that the first vector of the PCA is the one that best explains the variability of your data (the principal direction), the second vector is the second best explanation and is orthogonal to the first one, and so on. ICA finds a representation of the dataset as independent sub-elements. You can think of the data as a mixed signal, consisting of independent vectors.
219
What is Fisher Discriminant Analysis? Is it Supervised or Unsupervised? How is it different from PCA?
Fisher Discriminant Analysis is a Supervised Learning technique, which tries to find the components in such a way that the class separation is maximized while minimizing the within class variance. Both PCA and FDA techniques are used for feature reduction by finding the eigenvalues and eigenvectors to project the existing feature space into new dimensions. The major difference is that PCA falls under Unsupervised Learning and tries to find the components such that the variance in the complete dataset is maximized whereas FDA tries to maximize the separation between classes.
220
What are the differences between Factor Analysis and Principal Component Analysis?
PCA involves transforming the given data into a smaller set of components such that they are linearly uncorrelated to each other. Factor analysis is a generalization of PCA which is based on the maximum-likelihood. PCA is used when you want to simply reduce your correlated observed variables to a smaller set of important, independent, orthogonal variables. Factor analysis is sued when you want to test a theoretical model of latent factors causing observed variables.
221
How is Singular Value Decomposition (SVD) mathematically related to EVD for PCA?
PCA is usually performed using Eigenvalue Decomposition but you can also use Singular Value Decomposition to perform PCA. The link: SVD on X and EVD on the covariance matrix X^TX are equivalent operations. The right singular vectors V of X are identical to the eigenvectors of X^TX, and the singular values relate to eigenvalues by λᵢ = σᵢ² / (n−1).
222
Why do you need to center the data for PCA and what can happen if you do not do it?
Centering the data means brining the mean to the origin, by subtracting it from the data. It is required to ensure that the first principal component is indeed in the direction of maximum variance. A centered data (zero mean) is used to find a bases which minimizes the mean squared error. If you do not perform centering then the first component might instead be misguiding and correspond to the mean of the data. Centering is not required if you are performing PCA on a correlation matrix, since the data would already be centered after calculating the correlations.
223
Do you need to normalize the data for PCA? Why or why not?
PCA is about transforming the given data to the space which maximizes the variance. If the data is not normalized then PCA may select some features with the highest variance in the dataset, making them more important. For instance, if you use "grams" for a feature instead of "kgs", then its variance would increase and PCA might think that it has more impact, which may not be correct. Hence, it is very important to normalize the data for PCA.
224
What role does orthogonality play in PCA, and is post-hoc rotation (e.g., Varimax) necessary?
Orthogonality is fundamental to PCA — the principal components are constrained to be perpendicular to one another, which guarantees they capture uncorrelated, independent directions of variance. This is not optional; it's what makes PCA mathematically well-defined and ensures each component adds non-redundant information. Post-hoc rotation (e.g., Varimax, Oblimin), however, is not necessary and is an optional step borrowed from factor analysis. There's an important trade-off: Without rotation: Components are ordered by variance explained, which is ideal for dimensionality reduction. With rotation (Varimax): Variance is spread more evenly across components, making loadings easier to interpret — but you lose the clean variance-ordering guarantee. Key distinction: Orthogonality is a structural constraint built into PCA. Rotation is a post-hoc choice that trades predictive/compression power for interpretability.
225
Is PCA a linear model or not? Explain
PCA projects the original dataset into a lower dimensional linear subspace, called a hyperplane. All the mappings, rotations, and transformations performed are linear and can be expressed in terms of linear algebraic operations. Thus, a PCA is a linear method for Dimensionality Reduction.
226
Have you heard of Kernel PCA or other non-linear Dimensionality Reduction techniques? Can you explain any one of them?
Kernel PCA extends standard PCA to handle non-linearly separable data by implicitly mapping data into a high-dimensional feature space using the kernel trick, then performing PCA there. The Core Idea: Standard PCA finds linear directions of maximum variance. But if your data lies on a curved manifold (e.g., a Swiss roll), linear projections lose structure. Kernel PCA solves this without ever explicitly computing the high-dimensional mapping. How it Works 1. Choose a kernel, such as RBF 2. Compute the kernel matrix K 3. Center K in feature space 4. Eigendecompose K and project data onto the top eigenvectors The kernel trick means you compute dot products in the high-dimensional space without explicitly going there - O(n^2) in samples, not dimensions.
227
What Dimensionality Reduction techniques can be used for preprocessing your data?
Core Idea: Preprocessing Dimensionality Reduction is about removing noise, redundancy, and curse-of-dimensionality issues before feeding data to your actual - not just visualization. Dimensionality Reduction can be broadly divided into Feature Extraction and Feature Selection, both of which are used for preprocessing the data. The resulting dataset is then used for learning purpose. Here are three main categories of techniques: 1. Linear Methods - PCA - Remove correlated features, keep top-kk k variance-explaining components. Fast, interpretable, great default. - SVD/Truncated SVD - Same idea but works on sparse matrices (e.g., TF-IDF for NLP). Used under the hood in PCA. - LDA - Supervised DR; maximizes class separability. Useful when labels are available at preprocessing time. 2. Feature Selection - Variance Thresholding - Drop near-zero variance features outright. - Correlation filtering - Drop one of any highly correlated feature pairs. - L1 Regularization (Lasso) - Embeds selection into model training; drives irrelevant weights to zero. 3. Non-linear Methods - Autoencoders - learns a compressed bottleneck representation; great for images/text with complex structure. - UMAP - faster than t-SNE, better preserves global structure; usable as preprocessing (unlike t-SNE).
228
What is the difference between Feature Selection and Feature Extraction?1
Both of these techniques are used to avoid the Curse of Dimensionality, simplify models by removing redundant and irrelevant features, and reduce overfitting. But the difference lies in the way that they achieve it. Feature Selection means selecting a subset of features from the given features, based on some criteria. Some of the ways to perform Feature Selection are Forward Selection and Backward Elimination. Feature Extraction means projecting the given feature space into a new feature space, such as in SVD and PCA.
229
What are four Feature Extraction techniques used for Dimensionality Reduction?
1. Independent Component Analysis (ICA) 2. Principal Component Analysis (PCA) 3. Kernel Based PCA 4. Singular Value Decomposition (SVD)
230