ML Breadth Depth Qs Flashcards

Question

How do you make your models more robust to outliers?

Answer 1

Data Preparation Techniques: - Outlier detection and treatment: * Use techniques like box, z-scores, or isolation forests to identify potential outliers. * Consider removing outliers if they are clearly erroneous, or cap extreme values at a reasonable threshold. - Robust Scaling: Use robust scalers like median absolute deviation (MAD) that are less affected by outliers compared to standard scaling methods. - Transformations: Applying transformations like log transformation or square root transformation to your data can help reduce the impact of outliers. Robust Modeling Algorithms - Tree-Based Models: Decision trees, random forests, and gradient boosting machines are generally less sensitive to outliers than linear models. - Robust Regression: Use robust regression techniques like RANSAC, Theil-Sen estimator, or Huber Regression, which are designed to handle outliers. Regularization - L1 Regularization (Lasso): Can shrink the coefficients of less important features towards zero, potentially reducing the impact of outliers. - L2 Regularization (Ridge): Shrinks all coefficients, which can make the model less sensitive to individual data points (outliers). Ensemble Methods: - Bagging: Training multiple models on different subsets of data and averaging the predictions help reduce the impact of individual outliers. - Boosting: Sequentially building models, with each new model focusing on the errors of the previous models, can improve robustness.

Answer 2

1. Business Understanding -This initial phase involves defining the business problem and goals. You need to identify what you're hoping to achieve with the model. Some factors to consider include: * Business Goals: What specific outcomes are you hoping to achieve? * Business Impact: How will the model impact the business financially? * Latency Requirements: How quickly do you need the model to generate predictions? * Data Requirements: What data is available to train the model, and how much of it is required? 2. Data Preparation - Once you have a clear understanding of the business goals, you can start collecting and preparing the data for modelling. This stage involves several steps including: * Data Collection: Gathering data from relevant sources * Data Cleaning: Identifying and fixing errors and inconsistencies in the data * Data Parsing: Formatting the data into a usable form for modelling * Data Transformation: Transforming the data to create new features that might be useful for modeling 3. Feature Engineering - Involves creating new features from the existing data. The goal is to improve the model's ability to learn patterns from the data. Common feature engineering techniques include: * Aggregation: Combining data points into summaries * Encoding: Converting categorical data into numerical data. *Feature binning: Grouping similar values into bins. 4. Feature Selection - Selecting the most relevant features for modeling. Several techniques are considered in feature selection: * Model-Based: Feature importance internally build into the model * Filtering: Using univariate statistics like Pearson Correlation to rank signals based on the order of importance * Wrapper: Forward selection and backward elimination that tests various combinations of signals to identify the combination that produces the best performing model * PCA / ICA: Dimensionality reduction of the features into lower-dimensional space. * Feature Clustering: Grouping of features into single signal. 5. Model Training - This stage involves splitting your data into train, validation, and test then training your model on training and optimizing parameters on validation set. There are many different algorithms available, the best choice for your project will depend on your modelling task: * Regression: Used for predicting continuous outcomes (e.g. predicting house prices). * Classification: Used for predicting categorical outcomes (e.g., predicting whether a customer will churn). * Clustering: Used to segment (e.g. creating customer archetypes based on customer profile and card transaction data). 6. Model Evaluation - This stage involves testing your model on the test set with the metric based on your modeling task.: * Regression: MSE, RMSE, MAPE * Classification: Accuracy, AUC, F1-Score, Precision, Recall * Clustering: Accuracy (External Validation), Sum of Squares (Internal Validation), *****Silhouette Coefficient (Internal Validation) 7. Model Deployment - After training and evaluating the model, you deploy it to production. Depending on the prediction task, the inference is batch or real-time. * Batch Inference: Predictions are queued and then provided in an hours, daily, weekly, or monthly cadence. This is often used for providing reports with forecasts. * Realtime Inference: This involves creating a model API service using Rest API frameworks like FastAPI to generate real-time prediction from user activity and profile to generate model scores. Additionally, this involves model staging and versioning, which involves tracking DEV, UAT, and PROD versions of the model, then versioning the model based on updates. Lastly, the model needs to be continually monitored for performance consistency and updated if the performance degrades over time.

Answer 3

Feature engineering involves conducting exploratory data analysis to create and identify new variables from raw ones that can produce signals to improve model performance. Based on the variable type, feature engineering involves different methods. 1. Continuous Variable: * Discretization: Bins continuous variables into a finite number of intervals. This can be useful for models that don't handle continuous data well or for creating features that represent specific ranges of values. * Log Transformation: Transforms skewed data (data that is not symmetrical) to a more normal distribution. This can improve the performance of some machine learning models. * Scaling: Standardizes features to have a mean of zero and standard deviation of one. This can be important for some machine learning models, especially those that use distance-based metrics. 2. Categorical Variables: * One-Hot Encoding: Converts categorical variables into binary vectors, with a new feature created for each category. This is useful for machine learning models that can't understand categorical data directly. * Label Encoding: Assigns a numerical value to each category. However, this can mislead the model into interpreting the difference between categories as ordinal, when they may not be. 3. Text Variables: * Bag-of-Words: Represents text data as a collection of words, ignoring grammar and word order. Each word is a feature, with its value indicating its frequency in the text. * TF-IDF: Similar to bag-of-words, but it weights the importance of words based on how common they are in the dataset overall and how frequently they appear within a specific document. This can help identify words that are distinctive and informative. * Text Embedding: Transforms text data into numerical vectors that capture the semantic meaning of the words. This allows machine learning models to perform operations on text data like similarity comparisons. 4. Time Variables: * Date/Time Decomposition: Extracts features like year, month, day, hour, minute, and second from a date/time variable. This can be useful for tasks like modeling seasonal patterns. *Sin/Cos Transformation: Encodes cyclical patterns. For example, you could convert sine and cosine to the hour of the day to capture daily seasonality.

Answer 4

Handling categorical variables properly is vital as the wrong method could potentially cause the model to overfit and/or underperform. Here are several factors to consider: * Type of categorical variable: Nominal or ordinal * Cardinality of the feature: Number of unique categories * Relationship with the target variable * Model you're using: Some models handle categorical variables natively, while others require encoding. Here are some common techniques for handling categorical variables: 1. Encoding Techniques * One-Hot Encoding: Create new binary columns for each unique category in the original feature. Pros: Easy to implement, avoids introducing ordinality into categorical data. Cons: Can significantly increase dimensionality (especially for high-cardinality features), potentially leading to overfitting. * Label encoding: Assigns a unique integer to each category. Pros: Simple, computationally efficient. Cons: Introduces a false sense of order/magnitude, potentially misleading models (especially tree-based ones). * Target Encoding (Mean Encoding): Replaces each category with the mean target value (for regression) or the proportion of positive outcomes (for classification) within that category. Pros: Handles overfitting better than one-hot encoding, can capture meaningful relationships between the category and the target. Cons: Prone to overfitting if the number of samples within each category is small. 2. Feature Engineering Techniques * Frequency Encoding: Replaces categories with their frequency (how often they occur in the data). This is often used for handling rare categories or when category frequency may be relevant. Pros: Can handle unseen categories, useful for rare categories. Cons: Loses information about the original categories. * Hashing Trick: Reduces dimensionality by hashing categories into a smaller number of buckets. Useful for very high cardinality features where one-hot encoding is impractical. 3. Specialized Models * Tree-based models (Decision Trees, Random Forests, Gradient Boosting): Can inherently handle categorical variables without explicit encoding. * Embeddings (mainly in Neural Networks): Maps categories to dense numerical representations, capturing semantic relationships between categories.

Answer 5

Model ensembling is a powerful technique in machine learning where you combine the predictions from multiple models to improve overall performance. Here's how it works: Core Concepts: 1. Base Models: Ensembling begins with training several different models, often referred to as "base models" or "weak learners". These base models can be different types of algorithms (e.g. decision tree, linear regression, neural network) or the same type of algorithm trained on different subsets of the data or with different hyperparameters. 2. Diversity: The key to successful ensembling is ensuring diversity among your base models. This means that the models should make different kinds of errors so that their strengths and weaknesses complement each other. 3. Combination Techniques: Once you have predictions from your base models, you need a way to combine them into a final prediction. Here are common ensembling techniques: * Averaging (Regression): Simply calculate the average of the predictions from each model. This works well for regression problems. * Voting (Classification): Each model "votes" for a class label. The final prediction is assigned the class that receives the most votes. * Weighted Averaging/Voting: Assign a weight to each model based on its performance on the validation, and compute a weighted average of predictions or a weighted majority vote. * Stacking: Train a meta-model that learns how to best combine the predictions of the base models. Why Ensembling Works: * Bias-Variance Tradeoff: Different models have different biases (tendency to consistently under or overfit) and variances (sensitivity to changes in training data). By combining multiple models, ensembling can reduce overall variance and potentially reduce bias, leading to more stable and generalizable models. * Wisdom of the Crowd: The idea that the collective judgement of a group is often better than individual judgement applies in machine learning as well. Combining multiple models can leverage their individual strengths. Common Ensembling Methods: * Bagging (e.g. Random Forests): Trains multiple models in parallel on different random subsets of the data, reducing variance * Boosting (e.g. Gradient Boosting, AdaBoost): Trains models sequentially, where each new model focuses on correcting the errors of the previous one, reducing bias. * Stacking: Combining models through a meta-learner that figures out how to best combine their individual predictions.

Answer 6

Handling text features in prediction tasks involves transforming raw text into meaningful numerical representations that machine learning models can understand. Here's a breakdown of the key steps: 1. Preprocessing a. Tokenization: Break down the text into individual units like words, phrases, or characters. b. Normalization: Convert text to lowercase, remove punctuation, handle misspellings. c. Stop word removal: Filter out common words like "the", "and", "is", that may carry less semantic value. d. Stemming/Lemmatization: Reduce words to their root form (e.g. "running" -> "run") to group similar words together. 2. Feature Representation * Bag-of-Words (BoW): Create a vocabulary of all unique words, represent each document as a vector where each element corresponds to the count of a word within the document. * TF-IDF (Term Frequency-Inverse Document Frequency): Extension of BoW, downweighs common words across documents while emphasizing distinctive words. TF-IDF is often a better representation for predictive tasks. * Word Embeddings: Represents words as dense, low-dimensional vectors that capture semantic and syntactic relationships between words (e.g. Word2Vec, GloVe). This approach allows models to understand context and similarities. 3. Feature Engineering (Optional) * n-grams: Consider sequences of words (e.g. bigrams, trigrams) in addition to single words to capture phrases and context. * Topic Modeling: Using techniques like Latent Dirichlet Allocation (LDA) to identify underlying topics within a collection of documents. * Sentiment Analysis: Extract sentiment scores (positive, negative, neutral) from text to use in predictive models. 4. Modeling Choose an appropriate model based on your task: * Classification: Naive Bayes, Support Vector Machines (SVMs), Random Forests, Gradient Boosting, Neural Networks (with text embedding layers) are commonly used for text classification tasks. * Regression: Models like Linear Regression, Lasso/Ridge Regression, and Neural Networks can handle text features for regression tasks (e.g., predicting sentiment scores). Example: Sentiment Analysis Let's say you want to predict whether a movie review is positive or negative: 1. Preprocessing: Tokenize the reviews, normalize text, remove stop words. 2. Representation: Create TF-IDF matrix representation of the reviews. 3. Modeling: Train a classification model (e.g., Support Vector Machine, Logistic Regression, or Neural Network) using the TF-IDF representation as input and the sentiment labels (positive/negative) as targets.

Answer 7

Initialize weights w to zeros (or small random values) Initialize bias b to 0 Set learning rate α and number of epochs For each epoch: For each training example (x, y): # Forward pass z = w · x + b # linear combination ŷ = σ(z) = 1 / (1 + e^(-z)) # sigmoid activation # Compute gradient (using BCE loss) dw = (ŷ - y) · x db = (ŷ - y) # Update parameters (gradient descent) w = w - α · dw b = b - α · db (Optional) Compute and log loss: L = -[y·log(ŷ) + (1-y)·log(1-ŷ)] Prediction If ŷ ≥ 0.5 → class 1 Else → class 0 Key pieces to remember: The sigmoid function squashes the linear output into a probability between 0 and 1. The loss function is Binary Cross-Entropy (log loss). The gradients turn out to be elegantly simple — (ŷ - y) — which is identical in form to linear regression's gradient, just with ŷ produced by the sigmoid rather than being raw.

Answer 8

# The base decision tree algorithm pseudocode Decision tree training involves recursively splitting the dataset into smaller subsets based on. A (feature, threshold) pair that best separate the different classes. The process starts with all data at the root node. The algorithm examines all the features and possible split points (thresholds), calculating measures like Gini impurity to determine the (feature, threhold) pair that creates the most homogenous subsets (ideally containing mostly examples of a single class). The data is divided based on the selected (feature, threshold) pair and branches are created. This splitting process is repeated recursively on these subsets until stopping criteria are met (like reaching a maximum tree depth or having too few examples in a node). The final nodes, called leaves, represent the predicted class labels. In regression, these predictions are based on the averaged target values in leaves. In classification, the predictions are based on the proportion of the positive class found in leaves. Decision Tree (CART) Pseudocode: Function BuildTree(dataset, depth): # Base cases (stopping criteria) If all samples belong to same class → return Leaf(class) If no features remaining → return Leaf(majority class) If depth ≥ max_depth → return Leaf(majority class) If num_samples < min_samples_split → return Leaf(majority class) # Find the best split best_feature, best_threshold = None best_score = +∞ (or −∞ for info gain) For each feature f: For each unique threshold t in f's values: left = samples where f ≤ t right = samples where f > t score = impurity(left, right) If score is better than best_score: best_score = score best_feature = f best_threshold = t # If no valid split improves purity → return leaf If no improvement → return Leaf(majority class) # Recurse left_split = samples where best_feature ≤ best_threshold right_split = samples where best_feature > best_threshold left_child = BuildTree(left_split, depth + 1) right_child = BuildTree(right_split, depth + 1) return Node(best_feature, best_threshold, left_child, right_child) --- Impurity Measures --- Gini(S) = 1 − Σ pᵢ² Entropy(S) = − Σ pᵢ · log₂(pᵢ) Weighted impurity of a split: Score = (|left|/|total|)·Impurity(left) + (|right|/|total|)·Impurity(right) --- Prediction --- Function Predict(x, node): If node is Leaf → return node.class If x[node.feature] ≤ node.threshold: return Predict(x, node.left_child) Else: return Predict(x, node.right_child) __________________________________________________________ Key points to remember: The algorithm is a greedy, recursive partitioning strategy — at each node it picks the locally optimal split, with no backtracking. The two most common impurity measures are Gini impurity (used by CART/sklearn default) and information gain / entropy (used by ID3/C4.5). Stopping criteria (max depth, min samples, min impurity decrease) act as a form of regularization to prevent overfitting.

Answer 9

Both logistic regression and random forest are popular models for performing binary classification. But, how the model methodologies differ greatly. Random forest leverages a model technique called bagging, which averages predictions across several models fitted on bootstrap data. Such a technique can reduce the variance of the model; thereby increasing model generalization. The equation for bagging is: f_hat = 1/T * summation from t=1 to T of f_t(x) (c.f. image). Each f_t(x) is a prediction from a forest of T decision trees. Averaging across T trees generates the random forest prediction, f_hat. On the other hand, logistic regression is a member of generalized linear models (GLM). The model uses a sigmoid function: f(x) = 1 / (1 + e^(-x) ) , to map a binary response of 0 and 1 on a probability range. The transformed response can then be explained in a linear equation consisting of coefficients and an intercept: ln(p/1-p) = beta_0 + beta_1*X_1 + beta_2*X_2 + ... + beta_k*X_k (c.f. image) Random Forest Procedure 1. For b = 1 to B { (a) Draw a bootstrap sample Z* of size N from the training data. (b) Grow a decision tree T to the bootstrap data. For each tree, train on the random select of M variables. } 2. For each observation, predict class across trained trees. Choose the majority. Logistic Regression Procedure 1. Initialize parameters (coefficients + intercept) with standard normal values. 2. Apply gradient descent to optimize parameters. Repeat until stopping rule { (c.f. image with application of gradients) } Model Tuning When you discuss model tuning, relate how the tuning affects model performance. Even better, relate each tuning parameter to variance and bias trade-off. Note that the best way to fine tune hyperparameters is through grid-search with cross validation. You would pick the model parameters that result in the best performance in a grid search. Random Forest: 1. Number of Trees - Determines the number of boostrapped trees to train. As the number of trees increases, variance decreases, but bias increases. Think about how the variance of distribution decreases as the sample size increases. Similarly, when more trees, prediction scores converge toward a mean with less variability. Hence, variance decreases. An increase in trees can benefit model generalization, but too much can be excessive. 2. Tree Depth - Depth determines how well the tree fits on training data. Too much will increase variance and decrease bias. Finding an optimal balance between tree depth and number of trees is a must to a build high-performance random forest model. 3. Column and Row Sample - If you took two trees, trained on the same set of observations and features, the resulting prediction is the same. This defeats the purpose of a "random" forest which is designed to decrease variance with averaging of several tree estimators. A sampling of observations and columns prevents overfitting and increases generalizations. 4. Minimum Samples Leaf and Minimum Samples Split - The two parameters evaluate whether a node of a tree should be split or not. Suppose minimum samples split equals 8. If a node contains 9 observations, the node will split. However, if the minimum sample leaf is 2, and the resulting split in two leaf nodes contains 1 and 8 observations, the split will not occur. As the thresholds for both decrease, bias decreases while variance increases. Logistic Regression Unlike random forest, logistic regression contains a smaller set of parameters to tweak. Key parameters involve regularization that reduces model overfitting. Two main types of regularizations are Ridge and Lasso which compress coefficients of weak coefficients to 0. c.f. image for Ride and Lasso regression definitions. There are two main differences between Ridge and Lasso shrinkages. Unlike Ridge, Lasso can reduce coefficients of weak predictors completely to 0. Therefore, Lasso lends feature selection where strong predictors remain in the model while weak predictors are removed. Another main difference is that Ridge can be solved in closed-form, meaning the beta vector can be solved using matrix algebra. Lasso betas, on the other hand, cannot be solved using beta vector such that convex optimization such as gradient descent is required to estimate beta. Note that both ridge and lasso forgo unbias estimation of the beta coefficients while reducing variance. Its important to understand the trade-off between variance and bias. When you increase bias, variance decreases. The benefit of such a trade-off is that model generalization to unseen data can increase. Model Interpretability Both random forest and logistic regression models provide variable importance. Random forest uses mean Gini decrease. On the other hand, logistic regression uses p-values with lower values to convey higher predictor importance. Additionally, logistic regression provides additional interpretability that random forests does not provide. For each predictor, logistic regression provides an odds ratio comparison between focal group versus reference group (i.e. male versus female).

Answer 10

A single decision tree is prone to over-fitting as the tree depth increases. As the depth increases, the flexibility of the decision boundary increases, which in turn increases the variance of the model. Hence, to reduce the model variance, averaging predictions from multiple trees trained on the same data can help. Two such techniques are bagging and boosting. Bagging: Bagging is quite simply averaging of models trained on the random sample of the same original data. Random forest, for instance, first trains a K number of decision trees, independently, then averages the predictions from the trees to produce a final prediction. This technique is helpful in reducing the variance of the model. Boosting: Boosting involves an iterative training of weak learners such as the decision tree. The errors of one weak learner are assigned as weights in the training sample of the next learner in the iteration. The predictions from the weak learners are averaged to produce a final prediction. Given that the errors from the previous learner influence the training of the next model, the bias of the model decreases. In addition the variance of the model decreases as the predictions from the learners are averaged.

Answer 11

Random forest is an averaging of decision trees independently trained on a bootstrapped dataset. A single decision tree is prone to overfitting. Averaging trees, which is a technique called bagging in machine learning, is a way to reduce the variability of the model, thereby, reducing overfitting. When you increase the depth of the random forest model, the variance of the model increases as the training data is split more, leaving fewer samples to average in the terminal leaves. In exchange the bias of the model will decrease given the variance-bias trade-off. When you increase the number of trees in the random forest model, the variance decreases while the bias increases. As discussed in the intuition behind the random forest, as more trees are averaged, the model's variance decreases.

Answer 12

Key Points: 1. Discuss how each of the three models fits data 2. Discuss variance and bias 3. Compare and contrast the trees based on the criteria This is not an open-ended question. You are evaluated on the accuracy of your response. If your explanation is unclear or incorrect, the mistake will cost you greatly. Ensure that you understand how each of the three algorithms work as they are common questions across companies (not just the top ones). Also, do not misinterpret variance and bias. Decision tree, the simplest of all three, uses GINI-impurity, a measure of heterogenity in a sub-space of data, to define decision trees. It will optimize a split that results in the highest purity, such that a partition will contain all the classes 0's and the adjacent will contain all the class 1's. The issue with decision tree is overfitting, meaning that its prediction on unseen data performs poorly. Random forest, address, the issue with decision tree with bagging - bootstrapping samples of data and averaging the predictions across multiple decision trees. Boosted trees, such as XGBoost, AdaBoost, GBM, CatBoost, and LightGBM, all have the same idea, in that, the trees are constructed sequentially such that the observation weight and the weight of the current model is based on the misclassification of the tree model constructed in the previous iteration. Ultimately, unlike random forest, which assumes independence and averaging with equal weights across trees, Boosting assumes sequential dependence and weighted averaging. Consider that decision tree has the lowest bias and highest variance, resulting in overfitting on training data. Random forest addresses the high variance issue with bagging of several trees. With a decrease in variance, bias will slightly increase as there is a trade-off between bias and variance. Boosted trees provide the best performance out-of-the-box as bias is low and variance is moderate (assuming that the hyperparameters are tuned properly).

Answer 13

Key Points: 1. Briefly outline K-Means 2. Explain scaling techniques and the impacts on clustering 3. Provide reasoning on recommendation. The interview question assumes that you are familiar with the K-Means algorithm, a basic machine learning technique. Oftentimes, the interviewer will ask you a scenario-based question that tests beyond basics. Start with the definition of K-Means to demonstrate that you grasp the basics. K-Means randomly initialize K-centroids and assign each data point to the nearest centroid. The position of each centroid is updated based on the multivariate mean of data points assigned to its cluster. This process is repeated until convergence on an internal or external validity score. Next, discuss the benefit of scaling on K-Means. Note that, generally, euclidean distance is used in K-Means to assign each datapoint to the nearest centroid. The consequence is that Euclidean distance is sensitive to outliers given that the computation requires the mean. For instance, suppose that feature set A contains 1, 2, 3, 3, 4, and 10. The mean, without the outlier 10, is 2.6. With the outlier, the mean skews to 4.8 The bottomline is that outliers skew the quality of clusters. Therefore, scaling data is a must as it handles outliers. However, note that not all scalings are the same as some still employ averaging. Use techniques such as robust scaling, which leverages quantiles. The other benefit to scaling is that when two variables greatly differ in range, the centroid will shift toward the feature with the highest variance. Suppose that X1 contains a 5 and 6. as the 10th and 90th percentile respectively. X2 contains 5000 and 16000. Additionally, the variance of X1 is 0.23 while the variance of X2 is 2540. The high variance will distort the centroid positioning. Therefore, scaling is a must.

Answer 14

The Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features (dimensions) in a dataset while retaining as much of the important information (variance) as possible. It creates new, uncorrelated features called "principal components". These are linear combination of the original features. The first principal component explains the most variance, the second explains the second-most, and so on. This allows you to discard less important components. C. F. the image. Why do we use PCA? 1. Reduce Overfitting: High-dimensional data can lead to overfitting in machine learning models. PCA lowers dimensionality, helping to manage this. 2. Improved Visualization: Its easier to visualize data in 2D or 3D. PCA helps project high-dimensional data into lower dimensions for visualization. 3. Faster Computation: Machine learning models generally train and run faster with fewer features. PCA Procedure 1. Standardize the data: Subtract the mean from each feature and scale to have unit variance. 2. Calculate the covariance matrix: This matrix shows how the features in your data are related. 3. Eigenvectors and Eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of maximum variance in the data and eigenvalues represent the amount of variance explained by each eigenvector. 4. Choose components: Sort eigenvectors by decreasing eigenvalues and select the top "k" which explain a desired amount of variance. The typical range is choosing K components that capture at least 80 to 90% of the variance. 5. Transform the data: Project the original data onto the selected eigenvectors to get the new lower-dimensional representation.

Answer 15

C.f. image. K-Means is an unsupervised clustering algorithm that groups data points into distinct clusters based on their similarity. Here's the core idea: 1. Initialization: * Choose the number of clusters you want to create (this is the K in K means) * Randomly place K centroids (cluster centers) in the data space 2. Assignment For each datapoint in your dataset: *Calculate the distance between the data point and each of the "K" centroids *Assign the datapoint to the cluster whose centroid is the closest. 3. Centroid Update: For each cluster: *Recalculate the centroid by taking the average (mean) of all the data points assigned to that cluster. 4. Repeat * Repeat steps 2 and 3 until the centroids stop moving significantly or a set number of iterations is reached. Pseudocode Function KMeans(dataset X, k, max_iters): # Step 1: Initialize k centroids centroids = randomly select k points from X (or use K-Means++ initialization) For iter = 1 to max_iters: # Step 2: Assignment Step For each data point xᵢ: Compute distance to each centroid cⱼ Assign xᵢ to cluster of nearest centroid: clusterᵢ = argmin_j ‖xᵢ − cⱼ‖² # Step 3: Update Step For each cluster j = 1 to k: cⱼ = mean of all points assigned to cluster j cⱼ = (1/|Cⱼ|) · Σ xᵢ for xᵢ ∈ Cⱼ # Step 4: Check convergence If centroids have not changed (or change < ε): break return centroids, cluster_assignments * Important Considerations * Determining "K": Choosing the right number of clusters is a non-trivial part of K-means. Methods like the Elbow Method or Silhouette Scores can help. Initialization Sensitivity: Due to random initialization, K-means might converge to different clustering solutions on different runs. Running it multiple times with different starting points can help. Distance Measures: Euclidean distance is most common, but other distance metrics (e.g., Manhattan distance) can be used depending on your data. Key points to remember: The algorithm alternates between two steps — assign points to nearest centroid, then recompute centroids as cluster means. It is guaranteed to converge (inertia decreases monotonically), but only to a local minimum, not the global one. That's why initialization matters so much: K-Means++ gives an O(log k) approximation guarantee and is the sklearn default. The objective being minimized is WCSS (Within-Cluster Sum of Squares), also called inertia.

Answer 16

The two common methods in K-Means are the Elbow and Silhouette Methods. Check the images and refer to the text below... 1. The Elbow Method - You plot the explained variance (or within-cluster sum of squares) as a function of the number of clusters (K). The goal is to find the point where adding more clusters no longer leads to a significant improvement in data representation. This point is "the elbow". Here's the procedure: 1. Run K-means for different values of K (e.g., K = 1 to 10) 2. For each K, calculate the within-cluster sum of squares (WCSS) - this measures how compact the clusters are. 3. Plot WCSS vs K. Look for the "elbow" - the point where the rate of decrease in WCSS sharply slows down. As seen in the image, you see that the inflection point is K=4, which indicates the optimal K in the Elbow Method. 2. The Silhouette Method * Intuition: Measures how well a data point fits within its assigned cluster compared to how well it would fit in neighboring clusters. A high silhouette score indicates good clustering. * Steps: a. Run K-means for different values of K b. For each K, calculate the average silhouette score across all data points c. Plot average silhouette score vs K. Choose K with the highest peak.

Answer 17

The common metrics in regression modeling are the following: 1. Mean Squared Error (MSE) - The average squared error. It's heavily affected by outliers. Lower MSE is better. In MSE, the units are squared, making interpretation less intuitive. 2. Root Mean Squared Error (RMSE) - The square root of MSE, giving error in the same units as the target variable. Lower RMSE is better. RMSE is still affected by outliers, though less so than MSE. 3. Mean Absolute Error (MAE) - The average absolute error. Less sensitive to outliers than MSE/RMSE. Lower MAE doesn't indicate if errors are generally underestimates or overestimates. 4. R-Squared (R^2) - The proportion of variance in the target variable explained by the model. Higher R^2 (closer to 1) is better. The drawback is that it can be misleading. R-squared always increases by adding more features, even if they are irrelevant. 5. Mean Absolute Percentage Error (MAPE) - Average absolute percentage error. Useful when comparing models with different target variable scales. Lower MAPE is better. MAPE can be unstable with small target values or near zero. Not defined when the actual value is zero.

Answer 18

Here's the pseudocode for the K-Nearest Neighbors (KNN) algorithm, broken down for clarity. --- KNN has NO training step --- # It is a lazy learner: just store the dataset Function KNN_Predict(X_train, y_train, x_query, k): # Step 1: Compute distances For each training point xᵢ: dᵢ = distance(x_query, xᵢ) # Step 2: Find k nearest neighbors neighbors = select k points with smallest dᵢ # Step 3: Aggregate # Classification: prediction = mode(labels of neighbors) (optional) weighted vote: weight each neighbor by 1/dᵢ # Regression: prediction = mean(values of neighbors) (optional) weighted average: prediction = Σ (1/dᵢ)·yᵢ / Σ (1/dᵢ) return prediction --- Common Distance Metrics --- Euclidean: d = √(Σ (xᵢ − qᵢ)²) Manhattan: d = Σ |xᵢ − qᵢ| Minkowski: d = (Σ |xᵢ − qᵢ|ᵖ)^(1/p) Cosine: d = 1 − (x · q) / (‖x‖·‖q‖) __________________________________________________________ Key points to remember: KNN is a lazy learner — there is no training phase; all computation happens at prediction time, making training O(1) but inference O(n·d) per query. Feature scaling is critical because distance metrics are sensitive to magnitude differences — always standardize or normalize first. The choice of k controls the bias-variance tradeoff: small k → low bias, high variance (overfits to noise); large k → high bias, low variance (over-smooths boundaries).

Answer 19

# Regression: High-Level Intuition: Each tree fixes the mistakes of all previous trees. By adding trees gradually with a small learning rate, the model steadily improves without overfitting too quickly. It's essentially gradient descent in function space — where each step is a tree instead of a parameter update. Pseudocode: 1. Initialize with a simple prediction F₀(x) = arg min_γ Σ L(yᵢ, γ) (e.g., mean of targets for regression, log-odds for classification) 2. For each round m = 1 to M: a. Compute pseudo-residuals for each sample: rᵢ = −∂L(yᵢ, F(xᵢ)) / ∂F(xᵢ) In plain terms: "How wrong is the model, and in what direction?" For MSE regression this simplifies to (true − predicted). b. Fit a small decision tree hₘ(x) to those residuals. The new tree learns to predict the mistakes. This creates leaf regions where multiple samples land together. c. Compute the optimal output value for each leaf: γⱼ = arg min_γ Σ L(yᵢ, Fₘ₋₁(xᵢ) + γ) "For all samples landing in this leaf, what single value minimizes the loss?" For MSE regression: just the mean of residuals in the leaf. For log-loss classification: γⱼ = Σrᵢ / Σpᵢ(1−pᵢ) — a Newton-Raphson step using both gradient and curvature, which is why XGBoost explicitly uses the Hessian. d. Update the model: Fₘ(x) = Fₘ₋₁(x) + ν × hₘ(x) Add the new tree's predictions, scaled down by learning rate ν. 3. Final model: F_M(x) = sum of all trees Three knobs interviewers love to ask about: Number of trees (M): More trees = more capacity, but risk overfitting. Learning rate (ν): Smaller = slower learning, needs more trees, but generalizes better. There's a well-known tradeoff between learning rate and number of trees. Tree depth: Shallow trees (depth 3–5) act as weak learners — many weak learners combined > one strong one. Key distinction: For regression, residuals are simply (true − predicted). For classification, they become the negative gradient of the loss function (e.g., log-loss), which is why it's called "gradient" boosting — the framework generalizes to any differentiable loss.

Answer 20

Here's a breakdown of the most important hyperparameters commonly tuned in Gradient Boosted Tree (GBT) models: Core Hyperparameters: * n_estimators: The number of boosting rounds (the number of trees in the ensemble). Larger values generally lead to better performance but increase overfitting risk. * learning_rate: Shrinks the contribution of each individual tree. Smaller learning rates slow down learning, requiring more trees, but often improving generalization and preventing overfitting. * max_depth: The maximum depth of each tree. Limits the complexity of each tree and can help prevent overfitting. Shallower trees are weaker learners. * subsample: The proportion of samples randomly selected to train each tree. This introduces randomness and helps prevent overfitting (similar to random forest) * min_samples_split: The minimum number of samples required to split a node in a tree. Larger values prevent overly complex trees. * min_samples_leaf: The minimum number of samples required in a terminal node or leaf. Helps control tree complexity and avoid overfitting to noisy data. * colsample_bytree, colsample_bynode, colsample_bylevel: Controls the proportion of features randomly selected for each tree, node, or level. Introduces further randomness and helps prevent overfitting. * gamma: Minimum loss reduction required for a node split. Larger gamma favors more conservative models. * max_features: Limits the maximum number of features considered at each split, similar to random forests.

Answer 21

Activation functions are essential components of neural networks as they introduce non-linearity. This allows neural networks to model complex relationships between inputs and outputs, which are crucial for many tasks like image recognition or natural language processing. Common activation functions in neural networks are Sigmoid, Tanh, and ReLU as seen in the image. Here's a breakdown of the pros and cons of these activation functions: Sigmoid Pros: * Smooth output: Sigmoid's output ranges between 0 and 1, making it suitable for representing probabilities. * Easy to understand and implement Cons: * Vanishing gradients: For large negative or positive inputs, the gradient of the sigmoid function approaches 0. This can make it difficult for the network to learn during backpropagation. Tanh Pros: * Zero-centered output: Tanh's output ranges between -1 and 1, which can be helpful for some neural network architectures. * Smooth output: Similar to sigmoid, tanh's output is smooth and continuous. Cons: Vanishing gradients: Similar to sigmoid, tanh also suffers from vanishing gradients for large positive or negative inputs. * Computationally expensive: Compared to other activation functions, tanh is computationally expensive due to the involvement of exponential operations. ReLU (Rectified Linear Unit) Pros: * Fast computation: ReLU is computationally efficient because it only involves a simple threshold operation. * Avoids vanishing gradients: ReLU does not suffer from vanishing gradients for positive inputs. Cons: * Dying ReLU: ReLU neurons can die if they receive a large negative update during backpropagation, causing them to output zero permanently. * Non zero-centered: The output of ReLU is not zero centered. In choosing an activation function, it's important to consider the specific task and the network architecture. Sigmoid is a good choice for output layers requiring probability-like outputs (e.g., logistic regression). Tanh can also be useful in hidden layers, but it may not be the best choice for deep networks due to vanishing gradients. ReLU is a popular choice for hidden layers due to its computational efficiency, but can suffer from the "dying ReLU" problem.

Answer 22

The benefits of normalization are that it helps neural network models achieve more efficient convergence, prevents exploding, vanishing gradients, regularized weights, and improves interpretation of signals. Here are common normalization techniques. * Min-Max Scaling: Transforms data to typically be between 0 and 1 * Standardization (Z-Score): Subtracts the mean and divides by the standard deviation, resulting in zero mean and unit variance. Let's discuss in details the terms of the benefits to normalization. 1. Faster Convergence and Stability * Gradient Descent Optimization: Neural networks learn by optimizing weights using gradient descent algorithms. When features are on different scales, the loss function has an elongated shape with different curvatures in different directions. This makes it difficult for gradient descent to find the optimal path, slowing down convergence and potentially leading to getting stuck in local minima. * Normalization Helps: Normalizing features to have similar ranges makes the loss function more symmetrical and smoother, leading to faster and more stable convergence in gradient descent. 2. Preventing Exploding/Vanishing Gradients * Deep Architectures: In deep neural networks, gradients can either explode (become very large) or vanish (become very small) as they are propagated back through the layers. This makes learning difficult, especially for early layers. * Input Scaling Matters: If the input features have large variances, the magnitudes of weights can explode to compensate, leading to the exploding gradient problem. Similarly, if the input features are very small, weights shrink to compensate, leading to the vanishing gradient problem. * Normalization Mitigates: Normalizations helps keep features within a reasonable range, preventing weights from exploding or shrinking during backpropagation. This allows for better gradient flow and easier training. 3. Regularization Effect: * Weight Decay: Normalization indirectly creates a form of regularization. During weight decay, large weights are penalized more. Normalization helps control the magnitude of weights, which complements weight decay. 4. Feature Interpretation * Relative Importance: When features are on different scales, it's hard to interpret the relative importance of each feature based on their raw weights. Normalization makes feature weights more comparable for interpretability.

Answer 23

Let's break down the key differences between Gradient Descent (GD), Stochastic Gradient Descent (SGD) and Adam optimizers, focusing on how they update weights and their pros and cons: 1. Gradient Descent (GD) * How it works: - Calculates the gradient of the loss function with respect to all parameters for the entire training dataset. - Updates weights in the opposite direction of the gradient: weights = weights - learning_rate * gradient * Pros: - Theoretically guaranteed to find the global minimum if the loss function is convex * Cons: - Computationally expensive for large datasets as it processes the entire dataset for each update - Can get stuck in local minima 2. Stochastic Gradient Descent (SGD) * How it works - Calculates the gradients for a single example or a small batch of examples (called a mini-batch) - Updates weights more frequently: weights = weights - learning_rate * gradient_of_minibatch * Pros: - Faster iteration due to smaller computations - Noisy updates help escape local minima * Cons: - Noisy updates can lead oscillations around the optimal point instead of smooth convergence 3. Adam (Adaptive Moment Estimation) * How it works: - Combines ideas of momentum and adaptive learning rates (RMSprop) - Keeps track of exponentially decaying averages of past gradients (m) and squared gradients (v) - Adjusts the learning rate for each parameter based on these averages Here's the formula for Adam: Notation: t: Time step (current iteration) theta: Model parameters (weights and biases) g_t: Gradient of the loss function with respect to the parameters at time step t alpha: learning rate m_t: Exponentially decaying average of past gradients (momentum) v_t: Exponentially decaying average of past squared gradients (adaptive learning rate) beta1, beta2: Hyperparameters controlling the decay rates for the moving averages (usually set to 0.9 and 0.99 respectively) Adam Update Rule: 1. Calculate gradient: Compute the gradient g_t for a mini-batch of data. 2. Update biased first moment estimate (momentum): m_t = beta1 * m_{t-1} + (1 - beta) * g_t 3. Update biases first moment estimate (adaptive learning rate): v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2 (element-wise squaring) 4. Compute bias-corrected first and second moment estimates: m_hat_t = m_t / (1 - beta1^t)v_hat_t = v_t / (1 - beta2^t) 5. Update parameters: theta_{t+1} = theta_t - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon) (epsilon is a small value added for numerical stability) Pros: - Often converges faster than GD or SGD, especially in early stages of training - Handles sparse gradients effectively - Relatively robust to the choice of learning rate Cons: - Can sometimes overshoot the optimal point, especially in later stages of training. - Requires more memory to keep track of moments. Which one to choose? SGD: Often a good starting point due to its simplicity and computational efficiency. Adam: Often the default go-to in many deep learning scenarios. It generally works very well out of the box. GD: Practical for small datasets or convex optimization problems where converging to the global minimum is crucial Additional Considerations 1. Learning Rate: All optimizers are sensitive to learning rate. Adam can be less sensitive, but tuning is still important. 2. Dataset Size: SGD and its variants shine with very large datasets. 3. Sparsity: Adam handles sparse data effectively (common in text-based problems).

Answer 24

Tuning dense neural networks involves the following parameters, and these can be tuned using parameter tuning strategies such as grid search, random search, and bayesian optimization. 1. Network Architecture: - Number of layers: How many hidden layers are in the network - Neurons per Layer: How many neurons (processing units) in each hidden layer. 2. Activation Functions: Nonlinear functions (e.g., ReLU, Sigmoid, Tanh) applied to neuron outputs, enabling complex decision boundaries. 3. Optimizer: The algorithm used for updating weights (e.g. Adam, SGD, RMSprop) - Learning Rate: How much to adjust weights with each update step. 4. Regularization: Techniques to reduce overfitting -L1/L2 Regularization: Penalize large weights. - Dropout: Randomly drop neurons during training 5. Batch Size: Number of samples used per training iteration. 6. Number of epochs: Number of times the networks sees the entire training dataset.

Answer 25

Gradient descent is an algorithm that aims to minimize a cost function. Imagine you're lost on a hilly landscape and want to find the lowest valley. Gradient descent helps you find that valley by iteratively moving in the direction of the steepest descent. Mathematical Formulation 1. Cost Functions: Let's denote our cost function as J(theta), where theta represents the parameters (weights) of our model. Our goal is the find the values of theta that minimize J(theta). 2. Gradient: The gradient of the cost function, denoted by *upsidedown triangle*J(theta), is a vector that points in the direction of the steepest increase of the function. Importantly, its negative, points in the direction of the steepest decrease. 3. Update Rule: The core of gradient descent is the following update: Theta = Theta - alpha * upsidedowntriangleJ(theta) - alpha is the learning rate, a hyperparameter that controls the size of the step we take in each update. - By subtracting the gradient (scaled by the learning rate) from our current parameter values, we take a step in the direction that decreases the cost function the most. - We repeat this process iteratively, with each step brining us closer to a (local) minimum of the cost function.

Answer 26

Backpropagation is an approach to updating the neural network weights and bias with the aim of minimizing a cost function (e.g. mean squared error, cross-entropy loss). Backpropagation calculates the gradient of the cost function with respect to each weight/bias in the network. These gradients tell us how to adjust the weights to reduce the error. Here's the breakdown: 1. Forward Pass: Input data is fed through the network layer by layer, from input to output. Each layer applies its weights and biases, then passes the result through an activation function, producing the network's final prediction. 2. Error Calculation: The cost function (e.g., MSE, cross-entropy loss) measures how far the network's prediction is from the true target, producing a single scalar loss value. 3. Backward Propagation of Error: Starting from the output layer, the derivative of the cost function with respect to the network's output is computed. Then, this gradient is propagated backward through each layer using the chain rule — multiplying the partial derivatives of each layer's output with respect to its inputs (involving the derivatives of the activation functions and the weights at each layer). Through this process, we obtain ∂J/∂w and ∂J/∂b for every weight and bias in the network. Each gradient signals how much a tiny change in that parameter would affect the overall loss. 4. Weight Update: Gradient Descent: Parameters are updated using gradient descent or its variants: w = w − α · (∂J/∂w), b = b − α · (∂J/∂b) where α is the learning rate. Parameters are adjusted in the direction opposite their gradient to reduce the loss. Mathematical Example (Single Neuron) Let's consider a single neuron with a sigmoid activation function: z = w*x+b (weighted sum of inputs) a = sigma(z) (output of the neuron after activation) If our cost function is J, then: partial deriv. J / partial deriv. a: Comes from the specific cost function used partial deriv. a / partial deriv. z = sigma(z)*(1-sigma(z)): Derivative of the sigmoid function partial deriv. z / partial deriv. w = x: Derivative of the weighted sum with respect to the weight. Using the chain rule: (c.f. image)

Answer 27

Signs of overfitting To assess whether the neural network is overfitting, you can compare the training and validation errors. When the validation error is much higher than the training error, the neural network is most likely overfitting. You can look at the error plots across training epochs to assess overfitting. The sign to look for is when the validation curve is higher and diverges away from the training error as seen in the image. How to prevent overfitting: Here are common techniques used to combat overfitting: 1. Regularization - L1 and L2 Regularization: Penalty terms added to the cost function that discourages overly large weights, favoring simpler models. - Droput: Randomly drop neurons (and their connections) during training, preventing the network from relying too heavily on specific neurons. 2. Data Augmentation: Artificially expand your dataset by applying random transformations (rotation, flipping, noise, etc) to existing examples. This helps reduce overfitting to specific variations seen in the training data. 3. Early Stopping: Monitor validation set performance during training. Stop training before validation error starts to increase, preventing the model from memorizing the training data too closely. 4. Reduce Model Complexity: - Fewer layers - Fewer neurons per layer - This limits the model's capacity to memorize the training data 5. More Data: When possible, the most reliable way to prevent overfitting is to collect more diverse and representative training data.

Answer 28

Encoder-decoder are neural network structures often seen in recurrent neural networks and Transformers as seen below. Encoders-decoders are core components of models used in machine translation, text summarization, and many other natural language processing tasks. Information flows from the encoder to the decoder, where the encoded representation serves as the foundation for generating a sequential output. Encoder - Processes an input sequence (text, audio, etc) and compresses it into a fixed-length context vector. This vector aims to capture the essence or meaning of the input. - Operation: * Reads the input sequence one element at a time * Maintains an internal hidden state that's updated with each input element * Final hidden state becomes the context vector that summarizes the input Decoder - Decodes the context vector generated by the encoder to produce an output sequence. Generates the output one element at a time - Operation * Takes the context vector as input * The internal state is initialized with the information from the context vector * Generates the output sequence element by element, using previous outputs to guide the generation of the next element. Example: Machine Translation - Encoder: Processes a sentence in the source language (e.g. French), producing a context vector - Decoder: Takes the context vectors and generates the translated sentence in the target language (e.g. English)

Answer 29

Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) are two types of neural network architectures that serve difference purposes and are structured differently to address distinct types of problems in machine learning. In general, RNN is more suited for data where the sequence is vital (e.g. time series forecasting and machine translation) and CNNs are optimal for data where spatial relationships and patterns are important (e.g. computer vision). Let's do a deep dive on each of the architecture: Recurrent Neural Networks (RNNs): 1. Purpose and Applications - RNNs are designed to handle sequential data. They are particularly useful for tasks where the input is inherently sequential, such as natural language processing, time series prediction, and speech recognition. - They can process inputs of variable length, making them ideal for applications like language translation and generating text. 2. Structure - The key feature of RNNs is their internal memory, which captures information about what has been processed so far in a sequence. This allows them to exhibit temporal dynamic behavior. - An RNN has loops within its architecture that allow information to persist. In theory, RNNs can retain information in this loop over long sequences, but in practice, they often struggle due to issues like vanishing or exploding gradients. 3. Challenges - RNNs are hard to train effectively due to the long-range dependencies in sequences, which often lead to vanishing and exploding gradient problems. Techniques like LSTM (long short-term memory) and GRU (Gated Recurrent Units) cells have been developed to mitigate these issues. Convolutional Neural Networks (CNNs) 1. Purpose and Applications - CNNs are primarily used for processing data that has a grid-like topology, such as images. They are also used in video analysis, image classification, and areas where recognizing patterns from spatial data is crucial. - They excel at tasks that require identifying and extracting spatial hierarchies in features, such as recognizing faces, objects, or scenes in images. 2. Structure - CNNs use convolutional layers that apply convolution operations to the input. These layers use filters (or kernels) to capture spatial hierarchies and features like edges, textures, and shapes in parts of the input image. - This architecture typically includes pooling layers that reduce the dimensions of the data, simplifying the amount of computation required while still preserving essential features. 3. Advantages - CNNs are relatively efficient to train and are invariant to the scale and translation of image, making them robust to different viewpoints or variations in the appearance of objects. - They can automatically learn and generalize features from raw data, minimizing the need for manual feature extraction. Key Differences: 1. Data Handling: RNNs are better for sequential data, while CNNs excel with spatial data (like images) 2. Memory and Processing: RNNs can remember previous inputs due to their recurrent structure, which is useful for tasks that depend on historical inputs. CNNs, conversely, are better at perceiving patterns in a static input, where the location of a feature is key to classification. 3. Common Use Cases: RNNs are common in speech recognition, language modeling, and text generation. CNNs are prevalent in image and video recognition tasks.

Answer 30

In most machine learning tutorials, you are provided with a dataset with labels. In such cases, machine learning becomes a simple exercise requiring feature engineering, algorithm selection, and hyperparameter tuning. However, real-life projects are not simple. Often, you are not instructed when a model should predict. Your job as a data scientist or machine learning engineer is to define the prediction point of a model. In other words, at what point should your model predict? Should it produce a prediction at the onset of a profile creation or after behavioral data has been collected about the user? Your choice should depend on your modelling strategy. Let's consider this scenario below: Suppose you interview for a risk data scientist role. The interviewer asks you to define the prediction point of a bad actor on an eCommerce platform. Let's assume that the bad actor is a spammer on Facebook's marketplace. So, you have the following datasets: Profile: 1. Profile ID 2. Profile feature X 3. Profile feature Y 4. Profile feature Z Posts: 1. Post ID 2. Profile ID 3. Post Context X 4. Post Content Y 5. Includes_External_Link_Indicator A naive response would be that you predict based on posts that are flagged as spam, then extrapolate the author as a spammer. This is problematic. Suppose for a user X, your model flags spam or not based on the following (c.f. 1st image table) This user generated 5 posts, posts 1 and 3 being spam. The corresponding probability score of the user is a spammer at those events are 0.8 and 0.6, respectively. Should you flag a user X as a spammer? What if the user, as you collect more data real time, yields the following behaviour (c.f. 2nd image table) Now, it's not too clear whether the user X should be flagged as a spammer, right? To simplify the problem, you need to choose a slice in time on when the user should be scored as a spammer. Suppose you choose the third post as the prediction point as users. That means that your training and test data will only contain the third posts across users. Your feature set could contain the following information: 1. User profile information 2. Third event post information 3. Aggregations of the past two events Obviously, each feature set will contain a multitude of features. Now, how should you pick your prediction point? The choice should depend on various conditions involving the business objective, behavioral data, and volume of flags. Quite simply, if the problem is, let's say new account origination (NAO), then predict at the time of sign-up before any actions. Essentially, your model data won't contain any behavioral aggregations, rather it will strictly be based on profile information. If your user classification is based on transactions, then choose the event prediction point, use aggregations up to the prediction point for your predictions. This framework can work across various problems, not just in risk and fraud. Conversion: at the time of sign-up, predict the likelihood that a customer will purchase a good. Retention: at the time of sign-up, predict that the user will stay on the platform a year later. Recommender system: based on the 10th purchase, predict the user's next set of purchases.

Answer 31

Key Points: 1. Explain the possibility of overfitting. 2. Explain the possibility of behavioral change. 3. Explain the possibility of ignoring unit testings. Various reasons could impact difference in performance between offline and online model performance. The AUC of 0.90 to 0.75 online is severe, requiring investigation on one of the following three - overfitting on offline data, behavioral change, and absence of offline model testing. Let's briefly discuss the cycle of offline model production to online. Typically, a raw offline model is downloaded. The raw wrangled data is feature engineered into a dataset that is plugged into an offline model. Then, you evaluate the model using either cross-validation or LOOV. If the result looks good, push the offline model to production. Wire the feature engineering such that the same processing is applied on realtime data. Finally, let's assume that you evaluate your data online on the first three months since production. The first possibility is overfitting on the offline data, meaning the offline model was not evaluated properly. Cross-validation, in many cases, is not the best way to evaluate a model (Look a the interview question on cross-validation) for productionalization. The best approach is train, valid, test splits segmented on time periods that reflect how online testing is performed. For instance, suppose you have one year of training data, January 2019 through December 2019. Allocate the first eight months for training. The next 2 months for validation. The last two months for testing, which you completely leave out until final evaluation. Use the validation set for your hyperparameter tuning. The test result will provide a better generalization of how well the model performs online. Next, behavioural change is another issue. Suppose there's a change in a platform because of a major feature release, glitches on the application, or demographic shift among customers. Such a change can impact the model's efficacy. Last point to consider is failing to conduct unit testings on features created both offline and online. Suppose numerical encoding is applied on high-cardinality data. For the categorical value "A", a numerical encoding of 56 is used. Is the encoding consistent offline and online? Any inconsistency on preprocessing offline and online could lead to inconsistent results.

Answer 32

There are several ways to deploy a machine learning model on AWS. The best choice depends on factors like your model framework, the scale of your application, and your specific requirements. Here's a breakdown of the most common methods: 1. AWS Sagemaker - Fully managed Service: SageMaker is a comprehensive platform, simplifying the process of building, training, and deploying machine learning models. - Steps: a. Model Packaging: Prepare your model artifact (code, dependencies) in a format compatible with SageMaker. b. Create a Model: Upload your model artifact to S3 and create a SageMaker model, specifying the artifact location and container image. c. Create an Endpoint Configuration: Define the instance type(s) and the number of instances for your endpoint d. Deploy the Endpoint: Create a SageMaker endpoint using the model and configuration - Pros: Streamlined process, handles infrastructure, scaling, and monitoring - Cons: Some level of dependency on the SageMaker ecosystem. 2. Severless Deployment - AWS Lambda: Ideal for models that need to be invoked on demand or can handle small inference workloads. - Steps: a. Package your model as a Lambda function, along with dependencies b. Create a Lambda function and upload your packaged model c. API Gateway (optional): Create an API Gateway endpoint to trigger your Lambda function, allowing external requests. - Pros: Cost effective (pay per execution), automatic scaling, easy setup - Cons: Limited memory and execution time, potential cold starts (latency for initial requests) 3. Containerized Deployment - Flexibility: Docker containers provide a portable way to package your model and its environment - Options: * AWS Elastic Container Service (ECS): Manage containerized applications at scale * AWS Elastic Kubernetes Service (EKS): Deploy and manage Kubernetes clusters on AWS * AWS Fargate: Serverless compute for containers, simplifies deployment - Pros: Control over the environment, suitable for complex deployments or integrating into existing microservice architectures - Cons: More infrastructure management overhead 4. Batch Predictions - AWS Batch: For predictions on large datasets where real-time inference isn't needed. - Steps: a. Containerize your model: Create a Docker image for your model code. b. Define a Batch Job: Specify the container image, data location, and compute resources c. Submit the job: AWS Batch handles the provisioning and execution of the job. - Pros: Handles large-scale predictions efficiently, cost-effective for non-real time processes - Cons: Not suitable for real-time or low-latency requirements Additional Considerations: 1. Model Format: Ensure your model is saved in a format compatible with your chosen deployment method (e.g. TensorFlow SavedModel, ONNX). 2. Monitoring and Retraining: Implement systems to monitor the performance of your deployed model and retrain when necessary to address model drift.

Answer 33

There are three primary areas of challenges when deploying models to production: 1. Challenges from Development to Deployment * Model-Environment Mismatch: Models built in lab settings often don't translate seamlessly to the real world. Issues include data distribution differences, scalability constraints, and latency requirements. * Data Dependencies: Production data may differ in format, quality, or distribution (concept drift) from what was used during training. Robust data cleaning and preprocessing pipelines are essential. * Computational Overhead: Large, complex models can be computationally expensive to run, leading to high costs and latency issues in production. * Reproducibility: Ensuring experiments and the model development process are well-documented and reproducible can be difficult, especially in larger teams. 2. Challenges in Production * Monitoring: Model performance can degrade over time due to: a. Concept drift: Changes in the underlying real-world patterns that the model was trained on b. Data drift: Changes in the distribution of input data c. System Issues: Upstream data problems, infrastructure failures. * Feedback loops: Collecting reliable performance and usage data from production systems to retrain and improve the model is often difficult to implement effectively. * Continuous Integration and Delivery (CI/CD): Machine learning models necessitate a more streamlined process of updating models as new data is received or as performance degrades. 3. Operational Challenges * Scalability: Handling sudden spikes in demand or scaling to handle large volumes of data can be a complex engineering challenge, especially for real-time inference. * Security: ML models and their data are potential attack vectors. Secure deployment and monitoring are critical. * Governance: Establishing clear processes for model updates, approvals, and ethical use becomes crucial, especially in larger organizations Addressing These Challenges Many of these challenges are addressed through a robust MLOps framework. Key elements include: 1. Data and Model Versioning 2. Experiment Tracking 3. Automated CI/CD Pipelines for ML 4. Model Monitoring and Alerting 5. Tools for Model Serving and Infrastructure Management

Answer 34

When a machine learning model that performed well in development suddenly fails in production, there are these steps to consider: Troubleshooting Steps 1. Isolate the Issue - Data Changes: Check for differences in the distribution, format, or quality of production data compared to training data. This is a very common culprit. - Code Discrepancies: Ensure the code used in production exactly matches the development version. Look for data preprocessing errors, model loading mistakes, or configuration issues. - System Issues: Investigate external factors like infrastructure problems, network errors, or resource constraints that might affect the model's performance. 2. Gather Information: - Metrics: Compare performance metrics (accuracy, precision, recall, F1-score, etc) between the development environment and production. - Error Logs: Analyze any error logs generated by the system. These often contain valuable clues about the root cause. - Data Samples: Examine specific instances where the model fails in production and the input data associated with them. 3. Root Cause Analysis: - Concept Drift: Determine if the underlying relationships your model learned during training have changed in the real world. Data changes are often responsible. - Overfitting: If the model performed exceptionally well in development, but poorly in production, consider overfitting. It means the model memorized training data instead of learning generalizable patterns. - Training-Serving Skew: Verify that your data preprocessing pipeline in production are identical to those used during training. Inconsistent preprocessing can lead to wildly different inputs for the model. - Hidden Biases: Assess whether the data used for training was sufficiently representative. Biased training sets can lead to models that fail on specific segments of real-world data. 4. Action - Data Adjustment: If concept drift or data quality are issues, you'll likely need to collect new data and retrain the model, potentially with more diverse examples. - Code Fixes: Verify and correct any code-related discrepancies found in your analysis. - Model Simplification: If overfitting is suspected, try these techniques: a. Regularization (L1, L2) b. Dropout c. Early Stopping - Infrastructure: Address any shortcomings in processing power or memory that might be affecting the model Tips on Creating a Robust Production Environment - Robust Monitoring: Implement production monitoring systems to get real-time alerts when models show signs of performance degradation. This is key to catching issues early on. - Continuous Learning Pipelines: Setup automated processes to retrain or recalibrate models as new production data becomes available. - Pre-Deployment Testing: Have a thorough validation stage before models go live. This includes testing with data that mimics the expected production environment. - Gradual Deployment: Consider techniques like canary deployment (new model on a small subset of traffic) or shadow deployment (run the new model alongside the old one but don't use its outputs) to mitigate risk.

Answer 35

What is Kubernetes? At its core, Kubernetes is a powerful open-source system designed to automate the deployment, scaling, and management of containerized applications. It provides an abstraction layer over your underlying infrastructure (physical machines, virtual machines, or cloud instances) letting you focus on application logic rather than managing individual servers. A common tool in Kubernetes is Kubeflow. Key Concepts: Kubernetes operates with concepts like: - Pods, the smallest deployable unit, often containing one container. - Deployments, which manage scaling and updates for sets of Pods. - Services, that provide networking abstractions for Pods, making them accessible. - Nodes, the worker machines where your Pods run. Why use Kubernetes for Model Deployment? Scalability: - Handles dynamic scaling of your ML models based on demand. If traffic spikes, Kubernetes can automatically spin up more replicas of your model. - Efficiently uses resources by spreading your models across multiple nodes. Resilience: - Self-healing capabilities automatically restart failed containers or pods. - Distributes model replicas across nodes, ensuring high availability. Deployment Automation: - Versioning and rollbacks of model deployments become simple. - Allows for canary deployments (releasing a new model version to a subset of users) or blue-green deployments (zero downtime updates) to minimize risk. Portability: - Packages models and dependencies in containers, making them runnable anywhere Kubernetes is supported (local machines, cloud, etc). Complex Workflows: - Kubernetes can orchestrate more complex model serving workflows, involving model preprocessing, post processing, A/B testing, and feedback loops. How Kubernetes is used 1. Containerizing Your Model: - Package your trained ML model, code, and dependencies into a Docker image. 2. Creating Kubernetes Resources: - Write YAML configuration files to define Deployments, Services, and other necessary Kubernetes objects for your model. 3. Deploying to a Kubernetes Cluster - Use tools like kubectl to deploy these configurations to your Kubernetes cluster. Kubernetes handles the rest!

Answer 36

What is Model Orchestration? Model orchestration is the process of automating and managing the entire lifecycle of machine learning models. It addresses the challenges of putting models into production, including: - Workflow Coordination: Orchestrating complex pipelines involving data preparation, model training, evaluation, deployment, and continuous monitoring. - Dependency Management: Ensuring all components (code, data, libraries, etc) are compatible and up-to-date. - Scalability: Handling the scaling requirements when models experience increased workloads. - Monitoring and Retraining: Tracking model performance in production and triggering retraining when accuracy degrades. Key Components of Model Orchestration 1. Orchestrator: This is the core software responsible for scheduling, sequencing, and managing the various tasks within a machine learning workflow. Popular orchestrators include: a. Apache Airflow: A highly versatile tool for authoring workflows as DAGs (Directed Acyclic Graphs). b. Kubeflow Pipelines: Part of the Kubeflow project, designed for ML workflows on Kubernetes. c. Flyte: A cloud-native, type-safe orchestration platform focused on data and ML pipelines. d. MLflow: Includes components for experiment tracking and model management, providing orchestration aspects. 2. Workflow Definition: Orchestration workflows are typically defined either through code (e.g. Python with Airflow) or configuration files (YAML). They specify: a. Steps/Tasks: The individual components of the pipeline (data preprocessing, training, deployment, etc). b. Dependencies: How steps relate to each other and which tasks need to be completed before others can start. c. Execution Logic: Conditional branching, error handling, retries, etc. 3. Task Execution: - The orchestrator spins up containers or jobs to execute each step of the workflow. - Tasks might involve running training scripts, deploying models as APIs, or triggering data quality checks. 4. Resource Management: - The orchestrator interacts with infrastructure (on-premises or cloud) to allocate compute resources, memory, and storage as needed for various tasks. - Integration with Kubernetes is common for dynamic resource scaling. 5. Monitoring and Logging: - The orchestrator keeps track of task execution, logs errors and warnings, and collects performance metrics. - Monitoring dashboards help detect issues like model drift or data quality problems. Example Workflow: A simplified ML orchestration workflow might look like this: 1. Data Preprocessing: Clean and prepare new data. 2. Model Training: Train or retrain a model using the updated data. 3. Evaluation: Evaluate the model's performance on a validation set. 4. Deployment: If the model meets performance criteria, deploy it to a production environment. 5. Monitoring: Track model performance in production and trigger retraining if performance degrades.

Answer 37

Train/holdout split with cross-validation. Use parameter search like Grid/Random/Bayesian search. Parameters - Learning Rate, Batch Size, Epochs, Layer Size and Counts, Activation Functions, Optimizers

Answer 38

Use pre-trained model like ResNet, VGG and extracts vectors from the feature extraction layer such that all images now have vector forms. For the target image vector, run similarity using KNN to identify top-K vectors that are closest.

Answer 39

Gather a dataset of images of furniture with labels/tags (e.g. sofa, bed, dining table). These should already be provided by the host. Augment data using techniques like resizing/normalization. Use CNN (or other pre-trained models) to predict furnitures found in an image.

Answer 40

Having different activation functions can affect the network’s convergence speed, training, and the ability to capture complex patterns in the data. Activation functions Sigmoid The range is between 0 and 1. Due to its formula, it suffers from the “vanishing gradient” problem, where the gradients become very small, which leads to slow learning of the network. ReLu If x is positive, it returns ‘x’; otherwise, it returns ‘0’. It helps with mitigating the ‘vanishing’ problem from ‘Sigmoid’, but it suffers from the ‘dying ReLU’ effect, in which certain neurons become inactive during training. Tan h Range is between -1 and +1. It can be used in cases where zero-centered outputs are preferred. It also mitigates the vanishing gradient problem to some extent.

Answer 41

Vanishing gradients -Gradients can become extremely small as they propagate backwards during training, which can cause issues with activation functions such as sigmoid and tanh. How do these issues affect the training process: Gradients "dissappear", the network virtually stops training, and does not converge to an optimal solution. Common strategies used to mitigate them: 1. Use a different set of activation functions (e.g. ReLU) 2. Batch normalization 3. Proper weight initialization. 4. Residual Connections Exploding gradient - Gradients can become very large during backpropagation, which can lead to divergence during training. How do these issues affect the training process: The model struggles to converge to a solution. Common strategies used to mitigate them: 1. Proper weight initialization. 2. Batch normalization. 3. Use another set of activation functions with smaller gradients. 4. Gradient clipping. 5. Residual Connections

Answer 42

Regularization can prevent overfitting in neural nets by applying L1 and L2 regularization. Dropout During training, certain neurons are randomly selected to be dropped out by setting their outputs to ‘0’. This promotes the learning of better and more robust features. L1/L2 regularization Applies L1 or L2 regularization to the values of the weights, which penalizes large weights. This prevents the model from fitting the training data too closely.

Answer 43

SGD updates parameters along the negative gradient, scaled by a fixed learning rate. It's simple and low-overhead, but converges slowly — especially near saddle points or local minima — because it's sensitive to gradient noise. Stability depends heavily on learning rate scheduling; too high and it oscillates, too low and it stalls. Momentum augments SGD by accumulating an exponentially decaying moving average of past gradients, adding "inertia" to updates. This accelerates convergence along consistent gradient directions and dampens oscillations across noisy or high-curvature directions. It helps escape shallow local minima and traverse flat regions faster than vanilla SGD, but introduces an additional hyperparameter (the momentum coefficient, typically ~0.9). RMSprop adapts the learning rate per-parameter by dividing by a running average of recent squared gradients. This normalizes updates so that frequently-updated parameters get smaller steps and infrequent ones get larger steps. It converges faster than SGD on noisy or non-stationary problems (e.g., RNNs), and the gradient normalization helps stabilize training, though it doesn't fully eliminate vanishing/exploding gradient issues. Adam combines RMSprop's per-parameter adaptive rates with Momentum's first-moment estimate. It maintains both first-moment (mean) and second-moment (variance) running averages, with bias correction for both. This gives it fast early convergence and strong performance on large-scale, high-dimensional, or sparse-gradient problems. However, the adaptive rates can sometimes lead to convergence to suboptimal solutions compared to well-tuned SGD. Key tradeoffs: SGD generalizes best but requires careful tuning. Momentum improves SGD's convergence speed at minimal added complexity. RMSprop adds per-parameter adaptivity, helping on rugged loss landscapes. Adam converges fastest with minimal tuning but may find sharper (less generalizable) minima. In practice, Adam is the default starting choice, but SGD+Momentum with a good learning rate schedule often wins for final model quality.

Answer 44

Batch normalization centers and scales input data such that the mean becomes 0 and standard deviation becomes 1. Doing so ensures that a feature with higher range of values, let’s say -100 to 100, does not overpower error propagation compared to that of a feature with a smaller range, let’s say -1 to 1. Furthermore, normalization stabilizes the input distributions prior to activation functions, which can suffer from vanishing or exploding gradients.

Answer 45

Out-of-the-box pre-trained models like GPT3+ & Bert (Text), ViT & ImageNet (Images) are already trained on a large volume of data. This means that these models have learned the patterns of texts (representing a word token into embedding) and images (identify edges and textures). This makes it easy as a starting point for business specific tasks like classifying furniture types in Airbnb home images.

Answer 46

1. Visualize filters applied at each layer -Helps in understanding the types of patterns that each filter is sensitive to. -It can reveal low-level features (e.g. edges, textures, colors, etc.) as well as higher-level semantic features. 2. Look at Class Activation Mapping (CAM) -Generates heatmaps of the regions of an input image that contribute the most to a CNN’s prediction. 3. Consider Gradient-weighted CAM -Extends CAM in that it uses the gradients of the predicted class score w.r.t. the feature maps of the last convolutional layer. -It generates more localized and accurate heatmaps than CAM.

Answer 47

Challenges -Vanishing / Exploding gradients. -Difficulty in capturing long-term dependencies in the data . * As the sequence length increases, the network’s ability to remember information from distant past steps diminishes (due to vanishing gradient problem). -Memory constraints * They have limited memory capacity, which depends on the length of the sequence. It can cause the network to not be able to retain information from earlier time steps for much longer. Solutions -LSTM (Long Short-Term Memory) * Addresses the vanishing gradient problem. Introduces the “memory cell” and other gating mechanisms to address the issue with long-term dependencies (see above). -GRU (Gated Recurrent Unit) * Simpler architecture than LSTM and with fewer parameters. * Controls the flow of information via a gating mechanism. * Computationally more efficient than LSTM, but still effective in capturing long-term dependencies.

Answer 48

Definition - It’s a technique to reduce the size of the NN model by removing certain unnecessary neurons, layers, or connections while trying to maintain or improve its performance. Pros - It can reduce the computational cost of the model, improve efficiency, and provide regularization effects (i.e. can help with reducing overfitting). Trade-offs - Loss of performance - Aggressive pruning can lead to removing important parameters, which can decrease the performance of the model. This can be addressed by further fine-tuning the model after pruning. - Sensitivity to initialization and training - Depending on the initialization of the model, different pruning techniques can lead to different outcomes. - Increases the complexity of the training pipeline, since it adds to the set of tasks to do. - Loss of interpretability * If certain layers or portions of the network are removed, it can be challenging to understand the behavior of the pruned model.

Answer 49

Pros - Increases robustness of the model. - The model is able to generalize. - Improves the performance of the model. Cons - Increases training time, since more training samples are used. - Loss of information (not always) * Certain data augmentation. techniques may remove helpful information from the original data. - Artifacts * Some artifacts or unreal features may get introduced when applying data augmentation to the original data, which leads to a decrease in model performance. Typical techniques Rotation Flipping of the image (180 deg) Translation Scaling and Cropping of the images Noise injection Color Jittering

Answer 50

We can be concrete about this using the number of weights that must be considered for feedforward and backpropagation as a proxy for “training time”. In feed forward process, the weight is used for h_ij = activation_function(w_ij * x_ij). In backpropagation, the weights are w_ij_new = w_ij_old - alpha * gradient_w_ij Consider Image: The number of weights are larger for architecture A which has more units in hidden than depth. This could be tested in other architectures, and the number of calculations that must be dealt with are, in deed, higher.

Answer 51

Assuming all else being equal, meaning that the feature inputs are all scaled such that the mean and standard deviations are the same, we should expect input that are more influential in predicting output to have the highest weight. Consider a simple linear regression model example (c.f. 1st image) In this example we see that w1 = 5 has a higher weight than w2 = 0.1. This means that x1 has more influence in predicting y than x2. We can take a similar approach in extracting important signals in neural network. The more important variables should have higher weights linked to it. (c.f. 2nd image)

Answer 52

The False Positive Rate (FPR) is the probability of incorrectly classifying something as positive when it is actually negative. In other words, it measures how often a test incorrectly identifies a negative case as positive. Think of a security system that detects intruders: * A false positive happens when the system incorrectly thinks a friendly visitor is an intruder. * The False Positive Rate tells us how often the system makes this mistake out of all the actual friendly visitors. Formula for False Positive Rate (FPR) 𝐹𝑃𝑅 = False Positives (FP) / (False Positives (FP) + True Negatives (TN) ) Breaking it Down: False Positives (FP) = Cases where the model incorrectly predicted positive (e.g., the system flagged a friendly visitor as an intruder). True Negatives (TN) = Cases where the model correctly predicted negative (e.g., the system correctly ignored a friendly visitor). Denominator (FP + TN) = Total actual negative cases. A low FPR means the system rarely makes false alarms, while a high FPR means it often makes incorrect positive predictions.

Answer 53

The True Positive Rate (TPR), also called Recall or Sensitivity, measures how well a model correctly identifies actual positive cases. It answers the question: "Out of all the real positive cases, how many did the model correctly detect?" Example: Imagine a medical test for a disease: * A true positive happens when the test correctly identifies a sick person as sick. * The True Positive Rate tells us how often the test correctly detects sick people out of all the people who are actually sick. Formula for True Positive Rate (TPR) 𝑇𝑃𝑅 = True Positives (TP) / (True Positives (TP) + False Negatives (FN) ) Breaking it Down: True Positives (TP) = Cases where the model correctly predicted positive (e.g., the test correctly detected a sick person). False Negatives (FN) = Cases where the model missed a positive case (e.g., the test incorrectly said a sick person is healthy). Denominator (TP + FN) = Total actual positive cases. A high TPR means the model is good at detecting real positives, while a low TPR means it often misses them.

Answer 54

The True Negative Rate (TNR), also known as Specificity, measures how well a model correctly identifies actual negative cases. It answers the question: "Out of all the real negative cases, how many did the model correctly classify as negative?" Formula for True Negative Rate (TNR) 𝑇𝑁𝑅 = True Negatives (TN) / (True Negatives (TN) + False Positives (FP) ) Breaking it Down: True Negatives (TN) = Cases where the model correctly predicted negative (e.g., a security system correctly ignored a friendly visitor). False Positives (FP) = Cases where the model incorrectly predicted positive (e.g., a security system mistakenly flagged a friendly visitor as an intruder). Denominator (TN + FP) = Total actual negative cases. A high TNR means the model is good at avoiding false alarms, while a low TNR means it frequently misclassifies negatives as positives.

Answer 55

The False Negative Rate (FNR) measures how often a model misses actual positive cases. It answers the question: "Out of all the real positive cases, how many did the model incorrectly classify as negative?" Example: Imagine a medical test for a disease: * A false negative happens when the test incorrectly says a sick person is healthy. * The False Negative Rate tells us how often the test fails to detect sick people out of all the people who actually have the disease. Formula for False Negative Rate (FNR) 𝐹𝑁𝑅 = False Negatives (FN) / (True Positives (TP) + False Negatives (FN) ) Breaking it Down: False Negatives (FN) = Cases where the model missed a positive case (e.g., the test incorrectly said a sick person is healthy). True Positives (TP) = Cases where the model correctly predicted positive (e.g., the test correctly detected a sick person). Denominator (TP + FN) = Total actual positive cases. A low FNR means the model rarely misses real positive cases, while a high FNR means it often fails to detect positives.

Answer 56

1. Scalability 2. Latency 3. Availability 4. Reliability 5. Consistency 6. Fault Tolerant 7. Maintainability 8. Security 9. Cost Effective

Answer 57

Momentum introduces a velocity term, v, that replaces the raw gradient in the weight update, accumulating an exponentially weighted average of past gradients. ``` v = beta * v + (1 - beta) * grad w = w - lr * v ``` Why it helps: On loss surfaces shaped like narrow valleys, steep walls on the sides, gentle slope along the bottom, vanilla GD oscillates: large gradients on the steep walls cause it to ping-pong side to side while progress along the valley floor is slow. Momentum damps those oscillations (side-to-side gradients cancel out in the rolling average) and accelerates progress in consistent directions (along-the-valley gradients keep accumulating). Failure Modes: Setting beta too high (-> 1.0) causes the optimizer to overshoot minima and adapt slowly when gradient direction changes - you're coasting on stale history. Too low (-> 0) degenerates back to vanilla GD. A subtler issue is cold start: because v is initialized at zero, early steps are artificially small - you're averaging true gradients with zeros.

Answer 58

RMSprop adds one state variable per parameter, `v`, a running average of squared gradients, initialized to `0`. Each step: ``` v_w = beta * v_w + (1 - beta) * dw**2 # EMA of squared gradients w -= lr / (np.sqrt(v_w) + epsilon) * dw # adaptive update ``` Typical values: `beta=0.9`, `lr=0.001`, `epsilon=1e-8`. The effect: parameters with historically large gradients get a smaller effective learning rate; parameters with smaller gradients get a larger one - all automatically with no change to how gradients are computed.

Answer 59

VIF is a measure of how much the variance of a regression coefficient is inflated due to multicollinearity with other predictors. It quantifies how much a predictor can be explained by other predictors in the model. The formula is: VIF_j = 1 / (1 - R^2_j), where R^2_j is the R^2 from regressing the predictor X_j on all other predictors in the model. You can interpret VIF values like this: VIF = 1: No correlation with other predictors VIF = 1-5: Moderate correlation (generally acceptable) VIF = 5-10: High correlation (concern) VIF > 10 : Severe multicollinearity (action needed) VIF doesn't identify which pair of variables is causing the problem, only that a given variable is correlated with the others as a group. Use a correlation matrix alongside VIF for pairwise diagnosis.

Answer 60

Adam maintains two state variables per parameter: m — first moment: EMA of gradients (like Momentum) v — second moment: EMA of squared gradients (like RMSprop) ``` m = beta1 * m + (1 - beta1) * grad # first moment v = beta2 * v + (1 - beta2) * grad**2 # second moment m_hat = m / (1 - beta1**t) # bias-corrected v_hat = v / (1 - beta2**t) # bias-corrected w -= lr * m_hat / (np.sqrt(v_hat) + epsilon) ``` Typical values: beta1=0.9, beta2=0.999, lr=0.001, epsilon=1e-8. Bias correction: Both m and v are initialized at zero. In early steps, the EMAs are heavily biased toward zero — you're averaging true signal with a lot of zeros. Dividing by (1 - beta**t) rescales the estimates upward to compensate, shrinking toward 1.0 as t grows and the correction becomes irrelevant. Failure modes: Generalization gap: Adam often converges faster than SGD+Momentum but to a sharper minimum, which can hurt test performance. SGD with momentum tends to find flatter minima that generalize better — this is an active research area (see AMSGrad, AdamW). Weight decay coupling: Naive L2 regularization in Adam doesn't behave like true weight decay because the gradient of the penalty gets scaled by v_hat just like any other gradient. AdamW fixes this by decoupling the decay step. Epsilon sensitivity: A larger epsilon damps the adaptive behavior (pushing Adam toward plain Momentum); too small and near-zero v_hat causes exploding updates in rarely-updated parameters.

Answer 61

When your dataset has a significant class imbalance. Uniform sampling may under-represent rare classes by chance, while stratified sampling guarantees each class is represented at a specified proportion.

Answer 62

When a class is very rare (e.g., 0.1% of data), `round(proportion * n) can round to 0, meaning the rare class gets no samples at all. Fix by enforcing a minimum sample count per class or specifying proportions manually.

Answer 63

It allows uniform random sampling of `k` items from a stream too large to fit in memory. The invariant is that at every point in the stream, each item seen so far has an equal probability of being in the reservoir.

Answer 64

`k/i` - a random integer `j` is drawn from `[0, i]`, and the item enters if `j

Answer 65

Bayesian Optimization treats hyperparameter tuning as a problem of efficiently searching an unknown landscape. Imagine you're trying to find the lowest point in a dark, hilly terrain, but every step you take costs you 30 minutes. You want to be strategic about where you step next. The two key components: 1. Surrogate Model (usually a Gaussian Process): This is a statistical model that approximates the true objective function (e.g., validation loss as a function of hyperparameters). After each evaluation, it updates its beliefs about what the landscape looks like, maintaining both a predicted value and an uncertainty estimate at every point. Where you've evaluated, uncertainty is low. Where you haven't, uncertainty is high. 2. Acquisition Function: This is the decision rule that picks the next point to evaluate. It balances exploitation (sampling where the surrogate predicts good performance) with exploration (sampling where uncertainty is high, because a hidden optimum might be lurking there). Common acquisition functions include Expected Improvement (EI) and Upper Confidence Bound (UCB). The loop: Start by evaluating a few random hyperparameter configurations. Fit the surrogate model to all results so far. Use the acquisition function to choose the next most promising configuration. Evaluate that configuration (train the model, measure validation performance). Update the surrogate model with the new result. Repeat until your budget runs out. Why it beats Grid/Random Search: Grid and Random search treat every trial as independent — they don't learn from previous results. Bayesian Optimization uses every past evaluation to make a smarter decision about what to try next, so it typically finds strong configurations in far fewer iterations. This matters most when each evaluation is expensive (e.g., training a deep neural network for hours). Key tradeoff: The surrogate model itself becomes harder to fit accurately as the number of hyperparameters grows (roughly beyond 10–20 dimensions), which is why Bayesian Optimization is best suited for tuning a moderate number of continuous hyperparameters rather than massive discrete search spaces.

Answer 66

R² measures the proportion of variance in the target variable that is explained by the model. A higher value (closer to 1) is better. Its main drawback is that it always increases when you add more features, even irrelevant ones, which can be misleading about the model's true quality.

Answer 67

Variance is the average squared deviation from the mean (σ²); standard deviation is its square root (σ), expressed in the same units as the data. Use variance when: Doing mathematical derivations or proofs (it's algebraically cleaner — variances of independent variables add directly) Working with statistical models internally (e.g., computing loss functions, PCA, linear regression) Use standard deviation when: Communicating results to stakeholders (interpretable units match the data) Describing spread in feature distributions or model error (e.g., "predictions are off by ±2.3 kg") Key tradeoff: Variance penalizes outliers more heavily due to squaring, making it sensitive to extreme values — worth keeping in mind during feature analysis.

Answer 68

A p-value is the probability of observing results at least as extreme as your data, assuming the null hypothesis (H₀) is true. It measures how surprising your data is under H₀ — not how true any hypothesis is. Correct interpretation: A small p-value (e.g., < 0.05) means the data is unlikely under H₀, giving grounds to reject it. It is a continuous measure of evidence against H₀, not a binary pass/fail. p < 0.05 ≠ 95% chance the result is real. p > 0.05 ≠ no effect exists. Low p ≠ large or important effect (conflates sample size with magnitude). 0.05 is a convention, not a threshold of truth. ML relevance: In feature selection or A/B testing, relying solely on p-values without considering effect size, confidence intervals, and practical significance is a common and costly mistake.

Answer 69

Gini impurity measures the probability of misclassifying a randomly chosen sample if labeled according to the node's class distribution. A pure node scores 0; a maximally mixed binary node scores 0.5. Formula: G = 1 - Σ pᵢ² where pᵢ is the proportion of samples belonging to class i. ``` def gini(y): # compute proportion of each class p = np.bincount(y) / len(y) # apply formula: 1 - sum(pi^2) return 1 - np.sum(p ** 2) ``` Sanity checks: gini([0,0,0,0]) → 0.0 (pure node) gini([0,0,1,1]) → 0.5 (maximally mixed) The weighted Gini of a split is just: G_split = (n_left/n) * G_left + (n_right/n) * G_right

Answer 70

Use numpy.unique(array) to sort the array ascending, and use advanced indexing to get the consecutive pairs in the same index of two arrays. Then add the two arrays and divide. Here's the code: ``` thresholds = (np.unique(values)[:-1] + np.unique(values)[1:]) / 2 ```

Answer 71

Some of the tools, software or hardware, used to execute Machine Learning algorithms in parallel include GPUs, MapReduce, and Spark.

Answer 72

Apache Spark is an open-source, distributed computing framework designed for large-scale data processing. Spark provides an in-memory computation engine that makes it significantly faster than disk-based frameworks like Hadoop MapReduce. Why it matters for ML: Spark is useful because real-world ML often involves datasets too large for a single machine. Spark lets you distribute data processing and model training across a cluster. Its key ML-relevant components include: Spark SQL / DataFrames — for cleaning, joining, and transforming large datasets (the bulk of any ML pipeline). MLlib — Spark's built-in library with distributed implementations of common algorithms (linear regression, random forests, k-means, ALS for recommendations, etc.). Spark Streaming — enables near-real-time feature engineering and inference on streaming data. PySpark — a Python API that lets data scientists use familiar syntax while leveraging distributed compute under the hood. Key advantages to mention in an interview: In-memory processing — avoids repeated disk I/O, making iterative algorithms (like gradient descent) much faster than MapReduce. Lazy evaluation + DAG optimizer — Spark builds a directed acyclic graph of transformations and optimizes the execution plan before running anything, reducing unnecessary shuffles. Scalability — scales horizontally by adding nodes; you can go from gigabytes to petabytes without rewriting code. Unified pipeline — you can do data ingestion, feature engineering, model training, and evaluation all within one framework.

Answer 73

1. Supervised Learning - where the output variable (the one you want to predict) is labeled in the training dataset. Includes Regression and Classification. 2. Unsupervised Learning - where the training dataset does not contain the output variable. The objective is to group similar data together instead of predicting any specific value. Includes Clustering, Dimensionality Reduction, and Anomaly Detection 3. Semi-supervised Learning: This technique falls between Supervise and Unsupervised Learning because it has a small amount of labeled data with a relatively large amount of unlabeled data. You can find its applications in problems such as Web Content Classification and Speech Recognition, where it is very hard to get labeled data but you can easily get lots of unlabeled data. 4. Reinforcement Learning: RL focuses on finding a balance between Exploration (of unknown new territory) and Exploitation (of current knowledge). It monitors the response of actions taken through trial and error and measures the response against a reward. The goal is to take such actions for the new data so that the long-term reward is maximized.

Answer 74

Causation is a relationship between two variables such that one of them is caused by the occurrence of the other. Correlation, on the other hand, is a relationship observed between two variables which are related to each other but not caused by one another.

Answer 75

Online learning updates the model incrementally as each new data point (or small mini-batch) arrives. The model learns continuously and adapts to new patterns without needing access to the full historical dataset. Examples include stochastic gradient descent on a stream of data, or recommendation systems that update in real time as users interact. Offline (batch) learning trains the model on the entire dataset at once. You collect all the data, train the model, evaluate it, and deploy it. When new data becomes available, you retrain from scratch (or from a checkpoint) on the full updated dataset. Key differences to highlight in an interview: Data access — online sees one example at a time; batch sees everything at once. Adaptability — online adapts quickly to distributional shifts (concept drift); batch requires retraining to incorporate changes. Compute/memory — online is lightweight per update and doesn't need to store the full dataset in memory; batch can be expensive and requires the full dataset to be available. Stability — batch training tends to be more stable and reproducible; online learning can be sensitive to data ordering and noisy examples. Use cases — online is ideal for streaming data, non-stationary environments, or when data is too large to store (e.g., ad click prediction). Batch is preferred when you have a fixed dataset, need reproducibility, or the data distribution is stable.

Answer 76

Sampling is a process of choosing a subset from a target population which would serve as its representative. We use the data from the sample to understand the pattern in the population as a whole. Sampling is necessary because often we can not gather or process the complete data in a reasonable time. There are many ways to perform sampling, some commonly used techniques are Random Sampling, Stratified Sampling, and Cluster Sampling.

Answer 77

A confidence interval is an interval estimate which is likely to include an unknown population parameter, the estimated range being calculated from the given sample dataset. It simply means the range of values for which you are completely sure that the true value of your variable would lie in.

Answer 78

We often assume that the instances in the training dataset are independent and identically distributed (i.i.d.), i.e, they are mutually independent of each other and follow the same probability distribution. It means that the order in which the training instances are supplied should not affect your model and that the instances are not related to each other. If the instances do not follow an identical distribution, it would be fairly difficult to interpret the data.

Answer 79

The Generalized Linear Model (GLM) is a generalization of ordinary linear regression in which the response variables have error distribution models other than a normal distribution. The "linear" component in GLM means that the predictor is a linear combination of the parameters, and it is related to the response variable via a link function.

Answer 80

Conditional probability is the probability of an event A occurring given that another event B has already occurred (or is known to be true). It's written as P(A|B) and read as "the probability of A given B." The formula: P(A|B) = P(A ∩ B) / P(B), where P(B) > 0 This says: to find the probability of A given B, take the probability that both A and B happen, and divide by the probability of B. You're essentially narrowing the sample space to only the outcomes where B is true, then asking how often A also occurs within that subset. Intuitive example: Suppose you're drawing from a standard deck of 52 cards. The probability of drawing a king is 4/52. But if someone tells you the card is a face card (given B), you've narrowed the space to 12 cards, and now P(King | Face card) = 4/12 = 1/3. Why it matters for ML: 1. Bayes' theorem is built directly on conditional probability: P(A|B) = P(B|A) · P(A) / P(B). This is the foundation of Naive Bayes classifiers, Bayesian inference, and probabilistic graphical models. 2. Classification is fundamentally about estimating P(class | features) — a conditional probability. 3. Chain rule of probability decomposes joint distributions into a product of conditionals, which underpins language models (predicting the next word given all previous words). 4. Feature independence assumptions in models like Naive Bayes are statements about conditional probabilities.

Answer 81

Bayes' Theorem provides a way to reverse conditional probabilities. If you know P(B|A), it lets you compute P(A|B). The formula is: ```P(A|B) = P(B|A) · P(A) / P(B)``` Each term has a name that's worth knowing: P(A|B) — Posterior: what you want to know — the updated belief about A after observing B. P(B|A) — Likelihood: how probable the observed evidence B is if A were true. P(A) — Prior: your belief about A before seeing any evidence. P(B) — Evidence (marginal likelihood): the total probability of observing B under all possible hypotheses. Acts as a normalizing constant. The core intuition: Bayes' theorem is a principled framework for updating beliefs with evidence. You start with a prior belief, observe data, and arrive at a posterior belief. The more data you observe, the more the posterior is shaped by the likelihood rather than the prior. Why it's useful in ML: 1. Naive Bayes classifiers — directly apply Bayes' theorem to classify text, spam, sentiment, etc. by computing P(class | features) using the likelihood of each feature given the class. 2. Bayesian inference — instead of learning a single point estimate for model parameters, you maintain a full posterior distribution, which naturally quantifies uncertainty. 3. Bayesian optimization — used for hyperparameter tuning, where you build a probabilistic model of the objective function and update it as you evaluate new configurations. 4. Medical/fraud/anomaly detection — Bayes' theorem helps reason about rare events. A positive test result doesn't mean high probability of disease unless you account for the prior (base rate). This is the classic "base rate fallacy" that Bayes corrects for. 5. Probabilistic graphical models — Bayesian networks use Bayes' theorem to perform inference over complex joint distributions.

Answer 82

Use Training-Validation-Test datasets. Also leverage Cross-Validation to reduce the variance in your estimate.

Answer 83

This refers to the error introduced by approximating a complex real-world problem with a simplified model. A high-bias model makes strong assumptions about the data and tends to underfit — it misses relevant patterns. For example, fitting a linear model to data that has a quadratic relationship produces high bias. Formally, it's the difference between the expected prediction of the model and the true value.

Answer 84

Variance in ML refers to the amount by which the model's predictions would change if it were trained on a different dataset drawn from the same distribution. It measures how sensitive the model is to the specific training data it saw.

Answer 85

A probabilistic graphical model is a powerful framework which represents the conditional dependency among random variables in a graph structure. It can be used in modeling a large number of random variables having complex interactions with each other.

Answer 86

Matrix factorization means factorizing a matrix into 2 or more matrices such that the product of these matrices approximates the actual matrix. This technique can greatly simplify complex matrix operations and can be used to find the latent features in given data. An example of this is in Recommendation Systems, where it could be used to find the similarities between two users. In non-negative matrix factorization, a matrix is factorized into 2 sub-matrices such that all 3 matrices have no negative elements.

Answer 87

Fitting a model is the process of learning the parameters of a model using the training dataset. Parameters help define the mathematical formulas behind the Machine Learning models. Hyperparameters are the "high-level" parameters that cannot be learned from the data. They define the properties of a model, such as the model complexity or learning rate.

Answer 88

Gradient Descent theoretically minimizes the error function better than SGD, however SGD converges much faster once the dataset becomes large. Thus GD is preferable for small datasets, while SGD is preferable for larger ones. In practice, SGD is often used because it minimizes the error function well enough while being much faster and more memory efficient for large datasets.

Answer 89

1. If the learning rate is too small, it can take a long time to converge. If it is too large, the function may jump around the optimum value and not converge. 2. It may converge slowly in the case of a Symmetric Positive Definite (SPD) matrix. The eigen values lay down the curvature of the function, and in the case of SPD, they are all positive and generally different, which leads to a non-circular contour. Due to this, converging to the optimal point would take a lot of steps. In short, the more circular the contour is, the faster your algorithm would converge. 3. Sometimes, due to rounding errors, the GD may not converge at all. GD generally stops when the expected cost/error is either zero or very small. However, rounding errors might make it so that your error never becomes absolute zero, in which case the algorithm would keep converging. 4. If the function does not have a minimum, the GD would continue to descend forever. 5. Some functions are not differentiable in certain regions, and the gradient cannot be calculated at those points.

Answer 90

There's no correct answer, it truly depends. It is common to do a 80:20 split for training-test, and further splitting the training 80% into training-validation sets. With massive data for deep learning that 80:20 split can sometimes be more skewed, to 90:10 or even 95:5.

Answer 91

A paired t-test is a statistical procedure which is used to determine whether the mean difference between two sets of observations is zero or not. It has 2 hypotheses, the null hypothesis and the alternative hypothesis. The null hypothesis (H_0) assumes that the true mean difference between the paired samples is zero. Conversely, the alternative hypothesis assumes that the true mean difference is not equal to zero. We use paired t-test to compare the means of the two samples in which the observations in one sample can be paired with the observations in the other sample.

Answer 92

An F-test is any statistical hypothesis test where the test statistic follows an F-distribution under the null hypothesis. If you have 2 models that have been fitted to a dataset, you can use F-test to identify the model which best fits the sample population.

Answer 93

A chi-squared test is any statistical hypothesis test where the test statistic follows a chi-squared distribution (a distribution of the sum of squared standard normal deviates) under the null hypothesis. It measures how well the observed distribution of data fits with the expected distribution, if the variables are independent.

Answer 94

The F1 score is a measure of a model's performance. It is the weighted average of the precision and recall of a model. The result ranges from 0 to 1 with 0 being the worst and 1 being the best model. F1 score is widely used in the fields of Information Retrieval and NLP. F1 Score = 2*Precision*Recall/(Precision + Recall)

Answer 95

Type I error occurs if you reject the null hypothesis when it was true, also known as a False Positive. Type II error occurs if you accept the null hypothesis when it was false, also known as False Negative.

Answer 96

A Bayesian classifier is a probabilistic model which tries to minimize the probability of misclassification. From the training dataset, it calculates the probabilities of the features, given the class labels and uses this information in the test dataset to predict the class given (some of) the feature values by using the Bayes rule.

Answer 97

You can use any kind of predictor in a Naive Bayes classifier. All it needs is the conditional probability of a feature given the class, i.e., P(F|Class). For categorical features, you can estimate P(F|Class) using a distribution such as multinomial or Bernoulli. For numerical features, you can estimate P(F|Class) using a distribution such as Normal or Gaussian. Since Naive Bayes assumes the conditional independence of features, it can use different types of features together. You can calculate each feature's conditional probability and multiply them together to get the final prediction.

Answer 98

Naive Bayes assumes all the features in a dataset are equally important and conditionally independent of each other. These assumptions are rarely true in real world scenarios which is why Naive Bayes is called "Naive".

Answer 99

Let n be the number of features. Time complexity for Naive Bayes is O (log n) while it is O (n) for Logistic Regression.

Answer 100

A generative model learns the joint probability distribution P(x, y) whereas a discriminative model learns the conditional probability distribution P(y|x) where y is the output class label and x is the input variable. Generative models learn the distribution of the individual classes whereas discriminative models learn the boundary between classes. Naive Bayes is a generative approach as it generates the joint probability distribution of the features and the output label using P(Y) and P(X|Y), whereas Logisitic Regression is a discriminative approach because it tries to find a hyperplane which separates the classes.

Answer 101

Prior probability is the proportion of dependent (binary) variable in the dataset. It is the closest guess you can make about a class, without any further information. For example, let's say that you have a dataset in which the dependent variable is binary, spam or not spam. The proportion of spam is 75% and not spam is 25%. Hence, you can estimate that there are 75% chances that any new email would be spam. Likelihood is the probability of classifying a given observation as true in the presence of some other variable. For example, the probability that the word "CASH" is used in the previous spam message is a likelihood. Marginal likelihood is the probability that the word "CASH" is used in any message.

Answer 102

The Laplace estimate (add-one smoothing) prevents zero probabilities by adding 1 to every count: ```P(x) = (count(x) + 1) / (N + d)```. It's essential for Naive Bayes where a single zero probability would zero out the entire prediction. The m-estimate generalizes this as ```P(x) = (count(x) + m·p) / (N + m)```, where m controls smoothing strength and p is the prior. Laplace is the special case where ```m = d``` and ```p = 1/d```. The m-estimate offers more flexibility when you have domain knowledge or want to tune the bias-variance tradeoff in probability estimates.

Answer 103

A confusion matrix is a table layout which describes the performance of a model on the test dataset for which the true values are known. For a binary or 2-class classification, which can take two values, 0 or false and 1 or true, a confusion matrix can be drawn as seen in the image.

Answer 104

Decision Trees partition the feature space into smaller and smaller subspaces, whereas Logistic Regression fits a single hyper-surface to divide the feature space exactly into two. When the classes are not well separated, decision trees are susceptible to overfitting whereas Logistic Regression generalizes better. Decision Tree is more prone to overfitting whereas Logistic Regression, being simple and having less variance, is less prone to overfitting. So, for datasets with very high dimensionality, it is better to use Logistic Regression to avoid the Curse of Dimensionality.

Answer 105

If the training set is small, high bias/low variance models such as Naive Bayes tend to perform better because they are less likely to overfit. If the training set is large, low bias/high variance models such as Decision Trees can perform better because they can reflect more complex relationships.

Answer 106

A decision boundary or decision surface is a hypersurface which divides the underlying feature space into two subspaces, one for each class. If the decision boundary is a hyperplane, then the classes are linearly separable.

Answer 107

When you fit a Decision Tree to a training dataset, the top few nodes on which the tree is split are basically the most important features in the dataset and thus, you can use it for feature selection to select the most relevant features in the dataset. Decision Trees are also insensitive to outliers since the splitting happens based on the proportion of samples within the split ranges and not on the absolute values. Finally, their tree like structure makes them very easy to understand and interpret. They do not need data to be normalized and work well even when features have nonlinear relationships with each other.

Answer 108

1. Even a small change in input data can, at times, cause large changes in the tree as it may drastically impact the information gain used by Decision Trees to select features. 2. Decision trees, moreover, examine only a single field at a time, leading to rectangular classification boxes. This may not correspond well with the actual distribution of records in the decision space. 3. Decision Trees are inadequate when it comes to applying regression and predicting continuous variables. A continuous variable can have an infinite number of values within an interval, capturing which, in a tree having only a finite number of branches and leaves, is very hard. 4. There is a possibility of duplication with the same sub-tree on different paths, leading to complex trees. 5. Every feature in the tree is forced to interact with every feature further up the tree. This is extremely inefficient if there are features that have no or weak interactions.

Answer 109

Entropy is a measure of uncertainty associated with a random variable, Y. It is the expected number of bits required to communicate the value of the variable. It is calculated as −Σ pᵢ log₂(pᵢ), where pᵢ is the proportion of class i in a node. It ranges from 0 (pure node) to log₂(k) for k classes.

Answer 110

Information gain is used to identify the best feature to split a given training dataset. It selects the split S that most reduces the conditional entropy of output Y for the training set D. In simple terms, the Information Gain is the change in the Entropy, H, from a prior state to a new state when split on a feature. Formally: IG(D, S) = H(D) − H(D|S), where H(D) is the entropy of the dataset before splitting and H(D|S) is the conditional entropy after splitting on feature S. When splitting for a decision tree, we use the weighted average of child entropies to calculate the information gain for the split (much like with Gini impurity): Formula: IG(D, S) = H(D) − Σ (|Dᵥ| / |D|) · H(Dᵥ) Where H(X) = −Σ p(xᵢ) · log₂(p(xᵢ)) is the entropy, Dᵥ are the subsets after splitting on feature S, and |Dᵥ|/|D| is the weight for each subset.

Answer 111

Information Gain is biased towards the tests with many outcomes. For instance, consider a feature that uniquely identifies each training sequence. Splitting on this feature would result in many branches, each of which is "pure" (has instances of only one class) i.e., maximal information gain and this affects the model's generalization accuracy. To address this limitation, the C4.5 algorithm uses a splitting criterion known as the Gain Ratio. Gain Ratio normalizes the Information gain by dividing it by the entropy of the split being considered, thereby avoiding the unjustified favoritism of Information Gain. The GainRatio formula is attached as an image.

Answer 112

1. Information Gain 2. Gain Ratio 3. Gini Impurity 4. Multi-variate split - Multivariate decision trees can use splits that contain more than one attribute at each internal node.

Answer 113

Always using an ensemble may seem like a better approach than a single Decision Tree, but Random Forests have their own limitations. These include: 1. Ensembles generally do not perform well when the relationship between dependent and independent variables is highly linear. 2. Unlike Decision Trees, the classification made by Random Forests is difficult to interpret easily. 3. Random Forest ensembles are technically more computationally expensive than a single Decision Tree (but can be trained and inferenced in parallel, however Gradient Boosted Machines are trained sequentially).

Answer 114

Pruning is a technique which reduces the complexity of the final classifier by removing sub-trees whose existence does not impact the accuracy of the model. In pruning, you grow the complete tree and then iteratively prune back some nodes until further pruning is harmful. This is done be evaluating the impact of pruning each node on the tuning (validation) dataset accuracy and greedily removing the one that most improves the tuning dataset accuracy. One simple way of pruning a Decision Tree is to impose a minimum number of the training examples that reach a leaf. Pruning keeps the tree simple without affecting the overall accuracy. It helps solve the overfitting issue by reducing size as well as complexity of the tree.

Answer 115

Advantages: 1. Simple to understand and implement. KNN requires no explicit training phase — it simply stores the training data and defers all computation to prediction time (this is what "lazy learner" means). 2. Flexible choice of distance metrics and features. You can adapt KNN to different problem types by choosing appropriate distance functions (e.g., Euclidean, Manhattan, cosine similarity). 3. Naturally handles multi-class classification. Unlike some algorithms that require special adaptations for more than two classes, KNN works seamlessly with any number of classes. 4. No assumptions about data distribution. KNN is non-parametric, meaning it makes no assumptions about the underlying shape of the data, which makes it versatile across many problem types. Disadvantages: 1. High memory usage. Because KNN stores the entire training dataset and uses it at prediction time, it can be very memory-intensive for large datasets. 2. Slow prediction time at scale. For each new prediction, KNN must compute distances to every training point, making inference slow when the training set is large. 3. Sensitive to irrelevant or poorly scaled features. If features aren't carefully selected or normalized, irrelevant dimensions can dominate the distance calculation and hurt accuracy. 4. Requires a large, representative training set. KNN needs sufficient data density to find meaningful nearest neighbors — with sparse or small datasets, predictions can be unreliable.

Answer 116

Here are some solid methods for choosing optimal k in k-NN: Cross-Validation (most reliable) Train with different k values and evaluate using k-fold cross-validation. Pick the k that minimizes validation error. This is the gold standard approach. The √n Rule of Thumb Start with k ≈ √n, where n is the number of training samples. It's a quick baseline, not a final answer. Elbow Method Plot error rate vs. k values. Look for the "elbow" — the point where error stops decreasing meaningfully. Beyond that point, you're gaining little while increasing bias. Odd k for Binary Classification Always prefer odd k values when you have 2 classes to avoid tie-breaking ambiguity. And here are some practical considerations: 1. Small k → low bias, high variance (overfitting) 2. Large k → high bias, low variance (underfitting) 3. Larger datasets generally tolerate larger k 4. Use weighted k-NN (distance-weighted votes) to reduce sensitivity to k choice

Answer 117

Hamming Distance is the number of positions at which two equal-length sequences differ. Examples: 1011101 vs 1001001 → 2 positions differ → Hamming distance = 2 "karolin" vs "kathrin" → 3 positions differ → Hamming distance = 3

Answer 118

You should increase k to handle any noise. A large k value would average out or nullify noise or outliers in a given dataset.

Answer 119

A t-distribution is a probability distribution similar to the normal distribution but with heavier tails, used when working with small sample sizes or when the population standard deviation is unknown. One liner: "It's the normal distribution, but more uncertain - because you're estimating variance from data, not assuming you know it."

Answer 120

1. Edited nearest neighbors - Instead of retaining all the training instances, select a subset of them which can still provide accurate classifications. Use either forward selection or backward elimination to select the subset of the instances which can still represent other instances. 2. K-dimensional Tree - It is a smart data structure used to perform nearest neighbor and range searches. A k-d tree is similar to a decision tree except that each internal node stores one data instance (i.e., each node is a k-dimensional data point) and splits on the median value of the feature having the highest variance.

Answer 121

Logistic regression is a statistical method for analyzing a dataset in which one or more independent variables determine the outcome, that can have only a limited number of values, i.e. the response variable is categorical in nature. Logistic regression is a go-to method for classification problems when the response (output) variable is binary.

Answer 122

1. AUROC: You can use the AUROC curve along with a confusion matrix, plus recall, precision, accuracy, and F1 score. 2. AIC (Akaike Information Criterion): Analogous metric of adjusted R^2 in logistic regression. AIC is the measure of fit which penalizes the mode for the number of model coefficients. We prefer a model with a minimum AIC value. 3. Deviance: Deviance represents the goodness of fit for a model. We prefer a model with lower deviance value. Null deviance means the response only has an intercept and the residual deviance indicates the response has non-zero weight vector.

Answer 123

A link function provide the relationship between the expected value of the response variable and the linear predictor. Logistic Regression uses Logit as its link function, which is the term wx in P(y) = 1/(1+e^-(wx)).

Answer 124

Logistic Regression is multinomial when the number of classes to separate are more than two. A Multinomial Logistic Regression algorithm predicts the probabilities of each possible class as the outcome.

Answer 125

In One Vs All, if there are n classes, then you have n different independent classification problems, one for each class. For the ith classification problem, you learn all the points which belong to the class i, and all the other points are assumed to belong to a pseudo class not i. For a new test data, you use the most occurring response from all the n classifiers to predict its output (ie, majority vote).

Answer 126

Reducing the number of iterations during gradient descent would reduce the training time but it will hamper the accuracy as well. So, you can increase the learning rate to speed up the convergence, while still maintaining a similar accuracy. Alternatively, you can use learning rate decay to have it high for fast initial convergence, then reduced to better settle into a local minima. You can also consider Momentum, RMSProp, and Adam.

Answer 127

(This is related to Support Vector Machines). A margin classifier gives the distance of a data instance from the decision boundary. In the case of Support Vector Machines, a decision boundary is a hyperplane separating the two class labels. A "maximal margin classifier" is a classifier that draws the separation hyperplane between the two class labels in a way such that the distance of the hyperplane is maximum, i.e., the hyperplane is at an equal distance from them. The maximal margin is the optimal hyperplane separating the classes and does not suffer from overfitting.

Answer 128

Core Idea: An SVM finds the optimal hyperplane that separates classes by maximizing the margin — the distance between the hyperplane and the nearest data points from each class (called support vectors). Hard-Margin SVM: Assumption: Data is perfectly linearly separable — no misclassifications allowed. Soft-Margin SVM: Assumption: Data may not be perfectly separable — allows some misclassifications. Quick Comparison: Hard-SVM: 1. Must be linearly separable 2. No slack variables 3. No hyperparameter 4. Highly sensitive to outliers 5. Rare in real-world use Soft-SVM: 1. Works with overlapping classes 2. slack variable ξi ≥ 0 (Xi, pronunced "Zy") 3. C hyperparameter is used for regularization, if C = infinity, recover Hard-SVM. Small C tolerates more violations, get wider margin and more generalization. Large C penalizes violations heavily, get narrower margin and less tolerance. 4. Outlier sensitivity controlled by C 5. Standard in real-world use. Soft-SVM is typically always used IRL because real data is noisy and rarely perfectly separable.

Answer 129

A kernel is a function K(xi, xj) that computes the dot production between two data points in a higher-dimensional feature space, without explicitly transforming the data into that space: ```K(xi,xj) = ϕ(xi)⋅ϕ(xj)``` where ϕ is a mapping to a higher-dimensional space - but you never actually compute ϕ. The Problem Kernels Solve: Many real-world datasets are not linearly separable in their original space. The naive solution is to map data to a higher-dimensional space where it becomes linearly separable. ```x∈R^d ⟶ϕ⟶ ϕ(x)∈R^D (D≫d)``` But this can be quite expensive if D is huge (or infinite!) and become computationally intractable. The Kernel Trick: The key insight is that the SVM dual formulation only ever uses data through dot products xi⋅xj. So you can substitute: ```xi⋅xj ⟶ K(xi,xj) = ϕ(xi)⋅ϕ(xj)``` You get the power of a high-dimensional transformation at the cost of a simple pairwise function evaluation. This is the kernel trick. You never compute ϕ(x) explicitly — you only ever evaluate K(xi,xj) directly in the original space.

Answer 130

1. Linear; ```xi⋅xj``` - Used for linearly separable data. 2. Polynomial; ```(xi⋅xj+c)^d``` - Use when there is moderate non-linearity. 3. RBF/Gaussian; ```exp(−(∥xi−xj∥^2)/(2σ^2))``` - IS General purpose, most popular. 4. Sigmoid; ```tanh(αxi⋅xj+c)``` - Used when having neural net-like boundaries. Note that the RBF kernel maps data to an infinite-dimensional space - yet computing it is just a single exponential evaluation!

Answer 131

The SVM problem can have many possible hyperplanes which separate the positive and negative instances. But the goal is to choose the hyperplane which maximizes the margin between the classes. The reason behind this optimization is that such a hyperplane, which not only separates the training instances, but is also as far away from them as, generalizes the best and does not result in overfitting.

Answer 132

A kernel method generally constructs the kernel matrix of the order R^(N x N), where N is the number of data instances. Hence, the complexity of a kernel function depends on the number of data instances not on the number of features. A kernel method scales quadratically (referring to the construction of the gram matrix), and cubic (referring to the matrix inversion) with the number of data instances.

Answer 133

1. Nystrom Method - Kernel matrix computation varies quadratically with N, the number of data instances, which becomes a bottleneck when N becomes very large. To alleviate this issue, Nystrom approximation is used which generates a low-rank kernel matrix approximation, having d << N. 2. Taking random features, but query/nearest neighbors. This involves mapping the input data to a randomized low-dimensional feature space such that the inner products of the transformed data are approximately equal to those in the original one. 3. Distributed/Parallel training algorithms and applying multiple SVM classifiers together.

Answer 134

Pros: General kernel methods can work well with non-linearly separable data, are non-parametric, and more accurate in general. Cons: They do not scale well with the number of data instances and require hyperparameter tuning.

Answer 135

Yes, since the learning task is framed as a convex optimization problem, which is bound to have one optimum solution only, and that is the global minima. There is only a single global minimum in the case of SVM as opposed to the multi-layer neural network, which has multiple local minima and the solution achieved may or may not be a global minimum, depending upon the initial weights.

Answer 136

An Artificial Neural Network (ANN) is a computational model inspired by biological neural networks. They are used as a random function approximation tool. Typically, ANNs are organized in layers. The first layer consists of input neurons. They send input data on to hidden layers (where each neuron is called a hidden unit), which in turn send the output neurons to the final output layer.

Answer 137

Advantages: 1. It is a nonlinear model that is easy to use and understand as compared to statistical methods. 2. It has the ability to implicitly detect complex nonlinear relationships between the dependent (output) and independent (input) variables. 3. It can easily train on a large amount of data. 4. It can be easily run in a parallel architecture, thereby drastically reducing the computation time. 5. It is a non-parametric model (does not assume the data distribution to be based on any finite set of parameters such as mean or variance) which does not need a lot of statistics background. Disadvantages: 1. Because of its black-box nature, it is difficult to interpret how the output is generated from the input. 2. It cannot extrapolate the results. One reason for this shortcoming can be its non-parametric nature. 3. It can suffer from overfitting easily. Due to a large number of hidden units (neurons), ANNs can be very complex models which often leads to overfitting on the training dataset and poor performance on the test dataset. Regularization and early stopping can help generalize the model and reduce overfitting. 4. ANNs generally converge slowly. Can be sped up with Momentum, RMSProp, and Adam.

Answer 138

A perceptron is an algorithm which learns a binary classifier by directly mapping the input vector "x" to the output response "y" with no hidden layers. ```y = 1 if w.x + b > 0 else y = 0``` where w is a vector of real-valued weights representing the slope and b is the bias, representing the horizontal shift of the output vs input curve from the origin.

Answer 139

Hidden units transform the input space into a new space where the perceptrons suffice. They numerically represent new features constructed from the original features in the input layer. Each hidden layer (consisting of hidden units) transforms its input layer in a new feature space which is easier for the output layer to interpret. For instance, you have a raw image supplied as the input layer, and the first hidden layer transforms the raw pixel data into the edges in the image, the second hidden layer detects shapes from the edges, and the output layer performs object recognition on those shapes.

Answer 140

An activation function, also known as the transfer function, is simply the output value of a hidden or output unit. It can be Identity, Sigmoid function, etc.

Answer 141

Since a single-layered neural network is a convex function, the gradient descent is bound to converge to a global minimum. On the contrary, a multi-layered neural network is not a convex function and hence, the gradient descent may or may not converge to a global minimum, depending upon the initial weights.

Answer 142

The weights should be initialized with small values so that the activations are in the range where the derivative is large (learning is quicker), and random values to ensure symmetry breaking (i.e., if all weights are the same, the hidden units will all represent the same thing). Typical initial weights are in the range of [-0.01, 0.01].

Answer 143

You can set the learning rate either using hyperparameter tuning, or through the "hit and trial" method depending on the particular problem (The hit and trial method is a problem-solving technique involving repeated, educated guesses to find a solution). If the learning rate is set too small, convergence takes too long or not at all. If the learning rate is too large, you get divergence.

Answer 144

With many layers, back propagation can struggle to work well as the increase in layers can lead to vanishing or exploding gradients. We can mitigate this with residual connections.

Answer 145

A loss function is a function which maps the values of one or more variables onto a real number that represents the "cost" associated with those values. For backpropagation, the loss function calculates the difference between the actual output value and its expected output. The loss function is also sometimes referred to as the cost function or error function.

Answer 146

Logistic Regression, in general, can be thought of as a single layer Artificial Neural Network. It is mostly used in cases where the classes are more or less linearly separable whereas and ANN can solve much more complex problems. One of the nice properties of Logistic Regression is that the Logistic cost function is convex, which means that you are guaranteed to find the global minimum. But, in the case of a multi-layer neural network, you lose this convexity and may end up at a local minimum, depending upon the initial weights.

Answer 147

Rectified linear unit (ReLU) is an activation function, given by f(x) = max(0, x). Because of it's linear form, it greatly speeds up the convergence of stochastic gradient descent. It makes the activation sparse and efficient as it yields 0 activation for negative inputs. But ReLU can be fragile during training, that it, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. With large learning rate, a large gradient can "kill" a ReLU such that the input becomes negative. This leads to ReLU being in f(x) = 0 region, making the gradient 0, leading to no changes in the weights. This means that the neurons which go into this state will stop responding to any change in the error and hence "die".

Answer 148

Tangent activation function, also known as tanh function, is a hyperbolic activation function often used in Neural Networks. It's formula is: ```g_tanh(x) = (e^x - e^-x)/(e^x+e^-x)``` The output of the tanh function lies in the range (-1, 1). This provides an advantage over sigmoid in that the sigmoid output ranges from (0, 1), so very large negative inputs can cause dying neurons (much like with ReLU). With Tanh, very large negative inputs output close to -1, so this does not occur. Only zero-valued inputs to Tanh would result in zero-values outputs.

Answer 149

In a neural network, the output variable is usually modeled as a probability distribution where the output nodes (the different values that the output variable can take) are mutually exclusive of each other. The softmax function is a generalization of the logisitc function. It squashes the k-dimensional output variable into a k-dimensional probability distribution where each entry is the probability of the output variable taking that values. Hence, each output node takes a values in the range (0, 1) and the sum of the values of all the entries is 1.

Answer 150

1. Deep Neural Networks are mostly data hungry, so the more data you have, the better prediction you may get from them. 2. Hidden units - Having more hidden units is still acceptable but if you have less than the optimal number of hidden units, your model may suffer from underfitting. 3. Use back-propagation with Rectified Linear Units (ReLU activation functions). 4. Always initialize the weights with small random numbers to break the symmetry between different units. 5. You can try to use a gradually decreasing learning rate, which reduces after every epoch or few hundreds of instances, in order to speed up the convergence. (Honestly, there are a loooot more but these 5 are listed in the book).

Answer 151

Regularization is an approach used to prevent overfitting of a model. Three ways to perform regularization in Artificial Neural Networks are: 1. Early Stopping - This is an upperbound on the number of iterations to run before the model begins to overfit. 2. Dropout - This is a technique where you randomly drop units (along with their connections) from the neural network during training. This prevents the units from co-adapting too much and helps reduce overfitting. 3. L1 or L2 penalty terms - L1 and L2 are the regularization techniques which add an extra parameter, lambda, to penalize the coefficients if they start overfitting.

Answer 152

Autoencoders are artificial neural networks which belong to Unsupervised Learning Algorithms and are used to learn the encoding of the given dataset, typically for the purpose of Dimensionality Reduction. They consist of 2 parts: 1. Encoding (converting the higher dimensional input to a much lower dimension hidden layer(s)) 2. Decoding (converting the hidden layer(s) to the output. Autoencoders try to learn the approximation to the input, and not actually predict any output. They are extremely useful as they find the low dimensional representation of the given dataset and also remove any redundancy present in it.

Answer 153

CNNs are well suited for tasks such as image recognition or sequences in which the input has spatial structure. They are based on 4 building blocks: 1. Convolution - The primary purpose of Convolution is to extract the features from the input image. A small matrix, known as a filter or kernel, slides over the image and the dot product is computed. This dot product is called the Convolved Feature or Feature Map. By varying the filter, you can achieve different results such as Edge Detection. Blur, etc. 2. Rectified Linear Units - The purpose of ReLU is to introduce nonlinerity, since most of the real-world data would be nonlinear. It is applied after every convolution step. 3. Pooling or Sub sampling - Spatial Pooling (also called downsampling) reduces the dimensionality of each feature map. 4. Classification (Fully Connected Layer) - It is a traditional Multi-layer Perceptron. The term "fully connected" implies that every neuron in the previous layer is connected to every neuron in the next layer. The output from the convolution and pooling layers represent high-level features of the input image. THe purpose of the Fully Connected Layer is to use these features for classifying the input image into various classes based on the training dataset.

Answer 154

The weights should be initialized with random values to ensure symmetry breaking (i.e. if all weights are the same, the hidden units will all represent the same thing). Typical initial weights are in the range [-0.01, 0.01].

Answer 155

The Learning Rate. If the learning rate is too high, it will cause the result to jump over the optimal point resulting in the weights oscillating between positive and negative. If it is too low, it may take a very long time to converge.

Answer 156

The idea behind RNNs is to make use of sequential information. They are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations. They have applications in various NLP tasks such as Speech Recognition, Image Captioning, and Language Modelling. Unlike traditional Neural Networks, RNNs have loops in them, allowing information to persist. The figure shows an RNN being unrolled into a full network, which simply means writing out the network for the complete sequence, where x_i is the input at time i and h_i is the corresponding output. The output at time i depends on the previous information. For instance, predicting the next word in a sentence would depend on the words seen so far.

Answer 157

Regression analysis is a set of statistical processes that estimate the relationship between the independent and dependent variables. The most common approach is to estimate the conditional expectation of the dependent variable given the independent variable (based on the assumption that the independent variables are linearly independent).

Answer 158

Regression belongs to the Supervised Learning category because it learns the model from a labeled dataset to predict continuous or discrete variables.

Answer 159

1. Linear Regression - It tries to fit a straight line to model the relationship between the dependent variable and the independent variable. 2. Logistic Regression - It finds the probability of success. It is used when the dependent variable is binary. 3. Polynomial Regression - It fits a curve between dependent and independent variables, where the dependent variable is a polynomial function of the independent variable.

Answer 160

Low bias and high variance can be used in K-Nearest Neighbors. They have low bias because they do not assume anything special about the data distribution and high variance because they can easily change their prediction in response to the composition of the training set.

Answer 161

First off, you get an Intercept coefficient, B_0 and a set of B_i coefficients. The regression equation can be written as: ``` Y = B_0 + Summation(i=1 to N) of (B_i * X_i) The Intercept, B_0 can be interpreted as the predicted value for the response variable when all the predictor values are 0. The Coefficients for Continuous predictors - For all continuous predictors Xi, their corresponding coefficient B_i represents the difference in the response variable's predicted value for each one unit difference in X_i keeping all other X_j constant. The Coefficients for Categorial predictors - For all the categorical predictors X_i, since they can be coded as 0,1,2, etc., a one unit difference in X_i represents switching from one category to the other, keeping all other X_j constant.

Answer 162

Too many variables can cause overfitting. If you have too many variables in your regression model, then your model may suffer from lack of degree of freedom and have some variable correlated to each other. Having too few variables, on the other hand, will lead to underfitting as you won't have enough predictors to learn the training dataset.

Answer 163

In linear regression, the dependent variable y is the linear combination of the parameters. For instance, if x is the independent variable, and Beta_0 and Beta_1 are two parameters: ``` y = Beta_0 + Beta_1*x ``` Note that instead of x, you can have any function of x, such as x^2. In that case: ``` y = Beta_0 + Beta_1*x^2 ``` This is still a linear regression as y is still a linear combination of the parameters (Beta_0 and Beta_1).

Answer 164

An embedding layer is a trainable lookup table that maps integer indices to dense vectors. Mathematically, it's equivalent to multiplying a one-hot vector by a weight matrix W — but since that multiplication just selects the ith row of W (where i is the index of the 1 in the one-hot vector), an embedding layer skips the multiply entirely and directly indexes W[i]. Same result, much cheaper computation.

Answer 165

First, dimensionality — one-hot vectors grow to the size of the vocabulary (e.g., 50,000 for words), which is wasteful since they're almost entirely zeros. Embeddings compress this to a chosen dense dimension (e.g., 256). Second, similarity — one-hot vectors imply zero similarity between all categories (every pair is equidistant), while embedding vectors are learned during training so that semantically similar items end up with nearby vectors in the embedding space.

Answer 166

Three steps: (1) Map the category string to an integer index via a vocabulary mapping (e.g., "blue" → 1). (2) Use that integer to index into the embedding weight matrix W of shape (vocab_size, embed_dim). (3) Return W[index], which is a dense, trainable vector. During training, backpropagation updates only the rows of W that were looked up in each batch.

Answer 167

You can use the following statistics to test the model's fitness: 1. R-squared - measures how much of the variation in your outcome variable is explained by your model's predictors. 2. F-Test - evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative hypothesis that at least one is not. It is used to identify the best model which fits the given dataset. 3. Root Mean Squared Error (RMSE) - The square root of the variance of the residuals. It measures the average deviation of the estimates from the observed value.

Answer 168

You can use K-NN in regression for estimating the continuous variables. One of the algorithms is using a weighted average of the k nearest neighbors, weights by the inverse of their distance.

Answer 169

The intercept term signifies the response variable's shift from the origin. It ensures that the model would be unbiased, i.e., the residual mean is 0. If you omit the intercept term, then your model is forced to go through the origin and the slop would become steeper (and biased). Hence, you should not remove the intercept term unless you are completely sure that it is 0. For instance, if you are calculating the area of a rectangle, with height and width as the predictor variables, you can omit the intercept term since you know that the area should be 0 when both height and width are 0.

Answer 170

Collinearity is a phenomenon in which two predictor variables are linearly related to each other. Let X1 and X2 be two variables, then: ```X1 = lambda_0 + lambda1*X2``` where lambda_0 and lambda_1 are constants. X1 and X2 would be perfectly collinear if lambda_0 is 0.

Answer 171

Multicollinearity is a phenomenon in regression where one predictor (independent variable x_i) can be predicted as a linear combination of other predictors with a significant accuracy. ``` lambda_0 + lambda_1*X1 + lambda_2*X2 + ... + lambda_k*Xk = 0 ``` The issue with perfect multicollinearity is that it makes X.transpose @ X non-invertible, which is used by the Ordinary Least Squares method to solve linear regression by finding optimal estimates. So, you would need to first remove the redundant feature and then perform OLS. Note that OLS is a method for estimating the unknown parameters in a linear regression model, by minimizing the sum of the squares of the differences between the observed and predicted response variable.

Answer 172

The standard linear regression model makes the following five assumptions: 1. A linear relationship between the parameters and response variable exists. 2. The residuals follow the normal distribution (A residual, e_i, is the difference between the predicted value and the true value of the corresponding dependent (response) variable, y_i, where i represents the specific example in the data.) 3. No perfect multicollinearity exists among the predictors. 4. The number of observations is greater than the number of predictors. 5. The mean of the residuals is zero.

Answer 173

Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function that discourages model complexity. The core idea: Instead of just minimizing training loss, you minimize: Total Loss = Training Loss + λ · Penalty where λ (lambda) controls how strongly complexity is penalized. Common types: L1 (Lasso) — penalizes the sum of absolute weights (|w|). Drives some weights to exactly zero, producing sparse models and acting as built-in feature selection. L2 (Ridge) — penalizes the sum of squared weights (w²). Shrinks all weights toward zero but rarely to exactly zero. Most common in practice. Elastic Net — combines L1 + L2, balancing sparsity and shrinkage. Dropout (neural nets) — randomly deactivates neurons during training, forcing the network to learn redundant representations. Early stopping — halts training before the model overfits the training data. Why it works: The penalty discourages the model from assigning large weights to any single feature, forcing it to find simpler, more generalizable patterns rather than memorizing training noise. Key tradeoff: Higher λ → more regularization → lower variance but higher bias. Tuning λ is typically done via cross-validation.

Answer 174

Regularization becomes important when the model begins to either overfit or underfit. Another scenario where it is useful is to have regularization is when you want to optimize two competing functions simultaneously. In that case, there is a trade-off between them and a regularization/penalty term is used to optimize the more important function at the cost of the less important one.

Answer 175

Softmax is a function that converts a vector of raw scores (logits) into a probability distribution over multiple classes. The formula: σ(zᵢ) = e^zᵢ / Σ e^zⱼ For each score zᵢ, exponentiate it and divide by the sum of all exponentiated scores. Key properties: 1. All outputs are in the range (0, 1) 2. All outputs sum to exactly 1 → interpretable as probabilities 3. Amplifies differences — the largest logit gets disproportionately more probability mass (due to the exponential) 4. Generalizes the sigmoid function to multiple classes (sigmoid is just softmax for 2 classes) Where it's used: 1. Output layer of multi-class classifiers (e.g., image classification with 1000 categories) 2. Attention mechanisms in Transformers — softmax normalizes attention scores so they sum to 1 across tokens 3. Policy networks in RL — converts raw action scores into a probability distribution for sampling

Answer 176

Ridge Regression is a linear regression technique that adds an L2 penalty to the OLS loss function to shrink coefficients and reduce overfitting. The objective: ``` Loss = Σ(yᵢ − ŷᵢ)² + λ Σwⱼ² ``` OLS minimizes residuals alone; Ridge minimizes residuals plus the sum of squared weights. Why you need it: Multicollinearity — when features are highly correlated, OLS coefficient estimates become unstable and high-variance. Ridge stabilizes them. Overfitting — in high-dimensional settings (many features, relatively few samples), OLS overfits. Ridge constrains the model. Ill-conditioned systems — OLS requires inverting XᵀX, which can be singular or near-singular. Ridge adds λI to make it always invertible: ``` (XᵀX + λI)⁻¹ Xᵀy ``` The bias-variance tradeoff: Ridge intentionally introduces bias (coefficients are shrunk, not exact) in exchange for lower variance — predictions generalize better to unseen data. Key interview point: Ridge never sets coefficients to exactly zero (unlike Lasso), so it keeps all features in the model. If sparsity/feature selection matters, Lasso or Elastic Net is preferred. λ is tuned via cross-validation. For Ridge vs OLS key differences, check the image.

Answer 177

Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a linear regression technique that adds an L1 penalty to the OLS loss function, shrinking some coefficients to exactly zero — effectively performing automatic feature selection. The objective: ``` Loss = Σ(yᵢ − ŷᵢ)² + λ Σ|wⱼ| ``` Why it's powerful — sparsity: Unlike Ridge (which shrinks weights toward zero), Lasso can shrink weights to zero. This means it eliminates irrelevant features entirely, producing a simpler, more interpretable model. Lasso vs. Ridge — when to use which: Use Lasso when you suspect only a few features truly matter and want a sparse model Use Ridge when most features are relevant and multicollinearity is the main concern Use Elastic Net when you want both sparsity and stability under multicollinearity Key interview point: Lasso has no closed-form solution because the absolute value function is non-differentiable at zero. It's solved using methods like coordinate descent or subgradient methods. λ is tuned via cross-validation. Check the image for a comparison between Lasso and OLS.

Answer 178

Both of them are regularization techniques, with the difference in their penalty functions. The penalty in Ridge regression is the sum of the squares of the coefficients whereas, for Lasso, it is the sum of the absolute values of the coefficients. Lasso regression is used to achieve sparse solutions by making some of the coefficients zero. Ridge regression tries to smoothen the solution as it keeps all the coefficients but minimizes the sum of their squares. Lasso (L1) is more robust since it is resistant to outliers. Ridge (L2) squares the error, so for any outlier, its square term would be huge. L2 also produces a unique solution, since there is only a single shortest distance between two points, whereas with L1, you can have multiple solutions.

Answer 179

This comes down to geometry. The L1 constraint region is a diamond (sharp corners), and the loss function's elliptical contours are likely to touch it at a corner — where one or more weights are exactly zero. The L2 constraint is a sphere (no corners), so the contours touch it along a smooth edge, leaving all weights nonzero.

Answer 180

Cluster analysis is the task of grouping (clustering) a set of objects in such a way that objects in the same cluster are much more similar to each other than those in other clusters. In some cases, Cluster analysis can be used for the initial analysis of the given dataset based on the different target attributes. For data lacking output labels, you can use a clustering technique to automatically find the class label by grouping input data instances into different clusters and then assigning a unique label to each cluster.

Answer 181

Two of the most commonly used Clustering methods are: 1. Hierarchical Clustering - This produces a hierarchy of clusters, either by merging smaller clusters into larger ones or dividing the larger clusters into smaller ones. The merging or splitting of clusters depends on the metric used for measuring the dissimilarity between the sets of data instances. Some of the commonly used metrics are Euclidean distance, Manhattan distance, or Hamming distance. 2. K-Means Clustering - It assigns the data points to k clusters such that each data point belongs to the cluster with the closest mean. This is suitable for when you have a large number of data-points and users far less iterations than Hierarchical Clustering.

Answer 182

1. A partitional clustering is a division of the set of data objects into non-overlapping clusters such that each object is in exactly one cluster, whereas a hierarchical clustering is a set of nested clusters which are organized as a tree. 2. Hierarchical clustering does not require any input parameters, whereas partitional clustering algorithms typically require the number of clusters to begin (not HDBSCAN though...but K-Means does). 3. Partitional clustering is generally faster than hierarchical clustering.

Answer 183

One way to evaluate the clusters quality is to resample the data (via bootstrap or by adding small noise) and compute the closeness of the resulting partitions, measured by Jaccard similarity. This approach allows you to estimate the frequency with which similar clusters can be recovered after resampling. Jaccard Similarity measures the overlap between two clusters by comparing shared members to total members: ``` J(A,B) = |A ∩ B| / |A ∪ B| ``` Ranges from 0 (no overlap) to 1 (identical). Useful when comparing cluster assignments across runs or against ground truth labels.

Answer 184

Cluster evaluation is a hard problem, and most of the time, there is no perfect solution to it. Otherwise, it would be a classification problem where each cluster represents one class. Four common assessment approaches are: 1. Internal Evaluation, where the clustering result is compared with the found clusters and their relationship with each other. 2. External Evaluation, where the result of the clustering is compared to an existing "ground truth". However, obtaining an external reference result is not straightforward in most of the cases. 3. Manual Evaluation by a human expert. 4. Indirect Evaluation by evaluating the utility of the clustering in its intended application.

Answer 185

As the name suggests, Dimensionality Reduction means finding a lower-dimensional representation of the dataset such that the original dataset is preserved as much as possible even after reducing the number of dimensions. Dimensionality Reduction reduces time and storage space required. It also addresses multi-collinearity which improves the performance of the ML model. Many high-dimensional datasets such as videos, human genes, etc are difficult to process as is. For such data, you need to remove the unnecessary and redundant features and keep only the most informative ones to better learn from them.

Answer 186

Generally, you use Dimensionality Reduction for Unsupervised Learning tasks, but that can also be used in Supervised Learning. One of the standard methods of Supervised Dimensionality reduction is called Linear Discriminant Analysis (LDA). It is designed to find low-dimensional projections that maximizes class separation. Another approach is partial least squares (PLS), which looks for the projection having the highest covariance with group labels.

Answer 187

1. Principal Component Analysis 2. Backward Feature Elimination 3. Forward Feature Selection 4. Linear Discriminant Analysis 5. Generalized Discriminant Analysis

Answer 188

Feature selection is a special case of Dimensionality Reduction in which the set of features made by feature selection must be a subset of the original feature set. In Dimensionality Reduction, it is not always the case that the new features are a subset of the original features (consider PCA which reduces the dimensionality by making new synthetic features from the linear combination of the original ones).

Answer 189

Density sparse data means that a high percentage of the data contains 0 or null values. Dimensionally sparse data is the one which has a large feature space, in which some of the features are redundant, correlated, etc.

Answer 190

Reducing the number of features will definitely reduce the computational complexity of the model but it may not improve the performance of the SVM model, because SVM automatically uses regularization to avoid overfitting. So, performing dimensionality reduction before SVM modelling may not improve the performance of the SVM classifier.

Answer 191

Although it may not sound intuitive, but random projection is a valid dimensionality reduction method. It is a computationally efficient way to reduce the dimensionality by trading a controlled amount of error for smaller model sizes and faster processing times. Random projection is based on the idea that if the data points in a sparse feature space have a very high dimension, then you can project them into a lower dimensional space in a way that preserves the pairwise distances between the points approximately.

Answer 192

Independent Component Analysis is a statistical technique in Unsupervised Learning which decomposes a multi-variate signal into independent non-Gaussian components. It defines a generative model in which the data variables are assumed to be linear or nonlinear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed to be non-Gaussian and mutually independent, and they are called the independent components of the observed data. These independent components are found by ICA. ICA has been used for Facial Recognition and Stock Prediction. PCA helps to find the low-rank representation of the dataset such that the first vector of the PCA is the one that best explains the variability of your data (the principal direction), the second vector is the second best explanation and is orthogonal to the first one, and so on. ICA finds a representation of the dataset as independent sub-elements. You can think of the data as a mixed signal, consisting of independent vectors.

Answer 193

Fisher Discriminant Analysis is a Supervised Learning technique, which tries to find the components in such a way that the class separation is maximized while minimizing the within class variance. Both PCA and FDA techniques are used for feature reduction by finding the eigenvalues and eigenvectors to project the existing feature space into new dimensions. The major difference is that PCA falls under Unsupervised Learning and tries to find the components such that the variance in the complete dataset is maximized whereas FDA tries to maximize the separation between classes.

Answer 194

PCA involves transforming the given data into a smaller set of components such that they are linearly uncorrelated to each other. Factor analysis is a generalization of PCA which is based on the maximum-likelihood. PCA is used when you want to simply reduce your correlated observed variables to a smaller set of important, independent, orthogonal variables. Factor analysis is sued when you want to test a theoretical model of latent factors causing observed variables.

Answer 195

PCA is usually performed using Eigenvalue Decomposition but you can also use Singular Value Decomposition to perform PCA. The link: SVD on X and EVD on the covariance matrix X^TX are equivalent operations. The right singular vectors V of X are identical to the eigenvectors of X^TX, and the singular values relate to eigenvalues by λᵢ = σᵢ² / (n−1).

Answer 196

Centering the data means brining the mean to the origin, by subtracting it from the data. It is required to ensure that the first principal component is indeed in the direction of maximum variance. A centered data (zero mean) is used to find a bases which minimizes the mean squared error. If you do not perform centering then the first component might instead be misguiding and correspond to the mean of the data. Centering is not required if you are performing PCA on a correlation matrix, since the data would already be centered after calculating the correlations.

Answer 197

PCA is about transforming the given data to the space which maximizes the variance. If the data is not normalized then PCA may select some features with the highest variance in the dataset, making them more important. For instance, if you use "grams" for a feature instead of "kgs", then its variance would increase and PCA might think that it has more impact, which may not be correct. Hence, it is very important to normalize the data for PCA.

Answer 198

Orthogonality is fundamental to PCA — the principal components are constrained to be perpendicular to one another, which guarantees they capture uncorrelated, independent directions of variance. This is not optional; it's what makes PCA mathematically well-defined and ensures each component adds non-redundant information. Post-hoc rotation (e.g., Varimax, Oblimin), however, is not necessary and is an optional step borrowed from factor analysis. There's an important trade-off: Without rotation: Components are ordered by variance explained, which is ideal for dimensionality reduction. With rotation (Varimax): Variance is spread more evenly across components, making loadings easier to interpret — but you lose the clean variance-ordering guarantee. Key distinction: Orthogonality is a structural constraint built into PCA. Rotation is a post-hoc choice that trades predictive/compression power for interpretability.

Answer 199

PCA projects the original dataset into a lower dimensional linear subspace, called a hyperplane. All the mappings, rotations, and transformations performed are linear and can be expressed in terms of linear algebraic operations. Thus, a PCA is a linear method for Dimensionality Reduction.

Answer 200

Kernel PCA extends standard PCA to handle non-linearly separable data by implicitly mapping data into a high-dimensional feature space using the kernel trick, then performing PCA there. The Core Idea: Standard PCA finds linear directions of maximum variance. But if your data lies on a curved manifold (e.g., a Swiss roll), linear projections lose structure. Kernel PCA solves this without ever explicitly computing the high-dimensional mapping. How it Works 1. Choose a kernel, such as RBF 2. Compute the kernel matrix K 3. Center K in feature space 4. Eigendecompose K and project data onto the top eigenvectors The kernel trick means you compute dot products in the high-dimensional space without explicitly going there - O(n^2) in samples, not dimensions.

Answer 201

Core Idea: Preprocessing Dimensionality Reduction is about removing noise, redundancy, and curse-of-dimensionality issues before feeding data to your actual - not just visualization. Dimensionality Reduction can be broadly divided into Feature Extraction and Feature Selection, both of which are used for preprocessing the data. The resulting dataset is then used for learning purpose. Here are three main categories of techniques: 1. Linear Methods - PCA - Remove correlated features, keep top-kk k variance-explaining components. Fast, interpretable, great default. - SVD/Truncated SVD - Same idea but works on sparse matrices (e.g., TF-IDF for NLP). Used under the hood in PCA. - LDA - Supervised DR; maximizes class separability. Useful when labels are available at preprocessing time. 2. Feature Selection - Variance Thresholding - Drop near-zero variance features outright. - Correlation filtering - Drop one of any highly correlated feature pairs. - L1 Regularization (Lasso) - Embeds selection into model training; drives irrelevant weights to zero. 3. Non-linear Methods - Autoencoders - learns a compressed bottleneck representation; great for images/text with complex structure. - UMAP - faster than t-SNE, better preserves global structure; usable as preprocessing (unlike t-SNE).

Answer 202

Both of these techniques are used to avoid the Curse of Dimensionality, simplify models by removing redundant and irrelevant features, and reduce overfitting. But the difference lies in the way that they achieve it. Feature Selection means selecting a subset of features from the given features, based on some criteria. Some of the ways to perform Feature Selection are Forward Selection and Backward Elimination. Feature Extraction means projecting the given feature space into a new feature space, such as in SVD and PCA.

Answer 203

1. Independent Component Analysis (ICA) 2. Principal Component Analysis (PCA) 3. Kernel Based PCA 4. Singular Value Decomposition (SVD)

ML Breadth Depth Qs Flashcards

(230 cards)