What is the difference between supervised and unsupervised learning?
The primary difference between supervised and unsupervised learning lies in the presence or absence of labeled data and the goal of the learning process.
In supervised learning, the training data comes with labels, meaning each input example has a known, correct output. The goal is to learn a mapping function from inputs to outputs.
Use cases:
• Classification (e.g., spam detection, fraud detection)
• Regression (e.g., demand forecasting, stock prediction)
• Goal: Minimize error between predicted and actual output (e.g., mean squared error for regression).
Key thing: The “supervision” comes from the labels telling the model what the correct answer is.
In unsupervised learning, the data has no labels. The model tries to find structure or patterns in the input data without knowing the desired outcome.
Example: Customer segmentation. You only have customer behavior data, and you want the algorithm to discover groups of similar customers.
Use cases:
• Clustering (e.g., K-means, DBSCAN)
• Dimensionality reduction (e.g., PCA, t-SNE)
• Anomaly detection
• Goal: Find hidden structure, groupings, or features without any ground truth.
• Key thing: There’s no feedback signal telling the model if it’s right or wrong—it’s learning from the inherent structure in the data.
Are LLMs supervised or unsupervised?
LLMs are trained using supervised learning on an unsupervised objective—and later fine-tuned with actual supervision. Here’s what that means step by step:
1. Pretraining Phase: Self-Supervised Learning (a form of unsupervised learning)
• In this phase, the model is trained to predict the next word (or token) in a sentence.
• This is known as a self-supervised learning task because:
• The labels (next words) are derived automatically from the data itself.
• No human annotators are needed.
It uses a supervised learning algorithm (cross-entropy loss, gradient descent), but it’s trained on unlabeled data where the “labels” come from the data itself. So it’s neither.
Can you explain the concept of overfitting and underfitting in machine learning models?
Overfitting and underfitting are two sides of the same coin—they describe a model’s failure to generalize well to unseen data, but for opposite reasons.
The Bias-Variance Tradeoff
Before diving into the definitions, here’s a helpful frame:
* Underfitting → High bias, low variance
* Overfitting → Low bias, high variance
Underfitting
* Happens when a model is too simple to capture the underlying patterns in the data.
* It performs poorly on both training and test data.
* Common causes:
* Model is too shallow (e.g., linear model for nonlinear data)
* Insufficient features or poor feature engineering
* Too much regularization
Overfitting
* Happens when a model is too complex and starts learning noise or random fluctuations in the training data.
* It performs very well on training data but poorly on test/validation data.
* Common causes:
* Deep or overly flexible model (e.g., a large decision tree or high-degree polynomial)
* Too little training data
* No regularization
High train error, high test error = underfitting; fix: add model complexity, more features, less regularization
Low train error, high test error = overfitting; fix: add regularization, use simpler model, more data, early stopping
What is cross-validation? Why is it important?
Cross-validation is a technique used to assess the generalization performance of a machine learning model. It helps us estimate how well our model will perform on independent, unseen data, and is especially important when we have limited labeled data.
At its core, cross-validation means: “Train your model multiple times, each time on a different subset of the data, and test it on the parts you didn’t train on.” The idea is to rotate the roles of training and validation data, so you get a more reliable estimate of performance.
Most Common Type: k-Fold Cross-Validation
• Split the dataset into k equal parts (“folds”)
• For each fold:
• Train on k-1 folds
• Validate on the 1 remaining fold
Average the results over the k runs
TimeSeriesSplit - Maintains temporal order (no data leakage in time-based models). For time series or causal inference, I use forward-chaining splits instead to avoid leakage.
What is the bias-variance tradeoff?
Total error = bias²+variance+irreducible noise. More flexible model ↓bias ↑variance, so aim for minimum sum.
The bias-variance tradeoff describes the balance between two types of error that impact a model’s ability to generalize to unseen data:
• Bias: Error due to incorrect assumptions in the model.
• Variance: Error due to sensitivity to small fluctuations in the training set.
Together with irreducible error (noise in the data), these form the total prediction error
• As you make a model more complex (e.g., adding more features, layers, or trees), bias goes down, but variance goes up.
• A simple model has high bias, low variance.
• A complex model has low bias, high variance.
🎯 The sweet spot is a model that’s just complex enough to capture the signal but not the noise—low total error.
• Reduce bias by using more expressive models (e.g., from linear → polynomial → deep networks)
• Reduce variance by:
• Adding more training data
• Applying regularization (L1/L2, dropout)
• Using ensemble methods (bagging, boosting)
• Doing cross-validation to stabilize estimates
How would you validate a model you created to generate a predictive analysis?
To validate a model for predictive analysis, I follow a structured framework that ensures we’re evaluating the model fairly, robustly, and in a way aligned with the business goal.
I dig into:
• Residual plots: to spot systematic errors
• Confusion matrix: for classification tasks
• Subgroup performance: Does the model underperform on certain slices (e.g., low-income users, rare categories)?
• Prediction drift: Does performance drop on newer data (i.e., is retraining needed)?
What is the role of the cost function in machine learning algorithms?
It quantifies the difference between the model’s predictions and the actual ground truth.
1.Defines the learning objective:
The cost function tells the model “how wrong” its predictions are, and provides a numerical signal to guide the optimization process.
2.Enables model improvement:
During training, the model uses gradient descent (or a variant) to iteratively adjust its parameters to minimize the cost.
3.Encodes assumptions and priorities:
Different tasks—and different business goals—require different cost functions. Choosing the right one is a key part of model design.
Regression (Predicting Continuous Values)
• Common cost function: Mean Squared Error (MSE)
• Penalizes larger errors more heavily (because of squaring)
Smooth and differentiable—ideal for gradient-based optimization
Classification (Predicting Categories)
• Common cost function: Cross-Entropy Loss
• Measures how close the predicted probability distribution is to the true labels
Encourages well-calibrated probabilities
What is the curse of dimensionality? How do you avoid this?
The curse of dimensionality refers to the exponential increase in data sparsity, computational cost, and modeling difficulty as the number of input features (dimensions) grows. Intuitively: As the number of features increases, the data becomes “thinly spread” across a vast high-dimensional space—making it harder for models to learn meaningful patterns.
Why Is It a Problem?
1.Data sparsity:
• In high-dimensional space, even large datasets become sparse.
• Distance-based algorithms (e.g., KNN, clustering) break down because all points become roughly equidistant—making “closeness” meaningless.
2. Overfitting risk:
• More dimensions → more room for the model to memorize noise.
• Unless you have exponentially more data, your model is likely to overfit.
3. Combinatorial explosion:
• Feature space grows exponentially:
10 binary features → 2^{10} = 1024 combinations
100 binary features → 2^{100} = 1.27 * 10^{30}
• You need more data to “cover” the space meaningfully.
4. Slower training & inference:
• More dimensions = more parameters = higher compute and memory requirements
Imagine trying to cluster users based on 1000 behavioral features. Even if you have 1 million users, most points are so far apart in high-dimensional space that clusters become indistinguishable.
How to Avoid or Mitigate It
1. Dimensionality Reduction
• PCA (Principal Component Analysis): Find the directions (principal components) that capture the most variance.
• t-SNE / UMAP: For visualizing and understanding structure in lower dimensions.
• Autoencoders: Learn compressed representations in neural networks.
2. Feature Selection
• Remove irrelevant, redundant, or noisy features using:
• Mutual information, correlation analysis
• Lasso regularization (L1 penalty shrinks irrelevant coefficients to 0)
• Tree-based feature importance (e.g., from random forest or XGBoost)
3. Regularization
• Techniques like L1/L2 regularization prevent overfitting by penalizing model complexity in high dimensions.
4. Collect more data
• If possible, increase the number of samples to better populate the high-dimensional space.
• This is hard in practice but effective.
5. Use models that handle sparsity well
• Tree-based methods (like XGBoost, LightGBM) and linear models often handle high-dimensional spaces better than distance-based methods.
What is “Naive” about the naive bayes?
The “naive” in Naive Bayes refers to a strong and unrealistic assumption the algorithm makes:
It assumes that all features are conditionally independent given the class label.This means that, once you know the class (e.g., spam or not spam), the model assumes that each input feature (e.g., word in an email) contributes to the prediction independently of the others
Imagine you’re classifying an email as spam or not spam, and two features are:
• x_1 = presence of the word “free”
• x_2 = presence of the word “money”
These two words often appear together in spam emails, so they are not independent. But Naive Bayes treats them as if they are—this is the naive assumption.
Why Is It a Problem?
• In many real-world tasks (especially with text, images, or correlated signals), features are not independent. This can cause:
• Poor calibration (i.e., predicted probabilities aren’t trustworthy)
• Misclassification in some edge cases
Naive Bayes often performs quite well, especially in high-dimensional spaces like text classification, where independence violations average out.
When It Works Well
• In text classification, like spam filtering or sentiment analysis
• Despite word correlations, Naive Bayes gives strong baselines with almost no tuning
• In very small datasets, where model simplicity helps prevent overfitting
What is semi-supervised learning? Give examples of when it’s useful.
Semi-supervised learning is a hybrid approach that uses a small amount of labeled data along with a large amount of unlabeled data to train a model. The idea is to leverage structure in the unlabeled data to improve learning, especially when labeled data is expensive or time-consuming to obtain.
What is self-supervised learning? How is it different from unsupervised learning?
Self-supervised learning (SSL) is a type of machine learning where the model learns from unlabeled data by creating its own supervisory signal—typically by solving a pretext task that doesn’t require human-annotated labels. It’s called “self-supervised” because the labels are automatically generated from the data itself.
Example: Predicting the Next Word (Language Modeling)
• The model learns to predict the missing word (“Paris”) using the context.
The input is the text; the label (the next word) is just part of the same text, so no human labeling is needed.
Self-supervised learning unlocks the value in massive unlabeled datasets by turning them into training material. It:
• Scales better than supervised learning (no human labeling)
• Produces general-purpose representations that can be fine-tuned for many tasks (e.g., BERT pretraining → fine-tuning on sentiment analysis)
• Powers most foundation models today
What is curriculum learning? When might it be beneficial?
Curriculum learning is a training strategy where a machine learning model is exposed to easier examples first, and then gradually harder examples over time—just like a human curriculum. The idea is that learning simpler patterns first helps the model build a strong foundation, making it easier to learn complex patterns later.
Why Might It Help?
1. Faster convergence:
• The model doesn’t get overwhelmed by hard examples early on, so it can learn the basic structure of the problem more efficiently.
2. Better generalization:
• By gradually increasing task difficulty, the model may avoid poor local minima and overfitting on noisy, hard-to-learn data early in training.
3. Stabilizes training:
• Particularly useful in reinforcement learning or generative models where learning is unstable.
NLP: Language Translation
• Start training on short, syntactically simple sentences.
• Gradually introduce longer, more complex sentences with rare words.
Vision: Object Recognition
• Start with clean, centered images of objects on plain backgrounds.
• Gradually increase difficulty by adding clutter, occlusion, or rotation.
Reinforcement Learning
• Teach a robot arm to reach for large, stationary targets before trying to grasp small, moving ones.
Machine Teaching or Data Augmentation
• Use a score (e.g., confidence, difficulty estimate) to order examples.
• Train on high-confidence or “easy” synthetic data before exposing the model to noisy, real-world data.
To implement:
1. A way to rank examples by difficulty (e.g., human-defined, model confidence, heuristic)
2. A curriculum schedule (e.g., gradually expanding data scope or complexity over epochs)
Often combined with self-paced learning, where the model dynamically selects what it’s ready to learn based on its own performance.
When Might It Not Help?
• If all training examples are equally complex
• If your task benefits from exposure to full data diversity early on (e.g., large transformer models trained at scale)
• If difficulty estimation is unreliable
How do you handle missing or corrupted data in a dataset?
I approach missing or corrupted data systematically, using a combination of exploratory analysis, domain knowledge, and modeling considerations:
How would you handle an imbalanced dataset?
B. Undersampling the majority class
• Randomly remove majority class examples to balance the classes
Good when: dataset is very large, and model training is expensive
I sometimes combine SMOTE + undersampling to balance efficiency and representation.
B. Threshold tuning
• Instead of using a default 0.5 cutoff, adjust the prediction threshold to favor the minority class
• Useful when you care more about recall or precision than accuracy
Can you explain the concept of “feature selection” in machine learning?
Feature selection is the process of choosing a subset of relevant features (input variables) from your dataset that are most useful for predicting the target variable. It’s about removing irrelevant, redundant, or noisy features to improve model performance, training speed, and interpretability.
Feature engineering - Creates new features or transforms existing ones
Feature selection - Chooses which features to keep or drop
How do you handle categorical variables in your dataset?
The approach depends on:
• The type of categorical variable (nominal vs. ordinal)
• The model type (tree-based vs. linear vs. neural)
• The cardinality (number of unique values)
• Whether the feature has semantic meaning or ordering
A. One-Hot Encoding (OHE)
• Creates a new binary column for each category
• Common for low-cardinality nominal features
✅ Best for: linear models, small feature spaces
❌ Downside: high-dimensional explosion with many unique categories
B. Label Encoding (Integer Encoding)
• Assigns an integer to each category (e.g., red → 0, blue → 1)
• Only safe for ordinal variables unless the model is tree-based
✅ Good for: tree-based models (e.g., XGBoost, LightGBM)
❌ Not suitable for linear models—it introduces false orderings
C. Ordinal Encoding
• Similar to label encoding but with explicit ordering
• Useful for features like education level, product ratings, etc.
D. Target Encoding / Mean Encoding
• Replace category with mean of the target variable within that group
• Powerful but risky—can cause data leakage
✅ Often used in Kaggle, tabular competitions
❌ Needs cross-validation-aware encoding to avoid leakage
E. Frequency or Count Encoding
• Replace category with frequency of that category in the dataset
• Helps compress high-cardinality features into informative signals
F. Embedding Layers (for Deep Learning)
• Use learned dense vectors for each category
• Standard in tabular deep models and language models
✅ Great for high-cardinality categorical features (e.g., zip codes, user IDs)
Linear models -> One-hot, ordinal
Tree-based models -> Label, frequency, target encoding
Neural networks -> Embeddings, one-hot
How do filtering and wrapper methods work in feature selection?
Feature selection methods aim to choose a subset of relevant input features to improve model performance, reduce overfitting, and simplify interpretation.
Two foundational approaches are:
📌 How They Work:
• Compute a relevance score for each feature with respect to the target (e.g., correlation, mutual information)
• Rank features by their score
• Select the top-k features or those above a threshold
Pearson Correlation - Linear relationships (regression)
Chi-Squared Test - Categorical vs categorical (classification)
ANOVA F-test - Continuous features (classification)
Mutual Information - Nonlinear dependencies
Variance Threshold - Remove low-variance (uninformative) features
✅ Pros:
• Fast and scalable
• Simple to understand
• Not model-specific
❌ Cons:
• Ignores feature interactions
• May not align with what improves predictive performance
📌 How They Work:
1. Define a search strategy (e.g., forward selection, backward elimination, or recursive elimination)
2. Train a model on different subsets of features
3. Score each subset based on performance (e.g., accuracy, F1)
4. Keep the subset with the best score
🔹 Forward Selection
• Start with no features
• Add one feature at a time that improves performance the most
• Stop when no improvement
🔹 Backward Elimination
• Start with all features
• Remove the least useful one iteratively
🔹 Recursive Feature Elimination (RFE)
• Use model coefficients or feature importances to recursively eliminate the least important feature
✅ Pros:
• Takes feature interactions into account
• Tailored to the specific model you’ll use in production
❌ Cons:
• Computationally expensive, especially with many features
• May overfit on small datasets if not done with cross-validation
Describe a situation where you had to handle missing data. What techniques did you use?
One project that stands out was when I was working on a customer churn prediction model for a subscription-based service. The dataset contained customer activity logs, demographic data, and billing history.
The Problem - When exploring the data, I noticed:
• ~20% of users had missing values in the last_login_date
• ~10% were missing income_bracket and age
• A few other features had sporadic missing values or inconsistent formats (e.g., zipcode was missing or corrupted in some cases)
This raised three key challenges:
1. Determining if the missingness was informative
2. Choosing the right imputation strategy without distorting patterns
3. Ensuring downstream models could generalize well
✅ 1. Diagnosed Missingness Mechanism
I first ran an exploratory analysis:
• Checked missingness patterns with heatmaps
• Grouped missing rows by churn status to see if missingness was correlated with the target
• Found that users with missing last_login_date were disproportionately likely to churn—so missingness was informative
→ Added a boolean flag: is_last_login_missing = True/False as a separate feature
✅ 2. Imputation Strategy (Feature-wise)
Feature — Imputation Technique — Reason
last_login_date — Left as missing; used binary flag —Missingness itself carried predictive signal
income_bracket — Mode imputation + added “unknown” category —Categorical + potential bias in removing it
age — Median imputation by user segment (region) — More stable than mean; respected data structure
zipcode —Dropped (very sparse + unreliable) —Too many unique values + not helpful in modeling
What is principal component analysis (PCA) and when is it used?
Principal Component Analysis (PCA) is an unsupervised linear transformation technique that projects high-dimensional data into a lower-dimensional space by finding the directions (called principal components) that capture the maximum variance in the data.
In simple terms: PCA finds the most “informative” axes in your data and helps you reduce the number of features while retaining as much signal as possible.
How Does It Work? (High-Level Intuition)
1. Center the data (subtract the mean)
2. Compute the covariance matrix
3. Compute eigenvectors and eigenvalues of the covariance matrix
4. Select the top-k eigenvectors (those with the largest eigenvalues) → these are your principal components
5. Project the data onto these components to get a lower-dimensional representation
✅ 1. Dimensionality Reduction
• To reduce feature count while preserving important structure
• Useful when you have many features and limited samples (e.g., gene expression data)
✅ 2. Visualization
• Reduce data to 2D or 3D for plotting clusters or trends
• Very common in EDA or model interpretation
✅ 3. Noise Reduction
• PCA can help eliminate components that capture mostly noise (low variance)
• Improves signal-to-noise ratio in the data
✅ 4. Preprocessing for ML
• Helps avoid multicollinearity (highly correlated features)
• Some models (e.g., logistic regression) benefit from orthogonal features
✅ 5. Data Compression
Reduces storage or compute requirements in large-scale pipelines
⚠️ When NOT to Use PCA
• If interpretability matters: PCA transforms features into abstract combinations, making them harder to explain
• If features have nonlinear structure: PCA is linear—doesn’t capture nonlinear relationships (use t-SNE, UMAP, or autoencoders instead)
• If the features have vastly different scales: PCA is sensitive to feature scaling, so always standardize your data first
What’s the difference between PCA vs ICA?
🎯 What PCA Does
• Finds orthogonal axes (principal components) that capture the most variance in the data.
• Each component is a linear combination of original features, and they’re ordered by the amount of variance they explain.
• It assumes the data is Gaussian (only considers second-order statistics like mean and variance).
✅ Best for:
• Reducing dimensionality while preserving most of the variance
• Data compression
• Noise reduction
🎯 What ICA Does
• ICA goes beyond uncorrelatedness and looks for statistical independence, which is a much stronger condition.
• It’s commonly used when the observed signals are mixtures of independent sources—this is known as blind source separation.
Classic example: The “cocktail party problem”—you have multiple microphones in a room with several people speaking. ICA can separate each speaker’s voice from the mixed signals.
✅ Best for:
• Signal separation
• Latent factor analysis
• EEG/MEG brain signal decomposition
🔍 Example: Image Decomposition
• PCA applied to face images gives eigenfaces—components that look like blurry averaged faces (maximally varying directions).
• ICA yields more interpretable parts-based components like eyes, nose, mouth, because it’s finding statistically independent sources.
How do you handle time-based features in a machine learning model?
What is feature hashing? When would you use it?
Feature hashing is a method of converting categorical features (often with many unique values) into a fixed-length numerical vector using a hash function. Instead of assigning a unique index to every possible category (like in one-hot encoding), you apply a hash function to the category name to determine its position in a fixed-size vector.
Feature hashing is especially useful when:
✅ 1. High-Cardinality Categorical Data
• Text tokens (e.g., n-grams)
• User IDs, item IDs
• URLs, IP addresses
✅ 2. Unknown or Streaming Categories
• You don’t know all possible categories up front (e.g., in online learning or live systems)
✅ 3. Memory Constraints
• One-hot encoding or embeddings would blow up in size
• You want a bounded memory footprint regardless of input scale
Risk - hash collisions - give up interpretability and collision-free encoding for speed and scalability
How do you handle hierarchical categorical variables?
Hierarchical categorical variables are multi-level categorical features, where categories are nested within one another in a tree-like structure. Each level adds contextual information about the instance, and the hierarchy may carry semantic or statistical relationships between levels.
e.g. Electronics > Computers > Laptops
1. Encode Each Level Separately
Treat each level of the hierarchy (e.g., Category, Subcategory, Product) as its own feature.
Then encode each:
• One-hot encode or label encode for tree-based models
• Embeddings for neural networks
✅ Works well for tree-based models (e.g., XGBoost, LightGBM)
Pitfalls:
Flattening full paths with one-hot -> Too sparse, high dimensionality
Ignoring hierarchy -> Model misses important generalizations
Encoding only lowest level -> Leads to overfitting if that level is noisy or sparse