Explain the evolution from CNNs to Transformers. Why did Transformers become dominant?
CNNs excel at spatial features but struggle with long-range dependencies. RNNs handle sequences but suffer from vanishing gradients and cannot parallelize training. LSTMs improved memory (gates) but are still sequential. Transformers dominate because they utilize Self-Attention to process entire sequences in parallel (speed) and capture global dependencies regardless of distance (context).
How does Bagging reduce variance?
Bagging (Bootstrap Aggregating) averages predictions from multiple high-variance (overfitting) models (like deep decision trees). By averaging uncorrelated errors, the aggregate model stabilizes the prediction, effectively smoothing out the noise and reducing the overall variance without increasing bias.
What is Bootstrapping in the context of Random Forest?
Bootstrapping is the technique of creating datasets by random sampling from the original data with replacement. This means the same data point can appear multiple times in a single sample, while others may be left out (Out-of-Bag samples).
Differentiate between Boosting and Bagging.
Bagging (e.g., Random Forest) builds models in parallel independent of each other to reduce variance. Boosting (e.g., XGBoost) builds models sequentially, where each new model attempts to correct the errors of the previous one, primarily reducing bias.
XGBoost vs. Standard Gradient Boosting: What are the key differences?
Explain Decision Trees, Random Forest, and XGBoost internals in one sentence each.
Decision Tree: Splits data recursively based on features that maximize Information Gain (or minimize Gini Impurity).Random Forest: An ensemble of decision trees trained in parallel on bootstrapped data, using feature randomness to decorrelate trees.XGBoost: A gradient boosting framework that trains shallow trees sequentially to minimize the residual errors of the previous ensemble, utilizing regularization.
How does changing max_depth affect a tree-based model?
Increasing max_depth allows the model to learn more specific, complex patterns, which decreases bias but increases the risk of overfitting (high variance). Decreasing it acts as regularization (underfitting/high bias).
How does learning_rate impact a Boosting model?
The learning rate (shrinkage) scales the contribution of each new tree. A lower learning rate requires more trees (n_estimators) to reach the same performance but generally results in a more robust, generalized model.
Precision vs. Recall: When to use which? (Give examples)
Precision is vital when False Positives are costly (e.g., Spam Filter: don’t delete important emails).Recall is vital when False Negatives are dangerous (e.g., Fraud/Cancer Detection: don’t miss a fraudulent transaction or a tumor, even if you flag some safe ones).
Explain P-value in layman’s terms.
A P-value tells us how likely it is that our results happened just by random chance. A low P-value (usually < 0.05) means it’s very unlikely to be a fluke, so we can be confident there is a real effect or relationship.
How do you calculate Precision and Recall from a Confusion Matrix?
Precision = TP / (TP + FP) [Of all predicted positives, how many were actually positive?]Recall = TP / (TP + FN) [Of all actual positives, how many did we find?]
How do you identify the “Best Model” from an ROC Curve?
The best model is the one where the curve is closest to the top-left corner (High True Positive Rate, Low False Positive Rate). You quantify this using the AUC (Area Under Curve) score; the closer to 1.0, the better.
How can you reduce False Positives?
L1 vs. L2 Regularization: Key Differences.
L1 (Lasso) adds the absolute value of weights to the loss; it can shrink weights to zero, performing feature selection.L2 (Ridge) adds the squared value of weights; it shrinks weights towards zero but rarely to zero, handling multicollinearity better.
Batch Normalization vs. Layer Normalization.
Batch Norm: Normalizes across the batch dimension (computes mean/var of a feature across all samples in a batch). Good for CNNs.Layer Norm: Normalizes across the feature dimension (computes mean/var of all features for a single sample). Good for RNNs/Transformers as it doesn’t depend on batch size.
Explain Self-Attention in Transformers.
Self-Attention allows the model to look at other words in the input sequence to understand the context of the current word. It assigns “weights” to how relevant every other word is to the current word (e.g., linking “it” to “dog” in a previous sentence).
What are the main components of a RAG (Retrieval-Augmented Generation) system?
K-Means Clustering: How does it work?
Supervised vs. Unsupervised Learning.
Supervised: Training with labeled data (Input X -> Output Y). Used for Regression/Classification.Unsupervised: Training with unlabeled data to find structure/patterns. Used for Clustering/Dimensionality Reduction.
How to prevent overfitting in a CNN?
Case Study: How to predict customer behavior for a new product?
Why are RNNs more prone to vanishing gradients than CNNs?
RNNs use Weight Sharing, applying the same weight matrix at every time step. During backpropagation, this effectively raises the weights to the power of the sequence length ($W^t$), causing exponential decay or growth.
How does an LSTM solve the vanishing gradient problem?
LSTMs use a Cell State (conveyor belt) and Gating mechanisms. Information is added or removed via addition/linear operations rather than repeated matrix multiplication, allowing gradients to flow more easily over long distances.