CBA DS Flashcards

(23 cards)

1
Q

Explain the evolution from CNNs to Transformers. Why did Transformers become dominant?

A

CNNs excel at spatial features but struggle with long-range dependencies. RNNs handle sequences but suffer from vanishing gradients and cannot parallelize training. LSTMs improved memory (gates) but are still sequential. Transformers dominate because they utilize Self-Attention to process entire sequences in parallel (speed) and capture global dependencies regardless of distance (context).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Bagging reduce variance?

A

Bagging (Bootstrap Aggregating) averages predictions from multiple high-variance (overfitting) models (like deep decision trees). By averaging uncorrelated errors, the aggregate model stabilizes the prediction, effectively smoothing out the noise and reducing the overall variance without increasing bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Bootstrapping in the context of Random Forest?

A

Bootstrapping is the technique of creating datasets by random sampling from the original data with replacement. This means the same data point can appear multiple times in a single sample, while others may be left out (Out-of-Bag samples).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Differentiate between Boosting and Bagging.

A

Bagging (e.g., Random Forest) builds models in parallel independent of each other to reduce variance. Boosting (e.g., XGBoost) builds models sequentially, where each new model attempts to correct the errors of the previous one, primarily reducing bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

XGBoost vs. Standard Gradient Boosting: What are the key differences?

A
  1. Regularization: XGBoost includes L1/L2 regularization to prevent overfitting (GBM does not).2. Parallel Processing: XGBoost parallelizes tree construction (split finding).3. Tree Pruning: It uses max_depth but also prunes trees backward using a gamma threshold.4. Missing Values: It learns the best direction for missing values automatically.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain Decision Trees, Random Forest, and XGBoost internals in one sentence each.

A

Decision Tree: Splits data recursively based on features that maximize Information Gain (or minimize Gini Impurity).Random Forest: An ensemble of decision trees trained in parallel on bootstrapped data, using feature randomness to decorrelate trees.XGBoost: A gradient boosting framework that trains shallow trees sequentially to minimize the residual errors of the previous ensemble, utilizing regularization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does changing max_depth affect a tree-based model?

A

Increasing max_depth allows the model to learn more specific, complex patterns, which decreases bias but increases the risk of overfitting (high variance). Decreasing it acts as regularization (underfitting/high bias).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does learning_rate impact a Boosting model?

A

The learning rate (shrinkage) scales the contribution of each new tree. A lower learning rate requires more trees (n_estimators) to reach the same performance but generally results in a more robust, generalized model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Precision vs. Recall: When to use which? (Give examples)

A

Precision is vital when False Positives are costly (e.g., Spam Filter: don’t delete important emails).Recall is vital when False Negatives are dangerous (e.g., Fraud/Cancer Detection: don’t miss a fraudulent transaction or a tumor, even if you flag some safe ones).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain P-value in layman’s terms.

A

A P-value tells us how likely it is that our results happened just by random chance. A low P-value (usually < 0.05) means it’s very unlikely to be a fluke, so we can be confident there is a real effect or relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you calculate Precision and Recall from a Confusion Matrix?

A

Precision = TP / (TP + FP) [Of all predicted positives, how many were actually positive?]Recall = TP / (TP + FN) [Of all actual positives, how many did we find?]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you identify the “Best Model” from an ROC Curve?

A

The best model is the one where the curve is closest to the top-left corner (High True Positive Rate, Low False Positive Rate). You quantify this using the AUC (Area Under Curve) score; the closer to 1.0, the better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you reduce False Positives?

A
  1. Threshold Tuning: Increase the probability threshold for classifying a positive (e.g., >0.7 instead of >0.5).2. Feature Engineering: Add features that distinguish hard negatives.3. Regularization: Penalize the model to prevent overfitting to noise.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

L1 vs. L2 Regularization: Key Differences.

A

L1 (Lasso) adds the absolute value of weights to the loss; it can shrink weights to zero, performing feature selection.L2 (Ridge) adds the squared value of weights; it shrinks weights towards zero but rarely to zero, handling multicollinearity better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Batch Normalization vs. Layer Normalization.

A

Batch Norm: Normalizes across the batch dimension (computes mean/var of a feature across all samples in a batch). Good for CNNs.Layer Norm: Normalizes across the feature dimension (computes mean/var of all features for a single sample). Good for RNNs/Transformers as it doesn’t depend on batch size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain Self-Attention in Transformers.

A

Self-Attention allows the model to look at other words in the input sequence to understand the context of the current word. It assigns “weights” to how relevant every other word is to the current word (e.g., linking “it” to “dog” in a previous sentence).

17
Q

What are the main components of a RAG (Retrieval-Augmented Generation) system?

A
  1. Retriever: Searches a knowledge base (vector database) to find relevant documents based on the user query.2. Generator (LLM): Takes the retrieved documents + the user query as a prompt to generate a grounded answer.3. Vector DB: Stores embeddings of the data for semantic search.
18
Q

K-Means Clustering: How does it work?

A
  1. Initialize K centroids randomly.2. Assign each data point to the nearest centroid.3. Re-calculate centroids based on the mean of points in the cluster.4. Repeat until centroids stop moving (convergence).
19
Q

Supervised vs. Unsupervised Learning.

A

Supervised: Training with labeled data (Input X -> Output Y). Used for Regression/Classification.Unsupervised: Training with unlabeled data to find structure/patterns. Used for Clustering/Dimensionality Reduction.

20
Q

How to prevent overfitting in a CNN?

A
  1. Dropout: Randomly disable neurons during training.2. Data Augmentation: Flip, rotate, or crop images to increase data diversity.3. Pooling: Reduces spatial dimensions/parameters.4. Early Stopping: Stop training when validation loss increases.
21
Q

Case Study: How to predict customer behavior for a new product?

A
  1. EDA: Check distribution, outliers, correlations.2. Feat. Eng: Create lag features, aggregated past spend, user demographics.3. Model: Random Forest/XGBoost for interpretability and handling non-linear data.4. Metric: Precision/Recall depending on marketing budget vs. coverage.
22
Q

Why are RNNs more prone to vanishing gradients than CNNs?

A

RNNs use Weight Sharing, applying the same weight matrix at every time step. During backpropagation, this effectively raises the weights to the power of the sequence length ($W^t$), causing exponential decay or growth.

23
Q

How does an LSTM solve the vanishing gradient problem?

A

LSTMs use a Cell State (conveyor belt) and Gating mechanisms. Information is added or removed via addition/linear operations rather than repeated matrix multiplication, allowing gradients to flow more easily over long distances.