CBA DS Flashcards

Question 1

Q

Explain the evolution from CNNs to Transformers. Why did Transformers become dominant?

Answer

A

CNNs excel at spatial features but struggle with long-range dependencies. RNNs handle sequences but suffer from vanishing gradients and cannot parallelize training. LSTMs improved memory (gates) but are still sequential. Transformers dominate because they utilize Self-Attention to process entire sequences in parallel (speed) and capture global dependencies regardless of distance (context).

Question 2

Q

How does Bagging reduce variance?

Answer

A

Bagging (Bootstrap Aggregating) averages predictions from multiple high-variance (overfitting) models (like deep decision trees). By averaging uncorrelated errors, the aggregate model stabilizes the prediction, effectively smoothing out the noise and reducing the overall variance without increasing bias.

Question 3

Q

What is Bootstrapping in the context of Random Forest?

Answer

A

Bootstrapping is the technique of creating datasets by random sampling from the original data with replacement. This means the same data point can appear multiple times in a single sample, while others may be left out (Out-of-Bag samples).

Question 4

Q

Differentiate between Boosting and Bagging.

Answer

A

Bagging (e.g., Random Forest) builds models in parallel independent of each other to reduce variance. Boosting (e.g., XGBoost) builds models sequentially, where each new model attempts to correct the errors of the previous one, primarily reducing bias.

Question 5

Q

XGBoost vs. Standard Gradient Boosting: What are the key differences?

Answer

A

Regularization: XGBoost includes L1/L2 regularization to prevent overfitting (GBM does not).2. Parallel Processing: XGBoost parallelizes tree construction (split finding).3. Tree Pruning: It uses max_depth but also prunes trees backward using a gamma threshold.4. Missing Values: It learns the best direction for missing values automatically.

Question 6

Q

Explain Decision Trees, Random Forest, and XGBoost internals in one sentence each.

Answer

A

Decision Tree: Splits data recursively based on features that maximize Information Gain (or minimize Gini Impurity).Random Forest: An ensemble of decision trees trained in parallel on bootstrapped data, using feature randomness to decorrelate trees.XGBoost: A gradient boosting framework that trains shallow trees sequentially to minimize the residual errors of the previous ensemble, utilizing regularization.

Question 7

Q

How does changing max_depth affect a tree-based model?

Answer

A

Increasing max_depth allows the model to learn more specific, complex patterns, which decreases bias but increases the risk of overfitting (high variance). Decreasing it acts as regularization (underfitting/high bias).

Question 8

Q

How does learning_rate impact a Boosting model?

Answer

A

The learning rate (shrinkage) scales the contribution of each new tree. A lower learning rate requires more trees (n_estimators) to reach the same performance but generally results in a more robust, generalized model.

Question 9

Q

Precision vs. Recall: When to use which? (Give examples)

Answer

A

Precision is vital when False Positives are costly (e.g., Spam Filter: don’t delete important emails).Recall is vital when False Negatives are dangerous (e.g., Fraud/Cancer Detection: don’t miss a fraudulent transaction or a tumor, even if you flag some safe ones).

Question 10

Q

Explain P-value in layman’s terms.

Answer

A

A P-value tells us how likely it is that our results happened just by random chance. A low P-value (usually < 0.05) means it’s very unlikely to be a fluke, so we can be confident there is a real effect or relationship.

Question 11

Q

How do you calculate Precision and Recall from a Confusion Matrix?

Answer

A

Precision = TP / (TP + FP) [Of all predicted positives, how many were actually positive?]Recall = TP / (TP + FN) [Of all actual positives, how many did we find?]

Question 12

Q

How do you identify the “Best Model” from an ROC Curve?

Answer

A

The best model is the one where the curve is closest to the top-left corner (High True Positive Rate, Low False Positive Rate). You quantify this using the AUC (Area Under Curve) score; the closer to 1.0, the better.

Question 13

Q

How can you reduce False Positives?

Answer

A

Threshold Tuning: Increase the probability threshold for classifying a positive (e.g., >0.7 instead of >0.5).2. Feature Engineering: Add features that distinguish hard negatives.3. Regularization: Penalize the model to prevent overfitting to noise.

Question 14

Q

L1 vs. L2 Regularization: Key Differences.

Answer

A

L1 (Lasso) adds the absolute value of weights to the loss; it can shrink weights to zero, performing feature selection.L2 (Ridge) adds the squared value of weights; it shrinks weights towards zero but rarely to zero, handling multicollinearity better.

Question 15

Q

Batch Normalization vs. Layer Normalization.

Answer

A

Batch Norm: Normalizes across the batch dimension (computes mean/var of a feature across all samples in a batch). Good for CNNs.Layer Norm: Normalizes across the feature dimension (computes mean/var of all features for a single sample). Good for RNNs/Transformers as it doesn’t depend on batch size.

Question 16

Q

Explain Self-Attention in Transformers.

Answer

Study These Flashcards

A

Self-Attention allows the model to look at other words in the input sequence to understand the context of the current word. It assigns “weights” to how relevant every other word is to the current word (e.g., linking “it” to “dog” in a previous sentence).

Question 17

Q

What are the main components of a RAG (Retrieval-Augmented Generation) system?

Answer

Study These Flashcards

A

Retriever: Searches a knowledge base (vector database) to find relevant documents based on the user query.2. Generator (LLM): Takes the retrieved documents + the user query as a prompt to generate a grounded answer.3. Vector DB: Stores embeddings of the data for semantic search.

Question 18

Q

K-Means Clustering: How does it work?

Answer

Study These Flashcards

A

Initialize K centroids randomly.2. Assign each data point to the nearest centroid.3. Re-calculate centroids based on the mean of points in the cluster.4. Repeat until centroids stop moving (convergence).

Question 19

Q

Supervised vs. Unsupervised Learning.

Answer

Study These Flashcards

A

Supervised: Training with labeled data (Input X -> Output Y). Used for Regression/Classification.Unsupervised: Training with unlabeled data to find structure/patterns. Used for Clustering/Dimensionality Reduction.

Question 20

Q

How to prevent overfitting in a CNN?

Answer

Study These Flashcards

A

Dropout: Randomly disable neurons during training.2. Data Augmentation: Flip, rotate, or crop images to increase data diversity.3. Pooling: Reduces spatial dimensions/parameters.4. Early Stopping: Stop training when validation loss increases.

Question 21

Q

Case Study: How to predict customer behavior for a new product?

Answer

Study These Flashcards

A

EDA: Check distribution, outliers, correlations.2. Feat. Eng: Create lag features, aggregated past spend, user demographics.3. Model: Random Forest/XGBoost for interpretability and handling non-linear data.4. Metric: Precision/Recall depending on marketing budget vs. coverage.

Question 22

Q

Why are RNNs more prone to vanishing gradients than CNNs?

Answer

Study These Flashcards

A

RNNs use Weight Sharing, applying the same weight matrix at every time step. During backpropagation, this effectively raises the weights to the power of the sequence length ($W^t$), causing exponential decay or growth.

Question 23

Q

How does an LSTM solve the vanishing gradient problem?

Answer

Study These Flashcards

A

LSTMs use a Cell State (conveyor belt) and Gating mechanisms. Information is added or removed via addition/linear operations rather than repeated matrix multiplication, allowing gradients to flow more easily over long distances.

CBA DS Flashcards

(23 cards)