ML: Classification Flashcards by Henry Ye

What is the confusion matrix and why is it important?

A table summarizing TP, TN, FP, FN. It underpins all classification metrics (precision, recall, F1, specificity) and reveals model behavior beyond accuracy.

How well did you know this?

Not at all

Perfectly

When is accuracy a misleading metric?

In imbalanced datasets (e.g., fraud, disease detection) where accuracy hides minority class failures.

How well did you know this?

Not at all

Perfectly

What is the difference between precision and recall?

Precision: Of predicted positives, how many are correct? TP/(TP + FP). High precision = model’s pos predictions are highly reliable. Spam filter. Helps avoid false alarms.

Recall: Of actual positives, how many did we catch? TP/(TP + FN). High recall = model catches most positive instances. Cancer/Disease prediction. Helps avoid missed detections.

How well did you know this?

Not at all

Perfectly

How do you choose between precision and recall?

Depends on business risk:

Precision priority → expensive FP (e.g., sending human reviewers)

Recall priority → expensive FN (e.g., medical diagnosis)

How well did you know this?

Not at all

Perfectly

Why is F1-score the geometric mean of precision and recall?

It penalizes extreme imbalance by requiring both precision and recall to be high.

How well did you know this?

Not at all

Perfectly

What does the ROC curve represent?

True Positive Rate vs. False Positive Rate trade-off at various decision thresholds.

How well did you know this?

Not at all

Perfectly

What does AUC measure?

The probability that the classifier ranks a random positive higher than a random negative. Equivalent to ranking quality.

How well did you know this?

Not at all

Perfectly

When is PR curve preferred over ROC?

Highly imbalanced datasets → PR curve is more sensitive to performance on the positive class.

How well did you know this?

Not at all

Perfectly

How are ROC/AUC used in LLM agent evaluation?

Tool selection classification

Retrieval relevance classifier

Routing tasks (which LLM or tool to pick)

Safety classifiers (detect harmful intent)

AUC helps evaluate ranking quality of these decision layers.

How well did you know this?

Not at all

Perfectly

Why can’t logistic regression be solved with a closed-form solution?

The log-likelihood is concave but not quadratic → derivative doesn’t yield an algebraic solution → requires iterative optimization.

How well did you know this?

Not at all

Perfectly

How do regularization terms affect logistic regression?

L1 → sparse features

L2 → stable coefficients, mitigates multicollinearity
Regularization improves generalization.

How well did you know this?

Not at all

Perfectly

How to interpret logistic regression coefficients?

Exponentiating coefficients yields odds ratios; positive weights → increase log-odds of being class 1.

How well did you know this?

Not at all

Perfectly

Why is logistic regression still important in LLM pipelines?

Lightweight safety and routing classifiers

Calibration layers for confidence

Reward model pre-steps

Linear probing on embeddings

Its interpretability makes it especially valuable in AI agent decision layers.

How well did you know this?

Not at all

Perfectly

Why is Naive Bayes “naive”?

It assumes conditional independence between features given the class.

How well did you know this?

Not at all

Perfectly

Why does Naive Bayes often perform surprisingly well?

Even with violated independence assumptions, the ranking of class probabilities often remains correct.

How well did you know this?

Not at all

Perfectly

When is Naive Bayes especially effective?

Study These Flashcards

Text classification

High-dimensional sparse features

Real-time or low-latency applications

What is Laplace smoothing and why is it used?

Study These Flashcards

Adds pseudo-counts to avoid zero probabilities for unseen words/features.

What is the intuition behind the SVM margin?

Study These Flashcards

SVM maximizes the smallest distance between the decision boundary and any training point → robustness to noise.

Why are kernel methods powerful?

Study These Flashcards

They implicitly map data to high-dimensional spaces without explicitly computing those features (via kernel trick).

When are SVMs not a good choice?

Study These Flashcards

Very large datasets → slow training

Very large feature spaces without kernel approximation

When probabilistic outputs are needed (unless Platt scaling)

What is the role of C in SVM?

Study These Flashcards

Controls trade-off between maximizing margin and minimizing misclassification.

Large C → focus on correctness

Small C → focus on larger margin

What is entropy in a classification tree?

Study These Flashcards

A measure of uncertainty; lower entropy → purer nodes.

H(p)=−∑pi log pi

What is Gini impurity?

Study These Flashcards

Equivalent impurity measure, computationally faster.

G=∑pi(1−pi)

Why do decision trees overfit easily?

Study These Flashcards

They split until leaves become pure, capturing noise. Pruning or limiting depth is needed.

How do you prevent overfitting in trees?

Limit depth Min samples per split/leaf Pruning Using ensembles (RF, boosting)

Why does randomization help random forests?

Two components reduce variance: Bootstrap sampling Random feature selection per split

Why do random forests have low bias and low variance?

Each tree is high variance, low bias → averaging reduces variance, retaining low bias.

What hyperparameters matter most?

max_depth n_estimators max_features min_samples_split/leaf

How do RFs handle imbalanced data?

Class weights Balanced bootstrapping Threshold moving SMOTE + RF

How does boosting differ from bagging?

Bagging: trains independent models in parallel Boosting: trains models sequentially, each correcting predecessor’s errors

Why can boosting overfit less than expected?

Boosting focuses on fitting hard examples, but uses shrinkage, depth limits, and regularization, preventing runaway fitting.

Why is XGBoost so effective?

Second-order optimization Regularization Built-in handling of missing values Histogram-based splits Parallelization Tree pruning with “loss-guided growth”

How do you fix overfitting in boosting?

Increase regularization (λ, α) Reduce tree depth Reduce learning rate Increase min child weight Decrease number of estimators

Why do even sophisticated LLM agents rely heavily on classical classifiers?

Agents need deterministic, debuggable, low-latency decision nodes: Safety classifiers Intent detection Routing (choose a model/tool) Detect hallucinations Flagging harmful content Traditional ML models excel here.

How is boosting used in LLM / agent systems?

Ranking models (LightGBM/XGBoost) for retrieval re-ranking Safety classification Tool routing based on embedding features Reward modeling features Boosting is widely used because it's interpretable, fast, and structured-data-friendly.

How is classification used in RAG systems?

Relevance classification Chunk ranking with boosting models Hallucination detection Query type classification Document quality scoring

Why is AUC valuable in retrieval scoring?

Because ranking matters more than raw classification—AUC reflects the correctness of ordering relevant vs irrelevant chunks.

How are SVMs/logistic regression used in embedding space?

Linear classifiers on embeddings for: Topic classification Content moderation Multi-label tags Safety filters Weak supervision of agent behavior

The agent frequently chooses the wrong tool. What ML approach helps?

Use a structured classifier (logistic regression or boosted trees) on features like: last agent step intent class embedding of user query historical success rate This improves deterministic routing.

ML: Classification Flashcards

(39 cards)