Classification Flashcards

Question 1

Q

What is the Bag of Words model, and why is it commonly used in text classification?

Answer

A

BoW represents a document as a multiset of word frequencies, ignoring grammar and order. It’s used because it’s simple, converts text to numerical form, and works well with many classifiers.

Question 2

Q

Why might a rule-based classifier be preferred over a machine learning classifier in some scenarios?

Answer

A

Rule-based classifiers can be highly accurate if rules are crafted by experts, are interpretable, and don’t require labeled training data. They are often used in regulated or high-stakes environments (e.g., intelligence agencies).

Question 3

Q

How does kNN classify a new document?

Answer

A

kNN finds the k most similar documents in the training set (using a similarity measure like cosine distance) and assigns the majority class among them.

Question 4

Q

What is the “curse of dimensionality” in the context of text classification?

Answer

A

Text data often has thousands of features (words), which can lead to sparse vector representations, increased computational cost, and overfitting.

Question 5

Q

Explain the bias-variance tradeoff using kNN and Naive Bayes as examples.

Answer

A

kNN has low bias (flexible, adapts to data) but high variance (sensitive to noise). Naive Bayes has high bias (makes strong independence assumptions) but low variance (stable across datasets).

Question 6

Q

What is text classification?

Answer

A

Assigning documents to predefined categories based on content.

Question 7

Q

Name two methods of text classification.

Answer

A

Manual, rule-based, supervised learning.

Question 8

Q

What is the Bag of Words model?

Answer

A

A text representation that counts word occurrences, ignoring order.

Question 9

Q

What is a centroid in text classification?

Answer

A

The average vector of all documents in a class.

Question 10

Q

How does Naive Bayes work?

Answer

A

Uses word frequencies and Bayes’ theorem, assuming feature independence.

Question 11

Q

What is kNN in classification?

Answer

A

k-Nearest Neighbors classifies based on majority class of k closest training examples.

Question 12

Q

What is the similarity hypothesis?

Answer

A

Similar documents are close together in vector space.

Question 13

Q

What is overfitting?

Answer

A

When a model learns noise in training data and performs poorly on new data.

Question 14

Q

What is bias in ML?

Answer

A

Error from overly simplistic assumptions.

Question 15

Q

What is variance in ML?

Answer

A

Error from sensitivity to small changes in training data.

Question 16

Q

Why use feature selection?

Answer

Study These Flashcards

A

To reduce noise, training time, and improve generalization.

Question 17

Q

What is lazy learning?

Answer

Study These Flashcards

A

Learning that defers processing until classification (e.g., kNN).

Classification Flashcards

(17 cards)