Classification Flashcards

(17 cards)

1
Q

What is the Bag of Words model, and why is it commonly used in text classification?

A

BoW represents a document as a multiset of word frequencies, ignoring grammar and order. It’s used because it’s simple, converts text to numerical form, and works well with many classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why might a rule-based classifier be preferred over a machine learning classifier in some scenarios?

A

Rule-based classifiers can be highly accurate if rules are crafted by experts, are interpretable, and don’t require labeled training data. They are often used in regulated or high-stakes environments (e.g., intelligence agencies).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does kNN classify a new document?

A

kNN finds the k most similar documents in the training set (using a similarity measure like cosine distance) and assigns the majority class among them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the “curse of dimensionality” in the context of text classification?

A

Text data often has thousands of features (words), which can lead to sparse vector representations, increased computational cost, and overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the bias-variance tradeoff using kNN and Naive Bayes as examples.

A

kNN has low bias (flexible, adapts to data) but high variance (sensitive to noise). Naive Bayes has high bias (makes strong independence assumptions) but low variance (stable across datasets).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is text classification?

A

Assigning documents to predefined categories based on content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name two methods of text classification.

A

Manual, rule-based, supervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Bag of Words model?

A

A text representation that counts word occurrences, ignoring order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a centroid in text classification?

A

The average vector of all documents in a class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Naive Bayes work?

A

Uses word frequencies and Bayes’ theorem, assuming feature independence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is kNN in classification?

A

k-Nearest Neighbors classifies based on majority class of k closest training examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the similarity hypothesis?

A

Similar documents are close together in vector space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is overfitting?

A

When a model learns noise in training data and performs poorly on new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is bias in ML?

A

Error from overly simplistic assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is variance in ML?

A

Error from sensitivity to small changes in training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why use feature selection?

A

To reduce noise, training time, and improve generalization.

17
Q

What is lazy learning?

A

Learning that defers processing until classification (e.g., kNN).