What is the Bag of Words model, and why is it commonly used in text classification?
BoW represents a document as a multiset of word frequencies, ignoring grammar and order. It’s used because it’s simple, converts text to numerical form, and works well with many classifiers.
Why might a rule-based classifier be preferred over a machine learning classifier in some scenarios?
Rule-based classifiers can be highly accurate if rules are crafted by experts, are interpretable, and don’t require labeled training data. They are often used in regulated or high-stakes environments (e.g., intelligence agencies).
How does kNN classify a new document?
kNN finds the k most similar documents in the training set (using a similarity measure like cosine distance) and assigns the majority class among them.
What is the “curse of dimensionality” in the context of text classification?
Text data often has thousands of features (words), which can lead to sparse vector representations, increased computational cost, and overfitting.
Explain the bias-variance tradeoff using kNN and Naive Bayes as examples.
kNN has low bias (flexible, adapts to data) but high variance (sensitive to noise). Naive Bayes has high bias (makes strong independence assumptions) but low variance (stable across datasets).
What is text classification?
Assigning documents to predefined categories based on content.
Name two methods of text classification.
Manual, rule-based, supervised learning.
What is the Bag of Words model?
A text representation that counts word occurrences, ignoring order.
What is a centroid in text classification?
The average vector of all documents in a class.
How does Naive Bayes work?
Uses word frequencies and Bayes’ theorem, assuming feature independence.
What is kNN in classification?
k-Nearest Neighbors classifies based on majority class of k closest training examples.
What is the similarity hypothesis?
Similar documents are close together in vector space.
What is overfitting?
When a model learns noise in training data and performs poorly on new data.
What is bias in ML?
Error from overly simplistic assumptions.
What is variance in ML?
Error from sensitivity to small changes in training data.
Why use feature selection?
To reduce noise, training time, and improve generalization.
What is lazy learning?
Learning that defers processing until classification (e.g., kNN).