How does classification work?
It models the choice between two distinct alternatives. Ex: Is this fraud or no? Will someone buy this product?
Can we use linear regression for classification tasks?
Yes, if we encode the outcome as a binary variable.
What is the logistic function (logistic regression)?
p(X) = (e^(beta0 + beta1 * X))/(1+e^(beta0 + beta1 *X))
What are “the odds”?
p(X)/(1-p(X))
(This equals e^(beta0 + beta1 * X))
You can also do log(odds)
To fit the logistic model, we need to do numerical optimization like…
MLE!
Steps to MLE:
1.) Write the likelihood function
2.) Take the log and run an optimization on this (like gradient descent)(Start with an initial guess, move parameters in optimal direction, stop when close enough)
What is the log-likelihood function?
sum(i=1 to n) of (y_ilog(y_i(hat)) + (1-y_i)log(1-y_i(hat))
What are a couple standard optimization methods?
Gradient ascent
Newton-Raphson (IRLS)
The K-NN Classifier - process:
1.) Fix K
2.) Find the K points nearest to your point
3.) Assign the point to the class in the majority among the neighbors
Naïve Bayes - how is it different?
We make a simplifying assumption that we assume the conditional probabilities of each feature given its class are independent
Bayes can handle special kinds of data…
Text data! (multiple factors, often more than observations)(spam or ham)
What is the P>N problem?
If we have too many features relative to the amount of data, we cannot use models like logistic regression (NLP (text), biomed)
In order to avoid log(0) in spam vs ham, we add a fudge factor called
Laplace smoothing
Pros and cons of Naive Bayes
Fast, works well with high dimensions, handles irrelevant features well
Independence assumption means we can’t learn about interactions between variables
In evaluation, what’s the difference between regression and classification?
Regression: Care about distance (MSE)
Classification : Care about correct label (2 errors: false positives, false negatives)
We use an ROC curve to…
Visualize types of errors for all possible thresholds. The quality of the classifier is the area between the ROC curve and they X=Y line.
How do you calculate the ROC curve?
1.) Sort the instances by score (probability that the data point is a +, for ex), in descending order
2.) Apply the threshold at each unique score and record: count of TP FP TN FN of classifier at that threshold, TP rate, FP rate
3.) Plot the ROC curve by connecting the dots in the TP/FP space
How can we measure classifier quality from the ROC?
AUC! The area under the ROC curve - an ideal ROC curve will hug the top left corner of the space. An AUC of 0.5 (on the diagonal) is random. AUC less than this indicates worse than random performance!