k-NN process overview

Goal
given a set of labeled items, automatically label a new item
Idea
Consider most similar other items (defined in terms of their attributes), look at their labels and give the unassigned item the majority votes. Ties broken randomly.
To automate knn, what two decisions need to be made
Euclidean distance

Cosine similarity

Jaccard distance

Hamming distance

Manhatan distance

Regarding distance metrics…what if attributes are a mixture of kinds of data?
Define your own custom designed metric
synonymous terms

Evaluation metrics
Evaluation Metric : Accuracy
number of correct labels / (total number of labels)
Evaluation Metric : Precision
number of true positives /
(number of true positives + number of false positives)
Evaluation Metric : Recall
Number of true positives /
(number of true positives + number of false negatives)
Evaluation Metric : F-score
Harmonic mean of precision and recall
(2 × precision × recall) / (precision + recall)
Evaluation Metric : Misclassification rate
1-accurary
Choosing k
Modeling assumptions in K-NN

Scaling
Standardize the data so that all variables are given a mean of zero and a standard deviation of one.
In R, this can be achieved using the scale() function