What is classification
Classification is the task of predicting labels for unseen data based on patterns from training data
General approach to classification
Training set with known class labels, build a classification label on that and then apply it to the test set, which consists of unknown labels
Decision tree induction
The decision tree generation consists of two phases.
Tree construction and tree pruning.
Tree construction
* At the start, all the training examples are at the root
* Split xamples based on selected attributes
* examples that satisfy the condition on the selected attribute move to that branch
Tree pruning
* Identify and remove branches that reflect noise or outliers
A decision tree can be used to test attributes of an unseen sample. The unseen sample is used to predict its attributes based on the tree you built from the training data.
What is splitting
splitting is how we divide the data at a node based on an attribute so that the resulting child nodes are “purer”
Decision tree
A decision tree is a tree-like structure with branches and leaf nodes. Branches represent tests or decisions on attributes, and the leaf nodes determine the class label (or prediction) for the data.
What is discretization.
discrete bins like 20–29, 30–39, 40–49
What is binarization
Thresholds like 1 < 10 > 9
different purity measures
Information gain and gini index.
Information gain
Information Gain (IG) measures how much uncertainty (entropy) is reduced if we split the dataset based on a particular attribute.
Entropy
Randomness measurement
Precision
Precision is the ratio between the True Positives in all of the Positives.
Recall
Recall tells us how well a model finds all True positives in ALL the data(NOT ONLY THE POSITIVE DATA as in precision).
When is precision useful
Precision is more useful when we want to
confirm the correctness of our model.
KNN clustering
works by finding the “k” closest data points. not great with high dimensionality because distance can lose meaning