supervised learning
unsupervised learning
- clustering and frequent pattern mining (find dependencies between variables)
univariate distribution
what values occur for an attribute and how often, data is a sample of a population
entropy H(A)
how informative is an attribute? (about the value of another attribute)
measurement about the amount of information/chaos
distribution of a binary attribute
H(A) = – plg(p) – (1–p)lg(1–p)
H(A) maximal (1) when p = 1/2 = 1/m (m nr of possible values)
distribution of a nominal attribute
H(A) = Σ–p_{i}*lg(p_{i})
H maximal when p = 1/m
H_max = lg(m)
histograms
define cut points and count occurrences within bins
- equal height: select k cut points such that all bins contain n/k data points
information gain
gain(A) = H(p) – ΣH(p_{i})*n_{i}/n
p probability of positive in current set (before split)
n number of examples in current set (before split)
p_{i} probability of positive in branch i (after split)
n_{i} number of examples in branch i (after split)
problem highly-branching attributes
attributes with a large number of values are more likely to create pure subsets => information gain is biased towards choosing attributes with many values (may result in overfitting: not optimal for prediction)
=> solution: gain ratio
Gain ratio
modification of the information gain that reduces its bias on high-branching attributes => large when data is divided in few, even groups, small when each example belongs to a separate branch
GainRatio(S,A) = Gain(S,A)/IntrinsicInfo(S,A)
intrinsic information
the entropy of distribution of instances into branches: the more branches, the higher this entropy
entropy over numeric attributes
many possible split points: evaluate information gain for every possible split point, choose best one and its info gain is the information gain for the attribute (breakpoints between values of the same class cannot be optimal)
pruning
prevent overfitting to noise in the data
prepruning, postpruning
prepruning
stop growing the tree when there is no statistically significant dependence between an attribute and the class at a particular node - chi-squared test
problem: prepruning might stop the process to soon (early stopping), eg. XOR problem (no individual attribute exhibits any sign. association to the class, only visible in expanded tree)
postpruning
first, build full tree
then, prune: subtree replacement (bottom up, consider replacing a tree only after considering all its subtrees)
estimating error rates
avoid overfitting: don’t model details that will produce errors on future data => estimate such errors
covering algorithms
for each class in turn, identify a rule set that covers all examples in it
selecting a test (covering algorithms)
goal: maximize accuracy
- t total number of examples covered by the rule
- p positive examples of the class covered by the rule
- select test that maximizes p/t (finished when p/t = 1)
PRISM algorithm
(separate-and-conquer: all examples covered by a rule are separated out, remaining examples are conquered)
for each class, make rules by adding tests (maximizing p/t) until the rule is perfect and until each example of the class has been covered by a rule
instance-based / rote / lazy learning
training set is searched for example that most closely resembles the new example, examples themselves represent the knowledge
- (k-)NN
distance function
1-NN discussion
proper test and training procedure uses 3 sets
training set: train models
validation set: optimize algorithm parameters
test set: evaluate final model
trade-off: performance vs evaluation accuracy
more training data => better model
more test data => more accurate error estimate