Data mining definition
o process that uses statistical, mathematical, AI and machine-learning techniques
o To extract and identify useful information and subsequent knowledge from large databases
• Datamining tools find patterns in data and may even infer rules/models from them
• Other names:
o Knowledge extraction; pattern analysis, knowledge discovery, information harvesting…
Data Mining Process
Different groups have different versions; most common standard processes are:
1) CRISP-DM Process: First 4 steps account for 85% of total project time; highly repetitive and experimental process
1. Develop Business Understanding
2. Then, develop Data Understanding
3. Prepare Data
4. Build model
5. Test and evaluate
6. Deploy
2) SEMMA Process
1. Sample: generate a representative sample of the data
2. Explore: visualize the data and make a basic description of it
3. Modify: select variables, transform the variable representations
4. Model: use a variety of statistical and machine learning models
5. Assess: Evaluate the accuracy and usefulness of the models
Data Preparation
1) Data Consolidation: collecting, selecting, and integrating data
2) Data cleaning: imputing missing values, reducing noise and eliminating inconsistencies in data
3) Data transformation: Normalizing data, discretizing & aggregating data, constructing new attributes
4) Data reduction; reducing number of variables and cases, balancing skewed data
What does Data Mining do? How does it work?
Types of patterns:
Applications of Data Mining
Data Mining Terminology
Data science/data mining = Statisctics/operations research
Features/attributes = Independent variables; Predictors;
Explanatory Variable
Target variable/attribute/label = Dependent variable
Bias = Intercept in regression analysis
Taxonomy of Data Mining Tasks
Data Mining Methods – Classification
Definition:
Estimation Methodologies for Classification
Simple split (or holdout or test sample estimation)
K-fold cross validation (rotation estimation)
Other estimation methodologies:
Accuracy of Classification Models
Accuracy = True Positive Count + True Negative Count over all values
True Positive Rate = True Positive Count / True Positive + False Negative Count
True Negative Rate = True Negative Count / True Negative Count + False Positive Count
Precision = True Positive / True Positive + False Positive
Decision Trees
Entropy and Information Gain
Entropy = -p1log(p1) – p2log(p2) - …
Information Gain
IG(parent, children) = entropy(parent) – p(c1) entropy (c1) – p(c2) entropy (c2) - …
To test the accuracacy of the model
Summary: ID3 Decision Trees