BIS II - Data Mining Flashcards

Question 1

Q

Data mining definition

Answer

A

o process that uses statistical, mathematical, AI and machine-learning techniques
o To extract and identify useful information and subsequent knowledge from large databases
• Datamining tools find patterns in data and may even infer rules/models from them
• Other names:
o Knowledge extraction; pattern analysis, knowledge discovery, information harvesting…

Question 2

Q

Data Mining Process

Answer

A

Different groups have different versions; most common standard processes are:

CRISP-DM (Cross-Industry Standard Process for Data Mining)
SEMMA (Sample, Explore, Modify, and Assess)

1) CRISP-DM Process: First 4 steps account for 85% of total project time; highly repetitive and experimental process
1. Develop Business Understanding
2. Then, develop Data Understanding
3. Prepare Data
4. Build model
5. Test and evaluate
6. Deploy

2) SEMMA Process
1. Sample: generate a representative sample of the data
2. Explore: visualize the data and make a basic description of it
3. Modify: select variables, transform the variable representations
4. Model: use a variety of statistical and machine learning models
5. Assess: Evaluate the accuracy and usefulness of the models

Question 3

Q

Data Preparation

Answer

A

Most critical task in DM
Steps involved in obtaining well-formed data from skewed data

1) Data Consolidation: collecting, selecting, and integrating data
2) Data cleaning: imputing missing values, reducing noise and eliminating inconsistencies in data
3) Data transformation: Normalizing data, discretizing & aggregating data, constructing new attributes
4) Data reduction; reducing number of variables and cases, balancing skewed data

Question 4

Q

What does Data Mining do? How does it work?

Answer

A

DM extracts patterns from data
Pattern = mathematical (numeric and/or symbolic) relationsip among data items

Types of patterns:

Association
Prediction
Cluster (segmentation)
Sequential (or time series) relationships

Question 5

Q

Applications of Data Mining

Answer

A

In Customer Relationship Management:
1. To maximize return on marketing campaigns
2. To improve customer retention
3. To maximize customer value
Banking/financial
1. To automate loan application process
2. Detect fraudulent transactions
3. To optimize cash reserves with forecasting
Etc. -> retail & logistics, manufacturing and maintenance, brokerage and securities trading, insurance

Question 6

Q

Data Mining Terminology

Answer

A

Data science/data mining = Statisctics/operations research
Features/attributes = Independent variables; Predictors;
Explanatory Variable
Target variable/attribute/label = Dependent variable
Bias = Intercept in regression analysis

Question 7

Q

Taxonomy of Data Mining Tasks

Answer

A

Unsupervised learning aims at identifying associations, i.e. grouping data (to previously unknown classes)
Supervised learning: the classes are known
Different classification approaches differ regarding:
1. Search strategy
2. Efficiency with regard to resources
3. Input data requirements
4. Interpretability of results, generated rules/models

Question 8

Q

Data Mining Methods – Classification

Answer

A

Definition:

Supervised induction used to analyze historical data stored in databased
To automatically generate model that can predict future behavior

Most frequently used DM method
Employ supervised learning from past data, then classification of new data
Output variable is categorical (nominal or ordinal) in nature
Classification techniques:
1. Decision tree analysis
2. Artificial neural networks
3. Logistic regression
4. Support vector machines
5. Etc.

Question 9

Q

Estimation Methodologies for Classification

Answer

A

Simple split (or holdout or test sample estimation)

Split the data into 2 mutually exclusive sets training for model development (ca. 70%) and testing for model assessment/scoring (ca. 30%) to determine prediction accuracy
For ANN, the data is split into three sub-sets (training, ca- 60%; validation, ca. 20% and testing, ca. 20%)

K-fold cross validation (rotation estimation)

Split the data into k mutually exclusive subsets
Use each subset as testing while using the rest of the subsets as training
Repeat the experimentation k times
Aggregate the test results for true estimation of prediction accuracy training

Other estimation methodologies:

Leave-one-out, bootstrapping, jackknifing
Area under the ROC curve

Question 10

Q

Accuracy of Classification Models

Answer

A

In classification problems, the primary source for accuracy estimation is the confusion matrix

Accuracy = True Positive Count + True Negative Count over all values
True Positive Rate = True Positive Count / True Positive + False Negative Count
True Negative Rate = True Negative Count / True Negative Count + False Positive Count
Precision = True Positive / True Positive + False Positive

Question 11

Q

Decision Trees

Answer

A

Likelihood of a data subject to show a specific outcome of a target variable needs to be determined according to the attributes observed (basically correlation of occurrence of one attribute in concert with the target variable)
Question: “which of the attributes would be best to segment these people (in the example) into groups, in a way that will distinguish write-offs from non-write-offs?”
Looking for the most informative Attributes by creating a formula/algorithm that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable: ID3 decision tree algorithm

Question 12

Q

Entropy and Information Gain

Answer

A

Information Gain = the most common splitting criterion
1. Based on a purity measure = Entropy
Entropy
1. Measure of disorder
2. Disorder corresponds to how mixed (impure) the segment is with respect to the values of attribute of interest
3. A Mixed-up segment with lots of realizations of both target variables (write-offs and non-write-offs) would have high entropy

Entropy = -p1log(p1) – p2log(p2) - …

Pi = the probability of value I within the set
Pi = 1, when all members of the set have attribute value I
Pi = 0 when no members of the set have attribute value I
There may be more than two attribute values (properties)

Question 13

Q

Information Gain

Answer

A

IG(parent, children) = entropy(parent) – p(c1) entropy (c1) – p(c2) entropy (c2) - …

Measures how informative an attribute is with respect to the target; how much an attribute decreases (improves) entropy over the whole segmentation it creates
An attribute segments a set of instances into several k subsets. Terminology:
1. Parent set: the original set of examples
2. K children sets are the result of splitting on the attribute values
The entropy of each child is weighted by the proportion of instances belonging to that child

Question 14

Q

To test the accuracacy of the model

Answer

A

Predict the values of the hold-out sample using the developed decision tree and compare it to the true values
Create a confusion matrix and compare the values for accuracy (hit rate), recall and precision of the prediction model

Question 15

Q

Summary: ID3 Decision Trees

Answer

A

General algorithm for building an ID3 decision tree:
1. Create root node and select splitting attribute
2. Add branch to root node for each split candidate value and label
3. Take following iterative steps:
1. Classify data by applying information gain measure
2. If stopping point is reached, then create leaf node and label it. Otherwise, build another subtree
4. Disadvantages of ID3:
1. ID3 tends to prefer splits that result in large number of partitions, each being small but pure
2. Overfitting, less generalization capability
3. Cannot handle numeric values, missing values
4. C4.5 algorithm aims at curing these shortcomings

BIS II - Data Mining Flashcards

(15 cards)