What is datamining?
What is an algorithm?
A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer
Some applications of Datamining
CRM
- profiles, cross-selling, bundling
Web Optimisation
- click paths, search
Traffic forecasts
Fraud detection
Product Placement
Data mining methods
Data Mining Methods
Components from …
traditional statistics and data analysis
- e.g. regression, clustering, factor analysis
artificial intelligence
- e.g. machine learning, artificial neural networks, evolutionary algorithms
pattern recognition
- e.g. artificial neurons as identifiers
data base theory and practice
- e.g. association analysis, OLAP macros
What is an attribute?
provides information on one specific characteristic of each instance
e.g. the column stating the patients’ name for each recorded visit
What is an instance?
One specific entry in the data set
e.g. the row describing one patients’ specific visit
What is a data set?
All the data that you plan to analyse
e.g. all patients who visited this hospital during the last 5 years
Denormalised: one long table
Evaluating Data Mining
To evaluate the outcome of data mining efforts, we usually split the data set into three parts:
Training set:
- Use an algorithm to generate a model from the data
Validation set:
- Validate and improve the model
Test set:
- Test the applicability and performance of the model
Supervised learning
For training, validation and test sets, the solution is known a priori
e.g. classification
Unsupervised learning
There is no structured solution (yet)
e.g. clustering, association rules
What is clustering and when is it applied?
Unsupervised segmentation of an instance set based on attributes
Is applied, when instance segments are not known a priori, but the possibility of segmentation is expected
Clustering
Conditions
Clustering
Unique
Every instance belongs to exactly one cluster
e.g. bank note identification
Clustering
Overlapping
One Instance can belong to several clusters
e.g. expert profiles
Clustering
Probabilistic
Each instance has a certain probability of belonging to each cluster
e.g. customer segments
What is hierarchical clustering?
Cluster on lower aggregation levels are part of clusters on higher aggregation levels
e.g. biological classification by family, genus, species
Hierarchical clustering
Agglomerative clustering
Initially, every instance represents an individual cluster
-> larger clusters emerge through agglomeration
Hierarchical clustering
Divisive clustering
Initially, the whole instance set is a single cluster
-> smaller clusters emerge through division
Offline vs. online clustering
Offline:
- a complete set of instances is provided before clustering starts
Online:
- instances are added iteratively, clustering can be started when at least three instances are available
Offline vs. online clustering
Online additional info
k-Means clustering Algorithm
Attributes
unique, flat, offline
k-Means clustering algorithm
Steps
k-Means Clustering algorithm
Improvements, variations, problem
Improvements:
Do not randomly choose initial centroids, but according to distance variance: k-means++
Variations:
Diverse distance measures depending on attribute types
Problem:
What is the best k?