correlation vs regression
Correlation
• description of an undirected relationship between two or more variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths
Regression • description of a directed relationship between two or more variables • one variable influences the other • smoking and cancer • weight and height • model to describe the relationship • model to predict one variable
The Coefficients
Classication / Regression Trees
Good Cluster
High Quality Cluster with
Depends on dist Mensure and Cluster Methoden:
Good: Smal circles, Long Lines
Bad: bog circles, small Lines
Similarity and distance: Variable: binary
Matching coeff.
Similarity and distance: Variable: categorical
Jaquard Dist.
Sij= a/(a+b+c)
Types of clustering: hierarchical vs partitional
Hierarchical Clustering: A set of nested clusters
organized as a hierarchical tree –> we will get a dendrogram and a cluster id by dendrogram cutting
Partitional Clustering: A division of data objects into non overlapping subsets (clusters) such that each data object is in exactly one subset –> we will only get a cluster id
steps at hierarchical clustering
Hard vs. soft (fuzzy) clustering
• Hard clustering algorithms:
- assign each pattern to a single cluster during operation
and output
- hclust, diana, kmeans
• Fuzzy clustering algorithms:
{ assign degrees of membership in several groups
{ fanny
{ fanny membership sub-object: soft clustering results
{ fanny clustering sub-object: hard clustering results
K-Means
es gibt drei Möglichkeiten, die Entfernung zum Cluster zu definieren (bzw. im Cluster) - welche?
average linkage, single linkage, complete linkage
average linkage:
- -> average linkage to merge closest rows
complete linkage
o Nächste Entfernung zum weit entferntesten (bzw. größten) Punkt des ersten Clusters
o Problem: große Cluster nehmen selten neue Mitglieder auf
single linkage
{ use the smallest distance value
–> single linkage to merge closest rows
PCA (general)
• Mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables (PC’s).
• The first PC accounts for as much of the variability in the data as possible, each succeeding component accounts for as much of the remaining variability as possible.
• PCA is performend on a covariance or a correlation matrix
of your data.
• Use correlation matrix if variances of your variables differ
largely (scale=TRUE).
• Principal components (PC’s) are linear combinations of the original variables weighted by their contribution to
explaining the variance orthogonal dimension.
PCA’s usage
understanding PCA geometrically and covariance matrix/eigenvektor
• Geometrically
- Rotation of space to maximize variance for fewer coordinates
• Covariance matrix and eigenvector
- Eigenvector with largest eigenvalue is first principal component (PC)
simple linear regression
y=ax+b
(abhängig Variable= regressionskoeff.* unabhänige variable+intercept)
multiple lineare regression
• multiple regression (multiple predictor variables P,Q,R)
but one outcome
multiple coefficients:
Y = a + bP + cQ + dR