what is data mining?
its focused on better understanding characteristics and patterns among variables in large databases using a variety of analytical and statistical tools
what is classification?
its an approach of data mining which is the process of analysing data to predict how to classify a new data element, eg spam filtering in an email
what is cluster analysis?
also known as data segmentation, which is a collection of techniques that seek to group or segment a collection of observations into subsets which have a high amount of similarity
- cluster analysis is a data reduction technique that can take a large number of observations such as surveys which can be reduce the information into smaller, same groups
what is data exploration and reduction?
often involves identifying groups in which elements of the groups in some way are similar, often used to understand differences amongst people
- this techniques often breaks down large data sets into smaller samples that provide clearer insights
what is dirty data?
data sets that have missing values or errors are ‘dirty’ and need to be ‘cleaned’ prior to analysing them
how can dirty data be cleaned?
what is an outlier?
what is hierarchial clustering?
the data isn’t partitioned into a particular cluster in a single step, instead a series of partitions takes place which may run from a single cluster containing all objects to n clusters, each containing a single object
what is k-means clustering?
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
- used in marketing anf healthcare
what are the elements of effective segmentaion?
what is the process of k-means clustering?
k-means Clustering
Simple algorithm:
1. Randomly assign each observations one of the k
clusters
2. Iterate until cluster assignments stop changing:
i. For each of the k clusters, compute the
centroid
ii. Reassign each observation to the closest
centroid
what is the euclidean distance?
The most commonly used measure of distance between objects is Euclidean distance . This is an extension of the way in which the distance between two points on a plane is computed as the hypotenuse of a right triangle
what is classification?
these methods seek to classify a categorical outcome into one of two or more categories based on various data attributes
before building a model how can we partition data?
into a training data set or validation data set?
what is a training data set?
they have known outcomes and are used to teach a data mining algorithm
what is a validation data set?
this data set is often used to find-tune models
- Some data miners additional use a test data set to assess performance
what are the three different data-mining approaches used for classification?
k-nearest neighbours, discriminant analysis and logistic regression?
what is the k-nearest neighbours (K-NN) algorithm?
how do you work out the K-nearest neighbour?
what is discriminant analysis?
a technique for classifying a set of observations into predefined classes,
what is logistic regression?
Logistic regression is a variation of ordinary regression in which the dependent variable is categorical. The independent variables may be continuous or categorical, as in the case of ordinary linear regression. However, whereas multiple linear regression seeks to predict the numerical value of the dependent variable Y based on the values of the dependent variables, logistic regression seeks to predict the probability that the output variable will fall into a category based on the values of the independent (predictor) variables. This probability is used to classify an observation into a category.
what is association rule mining?
Association rule mining , often called affinity analysis , seeks to uncover interesting associations and/or correlation relationships among large sets of data. Association rules identify attributes that occur frequently together in a given data set. A typical and widely used example of association rule mining is market basket analysis.