Week 5 Flashcards

Question 1

Q

what is data mining?

Answer

A

its focused on better understanding characteristics and patterns among variables in large databases using a variety of analytical and statistical tools

Question 2

Q

what is classification?

Answer

A

its an approach of data mining which is the process of analysing data to predict how to classify a new data element, eg spam filtering in an email

Question 3

Q

what is cluster analysis?

Answer

A

also known as data segmentation, which is a collection of techniques that seek to group or segment a collection of observations into subsets which have a high amount of similarity
- cluster analysis is a data reduction technique that can take a large number of observations such as surveys which can be reduce the information into smaller, same groups

Question 4

Q

what is data exploration and reduction?

Answer

A

often involves identifying groups in which elements of the groups in some way are similar, often used to understand differences amongst people
- this techniques often breaks down large data sets into smaller samples that provide clearer insights

Question 5

Q

what is dirty data?

Answer

A

data sets that have missing values or errors are ‘dirty’ and need to be ‘cleaned’ prior to analysing them

Question 6

Q

how can dirty data be cleaned?

Answer

A

eliminate the records than have missing data
estimate reasonable values for missing observations eg the mean
errors can be identified by looking at outliers

Question 7

Q

what is an outlier?

Answer

A

Can arise for a variety of reasons, e.g., incorrect recording of an observation
Can make a significant difference in the statistical analysis and results
Should not blindly be eliminated as it might indicate a deficiency with the model

Question 8

Q

what is hierarchial clustering?

Answer

A

the data isn’t partitioned into a particular cluster in a single step, instead a series of partitions takes place which may run from a single cluster containing all objects to n clusters, each containing a single object

Question 9

Q

what is k-means clustering?

Answer

A

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
- used in marketing anf healthcare

Question 10

Q

what are the elements of effective segmentaion?

Answer

A

Measurable: be able to quantify the size
Substantial: large enough to warrant separate treatment
Differentiable: Exclusive, each segment reacts differently
Actionable: Should be possible to develop actions (e.g.,
sales and marketing) for different segments

Question 11

Q

what is the process of k-means clustering?

Answer

A

k-means Clustering
Simple algorithm:
1. Randomly assign each observations one of the k
clusters
2. Iterate until cluster assignments stop changing:
i. For each of the k clusters, compute the
centroid
ii. Reassign each observation to the closest
centroid

Question 12

Q

what is the euclidean distance?

Answer

A

The most commonly used measure of distance between objects is Euclidean distance . This is an extension of the way in which the distance between two points on a plane is computed as the hypotenuse of a right triangle

Question 13

Q

what is classification?

Answer

A

these methods seek to classify a categorical outcome into one of two or more categories based on various data attributes

Question 14

Q

before building a model how can we partition data?

Answer

A

into a training data set or validation data set?

Question 15

Q

what is a training data set?

Answer

A

they have known outcomes and are used to teach a data mining algorithm

Question 16

Q

what is a validation data set?

Answer

Study These Flashcards

A

this data set is often used to find-tune models
- Some data miners additional use a test data set to assess performance

Question 17

Q

what are the three different data-mining approaches used for classification?

Answer

Study These Flashcards

A

k-nearest neighbours, discriminant analysis and logistic regression?

Question 18

Q

what is the k-nearest neighbours (K-NN) algorithm?

Answer

Study These Flashcards

A

is a classification scheme that attempts to find records in a database that are similar to one we wish to classify. Similarity is based on the “closeness” of a record to numerical predictors in the other records.
eg: In the Credit Approval Decisions database, we have the predictors Homeowner, Credit Score, Years of Credit History, Revolving Balance , and Revolving Utilization . We seek to classify the decision to approve or reject the credit application.

Question 19

Q

how do you work out the K-nearest neighbour?

Answer

Study These Flashcards

A

For a new data point you want to classify, check the K nearest neighbours
Distance is measured using the Euclidean distance
The majority of categories of the neighbours is assigned as the category of the new data point

Question 20

Q

what is discriminant analysis?

Answer

Study These Flashcards

A

a technique for classifying a set of observations into predefined classes,

Question 21

Q

what is logistic regression?

Answer

Study These Flashcards

A

Logistic regression is a variation of ordinary regression in which the dependent variable is categorical. The independent variables may be continuous or categorical, as in the case of ordinary linear regression. However, whereas multiple linear regression seeks to predict the numerical value of the dependent variable Y based on the values of the dependent variables, logistic regression seeks to predict the probability that the output variable will fall into a category based on the values of the independent (predictor) variables. This probability is used to classify an observation into a category.

Question 22

Q

what is association rule mining?

Answer

Study These Flashcards

A

Association rule mining , often called affinity analysis , seeks to uncover interesting associations and/or correlation relationships among large sets of data. Association rules identify attributes that occur frequently together in a given data set. A typical and widely used example of association rule mining is market basket analysis.

Week 5 Flashcards

(22 cards)