data mining Flashcards

Question 1

Q

t/f: data mining has one definition

Answer

A

false, has many

Question 2

Q

data mining

Answer

A

non-trivial extraction of implicit, previously unknown and potentially usefully information from data
exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Question 3

Q

origins of data mining

Answer

A

ideas come from many disciplines including machine learning/AI, pattern recognition, statistics, and database systems

Question 4

Q

compared to data mining, traditional techniques may be unsuitable due to

Answer

A

enormity of data
high dimensionality of the data
heterogeneous, distributed nature of data

Question 5

Q

types of data mining algorithms

Answer

A

supervised algorithms (classification) and unsupervised algorithms (clustering)

Question 6

Q

supervised algorithms

Answer

A

learning by example
use training data which has correct answers (class label attribute)
create a model by running the algorithm on the training data
identify a class label for the incoming new data

Question 7

Q

unsupervised algorithms

Answer

A

do not use training data
classes may not be known in advance

Question 8

Q

approaches for supervised (classification)

Answer

A

decision trees
regression
neural networks
support vector machines
K-Nearest Neighbor approach
Bayesian Classification

Question 9

Q

classification: description

Answer

A

given a collection of records. each record contains a set of attributes, one of the attributes is the dependent variable/class.
find a model to predict the class attribute as a function of the values of the other attributes

Question 10

Q

classification goal

Answer

A

previously unseen records should be assigned to a class as accurately as possible

Question 11

Q

classification: a ____ is used to determine the accuracy of the model.

Answer

A

test set
- usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

Question 12

Q

classification approach: k-Nearest Neighbor

Answer

A

look at characteristics/attributes
“if it walks like a duck and quacks like a duck, then it’s probably a duck”

Question 13

Q

nearest neighbor requires 3 things

Answer

A

the set of stored records
distance metric to compute the distance between records
the value of k, the number of nearest neighbors to retrieve

Question 14

Q

to classify an unknown record

Answer

A

compute distance to other training records
identify k nearest neighbors
use class labels of nearest neighbors to determine the class label of unknown record

Question 15

Q

choosing the value of k

Answer

A

if k is too small, model is sensitive to noise.
if k is too large, neighborhood may include too many points from other classes

Question 16

Q

unsupervised algorithm: clustering

Answer

Study These Flashcards

A

given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
- data points in one cluster are more similar to one another
- data points in separate clusters are less similar to one another

Question 17

Q

clustering similarity measures

Answer

Study These Flashcards

A

euclidean distance (if attributes are continuous)
other problem-specific measures

Question 18

Q

examples of classification

Answer

Study These Flashcards

A

direct marketing: targeting a set of consumers likely to buy new product (use data for similar product introduced before)
fraud detection: label past transactions as fair or fraud to form class attribute and train the model

Question 19

Q

examples of clustering

Answer

Study These Flashcards

A

market segmentation: collect different attributes of customers, find clusters of similar customers (measure cluster quality by observing buying patterns of customers in same cluster vs those from diff clusters)
document clustering: identify frequently occurring terms in each document, form a similarity measure based on the frequencies of different terms, and use it to cluster.

data mining Flashcards

(19 cards)