data mining Flashcards

(19 cards)

1
Q

t/f: data mining has one definition

A

false, has many

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

data mining

A
  • non-trivial extraction of implicit, previously unknown and potentially usefully information from data
  • exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

origins of data mining

A

ideas come from many disciplines including machine learning/AI, pattern recognition, statistics, and database systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

compared to data mining, traditional techniques may be unsuitable due to

A
  • enormity of data
  • high dimensionality of the data
  • heterogeneous, distributed nature of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

types of data mining algorithms

A

supervised algorithms (classification) and unsupervised algorithms (clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

supervised algorithms

A
  • learning by example
  • use training data which has correct answers (class label attribute)
  • create a model by running the algorithm on the training data
  • identify a class label for the incoming new data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

unsupervised algorithms

A
  • do not use training data
  • classes may not be known in advance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

approaches for supervised (classification)

A
  • decision trees
  • regression
  • neural networks
  • support vector machines
  • K-Nearest Neighbor approach
  • Bayesian Classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

classification: description

A
  • given a collection of records. each record contains a set of attributes, one of the attributes is the dependent variable/class.
  • find a model to predict the class attribute as a function of the values of the other attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

classification goal

A

previously unseen records should be assigned to a class as accurately as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

classification: a ____ is used to determine the accuracy of the model.

A

test set
- usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

classification approach: k-Nearest Neighbor

A

look at characteristics/attributes
“if it walks like a duck and quacks like a duck, then it’s probably a duck”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

nearest neighbor requires 3 things

A
  • the set of stored records
  • distance metric to compute the distance between records
  • the value of k, the number of nearest neighbors to retrieve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

to classify an unknown record

A
  • compute distance to other training records
  • identify k nearest neighbors
  • use class labels of nearest neighbors to determine the class label of unknown record
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

choosing the value of k

A

if k is too small, model is sensitive to noise.
if k is too large, neighborhood may include too many points from other classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

unsupervised algorithm: clustering

A

given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
- data points in one cluster are more similar to one another
- data points in separate clusters are less similar to one another

17
Q

clustering similarity measures

A
  • euclidean distance (if attributes are continuous)
  • other problem-specific measures
18
Q

examples of classification

A
  • direct marketing: targeting a set of consumers likely to buy new product (use data for similar product introduced before)
  • fraud detection: label past transactions as fair or fraud to form class attribute and train the model
19
Q

examples of clustering

A
  • market segmentation: collect different attributes of customers, find clusters of similar customers (measure cluster quality by observing buying patterns of customers in same cluster vs those from diff clusters)
  • document clustering: identify frequently occurring terms in each document, form a similarity measure based on the frequencies of different terms, and use it to cluster.