A methodology of data mining by the company SAS. It encompasses the previous 10 steps. 1. Sample Take a sample. Partition into training/testing 2. Explote Examine data set statistically and graphically 3. Modify Transform variables/put in missing values 4. Model Fit predicitive models 5. Assess Compare models using a validation dataset.

Data Mining - Chapter 2 Flashcards by Joost Kok

What is classification?

Examining data and deciding in which class or category they will fall. 
--> Trying to predict a class

How well did you know this?

Not at all

Perfectly

What is prediction?

Trying to predict the value of a numerical variable.

–> Can be used for both continious as categorical data.

How well did you know this?

Not at all

Perfectly

What are association rules?

Rules designed to find general association patterns between items in a large database. Generates rules general to an entire population.

How well did you know this?

Not at all

Perfectly

What is collaborative filtering?

Making rules for an invidivual user opposed to the general public, based on individual history as well as the history of others.

How well did you know this?

Not at all

Perfectly

What is data reduction?

The process of consolidating a large number of records into a smaller set.

-> You do this because the performance of data mining algorithms is often improved when the number of variables is limited.
-> Often done by clustering

How well did you know this?

Not at all

Perfectly

What is dimension reduction?

Reducing the numer of variables (instead of the number of rows).

How well did you know this?

Not at all

Perfectly

What is data visualization?

Data exploration through creating charts and dashboards.

How well did you know this?

Not at all

Perfectly

What are supervised learning algorithms?

Algorithms that predict numerical values or classifications tht are trained by using training, validation and testing data.
Of the training data, it is already known what the value of the outcome of interest is. Therefore, you can see how well the algorithm performs, you can tune it with validiaton data and you can measure it against other algorithms.

How well did you know this?

Not at all

Perfectly

What are unsupervised learning algorithms?

Algorithms that use no outcome variable to predict or classify.
Examples: association rules, dimension reductions methods and clustering techniques.

How well did you know this?

Not at all

Perfectly

What are the 10 steps of data mining?

Develop an understanding of the purpose of the data mining project.
Obtain the dataset to be used in the analysis.
Explore, clean, and preprocess the data
Reduce the data dimension, if necessary
Determine the data mining task
Partition the data
Choose the data mining techniques to be used
Use algorithms to perform the task
Interpret the results of the algorithms
Deploy the model.

How well did you know this?

Not at all

Perfectly

What is SEMMA?

A methodology of data mining by the company SAS. It encompasses the previous 10 steps.

Sample
Take a sample. Partition into training/testing
Explote
Examine data set statistically and graphically
Modify
Transform variables/put in missing values
Model
Fit predicitive models
Assess
Compare models using a validation dataset.

How well did you know this?

Not at all

Perfectly

What is a slice of data?

A slice returns an object usually containing a portion of a sequence, such as a subset of rows and columns from a data frame.

How well did you know this?

Not at all

Perfectly

Which two techniques does pandas use to access rows in a data frame?

loc
More general, allows accessing rows using labels
iloc
Less general, only allows using integer numbers.

How well did you know this?

Not at all

Perfectly

What is oversampling?

Putting heavier weights in your sampling procedure to overweight the rare class relative to the majority class. Otherwise your model might not be able to identify that records belong to the rare class.

How well did you know this?

Not at all

Perfectly

Which types of variables are there?

Numerical (Continious, integer & date)
Text
Categorical (numerical/text)
- Nominal
- Ordinal

How well did you know this?

Not at all

Perfectly

Which data mining technique can not deal with continious variables?

Study These Flashcards

Naive Bayes

Which data mining techniques can not deal with categorical variables?

Study These Flashcards

Trick question - Almost every technique can deal with them.

Note: Ordered variables can sometimes be coded numerically to treat them as continious variables.

How can you use nominal categorical variables?

Study These Flashcards

They often can not be used in data mining techniques. You can decompose them into dummmy variables to be able to use them.

What is meant by explore, clean and preprocess the data?

Study These Flashcards

Verifying that the data are in reasonable condition.

How to handle missing data, are there outliers etc.

What is meant by determining the data mining task?

Study These Flashcards

Translating the general first question into a data mining specific question. Do you need to do classification, prediction, clustering etc.?

What is meant by partitioning the data?

Study These Flashcards

Dividing it up into training, validation and testing sets in case you have a supervised task.

What are the disadvantages of having too many variables in your model?

Study These Flashcards

Your model becomes more complex and you need more records to asses relationships between variables;
There will be more data quality and availabilitity issues
Require more data cleaning and preprocessing
Higher risk of overfitting

Rule of thumb: 10 records per predictor variable

What is normalization of data?

Study These Flashcards

Bringing all variables to the same scale. Sometimes needed to run your algorithm well.

-> Substract the mean and divide by the standard deviation.

What is overfitting?

Study These Flashcards

The goal about a model is to make good predictions about any additional data over which you run your algorithm.

If you have a function that represents your sample too perfectly, it does not take the ‘general’ relation between variables into account, just the ones from the sample. Therefore, it will not be able to predict future values well. This is overfitting.

-> Can be seen if the function in a graph is too close to the actual data points.

How can you prevent overfitting?

Partitioning your data and train your data with one data set to consequently test it with a different dataset to analyze its performance.

What is CRISP-DM?

A similar methdology to SEMMA. It stands for | CRoss Industry Standard Process for Data Mining

What are the six steps of CRISP-DM?

1. Business understanding 2. Data understanding 3. Data preparation (These three steps are 85% of the project time) 4. Model building 5. Testing and evaluation 6. Deployment (Last three steps are similar to SEMMA)

What is underfitting of a model?

The model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).

Data Mining - Chapter 2 Flashcards

(28 cards)