What is classification?
Examining data and deciding in which class or category they will fall. --> Trying to predict a class
What is prediction?
Trying to predict the value of a numerical variable.
–> Can be used for both continious as categorical data.
What are association rules?
Rules designed to find general association patterns between items in a large database. Generates rules general to an entire population.
What is collaborative filtering?
Making rules for an invidivual user opposed to the general public, based on individual history as well as the history of others.
What is data reduction?
The process of consolidating a large number of records into a smaller set.
What is dimension reduction?
Reducing the numer of variables (instead of the number of rows).
What is data visualization?
Data exploration through creating charts and dashboards.
What are supervised learning algorithms?
Algorithms that predict numerical values or classifications tht are trained by using training, validation and testing data.
Of the training data, it is already known what the value of the outcome of interest is. Therefore, you can see how well the algorithm performs, you can tune it with validiaton data and you can measure it against other algorithms.
What are unsupervised learning algorithms?
Algorithms that use no outcome variable to predict or classify.
Examples: association rules, dimension reductions methods and clustering techniques.
What are the 10 steps of data mining?
What is SEMMA?
A methodology of data mining by the company SAS. It encompasses the previous 10 steps.
What is a slice of data?
A slice returns an object usually containing a portion of a sequence, such as a subset of rows and columns from a data frame.
Which two techniques does pandas use to access rows in a data frame?
What is oversampling?
Putting heavier weights in your sampling procedure to overweight the rare class relative to the majority class. Otherwise your model might not be able to identify that records belong to the rare class.
Which types of variables are there?
Which data mining technique can not deal with continious variables?
Naive Bayes
Which data mining techniques can not deal with categorical variables?
Trick question - Almost every technique can deal with them.
Note: Ordered variables can sometimes be coded numerically to treat them as continious variables.
How can you use nominal categorical variables?
They often can not be used in data mining techniques. You can decompose them into dummmy variables to be able to use them.
What is meant by explore, clean and preprocess the data?
Verifying that the data are in reasonable condition.
How to handle missing data, are there outliers etc.
What is meant by determining the data mining task?
Translating the general first question into a data mining specific question. Do you need to do classification, prediction, clustering etc.?
What is meant by partitioning the data?
Dividing it up into training, validation and testing sets in case you have a supervised task.
What are the disadvantages of having too many variables in your model?
Rule of thumb: 10 records per predictor variable
What is normalization of data?
Bringing all variables to the same scale. Sometimes needed to run your algorithm well.
-> Substract the mean and divide by the standard deviation.
What is overfitting?
The goal about a model is to make good predictions about any additional data over which you run your algorithm.
If you have a function that represents your sample too perfectly, it does not take the ‘general’ relation between variables into account, just the ones from the sample. Therefore, it will not be able to predict future values well. This is overfitting.
-> Can be seen if the function in a graph is too close to the actual data points.