What is Data Mining?
A term used to describe discovering knowledge from large amounts of data.
How are companies dealing with data as it relates to understanding their customer?
They are analyzing the vast amount of data that they collect. Data mining helps the management of mission critical tasks with a high level of accuracy and timeliness.
What are some reasons businesses have turned to Data Mining? (7)
What is Genomic Data?
It combines genetics with statistical data analysis and computer science.
What are Four Example Uses of Data Mining?
Used to detect and reduce fraudulent activities
Identify customer buying patterns and reclaim profitable customers
Identify trading rules from historical data
Aid in increased profitability using market-basket analysis
What are the Seven (1-3) Characteristics and Objectives of Data Mining?
What are the Seven (4-7) Characteristics and Objectives of Data Mining?
What are the Six (6) Multiple Disciplines associated with Data Mining?
What are the Four (4) Major Types of Patterns Data Mining Seeks to Identify?
Association - Find the commonly co-occurring groupings
Predictions - Tell the nature of future occurrences of certain events based on what has happened in the past.
Clusters - Identify natural groupings of things based on their known characteristics.
Sequential Relationships - Discover time-ordered events
What is the Main Difference between Data Mining and Statistics?
Statistics starts with a well-defined proposition and well-defined hypothesis whereas data mining starts with a loosely-defined discovery statement.
What are the Fourteen (1-5) Industry Focuses where Data Mining can be Applied?
What are the Fourteen (6-10) Industry Focuses where Data Mining can be Applied?
What are the Fourteen (11-14) Industry Focuses where Data Mining can be Applied?
What does CRISP-DM Stand For?
Cross-Industry Standard Process for Data Mining
What are the Six (1-3) Steps associated with CRISP-DM?
What are the Six (4-6) Steps associated with CRISP-DM?
What does SEMMA stand for? What are the Five (5) Steps to SEMMA?
Sample - Generate a representative sample of the data
Explore - Visualization and basic representation of the data
Modify - Select variables; transform variable representations
Model - Use statistical and machine learning models
Assess - Evaluate the accuracy and usefulness of the models
What does KDD mean?
KDD is knowledge discovery in databases. The process of using data mining methods to find useful info and patterns in the data.
What are the Five (5) Elements to KDD?
Data Selection
Data Preprocessing
Data Transformation
Data Mining
Interpretation/Evaluation
What are the Main Differences between Data Mining Methods?
Classification learns the function between characteristics and their membership through a supervised learning process, whereas Clustering learns the relationship through an unsupervised learning process where only the input variables are presented to the algorithm.
What is a Simple Split?
It is the process of splitting the data into two mutually exclusive subsets called the training set and the test set.
What is the K-fold Cross-Validation?
Also called rotation estimation, the complete dataset is randomly split into k mutually exclusive subsets of approximately equal size.
What are the Four (4) other Classification Assessment Methodologies?
Leave-one-out - Similar to K-Fold but every data point is used for testing once on as many models developed as there are data points.
Bootstrapping - Fixed number of instances from the original data are sampled for training, and the rest is used for testing
Jackknifing - Similar to Leave-One-Out, but one sample is left out at each iteration of the estimation process.
Area Under the ROC Curve - graphical assessment where true positive rate is plotted on the y-axis and false positive rate is plotted on the x-axis.
What are the Seven (7) Classification Techniques?