Chapter 4 - Predictive Analytics I: Data Mining Flashcards by Justin Austin

What is Data Mining?

A term used to describe discovering knowledge from large amounts of data.

How well did you know this?

Not at all

Perfectly

How are companies dealing with data as it relates to understanding their customer?

They are analyzing the vast amount of data that they collect. Data mining helps the management of mission critical tasks with a high level of accuracy and timeliness.

How well did you know this?

Not at all

Perfectly

What are some reasons businesses have turned to Data Mining? (7)

More intense competition at the global scale
Untapped value hidden in large data sources
Consolidation and integration of database records
Consolidation of databases into a single location
Exponential increase in data processing & storage technologies
Significant reduction in cost of hardware and software
Movement toward demassification of business practices

How well did you know this?

Not at all

Perfectly

What is Genomic Data?

It combines genetics with statistical data analysis and computer science.

How well did you know this?

Not at all

Perfectly

What are Four Example Uses of Data Mining?

Used to detect and reduce fraudulent activities
Identify customer buying patterns and reclaim profitable customers
Identify trading rules from historical data
Aid in increased profitability using market-basket analysis

How well did you know this?

Not at all

Perfectly

What are the Seven (1-3) Characteristics and Objectives of Data Mining?

Data are cleansed and consolidated into a data warehouse
Data Mining environment is usually a client/server architecture or a Web-based IS architecture.
Sophisticated new tools help to remove information buried in corporate files or archival public records. Also explores the usefulness of soft data.

How well did you know this?

Not at all

Perfectly

What are the Seven (4-7) Characteristics and Objectives of Data Mining?

The miner is often the end user who obtains answers quickly
Striking it rich involves finding unexpected results and requires users to think creatively throughout the process
Data mining tools are readily combined with spreadsheets and other software development tools.
Due to the large amounts of data, it is sometimes necessary to use parallel processing for data mining.

How well did you know this?

Not at all

Perfectly

What are the Six (6) Multiple Disciplines associated with Data Mining?

Knowledge Extraction
Pattern Analysis
Data Archaeology
Information Harvesting
Pattern Searching
Data Dredging

How well did you know this?

Not at all

Perfectly

What are the Four (4) Major Types of Patterns Data Mining Seeks to Identify?

Association - Find the commonly co-occurring groupings
Predictions - Tell the nature of future occurrences of certain events based on what has happened in the past.
Clusters - Identify natural groupings of things based on their known characteristics.
Sequential Relationships - Discover time-ordered events

How well did you know this?

Not at all

Perfectly

What is the Main Difference between Data Mining and Statistics?

Statistics starts with a well-defined proposition and well-defined hypothesis whereas data mining starts with a loosely-defined discovery statement.

How well did you know this?

Not at all

Perfectly

What are the Fourteen (1-5) Industry Focuses where Data Mining can be Applied?

Customer Relationship Management (CRM)
Banking
Retailing and Logistics
Manufacturing and Production
Brokerage and Securities Trading

How well did you know this?

Not at all

Perfectly

What are the Fourteen (6-10) Industry Focuses where Data Mining can be Applied?

Insurance
Computer Hardware and Software
Government and Defense
Travel Industry (Airlines; Hotels; Rental Car Companies, etc.)
Healthcare

How well did you know this?

Not at all

Perfectly

What are the Fourteen (11-14) Industry Focuses where Data Mining can be Applied?

Medicine
Entertainment Industry
Homeland Security and Law Enforcement
Sports

How well did you know this?

Not at all

Perfectly

What does CRISP-DM Stand For?

Cross-Industry Standard Process for Data Mining

How well did you know this?

Not at all

Perfectly

What are the Six (1-3) Steps associated with CRISP-DM?

Business Understanding - The key element of any data mining study is to know what the study is for.
Data Understanding - Identify the relevant data from many available databases.
Data Preparation - Take data and prepare it for analysis by data mining methods.

How well did you know this?

Not at all

Perfectly

What are the Six (4-6) Steps associated with CRISP-DM?

Study These Flashcards

Model Building - Modeling techniques are selected and applied to an already prepared dataset to address the specific business need.
Testing and Evaluation - Models are evaluated how they meet business objectives and to what extent.
Deployment - Exploration, organization, and presentation of data findings.

What does SEMMA stand for? What are the Five (5) Steps to SEMMA?

Study These Flashcards

Sample - Generate a representative sample of the data
Explore - Visualization and basic representation of the data
Modify - Select variables; transform variable representations
Model - Use statistical and machine learning models
Assess - Evaluate the accuracy and usefulness of the models

What does KDD mean?

Study These Flashcards

KDD is knowledge discovery in databases. The process of using data mining methods to find useful info and patterns in the data.

What are the Five (5) Elements to KDD?

Study These Flashcards

Data Selection
Data Preprocessing
Data Transformation
Data Mining
Interpretation/Evaluation

What are the Main Differences between Data Mining Methods?

Study These Flashcards

Classification learns the function between characteristics and their membership through a supervised learning process, whereas Clustering learns the relationship through an unsupervised learning process where only the input variables are presented to the algorithm.

What is a Simple Split?

Study These Flashcards

It is the process of splitting the data into two mutually exclusive subsets called the training set and the test set.

What is the K-fold Cross-Validation?

Study These Flashcards

Also called rotation estimation, the complete dataset is randomly split into k mutually exclusive subsets of approximately equal size.

What are the Four (4) other Classification Assessment Methodologies?

Study These Flashcards

Leave-one-out - Similar to K-Fold but every data point is used for testing once on as many models developed as there are data points.
Bootstrapping - Fixed number of instances from the original data are sampled for training, and the rest is used for testing
Jackknifing - Similar to Leave-One-Out, but one sample is left out at each iteration of the estimation process.
Area Under the ROC Curve - graphical assessment where true positive rate is plotted on the y-axis and false positive rate is plotted on the x-axis.

What are the Seven (7) Classification Techniques?

Study These Flashcards

Decision Tree Analysis
Statistical Analysis
Neural Networks
Case-Based Reasoning
Bayeslan Classifiers
Genetic Algorithms
Rough Sets

Why are Ensemble Models for Predictive Analytics Effective?

Combining forecasts can improve accuracy and robustness of information outcomes, while reducing uncertainty and bias associated with individual models.

What is a Decision Tree?

It recursively divides a training set until each division consists entirely or primarily of examples of one class. Each non-leaf node contains a split-point, which is a test on one or more attributes and determines how the data are to be split further.

What is the Gini Index?

It is used in economics to measure the diversity of a population.

What is Information Gain?

It is the splitting mechanism used in ID3, which is more widely known as the decision tree algorithm.

What is Entropy?

It measures the extent of uncertainty or randomness in a data set

What are Cluster Analysis Results typically Used For?

It is used for classifying items, events, or concepts into common groupings. It is commonly used in biology, medicine, genetics, social network analysis, anthropology, archaeology, astronomy, character recognition, and even MIS development.

What is the Apriori algorithm?

It is the most commonly used algorithm to discover association rules. Given a set of itemsets, the algorithm attempts to find subsets that are common to at least a minimum number of the itemsets.

Chapter 4 - Predictive Analytics I: Data Mining Flashcards

(31 cards)