– Uses statistical treatment (mathematical methods) to analyze data. Helps identify main characteristics, patterns, and distribution before deeper analysis.
Exploratory Data Analysis
– Reshaping cleaned data and reprocessing missing values to make the dataset ready for modeling or assessment.
Data Preparation for Modelling/Assessment
– Testing different models and strategies to solve the business problem and achieve objectives.
Modeling
– Setting up a validation scheme while the data product is working to monitor performance and ensure reliable results.
Implementation
– The practice of analyzing large datasets to discover useful patterns and hidden relationships. Uses machine learning, statistics, and AI for tasks like marketing, fraud detection, and scientific discovery. Also known as KDD (Knowledge Discovery in Data).
Data Mining
– Process with stages: Business Understanding, Gathering of Data, Data Preparation, Conceptualization, and Evaluation of Model.
Traditional Data Mining Life Cycle
– A five-stage process designed to guide data mining projects for developing predictive models:
SEMMA Methodology
– selecting the dataset for modeling
Sample
– understanding the data by discovering both expected and unexpected relationships, including abnormalities, often with visualization
Explore
– selecting, creating, and transforming variables in preparation for modeling
Modify
Building the strategy to solve the problem
Model
– evaluating the modeling results to test reliability and usefulness
Assess
What are the five stages of the SEMMA methodology in data mining?
Sample, Explore, Modify, Model, Assess
–refers to a collection of data that is extremely large in volume grows exponentially over time, and is too complex for traditional data management tools to store or process efficiently. It is still data, but much larger and harder to handle.
Big Data
Characteristics of Big Data
Volume, Velocity, Variety, Veracity
– the size of data is massive
Volume
– the speed of data generation and processing
Velocity
– data comes in multiple forms (Structured, Unstructured, Semi- structured)
Variety
What form of data in Big Data is available in spreadsheets and databases?
Structured
What form of data in Big Data includes text, images, audio, and video?
Unstructured
What form of data in Big Data is a combination of structured and unstructured?
Semi-structured
– ensures completeness and quality of information
Veracity
– identifying the problem and determining the main cause
Business Problem Definition
– analyzing what other companies have done in similar cases
Research