What are the 4 Vs of Big Data?
Volume
Velocity
Variety
Veracity
What paradigm do big data scientists use?
Retrospective data mining with multiple hypotheses
Looking for patterns without a particular hunch
Types of Data?
Structured Data
Unstructured Data
Typical Data Structures
Typical Datasets - CSV - eXtensible Markup Language (XML) - JavaScript Object Notation (JSON) Nested JSON in CSV - SQL - Excel Data Formats Other - .txt files for text - RGB data for images
What is Quantitative Data?
Numbers, such as -
What is Qualitative Data?
Text, such as -
What is Quantitative Analysis?
Qauntitative Statistical Analysis:
What is Qualitative Analysis?
Thematic Analysis
Advanced methods:
- Text analytics: word embeddings
- Review mining
Gartner Analytic Continuum
Descriptive Analytics - Hindsight Diagnostics Analytics - Insight Predictive Analytics - Foresight Prescriptive Analytics
Increasing difficulty and value
Typical Data Analytics Process
Data gathering/wrangling/linking -> data cleansing -> exploratory data analysis [EDA] -> supervised machine learning
EDA
Supervised Machine Learning
Machine Learning
Supervised ML
Unsupervised ML
Semi-supervised ML
- Some labelled data
The 5 Tribes of ML & the No free lunch theorem
Symbolists | Structure Inference | Production Rule System & Inverse Deduction
Connectionist | Estimating Parameters | Back propagation & Deep Learning
Bayesians | Weighing Evidence | HMM Graphical Model
Evolutionaries | Structure Learning | Genetic Algorithms & Evolutionary Programming
Analogisers | Mapping to Novelty | kNN and SVM
The Neat and Scruffy Data Scientist
Neat: they care about the details and the ML methods
Scruffy: they care about the results and are somewhat ignorant of details and the methods