Characterization (3Vs)
Variety: Different forms of data
Volume: Petabytes of data
Velocity: Real-time data
Big Data Analysis Pipeline
Data Lake requirements
Advantages Cloud
Disadvantages Cloud
Three-Tier Server
Presentation ➔ Logic ➔ Data
Design Cloud
Fallancies of cloud
Cloud characteristics
Google File System
Store chunks across chunk servers, replicate chunks, access control by master node
Map Reduce
ACID
CAP
BASE
Types of NoSQL storage
Steps of machine learning
Data ➔ Preprocessing ➔ Featuring ➔ Learning ➔ Testing ➔ Analysis
Decision tree
- Choose attribute with most information value for each node step
K-means clustering
- Random start points, assign data based on least distance to start points, recalculate start points, iterate