Desired Properties of a Big Data System
1 Robustness and fault tolerance
2 Low latency
3 Minimal Maintenance
4 Ad hoc queries
Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem
- Instead, you have to use a variety of tools and techniques to build a complete Big Data system
The main idea of the Lambda Architecture is to…
build Big Data systems as a series of layers
Each layer satisfies a subset of the desired properties and builds upon the functionality provided by the layers above it
Storing data in raw format has many advantages:
Data should be stored in raw format, should be
2. Kept forever
Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers
The operations you can do with a distributed filesystem are often…
more limited than you can do with a regular filesystem
For instance, you may not be able to write to the middle of a file or even modify a file at all after creation
Google needed a good distributed file system, why?
Redundant storage of massive amounts of data on cheap and unreliable computers
Why did Google not use an existing file system?
Google File System - Assumptions
Bigtable is…
a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
The Split-Apply-Combine Approach
Split
Break up a big problem into manageable pieces
The Split-Apply-Combine Approach
Apply
Operate on each piece independently
The Split-Apply-Combine Approach
Combine
Put all the pieces back together
The MapReduce model consists of two main stages
Map
input data is split into discrete chunks to be processed
The MapReduce model consists of two main stages
Reduce
output of the map phase is aggregated to produce the desired result
MapReduce
The simple nature of the programming model lends itself to…
efficient and large-scale implementations across thousands of cheap nodes (computers).
Key benefits of MapReduce
Key benefits of MapReduce
Simplicity
Developers can write applications in their language of choice, such as Java, C++ or Python
Key benefits of MapReduce
Scalability
MapReduce can very large amounts of data, stored in HDFS on one cluster
Key benefits of MapReduce
Speed
Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes
Limitations of MapReduce
2. Low level framework (hard to use)
New tools have been developed to simplify the use of MapReduce
- Apache Pig (script language)