We decide to build a simple web analytics application to better understand the behavior of our users.
What kind of system should we put in place to fulfill the requirements?
We start with a traditional relational schema for the pageviews
Analytics Server -> Database
Which problems can emerge from this approach?
Scaling problems
Our startup is a huge success and traffic is growing rapidly
Our main application is fine: we have hosted it on Amazon Web Services,
and they are able to handle the traffic
However, our analytics application is struggling to keep up with the traffic
We look at the logs and we see that the problem is in the database: there are too many requests, i.e., the database cannot keep up with the rate of requests
Analytics Server -> Database
How to deal with scaling problems?
The best approach is to use multiple database servers and spread the table across all servers. Each server will have a subset of the data.
Hash function
a function that decides which database should keep information about a user
As the application becomes more popular we only need to deploy more database servers
Every time we add one more database this process becomes more and more painful
Fault-tolerance issues
When we have many databases it starts to become frequent that the hard drive in one of the databases goes bad
Our system is not resilient to hardware errors
Data corruption issues
At some point we deploy code with a bug: instead of incrementing each video viewership by one unit, our code increments by two units. We notice the mistake only 24 hours later.
Our system is not resilient to human errors
The desired properties of Big Data systems are related both to
complexity and scalability
Complexity
generally used to characterize something with many parts where those parts interact with each other in multiple ways
Scalability
ability to maintain performance in the face of increasing data or load by adding resources to the system
A Big Data system must …
Desired properties of a Big Data system
Systems need to behave correctly despite:
These challenges make it difficult to reason about what a system is doing
- Part of making a Big Data system robust is avoiding these complexities so that you can easily reason about the system
Desired properties of a Big Data system
Latency is the time between a request and a response
The vast majority of applications require reads to be satisfied with very low latency, typically between a few milliseconds to a few hundred milliseconds
Latency requirements vary a great deal between applications:
Desired properties of a Big Data system
Maintenance is the work required to keep a system running smoothly
An important part of minimizing maintenance is choosing components that have as little implementation complexity as possible
Desired properties of a Big Data system
Being able to do ad hoc queries on your data is extremely important