What is Big Data
Often
Today companies have the ability to store any data they generate, but don’t know what to do with it.
more data = more processing
key drivers of Big Data
This trend started when
3 reasons for the expontential data growth
Instrumentation
Interconnection
Intelligence
3 Big Data Characteristics
What is the Blind Zone?
We create more data than we can process.
Blind Zone = WE don’t know

Can data warehouses handle Big Data and why?
How much data an organisation creates, is cleansed, transformed and loaded into Data Warehouse?
Only 20% of data that could be used.
80% of data is Raw, Unstructured or Semi Structured.
3 categories of data based on its form in the primary source
Relational databases only work well with structured data.
How to handle unstructured data?
NoSQL databases (“Not only” SQL databases)
NoSQL Databases Attributes
NoSQL database falls into several technology architecture categories:
Relational Databases Attributes
define data velocity
How fast data is generated, flows, is stored, retrieved and analysed.
key characteristics of stream analytics + 2 use cases
Basically, what is required to make Big Data valuable?
Need to be able to process a massive volume of disparate types of data and analyse it to produce insight in a time frame driven by the business need.
Are DW trusted? Need?
Businesses need trust.
Need
Characteristics of Hadoop

HDFS - Hadoop Distributed File System
A distributed, scalable, and portable file system written in Java for the Hadoop framework.
Its data is divided into blocks, and copies of these blocks are stored on multiple servers across the Hadoop cluster.
Think of a file that contains the phone numbers for everyone in the United States; the people with a last name starting with A might be stored on server 1, B on server 2, and so on. Hadoop pieces together the phonebook across its cluster.
The example below shows such replication on both the same rack and other racks (double protection).

3 benefits of Hadoop’s file redundancy
Reduncacy achieves availability even as components fail.
Redundancy increases scalability
Rundandacy makes data local
MapReduce programming model
Components of a HDFS cluster
A HDFS cluster has two types of nodes (i.e. servers) operating in a master-slave pattern (master-worker). A cluster has one namenode, as master node and and a number of worker datanodes.
namenode - manages the filesystem namespace
namenode essential for the filesystem to function
Datanodes hold the filesystem data in the form of blocks.
MapReduce Framework follows master-worker architecture.
The JobTracker handles the runtime scheduling of MapReduce jobs and maintains information on each TaskTrackers’s load and available resources.
Each job is broken down into Map tasks, based on the number of data blocks that require processing, and Reduce tasks. The Jobtracker assigns tasks to TaskTrackers based on locality and load balancing.
It achieves locality by matching a TaskTracker to Map tasks that process data local to it, preferably on the same node or, failing that, on the same rack.
It load-balances by assuring that all availableTaskTrackers are assigned tasks. TaskTrackers regularly update the JobTracker with their status through heartbeat messages.
master node contains
worker nodes contain

Performing a Hadoop ‘Job’
outline + example
3 phases of MapReduce
MapReduce is a programming model that is used as part of a framework, such as Hadoop, based on key-value pairs.
It forces a file to undergo three stages:
1. Map: the task is distributed among the computers in the cluster and processes the inputs; produce key-value pairs
2. Shuffle: collects and sorts the keys (keys being chosen by the user) and distributed to different machines for the reduce phase. Every record for a given key will then go to the same reducer.
3. Reduce: takes the results from the map phase and combine them to get the wanted result;
Programming in Hadoop
PIG
HIVE
Summary of HadoopDB Architecture
An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Apache Spark
Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark’s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.
The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to a MapReduce implementation (as was common in Apache Hadoop stacks). Among the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark
Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), MapR File System (MapR-FS),Cassandra,OpenStack Swift, Amazon S3, Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.