What constraints do we let go of when entering the “NoSQL universe”
(Expecting 4)
In NoSQL, “data denormalization” design covers which 2 concepts?
This can be referred to broadly as “Denormalised Data”
Define “Heterogeneous Data”
Heterogeneous data does not fulfil domain integrity (it may not even have a schema) and also not relational integrity.
Define “Nested Data”
Nested data is not in first normal form (violating atomic integrity). For example tables inside tables
What are the two main paradigms to store data?
“Traditional” Databases
e.g PostgreSQL
E(xtract)
T(ransform)
L(oad)
Data Lakes
Read directly from a file system (in situ)
Stored “as is”
More convenient if you only want to read the data (“read-intensive”/OLAP)
e.g using pandas in Python
How is data stored locally?
How does local storage scale?
NAS = network-attached storage
WAN = wide-area
1,000 to 1,000,000 files ok on a laptop, but 1,000,000,000 will break
LAN = local-area network
What are ways we can make storage scale?
Scaling Up vs Scaling Out
Scaling Up - A bigger machine: more memory, more or faster CPU cores, a larger disk
Scaling Out - One can buy more, similar machines and share the work across them
Scaling Out price increases linearly
A better way is to optimise code which should always be done first
Data Centers (in numbers)
1,000-100,000 machines in a data center
1-200 cores per server
100,000 seems to be a hard limit - electricity consumption and cooling
Servers (in numbers)
How many cores?
How much local storage?
How much RAM?
Server = Node = Machine
1 and 64 cores per server
1-30 TB local storage per server
16GB - 24 TB of RAM per server
Laptops typically have up to 24 cores
Networks (in numbers)
Network Bandwidth?
The network bandwidth goes from 1 to 200 Gbit/s (HPC allows for higher)
Bandwidth is the highest within the same cluster
Bits for network as opposed to typical bytes
How do we measure distance in data centers?
Rack Units
Rack Servers can be between 1-4 RUs
A cluster is just a room filled with racks put next to each other.
Edit: shouldnt this be number of hops between nodes?
Describe Amazon S3
Simple Storage Service
Objects and Buckets - IDs (Can PUT, GET, DELETE)
An object can be at most 5 TB
Only possible to upload an object in a single chunk if it is less than 5 GB
By default users get 100 buckets
5TB is size that fits on a single disk typically
Object just means file
S3 Service Level Agreements (SLAs)
Durability - S3 loses less than 1 in 100 billion
Availability - S3 will be available > 99.99% of year (1h/year)
99% = < 4 days
99.999% = six minutes
99.9% = < 10hrs
What is the CAP theorem?
Yet another impossibility triangle
Can’t have all 3 in a network partition
Will either be AP or CP
CP - not available until network reconnected
AP - not consistent until reconnected (eventual consistency)
Unavailable /= Partition Intolerant
RestAPI
REpresentational State Transfer
Sending queries over HTTPs - Rest API supports integration with many host languages.
Generally successful response status codes are 200-299 and client error response status codes are 400-499
Requests have Method, URI, [Header], [Body]. Responses have Status Code, [Header], [Body]
Deconstruct this URI/IRI
http://www.example.com/api/collection/foobar?id=foobar#head
Describe the main HTTP methods
Which of the following most generally designates a relation that is transitive, reflexive, and antisymmetric?
* A total order
* A preorder
* An equivalence relation
* A (non-strict) partial order
Lecture Question - Antisymmetric means if a!=b and a -> b then b /-> a
A (non-strict) partial order
Partial Order is a DAG
Preorder is only Transitive and Reflexive (an antisymmetric Preorder is a Partial Order, a symmetric Preorder is an equivalence relation)
Total Order is a partial order where everything is in relation to something else
Azure Blob Storage
Object IDs are given by Account + Container + Blob
Object API - Block/Append/Page
Blocks are for data - e.g datasets
Append is for logging
Limits
* 190.7 TB block
* 195 GB append
* 8 TB page
Blob = Binary Large OBject
What is a storage stamp?
(Azure)
10-20 racks * 18 storage nodes/rack (30 PB)
kept below 70/80% storage capacity