The Web and Caching
Protocol: Hypertext Transfer Protocol (HTTP)
Server is stateless
HTTP Requests
Request line
Optional headers
HTTP Response
Status line
Early HTTP
v.0.9/1.9
One request/response per TCP Connection
Advantage: simple to implement
Disadvantage: TCP connection for every request -overhead, slowing transfer,
Solution to increase efficiency: persistent connections
Persistent Connections
Multiple HTTP requests/responses are multiplexed on a single TCP connection
Pipelining
Client sends the next request as soon as it encounters a referenced object
So as little as one RTT for all referenced objects before they begin to be fetched
Persistent connections with pipelining was default in HTTP 1.1
Caching
Improves performance
Can occur in multiple places
Caches expire to ensure client see the most recent version of a page
Ways to redirect clients to caches:
TCP Throughput is inversely proportional to…
RTT
Locations of caches
Can occur in multiple places
Ways to redirect clients to caches
Ways to redirect clients to caches:
Benefits of caching
Reduced transit costs for ISP
Improved performance for clients
Web Content Distribution Networks (CDN) defined
Overlay network of web caches designed to deliver content to a client from optimal location
-Optimal can be geographically closest or something else.
CDNs often have geographically disparate servers
Purpose to place caches as close to users as possible
Owned by content providers (google), networks (At&t) or independently (Akami)
Nonnetwork CDNs typically place servers in other autonomous systems or ISPs
The number of cache nodes vary.
How server selection works in CDN
Which server criteria
Content Routing: How clients get redirected to the right server in CDN
How to direct clients
1-Routing systems (e.g. anycast): number all replicas with same IP address then allow routing to take them to closes replica
–simple but servers given very little control since at whims of Internet routing (simple but coarse)
2-Application-based (e.g. HTTP redirect): requires client to first go to origin server to get redirect, increasing latency (simple but delays)
3-Naming system (e.g. DNS): client looks up domain and response contains IP address of nearby cache. Significant flexibility in directing different clients to different server replicas (fine-grained control and fast)
CDNs ++ ISPs
CDNs and ISPs have symbiotic relationship when it comes to peering relationships
CDNs like to peer with ISP because peering directing with ISP where customer located because
ISPs like peering with CDNs (or host caches locally) because:
Bit Torrent
Peer-to-Peer CDN used for file sharing and distribution of large files
Fetch content from peers rather than origin
Break original file into many pieces, replicate different pieces on different peers as soon as possible so each peer assembles file by picking up different pieces and get remaining pieces from other peers. Eventually assemble entire files
Bit Torrent Publishing
Leechers: clients with incomplete copies of the file.
Trackers: allows peers to find each other and returns random list of peers leechers can use to swap parts of the file
Freeloading: client leaves network as soon as finished downloading file. Solved by bit torrent
Solution to freeriding
Bit torrent solved freeriding (clients leaving network as soon as finished downloading file) with choking
Choking (tit-for-tat) temporary refusal to upload chunks to another peer
-if peer can’t download from peer, don’t upload to it
(eliminated freerider problem)
Bit torrent getting chunks to swap
If all client receive same chunks, no one has complet copy and clients won’t swap
Rarest piece first: client determines which pieces are rarest and download those first
Distributed Hash Tables
Enable a form of content overlay called a structured overlay
Chord
Distributed hash table
Scalable, distributed lookup service (maps keys to values)
Scalable
Provable correctness
Reasonably good performance
Main motivation is scalable location of data in a large distributed system
Key problem is lookup: hash table isn’t located in one place, distributed across network (DHT distributed hash table)
Consistent Hashing
keys and nodes map to same ID space
Create metric space (like a ring) with nodes. Nodes have ID and key maps to ID space. Consistent hash function will assign the keys and nodes and ID in this space. Hash function used to assign identifiers. Nodes hash of IP address, keys hash of keys.
In chord, key stored at successor, which is node with next highest ID.
Consistent hashing provides:
Implementing consistent hashing
Every node knows location of every other nodes
Each node only knows location of immediate sucessor
Finger tables: every node knows location of N other nodes, distance of nodes that it knows increases exponentially
-finger i points to the successor of n+2i
-find predecessor for a particular ID and ask what is the successor of that ID
finger tables have mapping of predecessor too. Then ask that predecessor for its successor, moves around ring looking for node whose successor’s id is bigger than id of data
-require order of long(n) hopes
-order logn messages per lookup
-size of finger table is order of log n state per node
-when nodes joins network, initialize fingers of node and update fingers of existing nodes, transfer keys from successor to new node
-when a node leaves, any particular node keeps track of fingers of successor so predecessor can reach nodes corresponding to failed/left node’s fingerlings
Finger tables
(from wiki)
To avoid the linear search above, Chord implements a faster search method by requiring each node to keep a finger table containing up to m entries, recall that m is the number of bits in the hash key.
The i^{th} entry of node n will contain:
successor((n+2^{i-1]) mod 2^m)
The first entry of finger table is actually the node’s immediate successor (and therefore an extra successor field is not needed). Every time a node wants to look up a key k, it will pass the query to the closest successor or predecessor (depending on the finger table) of k in its finger table (the “largest” one on the circle whose ID is smaller than k), until a node finds out the key is stored in its immediate successor.
With such a finger table, the number of nodes that must be contacted to find a successor in an N-node network is O(\log N)