Unit 5 1
Unit 5 1
1
Unit V CS1603/Distributed Systems
Napster was shut down as a result of legal proceedings instituted against the operators of
the Napster service by the owners of the copyright in some of the material (i.e., digitally encoded
music) that was made available on it. Anonymity for the receivers and the providers of shared data
and other resources is a concern for the designers of peer-to-peer systems. If files are also
encrypted before they are placed on servers, the owners of the servers can plausibly deny any
knowledge of the contents.
Limitations
Napster used a (replicated) unified index of all available music files. Unless the access path
to the data objects is distributed, object discovery and addressing are likely to become a bottleneck.
Application dependencies
Napster took advantage of the special characteristics of the application for which it was
designed:
Music files are never updated, avoiding any need to make sure all the replicas of files
remain consistent after updates.
No guarantees are required concerning the availability of individual files – if a music file
is temporarily unavailable, it can be downloaded later.
2
Unit V CS1603/Distributed Systems
Note that above the P2P overlay is the application layer overlay, where communication
between peers is point-to-point (representing a logical all-to-all connectivity) once a connection is
established.
c. Local indexing requires each peer to index only the local data objects and remote objects
need to be searched for. This form of indexing is typically used in unstructured overlays in
conjunction with flooding search or random walk search. Gnutella uses local indexing.
Other Classification
Semantic index mechanism: A semantic index is human readable, for example, a document name,
a keyword, or a database key. It supports keyword searches, range searches, and approximate
searches, whereas these searches are not supported by semantic free index mechanisms.
Semantic-free index mechanism: A semantic-free index is not human readable and typically
corresponds to the index obtained by a hash mechanism, e.g., the DHT schemes.
3
Unit V CS1603/Distributed Systems
4
Unit V CS1603/Distributed Systems
CHORD
5
Unit V CS1603/Distributed Systems
6
Unit V CS1603/Distributed Systems
For a query on key key at node i, if key lies between i and its successor, the key would reside at the
successor and the successor’s address is returned. If key lies beyond the successor, then node i searches
through the m entries in its finger table to identify the node j such that j most
immediately precedes key, among all the entries in the finger table. A s j is
the closest known node that precedes key, j is most likely to have the most
information on locating key, i.e., locating the immediate successor node to
which key has been mapped.
7
Unit V CS1603/Distributed Systems
8
Unit V CS1603/Distributed Systems
The entire space is partitioned dynamically among all the nodes present, so that each node
i is assigned a disjoint region r(i) of the space. As nodes arrive, depart, or fail, the set of
participating nodes, as well as the assignment of regions to nodes, change.
The three core components of a CAN design are the following:
A. Setting up the CAN virtual coordinate space, and partitioning it among the nodes as they join
the CAN.
B. Routing in the virtual coordinate space to locate the node that is assigned the region containing
p.
C. Maintaining the CAN due to node departures and failures.
B. CAN routing
Each node stores the IP address and coordinate zone of adjoining, or neighboring, nodes
⚫ This data makes up the node’s routing table
⚫ Greedy algorithm
Uniform hash function is used to map key to point P
if P is within the Zone of current node,
return (key, value)
9
Unit V CS1603/Distributed Systems
else
forward the query to the neighbor with coordinates closest to P
C. CAN maintainence
When a node voluntarily departs from CAN, it hands over its region and the associated
database of (key, value) tuples to one of its neighbors. The neighbor is chosen as follows:
If the node’s region can be merged with that of one of its neighbors to form a valid
convex region, then such a neighbor is chosen.
Otherwise the node’s region is handed over to the neighbor whose region has the
smallest volume or load – the regions are not merged and the neighbor handles both
zones temporarily until a periodic background region reassignment process runs to
integrate the regions and prevent further fragmentation.
CAN optimizations
1. Multiple Dimensions
⚫ Increase number of dimensions
⚫ Reduce average path length
⚫ Reduce path latency
⚫ Increases routing table size due to greater number of neighbors
2. Multiple Realities
⚫ Increase number of Realities
⚫ Multiple coordinate spaces exist at the same time, each space is called a reality
⚫ Each node assigned a different node in each reality
⚫ Shorter paths, higher fault-tolerance
⚫ (key, value) mapping to P at (x,y,z) is possibly stored at three different nodes
3. Multiple Hash Functions
⚫ Multiple hash functions increases data availability, reduces query latency
⚫ Improve data availability by mapping a single key to k points in the coordinate space by
using k hash functions
⚫ (key, value) only unavailable when all nodes crash
4. Overload Coordinate Zones
⚫ Overload the coordinate zones by assigning more than one node to share the same zone
⚫ Reduces the average path length, improved fault-tolerance
⚫ No additional neighbors
5. Delay Latency (RTT Ratio)
⚫ Limiting the round-trip-time (RTT)
⚫ Each node measures RTT to neighbors
⚫ Favor the lower latency paths
6. Topologically sensitive overlay
The CAN overlay described so far has no correlation to the physical proximity or to the IP
addresses of domains. Logical neighbors in the overlay may be geographically far apart, and
logically distant nodes may be physical neighbors. By constructing an overlay that accounts for
physical proximity in determining logical neighbors, the average query latency can be significantly
reduced.
CAN complexity
The time overhead for a new joiner is O(d) for updating the new neighbors in the CAN,
and O(d/4 . log(n)) for routing to the appropriate location in the coordinate space. This is
also the overhead in terms of the number of messages.
10