21 p2p
21 p2p
Peer-to-Peer
15-441
P2P System
Why P2P?
• Harness lots of spare capacity
– 1 Big Fast Server: 1Gbit/s, $10k/month++
– 2,000 cable modems: 1Gbit/s, $ ??
– 1M end-hosts: Uh, wow.
• Build self-managing systems / Deal with huge
scale
– Same techniques attractive for both companies /
• Leverage the resources of client machines (peers) servers / p2p
– Computation, storage, bandwidth • E.g., Akamaiʼs 14,000 nodes
• Googleʼs 100,000+ nodes
3 4
1
Outline P2P file-sharing
• p2p file sharing techniques • Quickly grown in popularity
– Downloading: Whole-file vs. chunks – Dozens or hundreds of file sharing applications
– Searching – 35 million American adults use P2P networks --
• Centralized index (Napster, etc.) 29% of all Internet users in US!
• Flooding (Gnutella, etc.)
• Smarter flooding (KaZaA, …) – Audio/Video transfer now dominates traffic on the
• Routing (Freenet, etc.) Internet
• Uses of p2p - what works well, what doesnʼt?
– servers vs. arbitrary nodes
– Hard state (backups!) vs soft-state (caches)
• Challenges
5 6
– Fairness, freeloading, security, …
2
Searching 2 Framework
• Needles vs. Haystacks • Common Primitives:
– Searching for top 40, or an obscure punk – Join: how do I begin participating?
track from 1981 that nobodyʼs heard of? – Publish: how do I advertise my file?
• Search expressiveness – Search: how to I find a file?
– Whole word? Regular expressions? File – Fetch: how to I retrieve a file?
names? Attributes? Whole-text search?
• (e.g., p2p gnutella or p2p google?)
9 10
11 12
3
Napster: Overiew Napster: Publish
• Centralized Database:
– Join: on startup, client contacts central insert(X,
server 123.2.21.23)
– Publish: reports list of files to central ...
server
– Search: query the server => return Publish
someone that stores the requested file
– Fetch: get the file directly from peer
I have X, Y, and Z!
123.2.21.23
13 14
4
Next Topic... Gnutella: History
• Centralized Database
– Napster • In 2000, J. Frankel and T. Pepper from
• Query Flooding
– Gnutella
Nullsoft released Gnutella
• Intelligent Query Flooding • Soon many other clients: Bearshare,
– KaZaA
• Swarming Morpheus, LimeWire, etc.
– BitTorrent
• Unstructured Overlay Routing
• In 2001, many protocol enhancements
– Freenet including “ultrapeers”
• Structured Overlay Routing
– Distributed Hash Tables
17 18
5
Gnutella: Discussion KaZaA: History
• Pros: • In 2001, KaZaA created by Dutch company
– Fully de-centralized Kazaa BV
– Search cost distributed • Single network called FastTrack used by
– Processing @ each node permits powerful search
semantics other clients as well: Morpheus, giFT, etc.
• Cons: • Eventually protocol changed so other clients
– Search scope is O(N) could no longer talk to it
– Search time is O(???) • Most popular file sharing network today with
– Nodes leave often, network unstable >10 million users (number varies)
• TTL-limited search works well for haystacks.
– For scalability, does NOT search every node. May
21 22
23 24
6
KaZaA: File Insert KaZaA: File Search
search(A)
-->
123.2.22.50
insert(X,
123.2.21.23)
... search(A)
123.2.22.50 -->
Publish Query Replies 123.2.0.18
27 28
7
Stability and Superpeers BitTorrent: History
• Why superpeers? • In 2002, B. Cohen debuted BitTorrent
– Query consolidation • Key Motivation:
• Many connected nodes may have only a few files – Popularity exhibits temporal locality (Flash Crowds)
• Propagating a query to a sub-node would take more b/w – E.g., Slashdot effect, CNN on 9/11, new movie/game
than answering it yourself release
– Caching effect • Focused on Efficient Fetching, not Searching:
• Requires network stability – Distribute the same file to all peers
– Single publisher, multiple downloaders
• Superpeer selection is time-based
• Has some “real” publishers:
– How long youʼve been on is a good predictor of
– Blizzard Entertainment using it to distribute the beta of their
how long youʼll be around. new game
29 30
8
BitTorrent: Fetch BitTorrent: Sharing Strategy
• Employ “Tit-for-tat” sharing strategy
– A is downloading from some other people
• A will let the fastest N of those download from him
– Be optimistic: occasionally let freeloaders
download
• Otherwise no one would ever start!
• Also allows you to discover better peers to download
from when they reciprocate
• Goal: Pareto Efficiency
– Game Theory: “No change can make anyone
better off without making others worse off”
33
– Does it work? (donʼt know!) 34
• Cons: • Swarming
– BitTorrent
– Pareto Efficiency relatively weak condition
• Unstructured Overlay Routing
– Central tracker server needed to bootstrap swarm – Freenet
– (Tracker is a design choice, not a requirement. • Structured Overlay Routing
Could easily combine with other approaches.) – Distributed Hash Tables (DHT)
35 36
9
Distributed Hash Tables DHT: Chord Summary
• Academic answer to p2p • Routing table size?
• Goals
–Log N fingers
– Guaranteed lookup success
– Provable bounds on search time • Routing time?
– Provable scalability
–Each hop expects to 1/2 the distance to the
• Makes some things harder
desired id => expect O(log N) hops.
– Fuzzy queries / full-text search / etc.
• Read-write, not read-only
• Hot Topic in networking since introduction in
~2000/2001
37 38
39 40
10
A Peer-to-peer Google? Writable, persistent p2p
• Complex intersection queries (“the” + “who”) • Do you trust your data to 100,000 monkeys?
– Billions of hits for each term alone • Node availability hurts
• Sophisticated ranking – Ex: Store 5 copies of data on different nodes
– Must compare many results before returning a – When someone goes away, you must replicate the
data they held
subset to user
– Hard drives are *huge*, but cable modem upload
• Very, very hard for a DHT / p2p system bandwidth is tiny - perhaps 10 Gbytes/day
– Need high inter-node bandwidth – Takes many days to upload contents of 200GB
– (This is exactly what Google does - massive hard drive. Very expensive leave/replication
clusters) situation!
41 42
P2P: Summary
• Many different styles; remember pros and cons of
each
– centralized, flooding, swarming, unstructured and structured
routing
Extra Slides
• Lessons learned:
– Single points of failure are very bad
– Flooding messages to everyone is bad
– Underlying network topology is important
– Not all nodes are equal
– Need incentives to discourage freeloading
– Privacy and security are important
– Structure can provide theoretical bounds and guarantees
43
11
KaZaA: Usage Patterns KaZaA: Usage Patterns (2)
• KaZaA is more than • KaZaA is not Zipf!
one workload! – FileSharing:
– Many files < 10MB “Request-once”
(e.g., Audio Files) – Web: “Request-
– Many files > 100MB repeatedly”
(e.g., Movies)
from Gummadi et al., SOSP 2003 from Gummadi et al., SOSP 2003
45 46
12
Freenet: Overview Freenet: Routing Tables
• id – file identifier (e.g., hash of file)
• next_hop – another node that stores the file id
• Routed Queries: • file – file identified by id being stored on the local node
– Join: on startup, client contacts a few other id next_hop file
nodes it knows about; gets a unique node id • Forwarding of query for file id
…
– Publish: route file contents toward the file id. File – If file id stored locally, then stop
is stored at node with id closest to file id • Forward data back to upstream requestor
– Search: route query for file id toward the closest – If not, search for the “closest” id in the table, and
…
node id forward the message to the corresponding
next_hop
– Fetch: when query reaches a node containing – If data is not found, failure is reported back
file id, it returns the file to the sender • Requestor then tries next closest match in routing
table
49 50
13
Freenet: Anonymity & Security Freenet: Discussion
• Anonymity • Pros:
– Randomly modify source of packet as it traverses the – Intelligent routing makes queries relatively short
network
– Can use “mix-nets” or onion-routing – Search scope small (only nodes along search path
involved); no flooding
• Security & Censorship resistance
– No constraints on how to choose ids for files => easy to – Anonymity properties may give you “plausible
have to files collide, creating “denial of service” (censorship) deniability”
– Solution: have a id type that requires a private key signature • Cons:
that is verified when updating the file
– Cache file on the reverse path of queries/publications =>
– Still no provable guarantees!
attempt to “replace” file with bogus data will just cause the – Anonymity features make it hard to measure,
file to be replicated more! debug
53 54
14