05 en Distributed File Systems
05 en Distributed File Systems
function Description
open() Set mode: read, write & lock, change ACL,
event list, lock-delay, create
close()
GetContentsAndStat() Read file contents & metadata
SetContents(), SetACL() Write file contents or ACL
Delete()
Acquire(), TryAcquire(), Release() Lock operations
GetSequencer() Sequence # for a lock
SetSequencer() Associate a sequencer with a file handle
CheckSequencer() Check if sequencer is valid
CHUBBY: LOCKS
CHUBBY: LOCKS
At the client
Data cached in memory by chubby clients
Cache is maintained by a Chubby lease, which can be invalidated
All clients write through to the Chubby master
At the master
Writes are propagated via Paxos consensus to all Chubby replicas
Data updated in total order – replicas remain synchronized
The master replies to a client after the writes reach a majority of replicas
Cache invalidations
Master keeps a list of what each client may be caching
Invalidations sent by master and are acknowledged by client
File is then cacheable again
Chubby database is backed up to GFS every few hours
APACHE ZOOKEEPER
APACHE ZOOKEEPER
Similar to Chubby
Different naming
Writes via Leader
Read from any replica
Only 4 events/watches:
Central servers
Point of congestion, single point of failure
NEED OF PARALLEL
Alleviate somewhat with replication and client caching
FILE SYSTEMS
E.g., Coda, tokens, (aka leases, oplocks)
Client-server file systems Limited replication can lead to congestion
File access:
Most files are appended, not overwritten
Random writes within a file are almost never done
Once created, files are mostly read; often sequentially
Workload is mostly:
Reads: large streaming reads, small random reads – these dominate
Large appends
Hundreds of processes may append to a file concurrently
Replicate data
Expect some servers to be down
Store copies of data blocks on multiple servers
FILE SYSTEM INTERFACE
GFS cluster
Multiple chunkservers
Data storage: fixed-size chunks
Chunks replicated on several systems
One master
Stores file system metadata (names, attributes)
Maps files to chunks
GFS MASTER & CHUNKSERVERS
GFC cluster
GFS FILES
CORE PART OF GOOGLE CLUSTER ENVIRONMENT
Google Cluster Environment
Core services: GFS + cluster scheduling system
Typically 100s to 1000s of active jobs
200+ clusters, many with 1000s of machines
Pools of 1000s of clients
4+ PB filesystems, 40 GB/s read/write loads
CHUNKS AND CHUNKSERVERS
Manages
Chunk leases (locks)
Garbage collection (freeing unused chunks)
Chunk migration (copying/moving chunks)
Fault tolerance
Operation log replicated on multiple machines
New master can be started if the master fails
Operation log
Similar to a journal
All operations are logged
Periodic checkpoints (stored in a B-tree) to avoid playing back entire log
Master does not store chunk locations persistently
This is queried from all the chunkservers: avoids consistency problems
OPERATION LOGS
Large chunk makes it feasible to keep a TCP connection open to a chunkserver for
an extended time
Master stores <64 bytes of metadata for each 64MB chunk
READ
But
No support for concurrent appends
HDFS DESIGN GOALS & ASSUMPTIONS
Written in Java
Master/Slave architecture
Single NameNode
Master server responsible for the namespace & access control
Multiple DataNodes
Responsible for managing storage attached to its node
A file is split into one or more blocks
Typical block size = 128 MB (vs. 64 MB for GFS)
Blocks are stored in a set of DataNodes
GFS
HDFS
NAMENODE (= GFS MASTER)
One server: web server, app server, mySQL database, sync server
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 2
Server ran out of disk space: moved data to Amazon S3 service (key-value store)
Servers became overloaded: moved mySQL DB to another machine
Clients periodically polled server for changes
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 3
Move from polling to notifications: add notification server
Split web server into two:
Amazon-hosted server hosts file content and accepts uploads (stored as blocks)
Locally-hosted server manages metadata
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 4
Add more metaservers and blockservers
Blockservers do not access DB directly; they send RPCs to metaservers
Add a memory cache (memcache) in front of the database to avoid scaling
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 5
10s of millions of clients – Clients have to connect before getting notifications
Add 2-level hierarchy to notification servers: ~1 million connections/server
QUESTIONS?
NOW, BY E-MAIL, …