0% found this document useful (0 votes)
10 views

05 en Distributed File Systems

Uploaded by

yagcifuad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

05 en Distributed File Systems

Uploaded by

yagcifuad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

BASICS OF DISTRIBUTED SYSTEMS

DISTRIBUTED FILE SYSTEMS


GOOGLE CHUBBY
(≈ APACHE ZOOKEEPER)
CHUBBY

 Distributed lock service + simple fault-tolerant file system


 Interfaces
 File access
 Event notification
 File locking
 Chubby is used to:
 Manage coarse-grained, long-term locks (hours or days, not < sec)
 get/release/check lock – identified with a name
 Store small amounts of data associated with a name
 E.g., system configuration info, identification of primary coordinators
 Elect masters
 Design priority: availability rather than performance
CHUBBY DEPLOYMENT
 Client library + a Chubby cell (5 replica servers)
CHUBBY: DESIGN DECISIONS

 Lock services for distributed systems


 Read-write small files
 File access are controlled with sessions
 Files are locally cacheable
 Locks are coarse – grained
 Strict consistency
CHUBBY MASTER

 Chubby has at most one master


 All requests from the client go to the master

 All other nodes (replicas) must agree on who the master is


 Paxos consensus protocol used to elect a master
 Master gets a lease time
 Re-run master selection after lease time expires to extend the lease or if the master fails
 When a Chubby node receives a proposal for a new master
 It will accept it only if the old master’s lease expired
SIMPLE USER-LEVEL API FOR CHUBBY

 User-level RPC interface


 Not implemented under VFS
 Programs must access Chubby via an API

 Look up Chubby nodes via DNS


 Ask any Chubby node for the master node
 File system interface (names, content, and locks)
CHUBBY: FILE SYSTEM INTERFACE

 Each file has


 Name
 Data
 Access control list
 Lock
 No modification, access times
 No seek or partial reads/writes; no symbolic links; no moves
CHUBBY: API

function Description
open() Set mode: read, write & lock, change ACL,
event list, lock-delay, create
close()
GetContentsAndStat() Read file contents & metadata
SetContents(), SetACL() Write file contents or ACL
Delete()
Acquire(), TryAcquire(), Release() Lock operations
GetSequencer() Sequence # for a lock
SetSequencer() Associate a sequencer with a file handle
CheckSequencer() Check if sequencer is valid
CHUBBY: LOCKS
CHUBBY: LOCKS

 Every file & directory can act as a reader-writer lock


 Either one client can hold an exclusive (writer) lock
 Or multiple clients can hold reader locks

 Locks are advisory


 If a client releases a lock, the lock is immediately available
 If a client fails, the lock will be unavailable for a lock-delay period (typically 1
minute)
USING LOCKS FOR LEADER ELECTION

 Using Chubby locks makes leader election easy


 No need for user servers to participate in a consensus protocol … the programmer doesn't need to figure out Paxos (or Raft)
 Chubby provides the fault tolerance
 Participant tries to acquire a lock
 If it gets it, then it’s the master for whatever service it's providing!

 Example: electing a master & using it to write to a file server


 Participant gets a lock, becomes master (for its service, not Chubby)
 Gets a lock sequence count
 In each RPC to a server, send the sequence count to the server
 During request processing, a server will reject old (delayed) packets
if (sequence_count < current_sequence_count)
reject request /* it must be from a delayed packet */
SESSIONS

 Client requests for session


 Master grants it
 Client sends “KeepAlive” messages
 If no “KeepAlive” message is
received by master – the session is
closed
EVENTS

 Clients may subscribe to events:


 File content modifications
 Child node added/removed/modified
 Chubby master failed over
 File handle & its lock became invalid
 Lock acquired
 Conflicting lock request from another client
CHUBBY CLIENT CACHE

 Master writes to client cache


 Client works with cache
 If changed were made, master
invalidates the cache
 Ifso, client communicates with
master
CHUBBY CLIENT CACHING & MASTER REPLICATION

 At the client
 Data cached in memory by chubby clients
 Cache is maintained by a Chubby lease, which can be invalidated
 All clients write through to the Chubby master
 At the master
 Writes are propagated via Paxos consensus to all Chubby replicas
 Data updated in total order – replicas remain synchronized
 The master replies to a client after the writes reach a majority of replicas
 Cache invalidations
 Master keeps a list of what each client may be caching
 Invalidations sent by master and are acknowledged by client
 File is then cacheable again
 Chubby database is backed up to GFS every few hours
APACHE ZOOKEEPER
APACHE ZOOKEEPER

 Similar to Chubby
 Different naming
 Writes via Leader
 Read from any replica

 Only 4 events/watches:
 Central servers
 Point of congestion, single point of failure
NEED OF PARALLEL
 Alleviate somewhat with replication and client caching
FILE SYSTEMS
 E.g., Coda, tokens, (aka leases, oplocks)
Client-server file systems  Limited replication can lead to congestion

 File data is still centralized


 A file server stores all data from a file – not split across
servers
 Even if replication is in place, a client downloads all data for
a file from one server
 File sizes are limited to the capacity available on a server
 What if you need a 1,000 TB file?
GOOGLE FILE SYSTEM
GFS
(≈ APACHE HADOOP DISTRIBUTED FILE SYSTEM)
GFS

 Scalable distributed file system


 Designed for large data-intensive applications
 Fault-tolerant; runs on commodity hardware
 Delivers high performance to a large number of clients
DESIGN ASSUMPTIONS

 Assumptions for conventional file systems don’t work


 E.g., “most files are small”, “lots have short lifetimes”

 Component failures are the norm, not an exception


 File system = thousands of storage machines
 Some % not working at any given time

 Files are huge. Multi-TB files are the norm


 It doesn’t make sense to work with billions of nKB-sized files
 I/O operations and block size choices are also affected
DESIGN ASSUMPTIONS

 File access:
 Most files are appended, not overwritten
 Random writes within a file are almost never done
 Once created, files are mostly read; often sequentially
 Workload is mostly:
 Reads: large streaming reads, small random reads – these dominate
 Large appends
 Hundreds of processes may append to a file concurrently

 GFS will store a modest number of files for its scale


 approx. a few million
 Designing the GFS API with the design of apps benefits the system
 Apps can handle a relaxed consistency model
WHAT IS A PARALLEL FILE SYSTEM?

 Conventional file systems


 Store data & metadata on the same storage device
 Example:
 Linux directories are just files that contain lists of names & inodes
 inodes are data structures placed in well-defined areas of the disk that contain information about the file
 Lists of block numbers containing file data are allocated from the same set of data blocks used for file data

 Parallel file systems:


 File data can span multiple servers
 Metadata can be on separate servers
 Metadata = information about the file
 Includes name, access permissions, timestamps, file size, & locations of data blocks
 Data = actual file contents
BASIC DESIGN IDEA

 Use separate servers to store metadata


 Metadata includes lists of (server, block_number) sets that identify which blocks on which servers hold file data
 We need more bandwidth for data access than metadata access
 Metadata is small; file data can be huge

 Use large logical blocks


 Most "normal" file systems are optimized for small files
 A block size is typically 4KB
 Expect huge files, so use huge blocks … >1,000x larger
 The list of blocks that makes up a file becomes easier to manage

 Replicate data
 Expect some servers to be down
 Store copies of data blocks on multiple servers
FILE SYSTEM INTERFACE

 GFS does not have a standard OS-level API


 No POSIX system call level API – no kernel/VFS implementation
 User-level API for accessing files
 GFS servers are implemented in user space using native Linux FS
 Files organized hierarchically in directories
 Operations
 Basic operations
 Create, delete, open, close, read, write
 Additional operations
 Snapshot: create a copy of a file or directory tree at low cost
 Append: allow multiple clients to append atomically without locking
GFS MASTER & CHUNKSERVERS

 GFS cluster
 Multiple chunkservers
 Data storage: fixed-size chunks
 Chunks replicated on several systems

 One master
 Stores file system metadata (names, attributes)
 Maps files to chunks
GFS MASTER & CHUNKSERVERS

 GFC cluster
GFS FILES
CORE PART OF GOOGLE CLUSTER ENVIRONMENT
 Google Cluster Environment
 Core services: GFS + cluster scheduling system
 Typically 100s to 1000s of active jobs
 200+ clusters, many with 1000s of machines
 Pools of 1000s of clients
 4+ PB filesystems, 40 GB/s read/write loads
CHUNKS AND CHUNKSERVERS

 Chunk size = 64 MB (default)


 Chunkserver stores a 32-bit checksum with each chunk
 In memory & logged to disk: allows it to detect data corruption

 Chunk Handle: identifies a chunk


 Globally unique 64-bit number
 Assigned by the master when the chunk is created
 Chunkservers store chunks on local disks as Linux files
 Each chunk is replicated on multiple chunkservers
 Three replicas (different levels can be specified)
 Popular files may need more replicas to avoid hotspots
MASTER

 Maintains all file system metadata


 Namespace
 Access control info
 Filename to chunks mappings
 Current locations of chunks

 Manages
 Chunk leases (locks)
 Garbage collection (freeing unused chunks)
 Chunk migration (copying/moving chunks)

 Fault tolerance
 Operation log replicated on multiple machines
 New master can be started if the master fails

 Periodically communicates with all chunkservers


 Via heartbeat messages to get state and send commands
CLIENT INTERACTION MODEL

 GFS client code linked into each app


 No OS-level API – you have to use a library
 Interacts with master for metadata-related operations
 Interacts directly with chunkservers for file data
 All reads & writes go directly to chunkservers
 Master is not a point of congestion

 Neither clients nor chunkservers cache data


 Except for the caching by the OS system buffer cache
 Clients cache metadata
 E.g., location of a file’s chunks
ONE MASTER = SIMPLIFIED DESIGN

 All metadata stored in master’s memory


 Super-fast access
 Namespaces and name-to-chunk_list maps
 Stored in memory
 Also persist in an operation log on the disk
 Replicated onto remote machines for backup

 Operation log
 Similar to a journal
 All operations are logged
 Periodic checkpoints (stored in a B-tree) to avoid playing back entire log
 Master does not store chunk locations persistently
 This is queried from all the chunkservers: avoids consistency problems
OPERATION LOGS

 All operations logged


 If Master crashes, after
refreshing it reads operations
log

 Additional shadow Master can


be used as backup
WHY LARGE CHUNKS?

 Default chunk size = 64MB


 (compare to Linux ext4 block sizes: typically 4 KB and up to 1 MB)
 Reduces need for frequent communication with master to get chunk location info
– one query can give info on location of lots of bytes of data
 Clients can easily cache info to refer to all data of large files
 Cached data has timeouts to reduce possibility of reading stale data

 Large chunk makes it feasible to keep a TCP connection open to a chunkserver for
an extended time
 Master stores <64 bytes of metadata for each 64MB chunk
READ

 Contact the master


 Get file’s metadata: list chunk handles
 Get the location of each of the chunk handles
 Multiple replicated chunkservers per chunk

 Contact any available chunkserver for chunk data


READ
WRITE

 Less frequent than reading


 Master grants a chunk lease to one of the replicas
 This replica will be the primary replica chunkserver
 Primary can request lease extensions, if needed
 Master increases the chunk version number and informs replicas
WRITE: PHASE 1 – SEND DATA

Deliver data but don’t write to the file


 A client is given a list of replicas
 Identifying the primary and secondaries
 Client writes to the closest replica chunkserver that has not received the data
 Replica forwards the data to another replica chunkserver
 That chunkserver forwards to another replica chunkserver …
 Chunkservers store this data in a cache – it’s not part of the file yet
Goal: Maximixe bandwidth via pipelining and Minimize latency by forwarding data as
soon as it is received
WRITE: PHASE 1 – WRITE DATA

Add it to the file (commit)


 Client waits for replicas to acknowledge receiving the data
 Sends a write request to the primary, identifying the data that was sent
 The primary is responsible for serialization of writes
 Assigns consecutive serial numbers to all writes that it received
 Applies writes in serial-number order and forwards write requests in order to secondaries
 Once all acknowledgements have been received, the primary acknowledges the client
WRITE

 Data Flow (phase 1) is different from


Control Flow (phase 2)
 Data Flow (upload): :
 Client to chunkserver to chunkserver to
chunkserver…
 Order does not matter
 Control Flow (write):
 Client to primary; primary to all secondaries
 Locking used; Order maintained
 Chunk version numbers are used to detect
if any replica has stale data (was not
updated because it was down)
NAMESPACE

 No per-directory data structure like most file systems


 E.g., directory file contains names of all files in the directory

 No aliases (hard or symbolic links)


 Namespace is a single lookup table
 Maps pathnames to metadata
HADOOP
HDFS
HDFS: HADOOP DISTRIBUTED FILE SYSTEM

 Primary storage system for Hadoop applications


 Hadoop – Software library – framework that allows for the distributed processing of large data
sets across clusters of computers
 Hadoop includes:
 MapReduce™: software framework for distributed processing of large data sets on compute clusters.
 Avro™: A data serialization system.
 Cassandra™: A scalable multi-master database with no single points of failure.
 Chukwa™: A data collection system for managing large distributed systems.
 HBase™: A scalable, distributed database that supports structured data storage for large tables.
 Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
 Mahout™: A Scalable machine learning and data mining library.
 Pig™: A high-level data-flow language and execution framework for parallel computation.
 ZooKeeper™: A high-performance coordination service for distributed applications.
HDFS DESIGN GOALS & ASSUMPTIONS

 HDFS is an open source (Apache) implementation inspired by GFS design


 Similar goals and same basic design as GFS
 Run on commodity hardware
 Highly fault tolerant
 High throughput – Designed for large data sets
 OK to relax some POSIX requirements
 Large scale deployments
 Instance of HDFS may comprise 1000s of servers
 Each server stores part of the file system’s data

 But
 No support for concurrent appends
HDFS DESIGN GOALS & ASSUMPTIONS

 Write-once, read-many-times file access model


 A file’s contents will not change
 Simplifies data coherency
 Suitable for web crawlers and MapReduce analytics applications
HDFS ARCHITECTURE

 Written in Java
 Master/Slave architecture

 Single NameNode
 Master server responsible for the namespace & access control
 Multiple DataNodes
 Responsible for managing storage attached to its node
 A file is split into one or more blocks
 Typical block size = 128 MB (vs. 64 MB for GFS)
 Blocks are stored in a set of DataNodes
GFS
HDFS
NAMENODE (= GFS MASTER)

 Executes metadata operations


 open, close, rename
 Maps file blocks to DataNodes
 Maintains HDFS namespace
 Transaction log (EditLog) records every change that occurs to file system metadata
 Entire file system namespace + file-block mappings is stored in memory
 … and stored in a file (FsImage) for persistence
 NameNode receives a periodic Heartbeat and Blockreport from each DataNode
 Heartbeat = “I am alive” message
 Blockreport = list of all blocks managed by a DataNode
 Keep track of which DataNodes own which blocks & their replication count
DATANODE (= GFS CHUNKSERVER)

 Responsible for serving read/write requests


 Blocks are replicated for fault tolerance
 App can specify # replicas at creation time
 Can be changed later

 Blocks are stored in the local file system at the DataNode


RACK-AWARE READS & REPLICA SELECTION

 Client sends request to NameNode


 Receives list of blocks and replica DataNodes per block

 Client tries to read from the closest replica


 Prefer same rack
 Else same data center
 Location awareness is configured by the admin
WRITES
 Client caches file data into a temp file
 When temp file ≥ one HDFS block size
 Client contacts NameNode
 NameNode inserts file name into file system hierarchy & allocates a data block
 Responds to client with the destination data block
 Client writes to the block at the corresponding DataNode
 When a file is closed, remaining data is transferred to a DataNode
 NameNode is informed that the file is closed
 NameNode commits file creation operation into a persistent store (log)
 Data writes are chained: pipelined
 Client writes to the first (closest) DataNode
 That DataNode writes the data stream to the second DataNode
 And so on…
INTERNET-BASED FILE SYNC & SHARING:
DROPBOX
FILE SYNCHRONIZATION

 Client runs on desktop


 Uploads any changes made within a dropbox folder
 Huge scale
 100+ million users syncing 1 billion files per day
 Design
 Small client that doesn’t take a lot of resources
 Expect possibility of low bandwidth to user
 Scalable back-end architecture
 99%+ of code written in Python
 infrastructure (storage, monitoring) software migrated to Go in 2013
 a few components running on storage boxes migrated to Rust for memory efficiency
WHAT’S DIFFERENT ABOUT DROPBOX?
 Most web-based apps have high read to write ratios
 E.g., twitter, facebook, reddit, … 100:1, 1000:1, or higher
 But with Dropbox…
 Everyone’s computer has a complete copy of their Dropbox
 Traffic happens only when changes occur
 File upload : file download ratio roughly 1:1
 Huge number of uploads compared to traditional services

 Must abide by most ACID requirements … sort of


 Atomic: don’t share partially-modified files
 Consistent:
 Operations have to be in order and reliable
 Cannot delete a file in a shared folder but have others see
 Durable: Files cannot disappear
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 1

 One server: web server, app server, mySQL database, sync server
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 2
 Server ran out of disk space: moved data to Amazon S3 service (key-value store)
 Servers became overloaded: moved mySQL DB to another machine
 Clients periodically polled server for changes
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 3
 Move from polling to notifications: add notification server
 Split web server into two:
 Amazon-hosted server hosts file content and accepts uploads (stored as blocks)
 Locally-hosted server manages metadata
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 4
 Add more metaservers and blockservers
 Blockservers do not access DB directly; they send RPCs to metaservers
 Add a memory cache (memcache) in front of the database to avoid scaling
DROPBOX: ARCHITECTURE EVOLUTION: VERSION 5
 10s of millions of clients – Clients have to connect before getting notifications
 Add 2-level hierarchy to notification servers: ~1 million connections/server
QUESTIONS?
NOW, BY E-MAIL, …

You might also like