Google file system (GFS) Google File System, a scalable distributed file system for large distributed data-intensive applications. Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as other distributed file systems such as performance, scalability, reliability, and availability. GFS provides a familiar file system interface. Files are organized hierarchically in directories and identified by pathnames. Support the usual operations to create, delete, open, close, GFS Small as well as multi-GB files are common. Each file typically contains many application objects such as web documents. GFS provides an atomic append operation called record append. In a traditional write, the client specifies the offset at which data is to be written. Concurrent writes to the same region are not serializable. GFS has snapshot and record append operations. GFS (snapshot and record append) The snapshot operation makes a copy of a file or a directory. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results. GFS consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations Common Goals of GFS and most Distributed File Systems Performance Reliability Scalability Availability Other GFS Concepts Component failures are the norm rather than the exception. File System consists of hundreds or even thousands of storage machines built from inexpensive commodity parts.
Files are Huge. Multi-GB Files are common.
Each file typically contains many application objects such as web documents. Append, Append, Append. Most files are mutated by appending new data rather than overwriting Other GFS Concepts Why assume hardware failure is the norm? It is cheaper to assume common failure on poor hardware and account for it, rather than invest in expensive hardware and still experience occasional failure. The amount of layers in a distributed system (network, disk, memory, physical connections, power, OS, application) mean failure on any could contribute to data corruption. GFS Interface GFS – familiar file system interface Files organized hierarchically in directories, path names Create, delete, open, close, read, write operations Snapshot and record append (allows multiple clients to append simultaneously - atomic) GFS Architecture GFS Architecture GFS Architecture Chunk GFS Architecture A GFS cluster consists of a single master and multiple chunk-servers and is accessed by multiple clients, Each of these is typically a commodity Linux machine.
It is easy to run both a chunk-server and a client on the
same machine.
As long as machine resources permit, it is possible to
run flaky application code is acceptable. GFS Architecture Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk assigned by the master at the time of chunk creation. Chunk-servers store chunks on local disks as Linux files, each chunk is replicated on multiple chunk-servers. The master maintains all file system metadata. This includes the namespace, access control GFS is fault tolerance? Consistency Consistency Write control and data flow Replica placement Replica placement Garbage Collection