0% found this document useful (0 votes)
132 views4 pages

Distributed File Systems Overview

Google File System (GFS) is a distributed file system developed by Google to provide efficient access to large amounts of data stored across commodity hardware clusters. GFS divides files into 64MB chunks that are replicated across multiple machines for fault tolerance. The system uses a master node to track metadata like file mappings and chunk locations while chunkservers store and serve the data chunks. Processes access chunks by first querying the master node and then contacting chunkservers directly. Unlike traditional file systems, GFS is implemented as a userspace library rather than in the kernel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views4 pages

Distributed File Systems Overview

Google File System (GFS) is a distributed file system developed by Google to provide efficient access to large amounts of data stored across commodity hardware clusters. GFS divides files into 64MB chunks that are replicated across multiple machines for fault tolerance. The system uses a master node to track metadata like file mappings and chunk locations while chunkservers store and serve the data chunks. Processes access chunks by first querying the master node and then contacting chunkservers directly. Unlike traditional file systems, GFS is implemented as a userspace library rather than in the kernel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Google File System

Google File System (GFS or GoogleFS, not to be confused with the GFS Linux file system) is


a proprietary distributed file system developed by Google to provide efficient, reliable access to
data using large clusters of commodity hardware. The last version of Google File System
codenamed Colossus was released in 2010.

GFS is enhanced for Google's core data storage and usage needs (primarily the search engine),
which can generate enormous amounts of data that must be retained; Google File System grew
out of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early
days of Google, while it was still located in Stanford. Files are divided into fixed-size chunks of
64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely
rarely overwritten, or shrunk; files are usually appended to or read. It is also designed and
optimized to run on Google's computing clusters, dense nodes which consist of cheap
"commodity" computers, which means precautions must be taken against the high failure rate of
individual nodes and the subsequent data loss. Other design decisions select for high
data throughputs, even when it comes at the cost of latency.
A GFS cluster consists of multiple nodes. These nodes are divided into two types:
one Master node and multiple Chunkservers. Each file is divided into fixed-size chunks.
Chunkservers store these chunks. Each chunk is assigned a globally unique 64-bit label by the
master node at the time of creation, and logical mappings of files to constituent chunks are
maintained. Each chunk is replicated several times throughout the network. At default, it is
replicated three times, but this is configurable. Files which are in high demand may have a higher
replication factor, while files for which the application client uses strict storage optimizations
may be replicated less than three times - in order to cope with quick garbage cleaning policies.
The Master server does not usually store the actual chunks, but rather all the metadata associated
with the chunks, such as the tables mapping the 64-bit labels to chunk locations and the files they
make up (mapping from files to chunks), the locations of the copies of the chunks, what
processes are reading or writing to a particular chunk, or taking a "snapshot" of the chunk
pursuant to replicate it (usually at the instigation of the Master server, when, due to node failures,
the number of copies of a chunk has fallen beneath the set number). All this metadata is kept
current by the Master server periodically receiving updates from each chunk server ("Heart-beat
messages").
Permissions for modifications are handled by a system of time-limited, expiring "leases", where
the Master server grants permission to a process for a finite period of time during which no other
process will be granted permission by the Master server to modify the chunk. The modifying
chunkserver, which is always the primary chunk holder, then propagates the changes to the
chunkservers with the backup copies. The changes are not saved until all chunkservers
acknowledge, thus guaranteeing the completion and atomicity of the operation.
Programs access the chunks by first querying the Master server for the locations of the desired
chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master
replies with the locations, and the program then contacts and receives the data from the
chunkserver directly.
Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but
is instead provided as a userspace library.

Hadoop - HDFS

Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using
low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the
files are stored across multiple machines. These files are stored in redundant fashion to rescue
the system from possible data losses in case of failure. HDFS also makes applications available
to parallel processing.

Features of HDFS
● It is suitable for the distributed storage and processing.
● Hadoop provides a command interface to interact with HDFS.
● The built-in servers of namenode and datanode help users to easily check the
status of cluster.
● Streaming access to file system data.
● HDFS provides file permissions and authentication.
HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and
the namenode software. It is a software that can be run on commodity hardware. The system
having the namenode acts as the master server and it does the following tasks −

● Manages the file system namespace.


● Regulates client’s access to files.
● It also executes file system operations such as renaming, closing, and opening files
and directories.
Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.

● Datanodes perform read-write operations on the file systems, as per client request.
● They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are called
as blocks. In other words, the minimum amount of data that HDFS can read or write is called a
Block. The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.

Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.

Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.

Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.

You might also like