0% found this document useful (0 votes)
35 views

05 - Introduction To HDFS

HDFS is designed for batch processing large datasets across commodity hardware. It breaks files into blocks and replicates them across DataNodes for fault tolerance. The NameNode manages metadata and block placements, while DataNodes store blocks. The Secondary NameNode offloads some NameNode tasks like checkpointing for high availability. Clients access data through the NameNode and DataNodes.

Uploaded by

Jose Evanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

05 - Introduction To HDFS

HDFS is designed for batch processing large datasets across commodity hardware. It breaks files into blocks and replicates them across DataNodes for fault tolerance. The NameNode manages metadata and block placements, while DataNodes store blocks. The Secondary NameNode offloads some NameNode tasks like checkpointing for high availability. Clients access data through the NameNode and DataNodes.

Uploaded by

Jose Evanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Introduction to the

Hadoop Distributed File System (HDFS)


Course Road Map

Lesson 5: Introduction to the Hadoop


Module 1: Big Data Management System Distributed File System (HDFS)

Lesson 6: Acquire Data using CLI, Fuse-


Module 2: Data Acquisition and Storage DFS, and Flume

Lesson 07: Acquire and Access Data


Module 3: Data Access and Processing
Using Oracle NoSQL Database

Module 4: Data Unification and Analysis Lesson 08: Primary Administrative Tasks
for Oracle NoSQL Database

Module 5: Using and Managing Oracle


Big Data Appliance

5-2
Objectives

After completing this lesson, you should be able to:


• Describe the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5-3
Agenda

• Understand the architectural components of HDFS


• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5-4
HDFS: Characteristics
HDFS is designed more for batch processing rather than interactive use by users.
Use a scale-out model based on inexpensive commodity servers with internal disks
rather than RAID to achieve large-scale storage.

Highly fault-tolerant

High throughput

Suitable for applications with large


data sets

Streaming access to file system data

Can be built out of commodity


hardware

5-5
HDFS Deployments:
High Availability (HA) and Non-HA
• Non-HA Deployment:
– Uses the NameNode/Secondary NameNode architecture
– The Secondary NameNode is not a failover for the
NameNode.
– The NameNode was the Single Point of Failure (SPOF) of
the cluster before Hadoop 2.0 and CDH 4.0.
• HA Deployment:
– Active NameNode
– Standby NameNode

5-7
HDFS Key Definitions

Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes

Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
Secondary NameNode Performs internal NameNode transaction log checkpointing

DataNode (DN) Stores the blocks "chunks" of data for a set of files

5-8
NameNode (NN)
Manages the file system namespace (metadata) and controls access to files by client applications

File: movieplex1.log
Blocks (chunks) Blocks:
A, B, C
A Data Nodes:
1, 2, 3
B
Replication Factor: 3
C A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3
. . .






5-9
Functions of the NameNode

• Acts as the repository for all HDFS metadata


• Maintains the file system namespace
• Executes the directives for opening, closing, and renaming
files and directories
• Stores the HDFS state in an image file (fsimage)
• Stores file system modifications in an edit log file (edits)
• On startup, merges the fsimage and edits files, and
then empties edits
• Places replicas of blocks on multiple racks for fault
tolerance
• Records the number of replicas (replication factor) of a file
specified by an application

5 - 10
Secondary NameNode (Non-HA)
Backup of Namenode

NameNode Secondary NameNodes


File: movieplex1.log File: movieplex1.log
Blocks (chunks) Blocks: Blocks:
A, B, C A, B, C
A Data Nodes: Data Nodes:
1, 2, 3 1, 2, 3
B
Replication Factor: 3 Replication Factor: 3
C A: DN 1,DN 2, DN 3 A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3 B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3 C: DN 1,DN 2, DN 3
. . . . . .




5 - 11
DataNodes (DN)
DataNode is responsible for storing the actual data in HDFS.

Blocks NameNode (Master)


A (128 MB) File: movieplex1.log
Blocks: A, B, C
Data Nodes: 1, 2, 3
B (128 MB) Replication Factor: 3
A: DN 1,DN 2, DN 3
C (94 MB) B: DN 1,DN 2, DN 3
C: DN 1,DN 2, DN 3
. . .
movieplex1.log; 350 MB in size
and a block size of 128 MB.
The Client chunks the file into
(3) blocks: A, B, and C

A B

C A
...
B C

Data Node 1 (slave) Data Node 2 (slave)

5 - 12
Functions of DataNodes

DataNodes perform the following functions:


• Serving read and write requests from the file system
clients
• Performing block creation, deletion, and replication based
on instructions from the NameNode
• Providing simultaneous send/receive operations to
DataNodes during replication (“replication pipelining”)

DataNode

A
C
B

Slave Node

5 - 13
NameNode and Secondary NameNodes

NameNode and Secondary


Blocks NameNodes (Masters)
A (128 MB) File: movieplex1.log
File: movieplex1.log
Blocks:
A, Blocks:
B, C
C (128 MB) Data Nodes:
A, B, C
1, Data
2, 3 Nodes:
1, 2, 3
RF:3
B (94 MB)
A: RF:
DN 31,DN 2, DN 3
DNDN
B: A: 1,DN
1,DN 2, 2,
DNDN 3 3
movieplex1.log; 350 MB in size
C: DN 1,DN 2, DN 3 3
B: DN 1,DN 2, DN
and a block size of 128 MB. C: DN 1,DN 2, DN 3
The Client chunks the file into . . .
(3) blocks: A, B, and C

A B C
C A B
B C A

DataNode 1 (slave) DataNode 2 (slave) DataNode 3 (salve)

5 - 14
Storing and Accessing Data Files in HDFS
NameNode Secondary NameNode
Blocks File: movieplex1.log File: movieplex1.log
Blocks: A, B, C Blocks: A, B, C
A Data Nodes: 1, 2, 3 Data Nodes: 1, 2, 3
A: DN1,DN2, DN3 A: DN1,DN2, DN3
B: DN1,DN2, DN3 B: DN1,DN2, DN3
B C: DN1,DN2, DN3 C: DN1,DN2, DN3
. . . . . .
C
movieplex1.log
Master Master
Ack messages from the pipeline are sent
back to the client (blocks are copied)

Slave Slave Slave

A B C

C A B

B C A

DataNode 1 DataNode 2 DataNode 3

5 - 15
HDFS Architecture: HA

Component Description
NameNode Responsible for all client operations in the cluster
(Active) Daemon
NameNode Acts as a slave or "hot" backup to the Active NameNode,
(Standby) Daemon maintaining enough information to provide a fast failover if
necessary
DataNode Daemon This is where the data is stored (HDFS) and processed
(MapReduce). This is a slave node.

Hadoop 2.0 & later, CDH 4.0 & Later

Master Node Master Node Slave Node

5 - 16
Data Replication Rack-Awareness in HDFS
Block A : A Block B : B Block C : C

Rack 1 Rack 2 Rack 3

A A

C A B B

C B

5 - 17
Accessing HDFS

5 - 18
Agenda

• Understand the architectural components of HDFS


• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5 - 19
HDFS Commands

5 - 20
The File System Namespace:
The HDFS FS (File System) Shell Interface
• HDFS supports a traditional hierarchical file organization.
• You can use the FS shell command-line interface to
interact with the data in HDFS. The syntax of this
command set is similar to other shells (e.g., bash, csh)
– You can create, remove, rename, and move directories/files.
• You can invoke the FS shell as follows:
hadoop fs <args>

• The general command-line syntax is as follows:


hadoop command [genericOptions] [commandOptions]

5 - 21
FS Shell Commands

5 - 22
Basic File System Operations: Examples
hadoop fs -ls

• For a file returns stat on the file with the following format:
– permissions number_of_replicas userid groupid
filesize modification_date modification_time
filename
• For a directory it returns list of its direct children as in
UNIX. A directory is listed as:
– permissions userid groupid modification_date
modification_time dirname

5 - 23
Basic File System Operations: Examples
Create an HDFS directory named curriculum by using the mkdir command:

Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:

5 - 24
Basic File System Operations: Examples

Delete the curriculum HDFS directory by using the rm command. Use the -r option
to delete the directory and any content under it recursively:

Display the contents of the part-r-00000 HDFS file by using the cat command:

5 - 25
Using the hdfs fsck Command: Example

5 - 26
HDFS Features and Benefits

HDFS provides the following features and benefits:


• A Rebalancer to evenly distribute data across the
DataNodes
• A file system checking utility (fsck) to perform health
checks on the file system
• Procedures for upgrade and rollback
• A secondary NameNode to enable recovery and keep the
edits log file size within a limit
• A Backup Node to keep an in-memory copy of the
NameNode contents

5 - 27
Summary

In this lesson, you should have learned how to:


• Describe the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5 - 28

You might also like