05 - Introduction To HDFS
05 - Introduction To HDFS
Module 4: Data Unification and Analysis Lesson 08: Primary Administrative Tasks
for Oracle NoSQL Database
5-2
Objectives
5-3
Agenda
5-4
HDFS: Characteristics
HDFS is designed more for batch processing rather than interactive use by users.
Use a scale-out model based on inexpensive commodity servers with internal disks
rather than RAID to achieve large-scale storage.
Highly fault-tolerant
High throughput
5-5
HDFS Deployments:
High Availability (HA) and Non-HA
• Non-HA Deployment:
– Uses the NameNode/Secondary NameNode architecture
– The Secondary NameNode is not a failover for the
NameNode.
– The NameNode was the Single Point of Failure (SPOF) of
the cluster before Hadoop 2.0 and CDH 4.0.
• HA Deployment:
– Active NameNode
– Standby NameNode
5-7
HDFS Key Definitions
Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes
Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
Secondary NameNode Performs internal NameNode transaction log checkpointing
DataNode (DN) Stores the blocks "chunks" of data for a set of files
5-8
NameNode (NN)
Manages the file system namespace (metadata) and controls access to files by client applications
File: movieplex1.log
Blocks (chunks) Blocks:
A, B, C
A Data Nodes:
1, 2, 3
B
Replication Factor: 3
C A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3
. . .
•
•
•
•
•
•
5-9
Functions of the NameNode
5 - 10
Secondary NameNode (Non-HA)
Backup of Namenode
•
•
•
•
5 - 11
DataNodes (DN)
DataNode is responsible for storing the actual data in HDFS.
A B
C A
...
B C
5 - 12
Functions of DataNodes
DataNode
A
C
B
Slave Node
5 - 13
NameNode and Secondary NameNodes
A B C
C A B
B C A
5 - 14
Storing and Accessing Data Files in HDFS
NameNode Secondary NameNode
Blocks File: movieplex1.log File: movieplex1.log
Blocks: A, B, C Blocks: A, B, C
A Data Nodes: 1, 2, 3 Data Nodes: 1, 2, 3
A: DN1,DN2, DN3 A: DN1,DN2, DN3
B: DN1,DN2, DN3 B: DN1,DN2, DN3
B C: DN1,DN2, DN3 C: DN1,DN2, DN3
. . . . . .
C
movieplex1.log
Master Master
Ack messages from the pipeline are sent
back to the client (blocks are copied)
A B C
C A B
B C A
5 - 15
HDFS Architecture: HA
Component Description
NameNode Responsible for all client operations in the cluster
(Active) Daemon
NameNode Acts as a slave or "hot" backup to the Active NameNode,
(Standby) Daemon maintaining enough information to provide a fast failover if
necessary
DataNode Daemon This is where the data is stored (HDFS) and processed
(MapReduce). This is a slave node.
5 - 16
Data Replication Rack-Awareness in HDFS
Block A : A Block B : B Block C : C
A A
C A B B
C B
5 - 17
Accessing HDFS
5 - 18
Agenda
5 - 19
HDFS Commands
5 - 20
The File System Namespace:
The HDFS FS (File System) Shell Interface
• HDFS supports a traditional hierarchical file organization.
• You can use the FS shell command-line interface to
interact with the data in HDFS. The syntax of this
command set is similar to other shells (e.g., bash, csh)
– You can create, remove, rename, and move directories/files.
• You can invoke the FS shell as follows:
hadoop fs <args>
5 - 21
FS Shell Commands
5 - 22
Basic File System Operations: Examples
hadoop fs -ls
• For a file returns stat on the file with the following format:
– permissions number_of_replicas userid groupid
filesize modification_date modification_time
filename
• For a directory it returns list of its direct children as in
UNIX. A directory is listed as:
– permissions userid groupid modification_date
modification_time dirname
5 - 23
Basic File System Operations: Examples
Create an HDFS directory named curriculum by using the mkdir command:
Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:
5 - 24
Basic File System Operations: Examples
Delete the curriculum HDFS directory by using the rm command. Use the -r option
to delete the directory and any content under it recursively:
Display the contents of the part-r-00000 HDFS file by using the cat command:
5 - 25
Using the hdfs fsck Command: Example
5 - 26
HDFS Features and Benefits
5 - 27
Summary
5 - 28