0% found this document useful (0 votes)

35 views

05 - Introduction To HDFS

HDFS is designed for batch processing large datasets across commodity hardware. It breaks files into blocks and replicates them across DataNodes for fault tolerance. The NameNode manages metadata and block placements, while DataNodes store blocks. The Secondary NameNode offloads some NameNode tasks like checkpointing for high availability. Clients access data through the NameNode and DataNodes.

Uploaded by

Jose Evanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

05 - Introduction To HDFS

Uploaded by

Jose Evanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Introduction to the

Hadoop Distributed File System (HDFS)

Course Road Map

Lesson 5: Introduction to the Hadoop

Module 1: Big Data Management System Distributed File System (HDFS)

Lesson 6: Acquire Data using CLI, Fuse-

Module 2: Data Acquisition and Storage DFS, and Flume

Lesson 07: Acquire and Access Data

Module 3: Data Access and Processing
Using Oracle NoSQL Database

Module 4: Data Unification and Analysis Lesson 08: Primary Administrative Tasks
for Oracle NoSQL Database

Module 5: Using and Managing Oracle

Big Data Appliance

5-2
Objectives

After completing this lesson, you should be able to:

• Describe the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5-3
Agenda

• Understand the architectural components of HDFS

• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5-4
HDFS: Characteristics
HDFS is designed more for batch processing rather than interactive use by users.
Use a scale-out model based on inexpensive commodity servers with internal disks
rather than RAID to achieve large-scale storage.

Highly fault-tolerant

High throughput

Suitable for applications with large

data sets

Streaming access to file system data

Can be built out of commodity

hardware

5-5
HDFS Deployments:
High Availability (HA) and Non-HA
• Non-HA Deployment:
– Uses the NameNode/Secondary NameNode architecture
– The Secondary NameNode is not a failover for the
NameNode.
– The NameNode was the Single Point of Failure (SPOF) of
the cluster before Hadoop 2.0 and CDH 4.0.
• HA Deployment:
– Active NameNode
– Standby NameNode

5-7
HDFS Key Definitions

Term Description
Cluster A group of servers (nodes) on a network that are configured to
work together. A server is either a master node or a slave
(worker) node.
Hadoop A batch processing infrastructure that stores and distributes
files and distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes

Blocks HDFS breaks down a data file into blocks or "chunks" and
stores the data blocks on different slave DataNodes in the
Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores on
different DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in
HDFS and tracks where data is stored in the HDFS cluster.
Secondary NameNode Performs internal NameNode transaction log checkpointing

DataNode (DN) Stores the blocks "chunks" of data for a set of files

5-8
NameNode (NN)
Manages the file system namespace (metadata) and controls access to files by client applications

File: movieplex1.log
Blocks (chunks) Blocks:
A, B, C
A Data Nodes:
1, 2, 3
B
Replication Factor: 3
C A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3
. . .

•
•
•
•
•
•

5-9
Functions of the NameNode

• Acts as the repository for all HDFS metadata

• Maintains the file system namespace
• Executes the directives for opening, closing, and renaming
files and directories
• Stores the HDFS state in an image file (fsimage)
• Stores file system modifications in an edit log file (edits)
• On startup, merges the fsimage and edits files, and
then empties edits
• Places replicas of blocks on multiple racks for fault
tolerance
• Records the number of replicas (replication factor) of a file
specified by an application

5 - 10
Secondary NameNode (Non-HA)
Backup of Namenode

NameNode Secondary NameNodes

File: movieplex1.log File: movieplex1.log
Blocks (chunks) Blocks: Blocks:
A, B, C A, B, C
A Data Nodes: Data Nodes:
1, 2, 3 1, 2, 3
B
Replication Factor: 3 Replication Factor: 3
C A: DN 1,DN 2, DN 3 A: DN 1,DN 2, DN 3
B: DN 1,DN 2, DN 3 B: DN 1,DN 2, DN 3
movieplex1.log C: DN 1,DN 2, DN 3 C: DN 1,DN 2, DN 3
. . . . . .

•
•
•
•

5 - 11
DataNodes (DN)
DataNode is responsible for storing the actual data in HDFS.

Blocks NameNode (Master)

A (128 MB) File: movieplex1.log
Blocks: A, B, C
Data Nodes: 1, 2, 3
B (128 MB) Replication Factor: 3
A: DN 1,DN 2, DN 3
C (94 MB) B: DN 1,DN 2, DN 3
C: DN 1,DN 2, DN 3
. . .
movieplex1.log; 350 MB in size
and a block size of 128 MB.
The Client chunks the file into
(3) blocks: A, B, and C

A B

C A
...
B C

Data Node 1 (slave) Data Node 2 (slave)

5 - 12
Functions of DataNodes

DataNodes perform the following functions:

• Serving read and write requests from the file system
clients
• Performing block creation, deletion, and replication based
on instructions from the NameNode
• Providing simultaneous send/receive operations to
DataNodes during replication (“replication pipelining”)

DataNode

A
C
B

Slave Node

5 - 13
NameNode and Secondary NameNodes

NameNode and Secondary

Blocks NameNodes (Masters)
A (128 MB) File: movieplex1.log
File: movieplex1.log
Blocks:
A, Blocks:
B, C
C (128 MB) Data Nodes:
A, B, C
1, Data
2, 3 Nodes:
1, 2, 3
RF:3
B (94 MB)
A: RF:
DN 31,DN 2, DN 3
DNDN
B: A: 1,DN
1,DN 2, 2,
DNDN 3 3
movieplex1.log; 350 MB in size
C: DN 1,DN 2, DN 3 3
B: DN 1,DN 2, DN
and a block size of 128 MB. C: DN 1,DN 2, DN 3
The Client chunks the file into . . .
(3) blocks: A, B, and C

A B C
C A B
B C A

DataNode 1 (slave) DataNode 2 (slave) DataNode 3 (salve)

5 - 14
Storing and Accessing Data Files in HDFS
NameNode Secondary NameNode
Blocks File: movieplex1.log File: movieplex1.log
Blocks: A, B, C Blocks: A, B, C
A Data Nodes: 1, 2, 3 Data Nodes: 1, 2, 3
A: DN1,DN2, DN3 A: DN1,DN2, DN3
B: DN1,DN2, DN3 B: DN1,DN2, DN3
B C: DN1,DN2, DN3 C: DN1,DN2, DN3
. . . . . .
C
movieplex1.log
Master Master
Ack messages from the pipeline are sent
back to the client (blocks are copied)

Slave Slave Slave

A B C

C A B

B C A

DataNode 1 DataNode 2 DataNode 3

5 - 15
HDFS Architecture: HA

Component Description
NameNode Responsible for all client operations in the cluster
(Active) Daemon
NameNode Acts as a slave or "hot" backup to the Active NameNode,
(Standby) Daemon maintaining enough information to provide a fast failover if
necessary
DataNode Daemon This is where the data is stored (HDFS) and processed
(MapReduce). This is a slave node.

Hadoop 2.0 & later, CDH 4.0 & Later

Master Node Master Node Slave Node

5 - 16
Data Replication Rack-Awareness in HDFS
Block A : A Block B : B Block C : C

Rack 1 Rack 2 Rack 3

A A

C A B B

C B

5 - 17
Accessing HDFS

5 - 18
Agenda

• Understand the architectural components of HDFS

• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5 - 19
HDFS Commands

5 - 20
The File System Namespace:
The HDFS FS (File System) Shell Interface
• HDFS supports a traditional hierarchical file organization.
• You can use the FS shell command-line interface to
interact with the data in HDFS. The syntax of this
command set is similar to other shells (e.g., bash, csh)
– You can create, remove, rename, and move directories/files.
• You can invoke the FS shell as follows:
hadoop fs <args>

• The general command-line syntax is as follows:

hadoop command [genericOptions] [commandOptions]

5 - 21
FS Shell Commands

5 - 22
Basic File System Operations: Examples
hadoop fs -ls

• For a file returns stat on the file with the following format:
– permissions number_of_replicas userid groupid
filesize modification_date modification_time
filename
• For a directory it returns list of its direct children as in
UNIX. A directory is listed as:
– permissions userid groupid modification_date
modification_time dirname

5 - 23
Basic File System Operations: Examples
Create an HDFS directory named curriculum by using the mkdir command:

Copy lab_05_01.txt from the local file system to the curriculum HDFS
directory by using the copyFromLocal command:

5 - 24
Basic File System Operations: Examples

Delete the curriculum HDFS directory by using the rm command. Use the -r option
to delete the directory and any content under it recursively:

Display the contents of the part-r-00000 HDFS file by using the cat command:

5 - 25
Using the hdfs fsck Command: Example

5 - 26
HDFS Features and Benefits

HDFS provides the following features and benefits:

• A Rebalancer to evenly distribute data across the
DataNodes
• A file system checking utility (fsck) to perform health
checks on the file system
• Procedures for upgrade and rollback
• A secondary NameNode to enable recovery and keep the
edits log file size within a limit
• A Backup Node to keep an in-memory copy of the
NameNode contents

5 - 27
Summary

In this lesson, you should have learned how to:

• Describe the architectural components of HDFS
• Use the FS shell command-line interface (CLI) to interact
with data stored in HDFS

5 - 28

MCQ On Operating System
80% (10)
MCQ On Operating System
46 pages
Windows Forensics 1660231684
100% (1)
Windows Forensics 1660231684
19 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Unit III
No ratings yet
Unit III
86 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
HDFS
No ratings yet
HDFS
16 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
huawei
No ratings yet
huawei
32 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Hdfs and Pig
No ratings yet
Hdfs and Pig
13 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
22 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Unit 4
No ratings yet
Unit 4
104 pages
Wa Introhdfs PDF
No ratings yet
Wa Introhdfs PDF
11 pages
21CS72-BIGDATA-MODULE-2-HDFS (1)
No ratings yet
21CS72-BIGDATA-MODULE-2-HDFS (1)
55 pages
Hadoop
No ratings yet
Hadoop
9 pages
HDFS
100% (2)
HDFS
6 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Exp1 Bda
No ratings yet
Exp1 Bda
11 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
No ratings yet
Chapter N2 HDFS The Hadoop Distributed File System - Matrix
37 pages
Hadoop Distributed File System HDFS 1688981751
No ratings yet
Hadoop Distributed File System HDFS 1688981751
49 pages
HDFS v001
No ratings yet
HDFS v001
30 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
No ratings yet
DSECL ZG 522: Big Data Systems: Session 6: Hadoop Architecture and Filesystem
56 pages
03_hdfs
No ratings yet
03_hdfs
27 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
BDS Session 5
No ratings yet
BDS Session 5
57 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
Hdfs Architecture
No ratings yet
Hdfs Architecture
16 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Unit 2-HDFS SGS
No ratings yet
Unit 2-HDFS SGS
29 pages
HDFS and YARN
No ratings yet
HDFS and YARN
91 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Hadoop
No ratings yet
Hadoop
31 pages
IMTC634_Data Science_Chapter 14
No ratings yet
IMTC634_Data Science_Chapter 14
22 pages
Hadoop Working
No ratings yet
Hadoop Working
33 pages
Big Data Hadoop HDFS
No ratings yet
Big Data Hadoop HDFS
32 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
HDFS
No ratings yet
HDFS
13 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
BDA Mid 2
No ratings yet
BDA Mid 2
21 pages
Big Data Importance of Hadoop Distributed Filesystem
No ratings yet
Big Data Importance of Hadoop Distributed Filesystem
4 pages
BigData Fundamental and Hadoop Interview Questions
No ratings yet
BigData Fundamental and Hadoop Interview Questions
33 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Exp3 BDI 60004200124
No ratings yet
Exp3 BDI 60004200124
5 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Bda Unit 5
No ratings yet
Bda Unit 5
17 pages
5.apache Hadoop
No ratings yet
5.apache Hadoop
33 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
06 - Acquire Data Using CLI and Flume
No ratings yet
06 - Acquire Data Using CLI and Flume
13 pages
08 - Admin - NoSQL Database
No ratings yet
08 - Admin - NoSQL Database
9 pages
07 - Acquire and Access Data Using NoSQL Database
No ratings yet
07 - Acquire and Access Data Using NoSQL Database
26 pages
03 - Using Big Data Lite Virtual Machine
No ratings yet
03 - Using Big Data Lite Virtual Machine
21 pages
Data Wrangling With SAS
No ratings yet
Data Wrangling With SAS
8 pages
Passware Kit User Guide
No ratings yet
Passware Kit User Guide
262 pages
VMCE_v12
No ratings yet
VMCE_v12
7 pages
Amazon S3 or Amazon Simple Storage Service (A Global Service With Regional Storage)
No ratings yet
Amazon S3 or Amazon Simple Storage Service (A Global Service With Regional Storage)
49 pages
Process List
No ratings yet
Process List
3 pages
Pam Admin
No ratings yet
Pam Admin
16 pages
Install Net Framework 1.1 For CX Programmer
50% (2)
Install Net Framework 1.1 For CX Programmer
1 page
Backing Up and Restoring Nagios XI
No ratings yet
Backing Up and Restoring Nagios XI
15 pages
Kernel
No ratings yet
Kernel
3 pages
Configure Authentication With Active Directory
No ratings yet
Configure Authentication With Active Directory
9 pages
Clearcase Command Line Help
50% (2)
Clearcase Command Line Help
26 pages
Upgrade From Jinit To Jre 6U29 For 11i
No ratings yet
Upgrade From Jinit To Jre 6U29 For 11i
4 pages
Installing Nagios XI With VMware VM Workstation Player
100% (1)
Installing Nagios XI With VMware VM Workstation Player
10 pages
Cakewall
No ratings yet
Cakewall
2 pages
Imanual 84
No ratings yet
Imanual 84
11 pages
684 BNCSC502
No ratings yet
684 BNCSC502
92 pages
Chapter 12 File Management
No ratings yet
Chapter 12 File Management
57 pages
Scan
No ratings yet
Scan
8 pages
User Profile Wizard Corporate User Guide
No ratings yet
User Profile Wizard Corporate User Guide
112 pages
Addons Linker Procedure 1.2.0
No ratings yet
Addons Linker Procedure 1.2.0
7 pages
Instant download (Ebook) PowerShell Cookbook, 4th Edition by Lee Holmes ISBN 9781098101602, 9781098101541, 109810160X, 1098101545 pdf all chapter
100% (8)
Instant download (Ebook) PowerShell Cookbook, 4th Edition by Lee Holmes ISBN 9781098101602, 9781098101541, 109810160X, 1098101545 pdf all chapter
67 pages
Asustor Nas User Guide
No ratings yet
Asustor Nas User Guide
61 pages
Xampp Installtion Procedure by ELA: Step 1: Download
No ratings yet
Xampp Installtion Procedure by ELA: Step 1: Download
5 pages
SCHEDLGU
No ratings yet
SCHEDLGU
5 pages
Resolving Software Issues Pscad
0% (1)
Resolving Software Issues Pscad
62 pages
Installation - GR-GSM - Open Source Mobile Communications
No ratings yet
Installation - GR-GSM - Open Source Mobile Communications
5 pages
Emacs: Filename Will Make It Unreadable For Others Again. Note That For Someone To Be Able
No ratings yet
Emacs: Filename Will Make It Unreadable For Others Again. Note That For Someone To Be Able
15 pages
Readme Rsa Securid Token Import Utility 5.0: February 2015 1
No ratings yet
Readme Rsa Securid Token Import Utility 5.0: February 2015 1
3 pages
Ex 280
No ratings yet
Ex 280
5 pages

05 - Introduction To HDFS

Uploaded by

05 - Introduction To HDFS

Uploaded by

Introduction to the

Hadoop Distributed File System (HDFS)

Lesson 5: Introduction to the Hadoop

Lesson 6: Acquire Data using CLI, Fuse-

Lesson 07: Acquire and Access Data

Module 5: Using and Managing Oracle

After completing this lesson, you should be able to:

• Understand the architectural components of HDFS

Suitable for applications with large

Streaming access to file system data

Can be built out of commodity

• Acts as the repository for all HDFS metadata

NameNode Secondary NameNodes

Blocks NameNode (Master)

Data Node 1 (slave) Data Node 2 (slave)

DataNodes perform the following functions:

NameNode and Secondary

DataNode 1 (slave) DataNode 2 (slave) DataNode 3 (salve)

Slave Slave Slave

DataNode 1 DataNode 2 DataNode 3

Hadoop 2.0 & later, CDH 4.0 & Later

Master Node Master Node Slave Node

Rack 1 Rack 2 Rack 3

• Understand the architectural components of HDFS

• The general command-line syntax is as follows:

HDFS provides the following features and benefits:

In this lesson, you should have learned how to:

You might also like