0% found this document useful (0 votes)
16 views

Hadoop Week 2

The document provides information about connecting with Edureka's Hadoop training and support services, the topics covered in their 8-week Hadoop course, and an overview of some key Hadoop concepts and components. It includes contact details for 24x7 support via Skype, email and phone. It also lists the main topics covered each week of the course, including introductions to HDFS, MapReduce, Pig, Hive, HBase, Zookeeper and Sqoop.

Uploaded by

Rahul Kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Hadoop Week 2

The document provides information about connecting with Edureka's Hadoop training and support services, the topics covered in their 8-week Hadoop course, and an overview of some key Hadoop concepts and components. It includes contact details for 24x7 support via Skype, email and phone. It also lists the main topics covered each week of the course, including introductions to HDFS, MapReduce, Pig, Hive, HBase, Zookeeper and Sqoop.

Uploaded by

Rahul Kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Connect with us

 24x7 Support on Skype, Email & Phone

 Skype ID – edureka.hadoop

 Email – [email protected]

 Call us – +91 88808 62004

 Venkat – [email protected]
Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
Recap of Week 1

What is Big Data


What is Hadoop
Hadoop Eco-System Components
Why DFS
Features of HDFS
Areas where HDFS is not a good fit
Block Abstraction in HDFS
HDFS Components:

• Namenodes

• Datanodes
Main Components Of HDFS:

 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 Single point of failure for Hadoop cluster
 Manages block replication

 DataNodes:
 slaves which are deployed on each machine and provide the actual
storage
 responsible for serving read and write requests for the clients
Secondary Name Node

It is important to make the namenode resilient to failure and one of the techniques
of doing this is to create a Secondary NameNode.

The secondary namenode usually runs on a separate physical machine, since it


requires plenty of CPU and as much memory as the namenode to perform the merge.
It keeps a copy of the merged namespace image, which can be used in the event of the
namenode failing

This node is also called as a Checkpoint Node as it manages the Edit log and check-
pointing of namenode metadata (once per hour or when edits log reaches 64MB in
size). Please note that it does not provide namenode failover

The Secondary Node copies FSImage and Transaction Log from NameNode to a
temporary directory. Then it merges FSImage and Transaction Log into a new FSImage
in temperate location and uploads the new FSImage to the NameNode.
NameNode Metadata

Meta-data in Memory
1. The entire metadata is in main memory
2. No demand paging of FS meta-data

Types of Metadata
1. List of files
2. List of Blocks for each file
3. List of Datanode for each block
4. File attributes, e.g access time, replication factor

A Transaction Log
1. Records file créations, file deletions. etc
JobTracker and TaskTracker:
HDFS Architecture
Job Tracker
Job Tracker Contd.
Job Tracker Contd.
HDFS Client Creates a New File
Rack Awareness
Anatomy of a File Write:
Anatomy of a File Read:
Terminal Commands
Terminal Commands
Web UI URLs

• NameNode status: http://localhost:50070/dfshealth.jsp

• JobTracker status: http://localhost:50030/jobtracker.jsp

• TaskTracker status: http://localhost:50060/tasktracker.jsp

• DataBlock Scanner Report:


http://localhost:50075/blockScannerReport
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Deployment Modes
• Standalone or local Mode
– No daemons running
– Everything runs on single JVM
– Good for deployment

• Pseudo-distributed Mode
– All daemons running on single machine, a cluster simulation on one machine
– Good for Test Environment

• Fully distributed Mode


– Hadoop running on multiple machines on a cluster
– Production Environment
Folder View of Hadoop
Hadoop Configuration Files
Configuration Filenames Description of log files
Enviroment variables that are used in the scripts to run
hadoop-env.sh Hadoop
Configuration settings for Hadoop Core such as I/O settings
core-site.xml that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the namenode, the
hdfs-site.xml secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
mapred-site.xml jobtracker and the task trackers
A list of machines(one per line) that each run a secondary
masters namenode
A list of machines(one per line) that each run a datanode and
slaves a trask tracker
Properties for controlling how metrics are published in
hadoop-metrics.properties Hadoop
Properties for system log files, the namenode audit log and
log4j.properties the task log for the tasktracker child process
DD for each component

Core core-site.xml

HDFS hdfs-site.xml

MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml

hdfs-site.xml core-site.xml

<?xml version - "1.0"?> <?xml version ="1.0"?>

<!--hdfs-site.xml--> <!--core-site.xml-->

<configuration> <configuration>

<property> <property>

<name>dfs.replication</name> <name>fs.default.name</name>

<value>1</value> <value>hdfs://localhost:8020/</value>

</property> </property>

</configuration> </configuration>
Defining HDFS details in hdfs-site.xml
Property Value Description

dfs.data.dir <value>/disk1/hdfs/data,/dis A list of directories where the datanode stores


k2/hdfs/data</value> blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data

fs.checkpoint.di <value>/disk1/hdfs/namese A list of directories where the secondary


r condary,/disk2/hdfs/namese namenode stores checkpoints. It stores a copy of
condary</value> the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary
Mapred-site.xml

<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker <value>localhost: The hostname and the port that the jobtrackers RPC server
8021</value> runs on. If set to the default value of local, then the
jobtracker is run in-process on demand when you run a
MapReduce job (you don’t need to start the jobtracker in this
case, and in fact you will get an error if you try to start it in
this mode)
Mapred.local.dir ${hadoop.tmp.dir} A list of directories where MapReduce stores intermediate
/mapred/local data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir} The directory relative to fs.default.name where shared files
/mapred/system are stored, during a job run.
Mapred.tasktracker.map 2 The number of map tasks that may be run on a tasktracker
.tasks.maximum at any one time
Mapred.tasktracker.redu 2 The number of reduce tasks tat may be run on a tasktracker
ce.tasks.maximum at any one time.
Critical Properties

fs.default.name

hadoop.tmp.dir

mapred.job.tracker
Slaves and masters
Two files are used by the startup and shutdown commands:

slaves

• contains a list of hosts, one per line, that are to host


DataNode and TaskTracker servers

masters

• contains a list of hosts, one per line, that are to host


secondary NameNode servers
Per-process runtime environment

hadoop-env.sh file:

Hadoop-env.sh JVM

• This file also offers a way to provide custom parameters for each of the
servers.
• Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in
the conf/ directory of the installation.
Reporting

hadoop-metrics.properties

• This file controls the reporting


• The default is not to report
Network Requirements

Uses Shell (SSH) to launch the server processes on


the slave nodes

Hadoop
Core

Requires password-less SSH connection between


the master and all slave and secondary machines
Namenode Recovery

• Shut down the secondary NameNode


1

• secondary:fs.checkpoint.dir  Namenode:dfs.name.dir
2

• secondary:fs.checkpoint.edits  Namenode:dfs.name.edits.dir
3

• When the copy completes, start the NameNode and restart the
4 secondary NameNode
Clarifications

Q & A..?
Thank You
See You in Class Next Week

You might also like