0% found this document useful (0 votes)

177 views

Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming

The document provides information on compiling and running the Hadoop WordCount example program. It describes 5 steps: 1) creating a wordcount_classes directory, 2) compiling the WordCount.java program, 3) creating a jar file, 4) creating input/output directories in HDFS and copying input files, and 5) running the WordCount application jar. The output files will be present in the output directory and can be viewed.

Uploaded by

bhattsb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views

Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming

Uploaded by

bhattsb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

MODULE-1

• HDFS BASICS
• Running example programs and
benchmarks
• Hadoop MapReduce Framework
• MapReduce Programming
OUTLINE
 Compiling and Running the Hadoop
WordCount Example
 Using the Streaming Interface

 Using the Pipes Interface

 Compiling and Running the Hadoop

Grep Chaining Example
 Debugging MapReduce
MAPREDUCE PROGRAMMING
 Wordcount.java
 To compile and run the program from the
command line, perform the following steps:
1. Make a local wordcount_classes directory
$ mkdir wordcount_classes

2. Compile the WordCount.java program using the

‘hadoop classpath’ command to include all the
available Hadoop class paths.

$ javac -cp ‘hadoop classpath’ -d wordcount_classes

WordCount.java

3. The jar file can be created using the following

command:
$ jar -cvf wordcount.jar -C wordcount_classes/
4. To run the example, create an input directory in
HDFS and place a text file in the new
directory.

$ hdfs dfs -mkdir war-and-peace-input

$ hdfs dfs -put war-and-peace.txt war-and-peace-input

5. Run the WordCount application using the

following command:

$ hadoop jar wordcount.jar WordCount war-and-peace-

input
war-and-peace-output
 The following files will be present in war-and-peace-output
directory.

 The complete list of word counts can be copied from HDFS

$ hdfs dfs -get war-and-peace-output/part-r-0000

 The output content directory and all its content can be

removed with following command
$ hdfs dfs -rm -r -skipTrash war-and-peace-output
USING THE STREAMING INTERFACE

 Apache Hadoop streaming interface enables

any program to use the MapReduce engine.
 The stream interface works with any program
that can read and write to stdin and stdout.
 In streaming mode, only the mapper and
reducer are created by the user.
 In this approach the mapper and reducer can
be easily tested from the command line.
 To run this application using Hadoop, create a
directory and move the war-and-peace.txt input
file into HDFS
$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input

 Before run output directory is removed

$hdfs dfs –rm –r skipTrash war-and-peace-output

 Locate the hadoop-streaming.jar file in HDFS. The

following command line use mapper.py and
reducer.py to do wordcount on the input file.
USING THE PIPES INTERFACE

 Pipes is a library that allows C++ source code

to be used for mapper and reducer.
 Both key and value inputs to pipes programs
are provided as strings.
 The Program must define the instance of
mapper and instance of reducer.
 Program to use Pipes is defined by extending
the Mapper and Reducer.
 The executable must be placed in HDFS.
 The program can be compiled with following line:
$ g++ wordcount.cpp –o wordcount

 Create the war-and-peace-input directory in HDFS

$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input

 The executable must be placed in HDFS and

output directory must be removed before running
$ hdfs dfs –put wordcount bin
$hdfs dfs –rm –r skipTrash war-and-peace-output

 Run the program. On successful output is placed

in war-and-peace-output directory.
COMPILING AND RUNNING THE HADOOP
GREP CHAINING EXAMPLE
 The Hadoop Grep.java example extracts matching
strings from text files and counts how many times
they occurred.

 It doesn't display the complete matching line, but only

the matching string, so in order to display lines
matching "foo", use .*foo.* as a regular expression.

 The program runs two map/reduce jobs in sequence.

 The first job counts how many times a matching string

occurred in input and the second job sorts matching
strings by their frequency and stores the output in a
single output file.
 Each mapper of the first job takes a line as input
and matches with the user-provided regular
expression. It extracts all matching strings and
emits <matching string, 1> pairs.
 The RegexMapper class is used to perform this
task.

 Each reducer sums the frequencies of each

matching string. The output is sequence files
containing the matching string and count.
 The reduce phase is optimized by running a
combiner that sums the frequency of strings from
local map output. As a result it reduces the
amount of data that needs to be shipped to a
reduce task.
 The LongSumReducer class is used to perform this
 The second job takes the output of the first job as
its input and sorts matching strings by their
frequencies and stores the output in a single file.

 The mapper is an inverse map that reverses (or

swaps) its input <key, value> pair to <value,
key> pair.

 The reducer is an Identityreducer class. The

number of reducers is one, so the output is stored
in one file, and it is sorted by count in descending
order.

 The output is text file, contains a count and string

per line.
 Compile and run the Grep.java
1. Create a directory
$ mkdir Grep_classes

2. Compile the WordCount.java using following

line
$ javac -cp ‘hadoop classpath’ -d Grep_classes Grep.java

3. Create a Java archive using following command

$ jar -cvf Grep.jar -C Grep_classes

4. Create directory in HDFS and move war-and-

peace.txt into HDFS.
$ hdfs dfs -mkdir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input
5. Ensure that before run, output directory is removed
$ hdfs dfs –rm –r skipTrash war-and-peace-output

6. Following command will Run the program

$ hadoop jar Grep.jar org.apache.hadoop.examples.Grep
war-and-peace-input war-and-peace-output Kutuzov

7. The result can be found in output file

$ hdfs dfs -cat war-and-peace-output/part-r-00000
530 Kutuzov
DEBUGGING MAPREDUCE
 The best advice for debugging parallel MapReduce
applications is : Don’t
 Debugging on distributed system is hard and
should be avoided at all costs.
 The best approach is to ensure applications runs
on simpler system with small data sets.
 Errors are much easier to locate and track. Unit
Testing is important.
 If applications run successfully on single system
with subset of data, then can run in parallel.
 Another approach is to use application logs to
inspect the progress.
Listing, Killing and Job Status
 Jobs can be managed using the mapred job
Command
 The options are -list, -kill and -status.

 yarn application command can be used to control

all applications running on the cluster.

Hadoop Log Management

 The MapReduce logs provide listing of both
mappers and reducers.
 The log output consists of 3 files: stdout, stderr
and syslog.
 There are 2 modes for log storage.

 The first method is to use log aggregation. In this

method logs are aggregated in HDFS.
 The result is displayed using yarn logs command.

 Second method, if log aggregation is disabled, the

logs can be placed locally on the cluster nodes
where mapper and reducer runs.

 The location of unaggregated local logs is given by

the yarn.nodemanager.log-dirs property in yarn-
site.xml

 Log aggregation is highly recommended.

Enabling YARN Log Aggregation
 To manually enable log aggregation, follow these
steps:
1. Create the following directory in HDFS.
$ hdfs dfs -mkdir -p /yarn/logs
$ hdfs dfs -chown -R yarn:hadoop /yarn/logs
$ hdfs dfs -chmod -R g+rw /yarn/logs

2. Add the following properties in the yarn-site.xml

and restart all YARN services.
Command-Line Log Viewing
 MapReduce logs can also be viewed from the
command line.
 The yarn logs command enables to view logs.

$ yarn logs

 The options to use yarn logs are as follows.

Example
 Run pi exmple program

$ hadoop jar $HADOOP_EXAMPLES/hadoop-mapreduce-

examples.jar pi 16 100000

 Find Application ID using,

$ yarn application -list -appStates FINISHED

 Run the command to produce dump of all the logs for

the application
$ yarn logs -applicationId application_143266…_001 >
AppOut

 The AppOut file can be inspected. The list of

containers are found in the file. A specific container
can be examined.

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
D202210 ACN MR Process Controller (Profibus) - BU
No ratings yet
D202210 ACN MR Process Controller (Profibus) - BU
8 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
38 pages
3170722_BDA_Lab Manual(1)
No ratings yet
3170722_BDA_Lab Manual(1)
78 pages
2 HDFS Commands
No ratings yet
2 HDFS Commands
7 pages
Exercise 6 PDF
No ratings yet
Exercise 6 PDF
2 pages
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
No ratings yet
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
35 pages
HADOOP
100% (1)
HADOOP
35 pages
Manual Hadoop HIve Installation
No ratings yet
Manual Hadoop HIve Installation
4 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
Unit 4 Da
No ratings yet
Unit 4 Da
57 pages
Star Schema Exercise 2 (Question)
No ratings yet
Star Schema Exercise 2 (Question)
4 pages
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
No ratings yet
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
5 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Hadoop Hdfs Commands
No ratings yet
Hadoop Hdfs Commands
5 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Digital Forensic
No ratings yet
Digital Forensic
238 pages
13 - m1 - Linux Basic Commands - Edureka VM PDF
No ratings yet
13 - m1 - Linux Basic Commands - Edureka VM PDF
3 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
Course Contents of Hadoop and Big Data
No ratings yet
Course Contents of Hadoop and Big Data
11 pages
1.4 HDFS Lab 1H
No ratings yet
1.4 HDFS Lab 1H
23 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
MapR Certified Hadoop Developer Study Guide (MCHD)
No ratings yet
MapR Certified Hadoop Developer Study Guide (MCHD)
26 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
MapR v4.X Upgrade Documentation v1.2
No ratings yet
MapR v4.X Upgrade Documentation v1.2
38 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
HDFS Commands
No ratings yet
HDFS Commands
6 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Gcloud Python
No ratings yet
Gcloud Python
398 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Hadoop Tutorial - YDN
No ratings yet
Hadoop Tutorial - YDN
14 pages
Big Data Analytics - Lab-Manual
No ratings yet
Big Data Analytics - Lab-Manual
19 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
SAS Hadoop Kerberos
No ratings yet
SAS Hadoop Kerberos
27 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
HDFS Exercises - Basic
No ratings yet
HDFS Exercises - Basic
5 pages
Sai Hadoop Resume
No ratings yet
Sai Hadoop Resume
5 pages
Hadoop Big Data Administration
No ratings yet
Hadoop Big Data Administration
6 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
ADM203 L13 Troubleshooting
No ratings yet
ADM203 L13 Troubleshooting
19 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
Hadoop module1
No ratings yet
Hadoop module1
37 pages
big datalab
No ratings yet
big datalab
4 pages
Big Data File
No ratings yet
Big Data File
16 pages
BDA Lab
No ratings yet
BDA Lab
13 pages
Dr. Anjan Krishnamurthy Associate Professor Dept. of CSE, BMSIT&M
No ratings yet
Dr. Anjan Krishnamurthy Associate Professor Dept. of CSE, BMSIT&M
129 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
Big Data Analytics 2nd Mod Notes
No ratings yet
Big Data Analytics 2nd Mod Notes
12 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
BigData Module 1
No ratings yet
BigData Module 1
17 pages
Inspiron 15 3000 Series
No ratings yet
Inspiron 15 3000 Series
22 pages
Scanned by Camscanner
No ratings yet
Scanned by Camscanner
16 pages
A Brief Introduction of Existing Big Data Tools
No ratings yet
A Brief Introduction of Existing Big Data Tools
37 pages
H. C. Verma: Birth and Early Education
No ratings yet
H. C. Verma: Birth and Early Education
2 pages
Chai Tong Yuen F
No ratings yet
Chai Tong Yuen F
6 pages
Ie Irodov Biography
No ratings yet
Ie Irodov Biography
2 pages
x86 Stderr
No ratings yet
x86 Stderr
8 pages
Youjie™ YJ HF600: Quick Start Guide
No ratings yet
Youjie™ YJ HF600: Quick Start Guide
16 pages
LAS LEARNING ACTIVITY SHEET FOR GRADE 7 Keyboard
No ratings yet
LAS LEARNING ACTIVITY SHEET FOR GRADE 7 Keyboard
18 pages
Ria Report
No ratings yet
Ria Report
24 pages
SodexoMealPassFAQs Ibmportal PDF
No ratings yet
SodexoMealPassFAQs Ibmportal PDF
5 pages
SCXI Chassis: Heading NI SCXI-1000, NI SCXI-1000DC, NI SCXI-1001
No ratings yet
SCXI Chassis: Heading NI SCXI-1000, NI SCXI-1000DC, NI SCXI-1001
3 pages
Control A Group of Fans With Java FX
No ratings yet
Control A Group of Fans With Java FX
6 pages
Mniproject Report.
No ratings yet
Mniproject Report.
22 pages
Final Year Project Idea
No ratings yet
Final Year Project Idea
20 pages
Solar Tracker
No ratings yet
Solar Tracker
6 pages
Ethical Decision Evaluation
No ratings yet
Ethical Decision Evaluation
7 pages
Virtual Reality and Augmented Reality in Education
No ratings yet
Virtual Reality and Augmented Reality in Education
7 pages
Using The WebLogic Scripting Tool
No ratings yet
Using The WebLogic Scripting Tool
8 pages
Microsoft Office 2013
No ratings yet
Microsoft Office 2013
76 pages
2PAA102411 C en System 800xa Course T306 - Information Management
No ratings yet
2PAA102411 C en System 800xa Course T306 - Information Management
2 pages
PH M Minh Danh: Objective
No ratings yet
PH M Minh Danh: Objective
3 pages
Skill
No ratings yet
Skill
2 pages
Computer Fundamentals
100% (3)
Computer Fundamentals
818 pages
Photogrammetry Assignment
No ratings yet
Photogrammetry Assignment
4 pages
Historia de Mexico Resumen
100% (1)
Historia de Mexico Resumen
8 pages
BITS Pilani: Object Oriented Programming (CS F213) Design Patterns
No ratings yet
BITS Pilani: Object Oriented Programming (CS F213) Design Patterns
13 pages
Barrier Litmus Tests and Cookbook A08
No ratings yet
Barrier Litmus Tests and Cookbook A08
28 pages
Flowchart Notes
No ratings yet
Flowchart Notes
7 pages
1st Quarter Examination in ICF - Final
No ratings yet
1st Quarter Examination in ICF - Final
2 pages
Interview Questions (All Interview)
100% (1)
Interview Questions (All Interview)
221 pages
Pricing by Item Category
No ratings yet
Pricing by Item Category
20 pages
WEEK 2_2024_2025_EXAM_FIRST_SEMESTER_PROVISIONAL
No ratings yet
WEEK 2_2024_2025_EXAM_FIRST_SEMESTER_PROVISIONAL
13 pages
Data Processing Year 10 3rd Term Exam Question
No ratings yet
Data Processing Year 10 3rd Term Exam Question
10 pages

Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming

Uploaded by

Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming

Uploaded by

MODULE-1

 Using the Pipes Interface

 Compiling and Running the Hadoop

2. Compile the WordCount.java program using the

$ javac -cp ‘hadoop classpath’ -d wordcount_classes

3. The jar file can be created using the following

$ hdfs dfs -mkdir war-and-peace-input

5. Run the WordCount application using the

$ hadoop jar wordcount.jar WordCount war-and-peace-

 The complete list of word counts can be copied from HDFS

 The output content directory and all its content can be

 Apache Hadoop streaming interface enables

 Before run output directory is removed

 Locate the hadoop-streaming.jar file in HDFS. The

 Pipes is a library that allows C++ source code

 Create the war-and-peace-input directory in HDFS

 The executable must be placed in HDFS and

 Run the program. On successful output is placed

 It doesn't display the complete matching line, but only

 The program runs two map/reduce jobs in sequence.

 The first job counts how many times a matching string

 Each reducer sums the frequencies of each

 The mapper is an inverse map that reverses (or

 The reducer is an Identityreducer class. The

 The output is text file, contains a count and string

2. Compile the WordCount.java using following

3. Create a Java archive using following command

4. Create directory in HDFS and move war-and-

6. Following command will Run the program

7. The result can be found in output file

 yarn application command can be used to control

Hadoop Log Management

 The first method is to use log aggregation. In this

 Second method, if log aggregation is disabled, the

 The location of unaggregated local logs is given by

 Log aggregation is highly recommended.

2. Add the following properties in the yarn-site.xml

 The options to use yarn logs are as follows.

$ hadoop jar $HADOOP_EXAMPLES/hadoop-mapreduce-

 Find Application ID using,

 Run the command to produce dump of all the logs for

 The AppOut file can be inspected. The list of

You might also like