0% found this document useful (0 votes)
26 views

Hadoop Week 3

The document provides information about connecting with the Edureka Hadoop training team through various communication channels like Skype, email, and phone. It also outlines the topics that will be covered in each of the 8 weeks of the Hadoop training course, including introductory topics in week 1, setting up Hadoop clusters in week 2, MapReduce basics in week 3, and tools like Hive, HBase, Zookeeper and Sqoop in subsequent weeks.

Uploaded by

Rahul Kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Hadoop Week 3

The document provides information about connecting with the Edureka Hadoop training team through various communication channels like Skype, email, and phone. It also outlines the topics that will be covered in each of the 8 weeks of the Hadoop training course, including introductory topics in week 1, setting up Hadoop clusters in week 2, MapReduce basics in week 3, and tools like Hive, HBase, Zookeeper and Sqoop in subsequent weeks.

Uploaded by

Rahul Kolluri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Connect with us

 24x7 Support on Skype, Email & Phone

 Skype ID – edureka.hadoop

 Email – [email protected]

 Call us – +91 88808 62004

 Venkat – [email protected]
Course Topics

 Week 1  Week 5
– Introduction to HDFS – HIVE

 Week 2  Week 6
– Setting Up Hadoop Cluster – HBASE

 Week 3  Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER

 Week 4  Week 8
– PIG – SQOOP
Recap of Week 2

HDFS Components
HDFS Architecture
Anatomy of File Write and File Read
Job Tracker working
Hadoop Command
Web UI links
Listing of the contents of the examples
jar file
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Deployment Modes
• Standalone or local Mode
– No daemons running
– Everything runs on single JVM
– Good for deployment
• Pseudo-distributed Mode
– All daemons running on single machine, a cluster simulation on one
machine
– Good for Test Environment
• Fully distributed Mode
– Hadoop running on multiple machines on a cluster
– Production Environment
Folder View of Hadoop
Hadoop Configuration Files
Configuration Filenames Description of log files
Environment variables that are used in the scripts to run
hadoop-env.sh Hadoop
Configuration settings for Hadoop Core such as I/O settings
core-site.xml that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the namenode, the
hdfs-site.xml secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
mapred-site.xml jobtracker and the task trackers
A list of machines(one per line) that each run a secondary
masters namenode
A list of machines(one per line) that each run a datanode and
slaves a task tracker
Properties for controlling how metrics are published in
hadoop-metrics.properties Hadoop
Properties for system log files, the namenode audit log and
log4j.properties the task log for the tasktracker child process
JDK Location
Hadoop-env.sh
DD for each component

Core core-site.xml

HDFS hdfs-site.xml

MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml

hdfs-site.xml core-site.xml

<?xml version - "1.0"?> <?xml version ="1.0"?>

<!--hdfs-site.xml--> <!--core-site.xml-->

<configuration> <configuration>

<property> <property>

<name>dfs.replication</name> <name>fs.default.name</name>

<value>1</value> <value>hdfs://localhost:8020/</value>

</property> </property>

</configuration> </configuration>
Defining HDFS details in hdfs-site.xml
Property Value Description
dfs.name.dir <value>/disk1/hdfs/name,/remote The list of directories where the
/hdfs/name</value> namenode stores its metadata.
${hadoop.tmp.dir}/dfs/name

dfs.data.dir <value>/disk1/hdfs/data,/disk2/hd A list of directories where the datanode


fs/data</value> stores blocks. ${hadoop.tmp.dir}/dfs/data

Fs.checkpoint.dir <value>/disk1/hdfs/namesecond A list of directories where the secondary


ary,/disk2/hdfs/namesecondary</ namenode stores checkpoints.
value> ${hadoop.tmp.dir}/dfs/namesecondary
Mapred-site.xml

<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker <value>localhost: The hostname and the port that the jobtrackers RPC server
8021</value> runs on. If set to the default value of local, then the
jobtracker is run in-process on demand when you run a
MapReduce job

Mapred.local.dir ${hadoop.tmp.dir} A list of directories where MapReduce stores intermediate


/mapred/local data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir} The directory relative to fs.default.name where shared files
/mapred/system are stored, during a job run.
Mapred.tasktracker.map 2 The number of map tasks that may be run on a tasktracker
.tasks.maximum at any one time
Mapred.tasktracker.redu 2 The number of reduce tasks tat may be run on a tasktracker
ce.tasks.maximum at any one time.
Critical Properties

fs.default.name

hadoop.tmp.dir

mapred.job.tracker
Slaves and masters
Two files are used by the startup and shutdown commands:

slaves

• contains a list of hosts, one per line, that are to host


DataNode and TaskTracker servers

masters

• contains a list of hosts, one per line, that are to host


secondary NameNode servers
Per-process runtime environment

hadoop-env.sh file:

Hadoop-env.sh JVM

• This file also offers a way to provide custom parameters for each of the
servers.
• Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in
the conf/ directory of the installation.
Reporting

hadoop-metrics.properties

• This file controls the reporting


• The default is not to report
Network Requirements

Uses Shell (SSH) to launch the server processes on


the slave nodes

Hadoop
Core

Requires password-less SSH connection between


the master and all slave and secondary machines
Namenode Recovery

• Shut down the secondary NameNode


1

• secondary:fs.checkpoint.dir  Namenode:dfs.name.dir
2

• secondary:fs.checkpoint.edits  Namenode:dfs.name.edits.dir
3

• When the copy completes, start the NameNode and restart the
4 secondary NameNode
Hadoop Daemon Port Details
Sample Input File
Ubuntu File System to HDFS
Files in Name Node before Copy
Cluster Summary after File Copy
Command for confirming the file copy
NameNode Status Check
File System View
MR Command Output
Map Reduce Command Output
View the Output
View the Output
MR Flow
Map Reduce Process
Mapper
• Reads data from input data split as per the input format
• Denoted as Mapper <k1,v1, k2,v2>
• K1,v1 are key value pair of input data
• K2,v2 are key value pair of output data
• Mapper API
• Public class MyMaper extends Mapper<LongWritable, Text, Text,
IntWritable>
• <LogWritable, Text> key0value pair input to mapper
• <Text, IntWritable>key-value pair output of mapper
• Override map() method
• Public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException
Reducer
• Processes Data from mapper output
• Denoted as Reducer <k3,list<v3>, k4,v4>
• K3, list<v3> are key and list of values for that key as input data
• K4,v4 are key value pair of output data
• Reducer API
• Public class MyReducer extends Reducer<Text, IntWritable, Text,
IntWritable>
• <Text, List<IntWritable>>key and list of values as input to reducer
• <Text, IntWritable>key-value pair output of reducer
• Override reduce() method
• Public void reduce(Text key, Iterable<intWritable> values, Context
context) throws IOException,InterruptedException
Mapper
Public static class MyMapper extends
Mapper<LongWritable,Text,Text,IntWritable>
{
private Text word = new Text();
public void map(LongWritable key, Text value, context context)
throws IOException, InterruptedException {
String line = value.toString();
String Tokenizer tokenizer = new String Tokenizer(line);

while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
context.write(word, new IntWritable(1));
}
}
}
Reducer
Public static class MyReducer extends Reducer<Text,
IntWritable, Text,IntWritable> {
public void reduce(Text
key,Iterator<IntWritable>values,Context context) throws
IOException, InterruptedException{

int sum = 0;
while (values.hasNext()) {
sum+=values.next().get();
}

context.write(key, new IntWritable(sum));


}
}
Driver
• Configures and runs the mapper and reducer programs on a dataset
– Configuration conf = new Configuration();
– Job job = new Job(conf, “Word Counter”);

• Sets the input and output formats and locations


– FileInputFormat.addInputPath(job,new Path (otherArgs[0]) );
– FileOutputFormat.setOutputPath (job, new Path( otherArgs[1] ) );

• Sets the mapper and reducer programs


– Job.setMapperClass( MyMapper.class );
Jon.setReducerClass( MyReducer.class );
Driver

• Specifies Mapper output


– Job.setMapOutputKeyClass( Text.class );
– Job.setMapOutputValueClass( IntWritable.class );

• Specifies reduce output


– Job.setOutputKeyClass( Text.class );
– Job.setOutputValueClass( IntWritable.class );
Input splits
Input Files
TextInputFormat is the default Input-Format implementation

Input files Input Files Split


Input Formats

Each line in the text files is a record.

KeyValueTextInputFormat Text files

SequenceFileInputFormat Sequence files


NLineInputFormat
Mapping Phase

• MapReduce inputs typically come from input files loaded


onto our processing cluster in HDFS.
• These files are evenly distributed across all our nodes.
• Running a MapReduce program involves running mapping
tasks on many or all of the nodes in our cluster. Each of
these mapping tasks is equivalent: no mappers have
particular "identities" associated with them. Therefore, any
mapper can process any input file.
• Each mapper loads the set of files local to that machine and
processes them.
OutputFormat
Important OutputFormat classes.
Reduce Phase

• When the mapping phase has completed, the intermediate (key, value)
pairs must be exchanged between machines to send all values with the
same key to a single reducer.
• The reduce tasks are spread across the same nodes in the cluster as the
mappers.
• This is the only communication step in MapReduce.
• Individual map tasks do not exchange information with one another, nor
are they aware of one another's existence. Similarly, different reduce tasks
do not communicate with one another.
• The user never explicitly marshals information from one machine to
another; all data transfer is handled by the Hadoop MapReduce platform
itself, guided implicitly by the different keys associated with values. This is
a fundamental element of Hadoop MapReduce's reliability
Record Reader

The InputSplit has defined a slice of work, but


does not describe how to access it. The
RecordReader class actually loads the data
from its source and converts it into (key, value)
pairs suitable for reading by the Mapper.
Shuffle & Partition

• Shuffling: The process of moving map outputs to the reducers is


known as shuffling. A different subset of the intermediate key space
is assigned to each reduce node; these subsets (known as
"partitions") are the inputs to the reduce tasks. Each map task may
emit (key, value) pairs to any partition; all values for the same key
are always reduced together regardless of which mapper is its
origin. Therefore, the map nodes must all agree on where to send
the different pieces of the intermediate data.
• The Partitioner class determines which partition a given (key, value)
pair will go to. The default partitioner computes a hash value for
the key and assigns the partition based on this result.
Combiner

• Called the Combiner, this pass runs after the Mapper


and before the Reducer. Usage of the Combiner is
optional. If this pass is suitable for your job, instances
of the Combiner class are run on every node that has
run map tasks. The Combiner will receive as input all
data emitted by the Mapper instances on a given node.
The output from the Combiner is then sent to the
Reducers, instead of the output from the Mappers. The
Combiner is a "mini-reduce" process which operates
only on data generated by one machine.
Clarifications

Q & A..?
Thank You
See You in Class Next Week

You might also like