Hadoop Week 3
Hadoop Week 3
Skype ID – edureka.hadoop
Email – [email protected]
Venkat – [email protected]
Course Topics
Week 1 Week 5
– Introduction to HDFS – HIVE
Week 2 Week 6
– Setting Up Hadoop Cluster – HBASE
Week 3 Week 7
– Map-Reduce Basics, types and formats – ZOOKEEPER
Week 4 Week 8
– PIG – SQOOP
Recap of Week 2
HDFS Components
HDFS Architecture
Anatomy of File Write and File Read
Job Tracker working
Hadoop Command
Web UI links
Listing of the contents of the examples
jar file
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Deployment Modes
• Standalone or local Mode
– No daemons running
– Everything runs on single JVM
– Good for deployment
• Pseudo-distributed Mode
– All daemons running on single machine, a cluster simulation on one
machine
– Good for Test Environment
• Fully distributed Mode
– Hadoop running on multiple machines on a cluster
– Production Environment
Folder View of Hadoop
Hadoop Configuration Files
Configuration Filenames Description of log files
Environment variables that are used in the scripts to run
hadoop-env.sh Hadoop
Configuration settings for Hadoop Core such as I/O settings
core-site.xml that are common to HDFS and MapReduce
Configuration settings for HDFS daemons, the namenode, the
hdfs-site.xml secondary namenode and the data nodes.
Configuration settings for MapReduce daemons : the
mapred-site.xml jobtracker and the task trackers
A list of machines(one per line) that each run a secondary
masters namenode
A list of machines(one per line) that each run a datanode and
slaves a task tracker
Properties for controlling how metrics are published in
hadoop-metrics.properties Hadoop
Properties for system log files, the namenode audit log and
log4j.properties the task log for the tasktracker child process
JDK Location
Hadoop-env.sh
DD for each component
Core core-site.xml
HDFS hdfs-site.xml
MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<!--hdfs-site.xml--> <!--core-site.xml-->
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.default.name</name>
<value>1</value> <value>hdfs://localhost:8020/</value>
</property> </property>
</configuration> </configuration>
Defining HDFS details in hdfs-site.xml
Property Value Description
dfs.name.dir <value>/disk1/hdfs/name,/remote The list of directories where the
/hdfs/name</value> namenode stores its metadata.
${hadoop.tmp.dir}/dfs/name
<?xml version=“1.0”?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
<property>
</configuration>
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker <value>localhost: The hostname and the port that the jobtrackers RPC server
8021</value> runs on. If set to the default value of local, then the
jobtracker is run in-process on demand when you run a
MapReduce job
fs.default.name
hadoop.tmp.dir
mapred.job.tracker
Slaves and masters
Two files are used by the startup and shutdown commands:
slaves
masters
hadoop-env.sh file:
Hadoop-env.sh JVM
• This file also offers a way to provide custom parameters for each of the
servers.
• Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in
the conf/ directory of the installation.
Reporting
hadoop-metrics.properties
Hadoop
Core
• secondary:fs.checkpoint.dir Namenode:dfs.name.dir
2
• secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir
3
• When the copy completes, start the NameNode and restart the
4 secondary NameNode
Hadoop Daemon Port Details
Sample Input File
Ubuntu File System to HDFS
Files in Name Node before Copy
Cluster Summary after File Copy
Command for confirming the file copy
NameNode Status Check
File System View
MR Command Output
Map Reduce Command Output
View the Output
View the Output
MR Flow
Map Reduce Process
Mapper
• Reads data from input data split as per the input format
• Denoted as Mapper <k1,v1, k2,v2>
• K1,v1 are key value pair of input data
• K2,v2 are key value pair of output data
• Mapper API
• Public class MyMaper extends Mapper<LongWritable, Text, Text,
IntWritable>
• <LogWritable, Text> key0value pair input to mapper
• <Text, IntWritable>key-value pair output of mapper
• Override map() method
• Public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException
Reducer
• Processes Data from mapper output
• Denoted as Reducer <k3,list<v3>, k4,v4>
• K3, list<v3> are key and list of values for that key as input data
• K4,v4 are key value pair of output data
• Reducer API
• Public class MyReducer extends Reducer<Text, IntWritable, Text,
IntWritable>
• <Text, List<IntWritable>>key and list of values as input to reducer
• <Text, IntWritable>key-value pair output of reducer
• Override reduce() method
• Public void reduce(Text key, Iterable<intWritable> values, Context
context) throws IOException,InterruptedException
Mapper
Public static class MyMapper extends
Mapper<LongWritable,Text,Text,IntWritable>
{
private Text word = new Text();
public void map(LongWritable key, Text value, context context)
throws IOException, InterruptedException {
String line = value.toString();
String Tokenizer tokenizer = new String Tokenizer(line);
while(tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
context.write(word, new IntWritable(1));
}
}
}
Reducer
Public static class MyReducer extends Reducer<Text,
IntWritable, Text,IntWritable> {
public void reduce(Text
key,Iterator<IntWritable>values,Context context) throws
IOException, InterruptedException{
int sum = 0;
while (values.hasNext()) {
sum+=values.next().get();
}
• When the mapping phase has completed, the intermediate (key, value)
pairs must be exchanged between machines to send all values with the
same key to a single reducer.
• The reduce tasks are spread across the same nodes in the cluster as the
mappers.
• This is the only communication step in MapReduce.
• Individual map tasks do not exchange information with one another, nor
are they aware of one another's existence. Similarly, different reduce tasks
do not communicate with one another.
• The user never explicitly marshals information from one machine to
another; all data transfer is handled by the Hadoop MapReduce platform
itself, guided implicitly by the different keys associated with values. This is
a fundamental element of Hadoop MapReduce's reliability
Record Reader
Q & A..?
Thank You
See You in Class Next Week