BDA Record (1)
BDA Record (1)
LAB RECORD
4/4 B.TECH (Computer Science and Engineering)
I Semester
Certificate
This is to certify that the experiments recorded in this book are the bonafide work
of …………………………………. student of …………..………………. carried
out in the subject ……………………. in the Bapatla Engineering College, Bapatla
during the year 2020 – 2021 of experiment recorded is…………….
Lecturer-in-charge
Date : Head of the Department
Department of Computer Science and Engineering
3) MapReduce - WordCount 10 - 20
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
Step-5:-
Copy these Commands on terminal
sudo mkdir -p /usr/local/hadoopdata/hdfs/namenode
sudo mkdir -p /usr/local/hadoopdata/hdfs/datanode
sudo mkdir -p /app/hadoop/tmp
sudo chown hadoop1:hadoop1 /app/hadoop/tmp (Instead of hadoop1:hadoop1 give your user name
ex: user’s_username: group name)
sudo chmod 750 /app/hadoop/tmp
Step-6:-
Modify these files on /usr/local/hadoop/etc/Hadoop
1) core-site.xml
2) hadoop-env.sh
3) mapred-site.xml
4) hdfs-site.xml
hdfs-site.xml:-
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>file:/usr/local/hadoopdata/hdfs/datanode</value>
</property>
Step-7:-
➢ source ~/.bashrc
➢ hadoop namenode -format
➢ start-all.sh (This will start all thenodes)
➢ jps (to check all the nodes, total - 6)
NOTE: If datanode is not started:
First stop all the services using below command:
STEP 1: Go to /app/hadoop/tmp/ location
Delete the all folders
STEP2: IN TERMINAL
sudo chmod -R 755 /app/hadoop
Start all services using commands mentioned above
If you get all the 6 nodes then your hadoop is Ready....
Hadoop Commands:
1) Print the Hadoop version
Syntax: hadoop version
ex: hadoop version
2) To create adirectory
Syntax: hadoopfs -mkdir [-p] <path>
-p: create parent directory along the path
Ex: hadoop fs -mkdir /hadoop/bec
3) List the contents in human readable format
Syntax: hadoop fs -ls [-R][_h] <path>
Ex: hadoop fs -ls /hadoop
4) Upload a file
Syntax: hadoop fs -put <local src><dest>
Ex: hadoop fs -put /home/Ubuntu/Desktop/hadoop.txt /hadoop/bec/hadop.txt
5) Download a file
Syntax: hadoop fs -get <src> <local dest>
Ex: hadoop fs -get /hadoop/bec/jes.txt .home/ubuntu/doc.txt
6) View the content of the file
Syntax: hadoop fs -cat <path (filename)>
Ex: hadoop fs -cat /hadoop/bec/hadop.txt
7) Copy the file from src to destination within hdfs
Syntax: hadoop fs -cp [-f] <hdfssrc> <hdfsdest>
-f: to overwrite the file if already exists
Ex: hadoop fs -cp /hadoop/fds.txt /hadoop/bec/
8) Move the file from src to destination within the Hdfs
Syntax: hadoop fs -mv <hdfssrc> <hdfsdest>
Ex: hadoop fs -mv /hadoop/bec/hadoop.txt /hadoop/
9) Copy from local to hdfs:
Syntax: hadoop fs -copyFromLocal [-f] <local src> <dest>
-f: to overwrite the file if already exists
Ex: hadoop fs -copyFromLocal /home/Ubuntu/Desktop/hadoop.txt /hadoop/bec/hadop.txt
10) Copy to local from hdfs
Syntax: hadoop fs -copyToLocal [-f] <hdfssrc> <dest>
-f: to overwrite the file if already exists
3)After opening the workspace, click on “File” in the menu tab and click click on “New” then select “Java
Project”
4)Provide a name for the project (say, wordcount) and leave other settings as default and then click on
“Finish”.
Step-2:-
1) Creating a package under the project.
➢ Right click on project name (say, wordcount), click on “New” and then select “Package”.
3) Right click on project name, click on “New” and then select “Class”
4) Provide a name for the class (say, WCount), leave other settings as default and click on “Finish”.
Step-3:-
WCount.java:
package Demo;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
@Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
output.collect(value, new IntWritable(1));
}
}
}
@Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
Note: We may get errors at the import statements as we have not inculded the requires jars in project’s
configuration, Follow the below step to do so.
Step-4:-
1) We need to include some external jar files to our project.
➢ There are two jar files which can be used for this program. They can be downloaded from the
following links given below:
➢ https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/3.3.0
➢ https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-mapreduce-client-core/3.3.0
➢ These can be included in our project by configuring the build and adding then as external libraries to
the classpath.
2) Right click on the project and then click on “Build Path”
5) Select the downloaded the jar files and then click on “open”
6) See that the jars are added to the classpath and click on “Apply and Close”
Note: See that the errors are solved in the java class after adding the jars in classpath.
Step-5:-
1) Now, we need to convert our project into a java jar file. Right click on the project name and then select
“Export”.
2) In Export window, select “Java” and then select “JAR File” from dropdown menu.
3) In file specification window, select the path and provide a name (say, wcount.jar) for the jar file which is
to be exported.
Note: See that the jar file is saved in the given path.
Step-6:-
1) Create a text document with some text which will be considered as an input for our program.
2) Start the hadoop daemons and move the text file to any hdfs directory and check if it is moved or not
using the below commands.
➢ start-all.sh
➢ jps
➢ hadoop fs -put /home/hayath/Desktop/hayath.txt /hadoop/
➢ hadoop fs -cat /hadoop/hayath.txt
3) Now we can execute the jar file with the below command by specifying the input_text_file and an
output_dir to save the output.
Syntax: hadoop jar <JAR_FILE_PATH(In local file system)> Package_Name.Class_Name
<Input_Text_File_Path(In HDFS)> <Output_Dir_Path(In HDFS)>
Ex:
4) Check that the specified output folder is created in hdfs. If found, open the directory and look out for the
“part-00000” file and download it. This file gives the output of the program which specifies the count of
each word in the given input file’s text.
Type below commands to view the output file or simply browse the hdfs file system and download the “part-
00000” file to view it in your system.
➢ pig –version : To know about the details of the pig version installed
➢ Pig Modes:
1. Local Mode: pig -x local;
2. Mapreduce Mode: pig -x mapreduce (or) pig
Pig Operators:
Processing Operators:
Loading and Storing Data:
For example Let us consider the stud.txt which contains the following content:
001,Rajiv,Reddy,21,9848022337,Hyderabad
002, Siddhartha, Battacharya, 22, 9848022338, Kolkata
003, Rajesh, Khanna, 22, 9848022339, Delhi
004, Preethi, Agarwal, 21, 9848022330, Pune
005, Trupthi, Mohanthy, 23, 9848022336, Bhubaneswar
006, Archana, Mishra, 23, 9848022335, Chennai
007, Komal, Nayak, 24, 9848022334, Trivandrum
008, Bharathi, Nambiayar, 24, 9848022333, Chennai
Load operator:
Syntax:
Relation_name = LOAD 'Input file path' USING function as schema;
Example:
Student = load '/home/ubuntu/Desktop/stud' using PigStorage (',') as (id: int, fname:chararray,
lname:chararray, age, contact,city);
Output:
Store operator:
Syntax:
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example:
store student into '/home/ubuntu/Desktop/det' using PigStorage(',');
Output:
Diagnostic Operators:
Dump operator:
Syntax:
Dump Relation_Name;
Example:
dump student;
Output:
Describe Operator:
Syntax:
describe Relation_Name;
Example:
describe student;
Output:
Filtering Data:
Syntax:
Relation2_name = FILTER Relation1_name BY (condition);
Example1:
filter_data = FILTER student BY city == 'Chennai';
dump filter_data;
Output:
Example2:
filter_age = FILTER student BY age == '21';
dump filter_age;
Output:
Foreach Operator:
Syntax:
Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example:
Grunt> foreach_data = foreach student generate id, firstname, city;
Grunt> dump foreach_data;
Output:
Group All:
Syntax:
group_all = GROUP All;
Example:
group_all = GROUP student All;
dump group_all;
Output:
Join Operator:
Let us consider the two files Customers.txt and Orders.txt
customers.txt:
1, Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt:
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we load these two files as follows:
customers = LOAD '/home/ubuntu/Desktop/customers’ USING PigStorage (',') as (id:int, name:chararray,
age:int, address:chararray, salary:int);
orders = LOAD '/home/ubuntu/Desktop/orders' USING PigStorage(',')as (oid:int, date:chararray,
customer_id:int, amount:int);
InnerJoin:
Syntax:
Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example:
Outer Join:
Left Outer Join:
Syntax:
Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Example:
outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id; dump outer_left;
Output:
Union Operator:
stud.txt:
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
stud1.txt:
007,Komal,Nayak,9848022334,trivendram.
008,Bharathi,Nambiayar,9848022333,Chennai.
Syntax:
Relation_name3 = UNION Relation_name1, Relation_name2;
Example:
student1 = LOAD '/home/ubuntu/Desktop/stud' USING PigStorage (',') as (id:int,firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
student2 = LOAD '/home/ubuntu/Desktop/stud1' USING PigStorage (',') as (id:int,firstname:chararray,
lastname:chararray,phone:chararray, city:chararray);
student3 = UNION student1,student2;
dump student3;
Output:
Split Operator:
Syntax:
SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name(condition2);
Example:
student_details = LOAD '/home/ubuntu/Desktop/stu' USING PigStorage (',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age<25);
dump student_details1;
dump student_details2;
2. Using run:
Syntax: run file_path;
Example: run /home/ubuntu/Desktop/hii.pig;
Output:
PIG Scripts:
WordCount:
wordcount.pig
lines = load '/home/ubuntu/Desktop/wordcount.txt' as (line: chararray);
words = FOREACH lines GENERATE FLATTEN (TOKENIZE (line)) as word;
grouped = group words by word;
wordcount = FOREACH grouped GENERATE group, COUNT (words);
dump wordcount;
Go to terminal
Grunt> exec /home/ubuntu/Desktop/wordcount.pig
Output:
Maxtemp:
maxtemp.pig
maxtmp =load '/home/ubuntu/Desktop/Maxtmp.txt' using PigStorage (',') as (year:int, tmp:int,
city:chararray);
maxtmp_year = group maxtemp by year;
max_tmp_yr = FOREACH maxtmp_year GENERATE group, MAX (maxtmp.tmp);
dump max_tmp_yr;
Goto terminal
Grunt> exec /home/ubuntu/Desktop/maxtemp.pig
Input:
Maxtemp.txt
1992,23,HYDERABAD
1996, 28,GOA
1992,53,KOLKATTA
1996, 53,MUMBAI
2013,25,BAPATLA
2018,45,GUNTUR
2013,42,ONGOLE
Output:
Card Count:
cardcount.pig
cards = load '/home/ubuntu/Desktop/Cardcount.txt' USING PigStorage (',') as (color: chararray, symbol:
chararray, num: int);
colors = group cards by color;
cardcount = foreach colors generate group, COUNT(cards.num);
dump cardcount;
Goto terminal
Grunt> exec /home/ubuntu/Desktop/cardcount.pig
Input:
Cardcount.txt
red,club,1
red,diamond,5
red,sprade,6
blue,sprade,7
blue,diamond,6
black,Sprade,9
black.Sprade,4
black,diamond,3
Pig UDFs:
Registering UDFs
--register_java_udf.pig
register ‘your_path_to_piggybank/piggybank.jar’;
divs = load ‘NYSE_dividends’ as (exchange:chararray, symbol:chararray, date:chararray,dividends:float);
Registering Python UDFs: (The Python script must be in your current directory)
--register_python_udf.pig
register ‘production.py’ using python as bballudfs;
players = load ‘baseball’ as (name:chararray, team:chararray, pos:bag{t:(p:chararray)}, bat:map[]);
Writing UDFs
Java UDFs:
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UPPER extends EvalFunc
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}