0% found this document useful (0 votes)
130 views29 pages

3 Cse Big Data Analytics 19a 05 602p R 19 Lab Manual

This document contains the lab manual for the course "Big Data Analytics" for undergraduate students of Computer Science and Engineering at SVR Engineering College. It includes the vision, mission and objectives of the institute and department. It lists 12 experiments involving developing MapReduce programs in Hadoop to perform tasks like word count, matrix multiplication, analyzing weather and sales data. The experiments aim to familiarize students with Hadoop, MapReduce programming and applying it for real-world big data applications. It also includes the course time table and syllabus.

Uploaded by

HOD-DIT PSG-PTC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views29 pages

3 Cse Big Data Analytics 19a 05 602p R 19 Lab Manual

This document contains the lab manual for the course "Big Data Analytics" for undergraduate students of Computer Science and Engineering at SVR Engineering College. It includes the vision, mission and objectives of the institute and department. It lists 12 experiments involving developing MapReduce programs in Hadoop to perform tasks like word count, matrix multiplication, analyzing weather and sales data. The experiments aim to familiarize students with Hadoop, MapReduce programming and applying it for real-world big data applications. It also includes the course time table and syllabus.

Uploaded by

HOD-DIT PSG-PTC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

lOMoARcPSD|13700378

3-CSE- BIG DATA Analytics (19A 05 602P)-(R-19) LAB


Manual
Computer Science & Engineering (Jawaharlal Nehru Technological University,
Anantapur)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by HOD-DIT PSG-PTC ([email protected])
lOMoARcPSD|13700378

SVR ENGINEERING COLLEGE


AYYALURUMETTA (V), NANDYAL, KURNOOL DT.
ANDHRA PRADESH – 518502

2021 – 2022

LAB MANVAL
of
BIG DATA ANALYTICS (19A05602P)
(R-19 REGULATION)

Prepared by
Mr. Oruganti.Sampath
Asst. Professor

For
B.Tech. III Year/ II Sem. (CSE)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

LAB MANVAL CONTENT


BIG DATA ANALYTICS (19A05602P)
1. Institute Vision & Mission, Department Vision & Mission
2. PO, PEO& PSO Statements.
3. List of Experiments
4. CO-PO Attainment
5. Experiment Code and Outputs

1. Institute Vision & Mission, Department Vision & Mission

Institute Vision:
To produce Competent Engineering Graduates & Managers with a strong base of
Technical & Managerial Knowledge and the Complementary Skills needed to be
Successful Professional Engineers & Managers.
Institute Mission:
To fulfill the vision by imparting Quality Technical & Management Education to the
Aspiring Students, by creating Effective Teaching/Learning Environment and providing
State – of the – Art Infrastructure and Resources.

Department Vision:
To produce Industry ready Software Engineers to meet the challenges of
21st Century.

Department Mission:
 Impart core knowledge and necessary skills in Computer Science and Engineering
through innovative teaching and learning methodology.

 Inculcate critical thinking, ethics, lifelong learning and creativity needed for
industry and society.

 Cultivate the students with all-round competencies, for career, higher education
and self-employability.

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

2. PO, PEO& PSO Statements

PROGRAMME OUTCOMES (POs)

PO-1: Engineering knowledge - Apply the knowledge of mathematics, science, engineering


fundamentals of Computer Science& Engineering to solve complex real-life engineering problems
related to CSE.
PO-2: Problem analysis - Identify, formulate, review research literature, and analyze complex
engineering problems related to CSE and reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
PO-3: Design/development of solutions - Design solutions for complex engineering problems related
to CSE and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, cultural, societal and environmental considerations.
PO-4: Conduct investigations of complex problems - Use research-based knowledge and research
methods, including design of experiments, analysis and interpretation of data and synthesis of the
information to provide valid conclusions.
PO-5: Modern tool usage - Select/Create and apply appropriate techniques, resources and modern
engineering and IT tools and technologies for rapidly changing computing needs, including prediction
and modeling to complex engineering activities, with an understanding of the limitations.
PO-6: The engineer and society - Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
CSE professional engineering practice.
PO-7: Environment and Sustainability - Understand the impact of the CSE professional engineering
solutions in societal and environmental contexts and demonstrate the knowledge of, and need for
sustainable development.
PO-8: Ethics - Apply ethical principles and commit to professional ethics and responsibilities and
norms of the relevant engineering practices.
PO-9: Individual and team work - Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
PO-10: Communication - Communicate effectively on complex engineering activities with the
engineering community and with the society-at-large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, give and receive clear
instructions.

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PO-11: Project management and finance - Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
PO-12: Life-long learning - Recognize the need for and have the preparation and ability to engage in
independent and life-long learning in the broadcast context of technological changes.

Program Educational Objectives (PEOs):

PEO 1:Graduates will be prepared for analyzing, designing, developing and testing the software
solutions and products with creativity and sustainability.
PEO 2: Graduates will be skilled in the use of modern tools for critical problem solvingand analyzing
industrial and societal requirements.
PEO 3:Graduates will be prepared with managerial and leadership skills for career and starting up own
firms.

Program Specific Outcomes (PSOs):

PSO 1:Develop creative solutions by adapting emerging technologies / tools for real time applications.
PSO 2: Apply the acquired knowledge to develop software solutions and innovative mobile apps for
various automation applications

2.1 Subject Time Table

O. SAMPATH II-CSE-AI

9:30AM 10:20AM 11:30AM 12:20PM 02:00PM 02:50PM 03:40PM


DAY/TIME
10:20AM 11:10AM 12:20PM 01:10PM 02:50PM 03:40PM 04:30PM

MON BDA
LUNCH

TUE BDA
WED BDA
THR BDA-LAB
FRI BDA
SAT BDA

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY ANANTAPUR


B.Tech (CSE) – III-II L T P C
0 0 3 1.5
(19A05602P) BIG DATA ANALYTICS LABORATORY
Course Objectives:
This course is designed to:
 Get familiar with Hadoop distributions, configuring Hadoop and performing File
 management tasks
 Experiment MapReduce in Hadoop frameworks
 Implement MapReduce programs in variety applications
 Explore MapReduce support for debugging
 Understand different approaches for building Hadoop MapReduce programs for real-time
applications
Experiments:
1. Install Apache Hadoop
2. Develop a MapReduce program to calculate the frequency of a given word in agiven file.
3. Develop a MapReduce program to find the maximum temperature in each year.
4. Develop a MapReduce program to find the grades of student’s.
5. Develop a MapReduce program to implement Matrix Multiplication.
6. Develop a MapReduce to find the maximum electrical consumption in each year given
electrical consumption for each month in each year.
7. Develop a MapReduce to analyze weather data set and print whether the day is shinny or cool
day.
8. Develop a MapReduce program to find the number of products sold in each country by
considering sales data containing fields like
Tranction _Date Product Price Payment_Type Name City\State Country Account_Created
Last_Login Latitude Longi

9. Develop a MapReduce program to find the tags associated with each movie by analyzing
movie lens data.

10. XYZ.com is an online music website where users listen to various tracks, the data gets
collected which is given below.
The data is coming in log files and looks like as shown below.
UserId | TrackId | Shared | Radio | Skip
111115 | 222 | 0 | 1 | 0
111113 | 225 | 1 | 0 | 0
111117 | 223 | 0 | 1 | 1
111115 | 225 | 1 | 0 | 0

Write a MapReduce program to get the following


�쁸Number of unique listeners
�쁸Number of times the track was shared with others
�쁸Number of times the track was listened to on the radio
�쁸Number of times the track was listened to in total
�쁸Number of times the track was skipped on the radio

11. Develop a MapReduce program to find the frequency of books published eachyear and find
in which year maximum number of books were published usingthe following data.
Title Author Published year Author country Language No of pages

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

12. Develop a MapReduce program to analyze Titanic ship data and to find the average age of
the people (both male and female) who died in the tragedy. How many persons are survived in
each class.
The titanic data will be..
Column 1 :PassengerI d Column 2 : Survived (survived=0 &died=1)
Column 3 :Pclass Column 4 : Name
Column 5 : Sex Column 6 : Age
Column 7 :SibSp Column 8 :Parch
Column 9 : Ticket Column 10 : Fare
Column 11 :Cabin Column 12 : Embarked

13. Develop a MapReduce program to analyze Uber data set to find the days on which each
basement has more trips using the following dataset.
The Uber dataset consists of four columns they are
dispatching_base_number date active_vehicles trips

14. Develop a program to calculate the maximum recorded temperature by yearwise for the
weather dataset in Pig Latin

15. Write queries to sort and aggregate the data in a table using HiveQL.

16. Develop a Java application to find the maximum temperature using Spark.
Text Books:
1. Tom White, <Hadoop: The Definitive Guide= Fourth Edition, O’reilly Media, 2015.
Reference Books:
1. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden, Big Data
Glossary, O’Reilly, 2011.
2. Michael Berthold, David J.Hand, Intelligent Data Analysis, Spingers, 2007.
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Uderstanding Big
Data : Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Publishing,
2012.
4. AnandRajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge
University Press, 2012.

Course Outcomes:
Upon completion of the course, the students should be able to:
1. Configure Hadoop and perform File Management Tasks (L2)
2. Apply MapReduce programs to real time issues like word count, weather dataset and
sales of a company (L3)
3. Critically analyze huge data set using Hadoop distributed file systems and MapReduce
(L5)
4. Apply different data processing tools like Pig, Hive and Spark.(L6)

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

EXPERIMENT SOURCE CODE AND OUTPUTS

PRACTICAL-I : Install Apache Hadoop

1.Installation of Hadoop:
Hadoop software can be installed in three modes of operation:
• Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity
of machines. However, we can install it on a single node in stand-alone mode. In this mode,
Hadoop software runs as a single monolithic java process. This mode is extremely useful for
debugging purpose. You can first test run your Map-Reduce application in this mode on small
data, before actually executing it on cluster with big data.

• Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single Node.
Various daemons of Hadoop will run on the same machine as separate java processes. Hence
all the daemons namely NameNode, DataNode, SecondaryNameNode, JobTracker,
TaskTracker run on single machine.

• Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker,
SecondaryNameNode (Optional and can be run on a separate node) run on the Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.

 Download the latest Hadoop distribution. a. Visit this URL and choose one of the mirror
sites.
 You can copy the download link and also use

 URL: https://hadoop.apache.org/release/3.1.3.html
 This is the third stable release of Apache Hadoop 3.1 line. It contains 246 bug fixes, improvements
and enhancements since 3.1.2.
 Users are encouraged to read the overview of major changes since 3.1.2. For details of the bug fixes,
improvements, and other enhancements since the previous 3.1.2 release, please check release
notes and changelog

 Download tar.gz file and extract the files in windows C:\ Drive.

 This finishes the Hadoop setup in stand-alone mode.

 Now Configure the .XML files accordingly.

1. Edit the file /home/Hadoop_dev/hadoop2/etc/


hadoop/core-site.xml as below:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Note: This change sets the namenode ip and
port.

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

2. Edit the file /home/Hadoop_dev/hadoop2/etc/


hadoop/hdfs-site.xml as below:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Note: This change sets the default replication
count for blocks used by HDFS.

 Edit configuration file /home/hadoop_dev/


hadoop2/etc/hadoop/hadoop-env.sh and set
JAVA_HOME in that file.

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

Practical-2:
AIM: Develop a MapReduce program to calculate the frequency of a given word in a
given file
Pre-requisite
o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version
Steps to execute MapReduce word count example
o Create a text file in your local machine and write some text into it.
Edit nano data.txt

o Check the text written in the data.txt file.


cat data.txt

In this example, we find out the frequency of each word exists in this text file.
o Create a directory in HDFS, where to kept text file.
$ hdfs dfs -mkdir /test

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

o Upload the data.txt file on HDFS in the specific directory.


$ hdfs dfs -put /home/codegyani/data.txt /test

o Write the MapReduce program using

The entire MapReduce program can be fundamentally divided into three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code

Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

File: WC_Count.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;

public class WordCount{


public static class Map extends
Mapper&lt;LongWritable,Text,Text,IntWritable&gt; {
public void map(LongWritable key, Text value,Context context)
throws IOException,InterruptedException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
context.write(value, new IntWritable(1));
}
}
}
public static class Reduce extends
Reducer&lt;Text,IntWritable,Text,IntWritable&gt; {
public void reduce(Text key, Iterable&lt;IntWritable&gt;
values,Context context) throws IOException,InterruptedException {
int sum=0;
for(IntWritable x: values)
{
sum+=x.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

//Configuring the input/output path from the filesystem into


the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we
don't have to delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
//exiting the job only if the flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

 Create the jar file of this program and name it countworddemo.jar.

Run the jar file


Run the MapReduce code:
The command for running a MapReduce code is:
1 hadoop jar hadoop-mapreduce-example.jar WordCount /sample/input /sample/output

 The output is stored in /r_output/part-00000

Execution:
o Now execute the command to see the output.
hdfs dfs -cat /r_output/part-00000

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

Practical-3:
Aim: Develop a MapReduce program to find the maximum temperature in each year.

1. The system receives temperatures of various cities(Austin, Boston,etc) of USA captured at regular intervals of
time on each day in an input file.
2. System will process the input data file and generates a report with Maximum and Minimum temperatures of
each day along with time.
3. Generates a separate output report for each city. E
x: Austin-r-00000
Boston-r-00000
Newjersy-r-00000
Baltimore-r-00000
California-r-00000
Newyork-r-00000

Expected output:- In each output file record should be like this: 25-Jan-2014 Time: 12:34:542 MinTemp: -22.3
Time: 05:12:345 MaxTemp: 35.7

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime
{
public static String calOutputName ="California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";

public static class WhetherForcastMapper extends


Mapper<Object, Text, Text, Text>
{
public void map(Object keyOffset, Text dayReport, Context con)
throws IOException, InterruptedException
{
StringTokenizer strTokens = new StringTokenizer( dayReport.toString(), "\t");

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

int counter = 0;
Float currnetTemp = null;
Float minTemp = Float.MAX_VALUE;
Float maxTemp = Float.MIN_VALUE;
String date = null;
String currentTime = null;
String minTempANDTime = null;
String maxTempANDTime = null;
while (strTokens.hasMoreElements())
{
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}

temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);

}
}
public static class WhetherForcastReducer extends
Reducer<Text, Text, Text, Text>
{

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

MultipleOutputs<Text, Text> mos;

public void setup(Context context)


{
mos = new MultipleOutputs<Text, Text>(context);
}
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int counter = 0;
String reducerInputStr[] = null;
String f1Time = "";
String f2Time = "";
String f1 = "", f2 = "";
Text result = new Text();
for (Text value : values) {

if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}
else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}
counter = counter + 1;
}
if (Float.parseFloat(f1) >Float.parseFloat(f2))
{
result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"
+ "Time: " + f1Time + " MaxTemp: " + f1);
} else {

result = new Text("Time: " + f1Time + "MinTemp: " + f1 + "\t"


+ "Time: " + f2Time + " MaxTemp: " + f2);
}
String fileName = "";
if (key.toString().substring(0, 2).equals("CA"))
{
fileName = CalculateMaxAndMinTemeratureTime.calOutputName;
} else if (key.toString().substring(0, 2).equals("NY"))
{
fileName = CalculateMaxAndMinTemeratureTime.nyOutputName;
} else if (key.toString().substring(0, 2).equals("NJ"))
{
fileName = CalculateMaxAndMinTemeratureTime.njOutputName;
} else if (key.toString().substring(0, 3).equals("AUS"))
{

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
} else if (key.toString().substring(0, 3).equals("BOS"))
{
fileName =
CalculateMaxAndMinTemeratureTime.bosOutputName;
} else if (key.toString().substring(0,3).equals("BAL"))
{
fileName =CalculateMaxAndMinTemeratureTime.balOutputName;
}
String strArr[] = key.toString().split("_");
key.set(strArr[1]); //Key is date value
mos.write(fileName, key, result);
}
@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wheather Statistics of USA");
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);
job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class,Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class,Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class,Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class,Text.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path( "hdfs://192.168.213.133:54310/weatherInputData/
input_temp.txt");
Path pathOutputDir = new Path( "hdfs://192.168.213.133:54310/user/hduser1/
testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

FileOutputFormat.setOutputPath(job,pathOutputDir);
try {
System.exit(job.waitForCompletion(true) ? 0 :1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Copy a input file form local file system to HDFS

C:\hdfs\sbin>/usr/local/hadoop2.6.1/bin$ ./hadoop fs -put /home/zytham/input_temp.txt / weatherInputData/

Now execute above sample program.

Run -> Run as hadoop. Wait for a moment and check whether output directory is in place on HDFS. Execute
following command to verify the same.

C:\hdfs\sbin>//usr/local/hadoop2.6.1/bin$ ./hadoop fs -ls /user/hduser1/testfs/output_mapred3

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PRACTICAL=5:
AIM: Develop a MapReduce program to implement Matrix Multiplication.

import java.io.IOException;
import java.util.*;
import java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class TwoStepMatrixMultiplication


{
public static class Map extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A"))
{
outputKey.set(indicesAndValue[2]);
outputValue.set("A," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
} else
{
outputKey.set(indicesAndValue[1]);
outputValue.set("B," + indicesAndValue[2] + "," +indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
public static class Reduce extends Reducer<Text, Text, Text,Text>
{
public void reduce(Text key, Iterable<Text> values,Context context)
throws IOException, InterruptedException
{
String[] value;
ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer, Float>>();
ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer, Float>>();
for (Text val : values)

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

{
value = val.toString().split(",");
if (value[0].equals("A")) {
listA.add(new SimpleEntry<Integer,Float>(Integer.parseInt(value[1]), Float.parseFloat(value[2])));
} else {
listB.add(new SimpleEntry<Integer,Float>(Integer.parseInt(value[1]), Float.parseFloat(value[2])));
}
}
String i;
float a_ij;
String k;
float b_jk;
Text outputValue = new Text();
for (Entry<Integer, Float> a : listA)
{
i = Integer.toString(a.getKey());
a_ij = a.getValue();
for (Entry<Integer, Float> b : listB)
{
k = Integer.toString(b.getKey());
b_jk = b.getValue();
outputValue.set(i + "," + k + "," + Float.toString(a_ij*b_jk));
context.write(null, outputValue);
}
}
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = new Job(conf, "MatrixMatrixMultiplicationTwoSteps");
job.setJarByClass(TwoStepMatrixMultiplication.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path("hdfs://127.0.0.1:9000/matrixin"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://127.0.0.1:9000/matrixout"));
job.waitForCompletion(true);
}
}

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PRACTICAL-6:

AIM: File Management tasks in Hadoop


1. Create a directory in HDFS at
given path(s).
Usage:
hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2
2. List the contents of a
directory.
Usage :
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode
3. Upload and download a file in
HDFS.
Upload:
hadoop fs -put:
Copy single src file, or multiple src files from local file system
to the Hadoop data file system
Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/
saurzcode/dir3/
Download:
hadoop fs -get:
Copies/Downloads files to the local file system
Usage:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/
4. See contents of a file
Same as unix cat command:
Usage:
hadoop fs -cat <path[filename]>
Example:
hadoop fs -cat /user/saurzcode/dir1/abc.txt
5. Copy a file from source to
destination
This command allows multiple sources as well in which case
the destination must be a directory.
Usage:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/
dir2
6. Copy a file from/To Local file
system to HDFS
copyFromLocal
Usage:
hadoop fs -copyFromLocal <localsrc> URI

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/
saurzcode/abc.txt
Similar to put command, except that the source is restricted to
a local file reference.
copyToLocal
Usage:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is
restricted to a local file reference.
7. Move file from source to
destination.
Note:- Moving files across filesystem is not permitted.
Usage :
hadoop fs -mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/
dir2
8. Remove a file or directory in
HDFS.
Remove files specified as argument. Deletes directory only
when it is empty
Usage :
hadoop fs -rm <arg>
Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
Recursive version of delete.
Usage :
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/saurzcode/
9. Display last few lines of a file.
Similar to tail command in Unix.
Usage :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt
10. Display the aggregate length
of a file.
Usage :
hadoop fs -du <path>
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PRACTICAL-7

AIM: Pig Latin scripts to sort,group, join,project, and filter your data.
ORDER BY
Sorts a relation based on one or more fields.
Note: ORDER BY is NOT stable; if multiple records have the same ORDER
BY key, the order in which these records are returned is not defined and is
not guarantted to be the same from one run to the next.

In Pig, relations are unordered (see Relations, Bags, Tuples, Fields):


• If you order relation A to produce relation X (X = ORDER A BY *
DESC;) relations A and X still contain the same data.

• If you retrieve relation X (DUMP X;) the data is guaranteed to be in


the order you specified (descending).

alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [,


field_alias [ASC|DESC] …] } [PARALLEL n];

alias The name of a relation.


* The designator for a tuple.
field_alia
s

A field in the relation. The field must be a simple type.


ASC Sort in ascending order.
DESC Sort in descending order.
PARALLEL n

Increase the parallelism of a job by specifying the number of


reduce tasks, n.

A = LOAD 'mydata' AS (x: int, y: map[]);


B = ORDER A BY x; -- this is allowed because x is a simple
type
B = ORDER A BY y; -- this is not allowed because y is a
complex type
B = ORDER A BY y#'id'; -- this is not allowed because y#'id'
is an expression

Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

In this example relation A is sorted by the third field, f3 in descending


order. Note that the order of the three tuples ending in 3 can vary.
X = ORDER A BY a3 DESC;
DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)

RANK
Returns each tuple with the rank within a relation.

When specifying no field to sort on, the RANK operator simply prepends a
sequential value to each tuple.

Otherwise, the RANK operator uses each field (or set of fields) to sort the
relation. The rank of a tuple is one plus the number of different rank
values preceding it. If two or more tuples tie on the sorting field values,
they will receive the same rank.

Suppose we have relation A.


A = load 'data' AS (f1:chararray,f2:int,f3:chararray);
DUMP A;
(David,1,N)
(Tete,2,N)
(Ranjit,3,M)
(Ranjit,3,P)
(David,4,Q)
(David,4,Q)
(Jillian,8,Q)
(JaePak,7,Q)
(Michael,8,T)
(Jillian,8,Q)
(Jose,10,V)

In this example, the RANK operator does not change the order of the
relation and simply prepends to each tuple a sequential value.
B = rank A;
alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [,
field_alias [ASC|DESC] …] } [DENSE] ];

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PRACTICAL-8
AIM : Hive Databases,Tables,Views,Functions and Indexes
Databases in Hive

hive> CREATE DATABASE financials;


Hive will throw an error if financials already exists. We can suppress these warnings with this variation:

hive> CREATE DATABASE IF NOT EXISTS financials;


At any time, We can see the databases that already exist as follows:

hive> SHOW DATABASES;


default financials

hive> CREATE DATABASE human_resources;


hive> SHOW DATABASES;
hive> SHOW DATABASES LIKE 'h.*';
human_resources

We can override this default location for the new directory as shown in this example:
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';

We can add a descriptive comment to the database, which will be shown by the DESCRIBE DATABASE
<database> command.

hive> CREATE DATABASE financials

> COMMENT 'Holds all financial tables';

hive> DESCRIBE DATABASE financials;


financials Holds all financial tables
hdfs://master-server/user/hive/warehouse/financials.db

hive> CREATE DATABASE financials


> WITH DBPROPERTIES ('creator' = 'Mark Moneybags', 'date'= '2012-01-02');

hive> DESCRIBE DATABASE financials;


financials hdfs://master-server/user/hive/warehouse/financials.db

hive> DESCRIBE DATABASE EXTENDED financials;


financials hdfs://master-server/user/hive/warehouse/financials.db

{date=2012-01-02, creator=Mark Moneybags);

The USE command sets a database as your working database, analogous to changing working directories in a
filesystem:

hive> USE financials;

Properties for setting a property to print the current database as

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

part of the prompt (Hive v0.8.0 and later):


hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;
hive (default)> set hive.cli.print.current.db=false;

Finally, you can drop a database:


hive> DROP DATABASE IF EXISTS financials;
hive> DROP DATABASE IF EXISTS financials CASCADE;

We can set key-value pairs in the DBPROPERTIES associated with a database using the ALTER DATABASE
command. No other metadata about the database can be changed, including its name and directory location:

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by'= 'Joe Dba');

There is no way to delete or <unset= a DBPROPERTY.


Creating Tables

The CREATE TABLE statement follows SQL conventions, but Hive’s version offers significant extensions to
support a wide range of flexibility where the data files for tables are stored, the formats used, etc.

CREATE TABLE IF NOT EXISTS mydb.employees (


name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are
percentages',
address STRUCT<street:STRING, city:STRING,
state:STRING, zip:INT>
COMMENT 'Home address')

COMMENT 'Description of the table'


TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02
10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PRACTICAL-9:
Develop a program to calculate the maximum recorded temperature by year wise for the
weather dataset in Pig Latin

Aim: Calculate the maximum recorded temperature by year for the weather dataset in Pig
Latin

-- max_temp.pig: Finds the maximum temperature by year


records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;

Start up Grunt in local mode, then enter the first line of the Pig script:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:chararray, temperature:int, quality:int);

We write a relation with one tuple per line, where tuples are represented as comma-separated items
inparentheses:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)

grunt> DUMP records;


(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

We can also see the structure of a relation the relation’s schema using the DESCRIBE operator on
the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}

Output:

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

Downloaded by HOD-DIT PSG-PTC ([email protected])


lOMoARcPSD|13700378

PRACTICAL-10: Develop a Java application to find the maximum temperature using Spark.
import re
import sys

from pyspark import SparkContext

#function to extract the data from the line


#based on position and filter out the invalid records
def extractData(line):
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
return [(year, temp)]
else:
return []

logFile = "hdfs://localhost:9000/user/bigdatavm/input"

#Create Spark Context with the master details and the application name
sc = SparkContext("spark://bigdata-vm:7077", "max_temperature")

#Create an RDD from the input data in HDFS


weatherData = sc.textFile(logFile)

#Transform the data to extract/filter and then find the max temperature
max_temperature_per_year =
weatherData.flatMap(extractData).reduceByKey(lambda a,b : a if int(a) >
int(b) else b)

#Save the RDD back into HDFS


max_temperature_per_year.saveAsTextFile("hdfs://localhost:9000/user/bigda
tavm/output”)

Downloaded by HOD-DIT PSG-PTC ([email protected])

You might also like