3 Cse Big Data Analytics 19a 05 602p R 19 Lab Manual
3 Cse Big Data Analytics 19a 05 602p R 19 Lab Manual
2021 – 2022
LAB MANVAL
of
BIG DATA ANALYTICS (19A05602P)
(R-19 REGULATION)
Prepared by
Mr. Oruganti.Sampath
Asst. Professor
For
B.Tech. III Year/ II Sem. (CSE)
Institute Vision:
To produce Competent Engineering Graduates & Managers with a strong base of
Technical & Managerial Knowledge and the Complementary Skills needed to be
Successful Professional Engineers & Managers.
Institute Mission:
To fulfill the vision by imparting Quality Technical & Management Education to the
Aspiring Students, by creating Effective Teaching/Learning Environment and providing
State – of the – Art Infrastructure and Resources.
Department Vision:
To produce Industry ready Software Engineers to meet the challenges of
21st Century.
Department Mission:
Impart core knowledge and necessary skills in Computer Science and Engineering
through innovative teaching and learning methodology.
Inculcate critical thinking, ethics, lifelong learning and creativity needed for
industry and society.
Cultivate the students with all-round competencies, for career, higher education
and self-employability.
PO-11: Project management and finance - Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
PO-12: Life-long learning - Recognize the need for and have the preparation and ability to engage in
independent and life-long learning in the broadcast context of technological changes.
PEO 1:Graduates will be prepared for analyzing, designing, developing and testing the software
solutions and products with creativity and sustainability.
PEO 2: Graduates will be skilled in the use of modern tools for critical problem solvingand analyzing
industrial and societal requirements.
PEO 3:Graduates will be prepared with managerial and leadership skills for career and starting up own
firms.
PSO 1:Develop creative solutions by adapting emerging technologies / tools for real time applications.
PSO 2: Apply the acquired knowledge to develop software solutions and innovative mobile apps for
various automation applications
O. SAMPATH II-CSE-AI
MON BDA
LUNCH
TUE BDA
WED BDA
THR BDA-LAB
FRI BDA
SAT BDA
9. Develop a MapReduce program to find the tags associated with each movie by analyzing
movie lens data.
10. XYZ.com is an online music website where users listen to various tracks, the data gets
collected which is given below.
The data is coming in log files and looks like as shown below.
UserId | TrackId | Shared | Radio | Skip
111115 | 222 | 0 | 1 | 0
111113 | 225 | 1 | 0 | 0
111117 | 223 | 0 | 1 | 1
111115 | 225 | 1 | 0 | 0
11. Develop a MapReduce program to find the frequency of books published eachyear and find
in which year maximum number of books were published usingthe following data.
Title Author Published year Author country Language No of pages
12. Develop a MapReduce program to analyze Titanic ship data and to find the average age of
the people (both male and female) who died in the tragedy. How many persons are survived in
each class.
The titanic data will be..
Column 1 :PassengerI d Column 2 : Survived (survived=0 &died=1)
Column 3 :Pclass Column 4 : Name
Column 5 : Sex Column 6 : Age
Column 7 :SibSp Column 8 :Parch
Column 9 : Ticket Column 10 : Fare
Column 11 :Cabin Column 12 : Embarked
13. Develop a MapReduce program to analyze Uber data set to find the days on which each
basement has more trips using the following dataset.
The Uber dataset consists of four columns they are
dispatching_base_number date active_vehicles trips
14. Develop a program to calculate the maximum recorded temperature by yearwise for the
weather dataset in Pig Latin
15. Write queries to sort and aggregate the data in a table using HiveQL.
16. Develop a Java application to find the maximum temperature using Spark.
Text Books:
1. Tom White, <Hadoop: The Definitive Guide= Fourth Edition, O’reilly Media, 2015.
Reference Books:
1. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden, Big Data
Glossary, O’Reilly, 2011.
2. Michael Berthold, David J.Hand, Intelligent Data Analysis, Spingers, 2007.
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Uderstanding Big
Data : Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Publishing,
2012.
4. AnandRajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge
University Press, 2012.
Course Outcomes:
Upon completion of the course, the students should be able to:
1. Configure Hadoop and perform File Management Tasks (L2)
2. Apply MapReduce programs to real time issues like word count, weather dataset and
sales of a company (L3)
3. Critically analyze huge data set using Hadoop distributed file systems and MapReduce
(L5)
4. Apply different data processing tools like Pig, Hive and Spark.(L6)
1.Installation of Hadoop:
Hadoop software can be installed in three modes of operation:
• Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity
of machines. However, we can install it on a single node in stand-alone mode. In this mode,
Hadoop software runs as a single monolithic java process. This mode is extremely useful for
debugging purpose. You can first test run your Map-Reduce application in this mode on small
data, before actually executing it on cluster with big data.
• Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single Node.
Various daemons of Hadoop will run on the same machine as separate java processes. Hence
all the daemons namely NameNode, DataNode, SecondaryNameNode, JobTracker,
TaskTracker run on single machine.
• Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker,
SecondaryNameNode (Optional and can be run on a separate node) run on the Master Node.
The daemons DataNode and TaskTracker run on the Slave Node.
Download the latest Hadoop distribution. a. Visit this URL and choose one of the mirror
sites.
You can copy the download link and also use
URL: https://hadoop.apache.org/release/3.1.3.html
This is the third stable release of Apache Hadoop 3.1 line. It contains 246 bug fixes, improvements
and enhancements since 3.1.2.
Users are encouraged to read the overview of major changes since 3.1.2. For details of the bug fixes,
improvements, and other enhancements since the previous 3.1.2 release, please check release
notes and changelog
Download tar.gz file and extract the files in windows C:\ Drive.
Practical-2:
AIM: Develop a MapReduce program to calculate the frequency of a given word in a
given file
Pre-requisite
o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version
Steps to execute MapReduce word count example
o Create a text file in your local machine and write some text into it.
Edit nano data.txt
In this example, we find out the frequency of each word exists in this text file.
o Create a directory in HDFS, where to kept text file.
$ hdfs dfs -mkdir /test
The entire MapReduce program can be fundamentally divided into three parts:
• Mapper Phase Code
• Reducer Phase Code
• Driver Code
Driver Code:
Configuration conf= new Configuration();
Job job = new Job(conf,"My Word Count Program");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path outputPath = new Path(args[1]);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
File: WC_Count.java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.fs.Path;
Execution:
o Now execute the command to see the output.
hdfs dfs -cat /r_output/part-00000
Practical-3:
Aim: Develop a MapReduce program to find the maximum temperature in each year.
1. The system receives temperatures of various cities(Austin, Boston,etc) of USA captured at regular intervals of
time on each day in an input file.
2. System will process the input data file and generates a report with Maximum and Minimum temperatures of
each day along with time.
3. Generates a separate output report for each city. E
x: Austin-r-00000
Boston-r-00000
Newjersy-r-00000
Baltimore-r-00000
California-r-00000
Newyork-r-00000
Expected output:- In each output file record should be like this: 25-Jan-2014 Time: 12:34:542 MinTemp: -22.3
Time: 05:12:345 MaxTemp: 35.7
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime
{
public static String calOutputName ="California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";
int counter = 0;
Float currnetTemp = null;
Float minTemp = Float.MAX_VALUE;
Float maxTemp = Float.MIN_VALUE;
String date = null;
String currentTime = null;
String minTempANDTime = null;
String maxTempANDTime = null;
while (strTokens.hasMoreElements())
{
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}
temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);
}
}
public static class WhetherForcastReducer extends
Reducer<Text, Text, Text, Text>
{
if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}
else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}
counter = counter + 1;
}
if (Float.parseFloat(f1) >Float.parseFloat(f2))
{
result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"
+ "Time: " + f1Time + " MaxTemp: " + f1);
} else {
fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
} else if (key.toString().substring(0, 3).equals("BOS"))
{
fileName =
CalculateMaxAndMinTemeratureTime.bosOutputName;
} else if (key.toString().substring(0,3).equals("BAL"))
{
fileName =CalculateMaxAndMinTemeratureTime.balOutputName;
}
String strArr[] = key.toString().split("_");
key.set(strArr[1]); //Key is date value
mos.write(fileName, key, result);
}
@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wheather Statistics of USA");
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);
job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class,Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class,Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class,Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class,Text.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path( "hdfs://192.168.213.133:54310/weatherInputData/
input_temp.txt");
Path pathOutputDir = new Path( "hdfs://192.168.213.133:54310/user/hduser1/
testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job,pathOutputDir);
try {
System.exit(job.waitForCompletion(true) ? 0 :1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Run -> Run as hadoop. Wait for a moment and check whether output directory is in place on HDFS. Execute
following command to verify the same.
PRACTICAL=5:
AIM: Develop a MapReduce program to implement Matrix Multiplication.
import java.io.IOException;
import java.util.*;
import java.util.AbstractMap.SimpleEntry;
import java.util.Map.Entry;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
{
value = val.toString().split(",");
if (value[0].equals("A")) {
listA.add(new SimpleEntry<Integer,Float>(Integer.parseInt(value[1]), Float.parseFloat(value[2])));
} else {
listB.add(new SimpleEntry<Integer,Float>(Integer.parseInt(value[1]), Float.parseFloat(value[2])));
}
}
String i;
float a_ij;
String k;
float b_jk;
Text outputValue = new Text();
for (Entry<Integer, Float> a : listA)
{
i = Integer.toString(a.getKey());
a_ij = a.getValue();
for (Entry<Integer, Float> b : listB)
{
k = Integer.toString(b.getKey());
b_jk = b.getValue();
outputValue.set(i + "," + k + "," + Float.toString(a_ij*b_jk));
context.write(null, outputValue);
}
}
}
}
PRACTICAL-6:
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/
saurzcode/abc.txt
Similar to put command, except that the source is restricted to
a local file reference.
copyToLocal
Usage:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is
restricted to a local file reference.
7. Move file from source to
destination.
Note:- Moving files across filesystem is not permitted.
Usage :
hadoop fs -mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/
dir2
8. Remove a file or directory in
HDFS.
Remove files specified as argument. Deletes directory only
when it is empty
Usage :
hadoop fs -rm <arg>
Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
Recursive version of delete.
Usage :
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/saurzcode/
9. Display last few lines of a file.
Similar to tail command in Unix.
Usage :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt
10. Display the aggregate length
of a file.
Usage :
hadoop fs -du <path>
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt
PRACTICAL-7
AIM: Pig Latin scripts to sort,group, join,project, and filter your data.
ORDER BY
Sorts a relation based on one or more fields.
Note: ORDER BY is NOT stable; if multiple records have the same ORDER
BY key, the order in which these records are returned is not defined and is
not guarantted to be the same from one run to the next.
Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
RANK
Returns each tuple with the rank within a relation.
When specifying no field to sort on, the RANK operator simply prepends a
sequential value to each tuple.
Otherwise, the RANK operator uses each field (or set of fields) to sort the
relation. The rank of a tuple is one plus the number of different rank
values preceding it. If two or more tuples tie on the sorting field values,
they will receive the same rank.
In this example, the RANK operator does not change the order of the
relation and simply prepends to each tuple a sequential value.
B = rank A;
alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [,
field_alias [ASC|DESC] …] } [DENSE] ];
PRACTICAL-8
AIM : Hive Databases,Tables,Views,Functions and Indexes
Databases in Hive
We can override this default location for the new directory as shown in this example:
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';
We can add a descriptive comment to the database, which will be shown by the DESCRIBE DATABASE
<database> command.
The USE command sets a database as your working database, analogous to changing working directories in a
filesystem:
We can set key-value pairs in the DBPROPERTIES associated with a database using the ALTER DATABASE
command. No other metadata about the database can be changed, including its name and directory location:
The CREATE TABLE statement follows SQL conventions, but Hive’s version offers significant extensions to
support a wide range of flexibility where the data files for tables are stored, the formats used, etc.
PRACTICAL-9:
Develop a program to calculate the maximum recorded temperature by year wise for the
weather dataset in Pig Latin
Aim: Calculate the maximum recorded temperature by year for the weather dataset in Pig
Latin
Start up Grunt in local mode, then enter the first line of the Pig script:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:chararray, temperature:int, quality:int);
We write a relation with one tuple per line, where tuples are represented as comma-separated items
inparentheses:
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
We can also see the structure of a relation the relation’s schema using the DESCRIBE operator on
the relation’s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
Output:
PRACTICAL-10: Develop a Java application to find the maximum temperature using Spark.
import re
import sys
logFile = "hdfs://localhost:9000/user/bigdatavm/input"
#Create Spark Context with the master details and the application name
sc = SparkContext("spark://bigdata-vm:7077", "max_temperature")
#Transform the data to extract/filter and then find the max temperature
max_temperature_per_year =
weatherData.flatMap(extractData).reduceByKey(lambda a,b : a if int(a) >
int(b) else b)