0% found this document useful (0 votes)
14 views

BDA - Lab Manual

its helpful

Uploaded by

Sahil Dharaviya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

BDA - Lab Manual

its helpful

Uploaded by

Sahil Dharaviya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 78

220223131020

A Laboratory Manual for

BIG DATA ANALYTICS


(3170722)

B.E. Semester 7
(CSE)

Directorate of Technical Education, Gandhinagar,


Gujarat
220223131020

Government Engineering Collage Patan

Certificate

This is to certify that Mr./Ms.


Enrollment No. of B.E. Semester
Computer Science Engineering of this Institute (GTU Code: 022) has
satisfactorily completed the Practical / Tutorial work for the subject BIG
DATA ANALYTICS (3170722) for the academic year 2024-25.

Place:
Date:

Name and Sign of Faculty member

Head of the Department


220223131020
Big Data Analytics (3170722)

Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.

By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each
experiment in this manual begins with competency, industry relevant skills, course outcomes as
well as practical outcomes (objectives). The students will also achieve safety and necessary
precautions to be taken while performing practical.

This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in order that
the students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.

Engineering Thermodynamics is the fundamental course which deals with various forms of
energy and their conversion from one to the another. It provides a platform for students to
demonstrate first and second laws of thermodynamics, entropy principle and concept of exergy.
Students also learn various gas and vapor power cycles and refrigeration cycle. Fundamentals of
combustion are also learnt.

Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal
of errors if any.
220223131020
Big Data Analytics (3170722)

Practical – Course Outcome matrix

Course Outcomes (COs):


CO-1: Identify the problems associated with Big Data processing and its application
areas.
CO-2: Identify and use Big data processing framework like Hadoop.
CO-3: Analyze data by applying selected techniques using NoSQL and SPARK.
CO-4: Explore various Data mining streams.
CO-5: Study open-source frameworks used to efficiently store and process large
datasets.
Sr.
Objective(s) of Experiment CO1 CO2 CO3 CO4 CO5
No.
1. Make a single node cluster in Hadoop. √

Run Word count program in Hadoop with 250 MB


2. √
size of Data Set.
Understand the Logs generated by MapReduce
3. √
program.
Run two different Data sets/Different size of Datasets
4. √
on Hadoop and Compare the Logs.
Develop Map Reduce Application to Sort a given file
5. √
and do aggregation on some parameters.
Download any two Big Data Sets from authenticated
6. √
website.
Explore Spark and Implement a Word count
7. √
application using Spark.
Creating the HDFS tables and loading them in Hive
8. √
and learn joining of tables in Hive.
Implementation of Matrix algorithms in Spark Sql
9. √
programming.
Create A Data Pipeline Based On Messaging Using
10. √
PySpark And Hive -Covid-19 Analysis.
Explore NoSQL database like MongoDB and perform
11. √
basic CRUD operation.
Case study based on the concept of Big Data
12. Analytics. Prepare presentation in the group of 4. √
Submit PPTs .
220223131020
Big Data Analytics (3170722)

Industry Relevant Skills

The following industry relevant competencies are expected to be developed in the student by
undertaking the practical work of this laboratory.
1. Apply knowledge of Big Data Analytics to solve real world problems
2. Big data analysts detect and analyze actionable data, such as hidden trends and
patterns. By fusing these findings with their in-depth knowledge of the market in
which their organizations operate, they can help leaders formulate informed strategic
business decisions.

Guidelines for Faculty members


1. Teacher should provide the guideline with demonstration of practical to the students
with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the students
before starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed in the
students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.

Instructions for Students


1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination
scheme, skill set to be developed etc.
2. Students shall organize the work in the group and make record of all observations.
3. Students shall develop maintenance skill as expected by industries.
4. Student shall attempt to develop related hand-on skills and build confidence.
5. Student shall develop the habits of evolving more ideas, innovations, skills etc. apart from
those included in scope of manual.
6. Student shall refer technical magazines and data books.
7. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.
220223131020
Big Data Analytics (3170722)

Index
(Progressive Assessment
Sheet)

Sr. Objective(s) of Experiment Page Date of Date of Assessme Sign. of Remar


No. No. perform submiss nt Teacher ks
ance ion Marks with date
1
Make a single node cluster in Hadoop.
2 Run Word count program in Hadoop with
250 MB size of Data Set.
3 Understand the Logs generated by
MapReduce program.
4 Run two different Data sets/Different size of
Datasets on Hadoop and Compare the Logs.
5 Develop Map Reduce Application to Sort a
given file and do aggregation on some
parameters.
6 Download any two Big Data Sets from
authenticated website.
7 Explore Spark and Implement a Word count
application using Spark.
8 Creating the HDFS tables and loading them
in Hive and learn joining of tables in Hive.
9 Implementation of Matrix algorithms in
Spark Sql programming.
10 Create A Data Pipeline Based On Messaging
Using PySpark And Hive -Covid-19
Analysis.
11 Explore NoSQL database like MongoDB and
perform basic CRUD operation.
12 Case study based on the concept of Big Data
Analytics. Prepare presentation in the group
of 4. Submit PPTs .
Total
220223131020
Experiment No: 01

Make a single node cluster in Hadoop


Date:

Competency and Practical Skills: Core Java and SQL

Relevant CO: 2
Pre-requisites: Basic understanding of Core Java and SQL.

Objectives:

(a) To understand about the Hadoop environment


(b) To learn install single node Hadoop Cluster

Theory:

Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage.

Hadoop is one of the top platforms for business data processing and analysis, and here are the
significant benefits of learning Hadoop tutorial for a bright career ahead:

1. Scalable: Businesses can process and get actionable insights from petabytes of data.
2. Flexible: To get access to multiple data sources and data types.
3. Agility: Parallel processing and minimal movement of data process substantial amounts of
data with speed.
4. Adaptable: To support a variety of coding languages, including Python, Java, and C++.

Hadoop is a collection of multiple tools and frameworks to manage, store, the process effectively,
and analyze broad data. HDFS acts as a distributed file system to store large datasets across
commodity hardware. YARN is the Hadoop resource manager to handle a cluster of nodes,
allocate RAM, memory, and other resources depending on the application requirements.
220223131020

Procedure:

Step 1: Verify the Java installed

javac -version

Step 2: Extract Hadoop at C:\Hadoop

Step 3: Setting up the HADOOP_HOME variable

Use windows environment variable setting for Hadoop Path setting.

Step 4: Set JAVA_HOME variable

Use windows environment variable setting for Hadoop Path setting.

6
220223131020

Step 5: Set Hadoop and Java bin directory path

Step 6: Hadoop Configuration :

For Hadoop Configuration we need to modify Six files that are listed below-
1. Core-site.xml
2. Mapred-site.xml
3. Hdfs-site.xml
4. Yarn-site.xml
5. Hadoop-env.cmd
6. Create two folders datanode and namenode

Step 6.1: Core-site.xml configuration

<configuration>
<property>
<name>fs.defaultFS</name>
7
220223131020

<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 6.2: Mapred-site.xml configuration

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Step 6.3: Hdfs-site.xml configuration

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</configuration>

Step 6.4: Yarn-site.xml configuration

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>

8
220223131020

</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 6.5: Hadoop-env.cmd configuration

Set "JAVA_HOME=C:\Java" (On C:\java this is path to file jdk.18.0)

Step 6.6: Create datanode and namenode folders

1. Create folder "data" under "C:\Hadoop-2.8.0"


2. Create folder "datanode" under "C:\Hadoop-2.8.0\data"
3. Create folder "namenode" under "C:\Hadoop-2.8.0\data"

9
220223131020

Step 7: Format the namenode folder

Open command window (cmd) and typing command ―hdfsnamenode –format‖

Step 8: Testing the setup

Open command window (cmd) and typing command ―start-all.cmd‖

Step 8.1: Testing the setup:

Ensure that namenode, datanode, and Resource manager are running

Step 9: Open: http://localhost:8088

10
220223131020

Step 10:

Open: http://localhost:50070

Conclusion:

Hadoop is a widely used Big Data technology for storing, processing, and analyzing large
datasets. After reading this article on what is Hadoop, you would have understood how Big Data
evolved and the challenges it brought with it.

Quiz:

1. Explain big data and list its characteristics.


2. What are the basic differences between relational database and HDFS?
3. Explain Hadoop. List the core components of Hadoop
4. Explain the Storage Unit In Hadoop
5. Compare the main differences between HDFS (Hadoop Distributed File System ) and
Network Attached Storage(NAS) ?

Suggested Reference:

 https://www.simplilearn.com/tutorials/hadoop-tutorial
 https://olympus.mygreatlearning.com/courses/10977/pages/hadoop-framework-stepping-
into-hadoop?module_item_id=449010
 https://olympus.mygreatlearning.com/courses/10977/pages/hdfs-what-and-
why?module_item_id=449011
 https://hadoop.apache.org/

11
220223131020

1. https://intellipaat.com/blog/tutorial/hadoop-tutorial/hdfs-operations/
2. http://hadooptutorial.info/formula-to-calculate-hdfs-nodes-storage/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

12
220223131020

Experiment No: 02

Run Word count program in Hadoop with 250 MB size of Data Set
Date:

Competency and Practical Skills: Core Java, Hadoop and SQL

Relevant CO: 2
Pre-requisites:
 Java Installation - Check whether the Java is installed or not using the following
command.
 Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.

Objectives:

(a) To understand about the Hadoop environment


(b) To understand about the MapReduce

Theory:

MapReduce is a programming paradigm that enables massive scalability across hundreds or


thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart
of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop
programs perform. The first is the map job, which takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value pairs).

MapReduce programming offers several benefits to help you gain valuable insights from your big
data:

1. Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed File
System (HDFS).
2. Flexibility. Hadoop enables easier access to multiple sources of data and multiple types of
data.
3. Speed. With parallel processing and minimal data movement, Hadoop offers fast
processing of massive amounts of data.
4. Simple. Developers can write code in a choice of languages, including Java, C++ and
Python.

13
220223131020

Usage of MapReduce

 It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
 It can be used for distributed pattern-based searching.
 We can also use MapReduce in machine learning.
 It was used by Google to regenerate Google's index of the World Wide Web.
 It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.

Procedure:

Steps to execute MapReduce word count example

 Create a text file in your local machine and write some text into it.
$ nano data.txt

Check the text written in the data.txt file.


$ cat data.txt

14
220223131020

find out the frequency of each word exists in this text file.

 Create a directory in HDFS, where to kept text file.


$ hdfs dfs -mkdir /test
 Upload the data.txt file on HDFS in the specific directory.
$ hdfs dfs -put /home/codegyani/data.txt /test

Write the MapReduce program using eclipse.

File: WC_Mapper.java

1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import java.util.StringTokenizer;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.LongWritable;
7. import org.apache.hadoop.io.Text;
8. import org.apache.hadoop.mapred.MapReduceBase;
9. import org.apache.hadoop.mapred.Mapper;
10. import org.apache.hadoop.mapred.OutputCollector;
11. import org.apache.hadoop.mapred.Reporter;
12. public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Te
xt,Text,IntWritable>{
13. private final static IntWritable one = new IntWritable(1);
14. private Text word = new Text();
15. public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> out
put,
16. Reporter reporter) throws IOException{
17. String line = value.toString();
18. StringTokenizer tokenizer = new StringTokenizer(line);
15
220223131020

19. while (tokenizer.hasMoreTokens()){


20. word.set(tokenizer.nextToken());
21. output.collect(word, one);
22. }
23. }
24.
25. }

File: WC_Reducer.java

1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10.
11. public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWrit
able,Text,IntWritable> {
12. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWri
table> output,
13. Reporter reporter) throws IOException {
14. int sum=0;
15. while (values.hasNext()) {
16. sum+=values.next().get();
17. }
18. output.collect(key,new IntWritable(sum));
19. }
20. }

File: WC_Runner.java

1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import org.apache.hadoop.fs.Path;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.FileInputFormat;
8. import org.apache.hadoop.mapred.FileOutputFormat;
9. import org.apache.hadoop.mapred.JobClient;
10. import org.apache.hadoop.mapred.JobConf;
11. import org.apache.hadoop.mapred.TextInputFormat;
12. import org.apache.hadoop.mapred.TextOutputFormat;
13. public class WC_Runner {
14. public static void main(String[] args) throws IOException{
15. JobConf conf = new JobConf(WC_Runner.class);
16. conf.setJobName("WordCount");

16
220223131020

17. conf.setOutputKeyClass(Text.class);
18. conf.setOutputValueClass(IntWritable.class);
19. conf.setMapperClass(WC_Mapper.class);
20. conf.setCombinerClass(WC_Reducer.class);
21. conf.setReducerClass(WC_Reducer.class);
22. conf.setInputFormat(TextInputFormat.class);
23. conf.setOutputFormat(TextOutputFormat.class);
24. FileInputFormat.setInputPaths(conf,new Path(args[0]));
25. FileOutputFormat.setOutputPath(conf,new Path(args[1]));
26. JobClient.runJob(conf);
27. }
28. }

Conclusion:

Quiz:

1. What is MapReduce?

2. What is Shuffling and Sorting in MapReduce?

3. Explain what is distributed Cache in MapReduce Framework?

Suggested Reference:
 https://www.udemy.com/topic/mapreduce/
 https://www.simplilearn.com/tutorials/hadoop-tutorial/mapreduce
 https://www.javatpoint.com/mapreduce
 https://www.udacity.com/course/intro-to-hadoop-and-mapreduce--ud617
 https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

17
220223131020

Experiment No: 03

Understand the Logs generated by MapReduce program

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical skill

Relevant CO: 2

Objective:

1. To understand the logs generated by the MapReduce

Theory:

Log files are an essential troubleshooting tool during testing and production and contain important
runtime information about the general health of workload daemons and system services. IBM®
Spectrum Symphony uses the log4j logging framework for MapReduce logging.
Log files and location

The MapReduce framework in IBM Spectrum Symphony provides the following log files, which
are located under the $PMR_HOME/logs/ directory. The following table describes the log files
and lists their location.

18
220223131020

Procedure:

Logs & File Location


 Default directory of Hadoop log file is $HADOOP_HOME/logs.
Several parts of Log generated by MapReduce program during execution:

1. Task Progress related Logs:


 This log shows details about process related information. Such as no. process, no. of splits,
progress of map & reduce task

2. File system related Logs:


 This log contains information related File. For example; no. of read & write bytes, no.
of read operation, etc.

19
220223131020

3. Job related Logs:


• This part of log contains information related to job that is being running. This contains
information like., no. launched map & reduce task, total time spent by map & reduce task,
total memory taken by map & reduce task, etc.

4. Map-Reduce related logs:


• This part contains information related to Map-Reduce task. Such as; shuffled maps, failed
maps, peak virtual memory and physical memory for map & Reduce task, CPU time
spent, total elapsed time, etc.

20
220223131020

5. Logs related to Errors:


 These parts indicate logs related to errors, occurred during processing

 Information regarding Job on Cluster Manger:

Conclusion:

Quiz:

1. What are the challenges in processing log files?


2. Explain the log file processing architecture.
3. What are the log file types?

Suggested Reference:
 http://hadooptutorial.info/log-analysis-hadoop/
 https://www.edureka.co/blog/mapreduce-tutorial/
 https://www.researchgate.net/publication/281768670_Driving_Big_Data_with_Hadoop_T
echnologies/figures?lo=1
 https://stackoverflow.com/questions/23689653/how-to-process-a-log-file-using-
mapreduce

21
220223131020

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

22
220223131020

Experiment No: 04

Run two different Data sets/Different size of Datasets on Hadoop and Compare
the Logs

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical skill

Relevant CO: 2

Objective:

1. To experiment with varying size of datasets and data size and compare the logs.

Theory:
Mapreduce Join operation is used to combine two large datasets. However, this process involves
writing lots of code to perform the actual join operation. Joining two datasets begins by
comparing the size of each dataset. If one dataset is smaller as compared to the other dataset then
smaller dataset is distributed to every data node in the cluster.
Types of Join

Depending upon the place where the actual join is performed, joins in Hadoop are classified into-

1. Map-side join – When the join is performed by the mapper, it is called as map-side join. In this
type, the join is performed before data is actually consumed by the map function. It is mandatory
that the input to each map is in the form of a partition and is in sorted order. Also, there must be
an equal number of partitions and it must be sorted by the join key.

2. Reduce-side join – When the join is performed by the reducer, it is called as reduce-side join.
There is no necessity in this join to have a dataset in a structured form (or partitioned).

Procedure:

First Dataset: Wikipedia Article dataset


 Program: Word Count
 Dataset Size: 900MB
 Logs from Command Line:

23
220223131020

24
220223131020

 Logs from Cluster Manger:

 From the logs we can see that dataset with 920mb size takes around 2mins to run word
count Job.

Second Dataset: Movie Reviews dataset


 Program: Word Count
 Dataset Size: 3.65GB
 Logs from Command Line:

25
220223131020

26
220223131020

• Logs from Cluster Manager:

 From the logs we can see that dataset with 3.65GB size takes around 5.5mins to run word
count Job.

Conclusion:
For two different size of dataset running same map reduce job we can say that time taken for
running for job is increase as size of dataset bigger.

Quiz:

1. What Is Hadoop? Components of Hadoop and How Does It Work


2. Who Uses Hadoop?
3. How Hadoop Improves on Traditional Databases

Suggested Reference:

 https://www.integrate.io/blog/process-small-data-with-hadoop/
 https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/
 https://www.simplilearn.com/tutorials/hadoop-tutorial/hive
 https://www.simplilearn.com/tutorials/hadoop-tutorial/what-is-hadoop

27
220223131020

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

28
220223131020

Experiment No: 05

Develop Map Reduce Application to Sort a given file and do aggregation on


some parameters

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical skill

Relevant CO: 2

Objective:

1. To develop Map Reduce Application on shorting and understand aggregation function

Equipment/Instruments: Ensure that Hadoop is installed, configured and is running. More


details:

 Single Node Setup for first-time users.


 Cluster Setup for large, distributed clusters.

Theory:

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where data is
already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master Resource Manager, one worker
NodeManager per cluster-node, and MRAppMaster per application

Mapreduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map
output records (using the map output keys), so that all the records that reach a single reducer are
sorted. the diagram below shows the internals of how the shuffle phase works in mapreduce.

29
220223131020

Procedure:

Given that MapReduce already performs sorting between the map and reduce phases, then sorting
files can be accomplished with an identity function (one where the inputs to the map and reduce
phases are emitted directly). this is in fact what the sort example that is bundled with Hadoop
does. you can look at the how the example code works by examining
the org.apache.hadoop.examples.sort class. to use this example code to sort text files in hadoop,
you would use it as follows:

shell$ export hadoop_home=/usr/lib/hadoop


shell$ $hadoop_home/bin/hadoop jar $hadoop_home/hadoop-examples.jar sort \
-informatorg.apache.hadoop.mapred.keyvaluetextinputformat \
-outformatorg.apache.hadoop.mapred.textoutputformat \
-outkeyorg.apache.hadoop.io.text \
-outvalueorg.apache.hadoop.io.text \
/hdfs/path/to/input \
/hdfs/path/to/output

Hadoop example sort can be accomplished with the hadoop-utils sort as follows:

shell$ $hadoop_home/bin/hadoop jar hadoop-utils-<version>-jar-with-dependencies.jar \


com.alexholmes.hadooputils.sort.sort \
/hdfs/path/to/input \
/hdfs/path/to/output

to bring sorting in mapreduce closer to the linux sort, the --key and --field-separator options can
30
220223131020

be used to specify one or more columns that should be used for sorting, as well as a custom
separator (whitespace is the default). for example, imagine you had a file in hdfs
called /input/300names.txt which contained first and last names:

shell$ hadoop fs -cat 300names.txt | head -n 5


roy franklin
mariogardner
willisromero
maxwilkerson
latoyalarson

to sort on the last name you would run:

shell$ $hadoop_home/bin/hadoop jar hadoop-utils-<version>-jar-with-dependencies.jar \


com.alexholmes.hadooputils.sort.sort \
--key 2 \
/input/300names.txt \
/hdfs/path/to/output

the syntax of --key is pos1[,pos2] , where the first position (pos1) is required, and the second
position (pos2) is optional - if it‘s omitted then pos1 through the rest of the line is used for sorting.

Conclusion:

Quiz:
1. What are the configuration parameters required to be specified in MapReduce?
2. What do you mean by a combiner?
3. What is meant by JobTracker?
4. What do you mean by InputFormat? What are the types of InputFormat in MapReduce?

Suggested Reference:

 https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
 https://www.geeksforgeeks.org/hadoop-reducer-in-map-reduce/
 https://medium.com/edureka/mapreduce-tutorial-3d9535ddbe7c
 https://www.ibm.com/docs/en/spectrum-symphony/7.3.1?topic=sampleapp-using-hash-

31
220223131020

based-grouping-data-aggregation
 https://www.talend.com/resources/what-is-mapreduce/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

32
220223131020

Experiment No: 06

Download any two Big Data Sets from authenticated website

Date:

Competency and Practical Skills: Data Modeling, Analytical skill

Relevant CO: 2

Objective:

1. To explore authenticate datasets from websites for future use.

Theory

The definition of big data is data that contains greater variety, arriving in increasing volumes and
with more velocity. This is also known as the three Vs.

Put simply, big data is larger, more complex data sets, especially from new data sources. These
data sets are so voluminous that traditional data processing software just can‘t manage them. But
these massive volumes of data can be used to address business problems you wouldn‘t have been
able to tackle before.

The three Vs of big data


The amount of data matters. With big data, you‘ll have to process high volumes of low-density,
unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams
Volume
on a web page or a mobile app, or sensor-enabled equipment. For some organizations, this might
be tens of terabytes of data. For others, it may be hundreds of petabytes.
Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest
velocity of data streams directly into memory versus being written to disk. Some internet-
Velocity
enabled smart products operate in real time or near real time and will require real-time
evaluation and action.
Variety refers to the many types of data that are available. Traditional data types were structured
and fit neatly in a relational database. With the rise of big data, data comes in new unstructured
Variety
data types. Unstructured and semistructured data types, such as text, audio, and video, require
additional preprocessing to derive meaning and support metadata.

33
220223131020

The value—and truth—of big data

Two more Vs have emerged over the past few years: value and veracity. Data has intrinsic value.
But it‘s of no use until that value is discovered. Equally important: How truthful is your data—
and how much can you rely on it?

Today, big data has become capital. Think of some of the world‘s biggest tech companies. A large
part of the value they offer comes from their data, which they‘re constantly analyzing to produce
more efficiency and develop new products.

Recent technological breakthroughs have exponentially reduced the cost of data storage and
compute, making it easier and less expensive to store more data than ever before. With an
increased volume of big data now cheaper and more accessible, you can make more accurate and
precise business decisions.

Finding value in big data isn‘t only about analyzing it (which is a whole other benefit). It‘s an
entire discovery process that requires insightful analysts, business users, and executives who ask
the right questions, recognize patterns, make informed assumptions, and predict behavior.

Procedure:

1. YelpDataset

This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together
for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on
Yelp's data and share their discoveries. In the most recent dataset you'll find information about
businesses across 8 metropolitan areas in the USA and Canada.
Website: https://www.yelp.com/dataste

34
220223131020

2. Kaggle
Website: https://www.kaggle.com/datasets

Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access GPUs at no cost
to you and a huge repository of community published data & code.

Inside Kaggle you‘ll find all the code & data you need to do your data science work. Use over
50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time.

3. UCI Machine Learning Repository

Currently maintain 622 data sets as a service to the machine learning community. You may view
all data sets through our searchable interface. For a general overview of the Repository, please
visit our about page. For information about citing data sets in publications, please read our citation
policy. If you wish to donate a data set, please consult our donation policy. For any other
questions, feel free to contact the Repository librarians.

The UCI Machine Learning Repository is a collection of databases, domain theories, and data
generators that are used by the machine learning community for the empirical analysis of machine
learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow
graduate students at UC Irvine. Since that time, it has been widely used by students, educators,
and researchers all over the world as a primary source of machine learning data sets. As an
indication of the impact of the archive, it has been cited over 1000 times, making it one of the top
100 most cited "papers" in all of computer science. The current version of the web site was
designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration
with Rexa.info at the University of Massachusetts Amherst. Funding support from the National
Science Foundation is gratefully acknowledged.
35
220223131020

Website: http://archive.ics.uci.edu/ml/index.php

Conclusion:

Quiz:

1. What is Big Data, and where does it come from? How does it work?
2. What are the 5 V‘s in Big Data?
3. How can big data analytics benefit business?
4. What are some of the challenges that come with a big data project?
5. What are the key steps in deploying a big data platform?

Suggested Reference:

 https://archive.ics.uci.edu/ml/index.php
 https://www.kaggle.com/
 https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
 https://www.oracle.com/in/big-data/what-is-big-data/
 https://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html

36
220223131020

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

37
220223131020

Experiment No: 07

Explore Spark and Implement Word count application using Spark

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Spark

Relevant CO: 3

Objective:

1. To install Spark and perform small practical

Theory:

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.

Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop‘s sub project developed in 2009 in UC Berkeley‘s AMPLab by


MateiZaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache
software foundation in 2013, and now Apache Spark has become a top level Apache project from
Feb-2014.

Features of Apache Spark

Apache Spark has following features.

 Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number of
read/write operations to disk. It stores the intermediate processing data in memory.

38
220223131020

 Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-
level operators for interactive querying.
 Advanced Analytics − Spark not only supports ‗Map‘ and ‗reduce‘. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.

There are three ways of Spark deployment as explained below.

 Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here,
Spark and MapReduce will run side by side to cover all spark jobs on cluster.
 Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any
pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem
or Hadoop stack. It allows other components to run on top of stack.
 Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.

39
220223131020

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core

Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.

Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data.

MLlib (Machine Learning Library)

MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).

40
220223131020

Procedure:

Word Count Program:


packagecom.journaldev.sparkdemo;

importorg.apache.spark.SparkConf;
importorg.apache.spark.api.java.JavaPairRDD;
importorg.apache.spark.api.java.JavaRDD;
importorg.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

importjava.util.Arrays;

public class WordCounter {

private static void wordCount(String fileName) {

SparkConfsparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");

JavaSparkContextsparkContext = new JavaSparkContext(sparkConf);

JavaRDD<String>inputFile = sparkContext.textFile(fileName);

JavaRDD<String>wordsFromFile = inputFile.flatMap(content ->Arrays.asList(content.split(" ")));

JavaPairRDDcountData = wordsFromFile.mapToPair(t -> new Tuple2(t, 1)).reduceByKey((x, y) -> (int) x


+ (int) y);

countData.saveAsTextFile("CountData");
}

public static void main(String[] args) {

if (args.length == 0)
{ System.out.println("No files provided.");
System.exit(0);
}

wordCount(args[0]);
}
}

Conclusion:

Quiz:

1. What is Apache Spark?


2. Explain the key features of Spark.
3. What does a Spark Engine do?
4. Define Actions in Spark.
5. What are the languages supported by Apache Spark and which is the most popular one?

41
220223131020

Suggested Reference:

 https://spark.apache.org/
 https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-platform-
that-crushed-hadoop.html
 https://aws.amazon.com/big-data/what-is-spark/
 https://en.wikipedia.org/wiki/Apache_Spark
 https://www.databricks.com/spark/about
 https://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

42
220223131020

Experiment No: 08

Creating the HDFS tables and loading them in Hive and learn joining of tables
in Hive
Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce

Relevant CO: 5

Objective:

1. To install Hive and perform queries on join.

Theory:

HDFS is a distributed file system that handles large data sets running on commodity hardware. It
is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS
is one of the major components of Apache Hadoop, the others being MapReduce and YARN.
HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented
non-relational database management system that sits on top of HDFS and can better support real-
time data needs with its in-memory processing engine.

The goals of HDFS:

Fast recovery from hardware failures

Because one HDFS instance may consist of thousands of servers, failure of at least one server is
inevitable. HDFS has been built to detect faults and automatically recover quickly.

Access to streaming data

HDFS is intended more for batch processing versus interactive use, so the emphasis in the design
is for high data throughput rates, which accommodate streaming access to data sets.

Accommodation of large data sets

HDFS accommodates applications that have data sets typically gigabytes to terabytes in size.
HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single
cluster.

43
220223131020

Portability

To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms and to
be compatible with a variety of underlying operating systems.

The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove files, move a file from one
directory to another, or rename a file. HDFS does not yet implement user quotas. HDFS does not
support hard links or soft links. However, the HDFS architecture does not preclude implementing
these features.

The NameNode maintains the file system namespace. Any change to the file system namespace or
its properties is recorded by the NameNode. An application can specify the number of replicas of
a file that should be maintained by HDFS. The number of copies of a file is called the replication
factor of that file. This information is stored by the NameNode.

44
220223131020

Data Replication

HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The
blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file. The replication
factor can be specified at file creation time and can be changed later. Files in HDFS are write-
once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.

Nodes: Master-slave nodes typically forms the HDFS cluster.

1. NameNode(MasterNode):
o Manages all the slave nodes and assign work to them.
o It executes filesystem namespace operations like opening, closing, renaming files
and directories.
o It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
45
220223131020

2. DataNode(SlaveNode):
o Actual worker nodes, who do the actual work like reading, writing, processing etc.
o They also perform creation, deletion, and replication upon instruction from the
master.
o They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background.

 Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
 DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.

Procedure:

Create a folder on HDFS under /user/cloudera HDFS Path

javachain~hadoop]$ hadoop fs -mkdirjavachain

46
220223131020

Move the text file from local file system into newly created folder called javachain

javachain~hadoop]$ hadoop fs -put ~/Desktop/student.txt javachain/

Create Empty table STUDENT in HIVE

hive> create table student


>( std_idint,
>std_name string,
>std_grade string,
>std_addres string)
> partitioned by (country string)
>row format delimited
> fields terminated by ','
>;
OK
Time taken: 0.349 seconds

Load Data from HDFS path into HIVE TABLE.

hive> load data inpath 'javachain/student.txt' into table student partition(country='usa');


Loading data to table default.student partition (country=usa)
chgrp: changing ownership of
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/student/country=usa/student.txt': User does
not belong to hive
Partition default.student{country=usa} stats: [numFiles=1, numRows=0, totalSize=120,
rawDataSize=0]
OK
Time taken: 1.048 seconds

Select the values in the Hive table.

hive> select * from student;


OK
101 'JAVACHAIN' 3RD 'USA usa
102 'ANTO' 10TH 'ENGLAND' usa
103 'PRABU' 2ND 'INDIA' usa
104 'KUMAR' 4TH 'USA' usa
105 'jack' 2ND 'INDIA' usa
Time taken: 0.553 seconds, Fetched: 5 row(s)

JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys
of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
47
220223131020

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT


FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

LEFT OUTER JOIN


The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no
matches in the right table. This means, if the ON clause matches 0 (zero) records in the right table,
the JOIN still returns a row in the result, but with NULL in each column from the right table.
A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

RIGHT OUTER JOIN


The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no
matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN still
returns a row in the result, but with NULL in each column from the left table.
48
220223131020

A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER
tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

FULL OUTER JOIN


The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer tables
that fulfil the JOIN condition. The joined table contains either all the records from both the tables,
or fills in NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

49
220223131020

Conclusion:

Quiz:

1. Differentiate between Pig and Hive.


2. What are the components used in Hive Query Processor?
3. What are collection data types in Hive?
4. What is a Hive variable? What for we use it?
5. What do you mean by schema on read?

Suggested Reference:

 https://www.geeksforgeeks.org/introduction-to-hadoop-distributed-file-systemhdfs/
 https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
 https://www.ibm.com/topics/hdfs
 https://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm
 https://www.alluxio.io/learn/hdfs/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

50
220223131020

Experiment No: 09

Implementation of Matrix algorithms in Spark Sql programming

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce, Spark sql

Relevant CO: 3

Objective:

1. To implement matrix algorithm in Spark.

Theory:

Spark SQL

Many data scientists, analysts, and general business intelligence users rely on interactive SQL
queries for exploring data. Spark SQL is a Spark module for structured data processing. It
provides a programming abstraction called DataFrames and can also act as a distributed SQL
query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing
deployments and data. It also provides powerful integration with the rest of the Spark ecosystem
(e.g., integrating SQL query processing with machine learning).

What is Apache Spark SQL?

Spark SQL brings native support for SQL to Spark and streamlines the process of querying data
stored both in RDDs (Spark‘s distributed datasets) and in external sources. Spark SQL
conveniently blurs the lines between RDDs and relational tables. Unifying these powerful
abstractions makes it easy for developers to intermix SQL commands querying external data with
complex analytics, all within in a single application. Concretely, Spark SQL will allow developers
to:

 Import relational data from Parquet files and Hive tables


 Run SQL queries over imported data and existing RDDs
 Easily write RDDs out to Hive tables or Parquet files

Spark SQL also includes a cost-based optimizer, columnar storage, and code generation to make
queries fast. At the same time, it scales to thousands of nodes and multi-hour queries using the

51
220223131020

Spark engine, which provides full mid-query fault tolerance, without having to worry about using
a different engine for historical data.

Why is Spark SQL used?

Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark
stack. Apache Hive had certain limitations as mentioned below. Spark SQL was built to overcome
these drawbacks and replace Apache Hive.

Is Spark SQL faster than Hive?

Spark SQL is faster than Hive when it comes to processing speed. Below I have listed down a few
limitations of Hive over Spark SQL.

Limitations With Hive:

 Hive launches MapReduce jobs internally for executing the ad-hoc queries. MapReduce
lags in the performance when it comes to the analysis of medium-sized datasets (10 to 200
GB).
 Hive has no resume capability. This means that if the processing dies in the middle of a
workflow, you cannot resume from where it got stuck.
 Hive cannot drop encrypted databases in cascade when the trash is enabled and leads to an
execution error. To overcome this, users have to use the Purge option to skip trash instead
of drop.

Procedure:

Matrix_multiply.py

frompyspark import SparkConf, SparkContext


import sys, operator

defadd_tuples(a, b):
return list(sum(p) for p in zip(a,b))

def permutation(row):
rowPermutation = []

for element in row:


for e in range(len(row)):
rowPermutation.append(float(element) * float(row[e]))

52
220223131020

returnrowPermutation

def main():
input = sys.argv[1]
output = sys.argv[2]
conf = SparkConf().setAppName('Matrix Multiplication')
sc = SparkContext(conf=conf)
assertsc.version>= '1.5.1'

row = sc.textFile(input).map(lambda row : row.split(' ')).cache()


ncol = len(row.take(1)[0])
intermediateResult = row.map(permutation).reduce(add_tuples)

outputFile = open(output, 'w')

result = [intermediateResult[x:x+3] for x in range(0, len(intermediateResult), ncol)]


for row in result:
for element in row:
outputFile.write(str(element) + ' ')
outputFile.write('\n')
outputFile.close()

# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)

if name == " main ":


main()

matrix_multiply_sparse.py

frompyspark import SparkConf, SparkContext


import sys, operator
fromscipy import *
fromscipy.sparse import csr_matrix

defcreateCSRMatrix(input):
row = []
col = []
data = []

for values in input:


value = values.split(':')
row.append(0)
col.append(int(value[0]))
data.append(float(value[1]))

53
220223131020

returncsr_matrix((data,(row,col)), shape=(1,100))

defmultiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
return (csrTransponse*csrMatrix)

defformatOutput(indexValuePairs):
return ' '.join(map(lambda pair : str(pair[0]) + ':' + str(pair[1]), indexValuePairs))

def main():
input = sys.argv[1]
output = sys.argv[2]
conf = SparkConf().setAppName('Sparse Matrix Multiplication')
sc = SparkContext(conf=conf)
assertsc.version>= '1.5.1'
sparseMatrix = sc.textFile(input).map(lambda row : row.split('
')).map(createCSRMatrix).map(multiplyMatrix).reduce(operator.add)
outputFile = open(output, 'w')

for row in range(len(sparseMatrix.indptr)-1):


col = sparseMatrix.indices[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]
data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]
indexValuePairs = zip(col,data)
formattedOutput = formatOutput(indexValuePairs)
outputFile.write(formattedOutput + '\n')
if name == " main ":
main()

Conclusion:

Quiz:

1. Can we run Apache Spark without Hadoop?


2. What are the benefits of Spark over MapReduce?
3. What are the drawbacks of Apache Spark?
4. Define Spark-SQL.
5. What is Map() operation in Apache Spark?

54
220223131020

Suggested Reference:

 https://www.edureka.co/blog/spark-sql-tutorial/
 https://intellipaat.com/blog/what-is-spark-sql/
 https://sparkbyexamples.com/spark/spark-sql-explained/
 https://spark.apache.org/sql/
 https://www.databricks.com/glossary/what-is-spark-sql

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

55
220223131020

Experiment No: 10

Create A Data Pipeline Based On Messaging Using PySpark And Hive -Covid-
19 Analysis

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce. Spark sql

Relevant CO: 3

Objective:

1. To analyze spark and hive using sample dataset.

Theory:

Building data pipeline for Covid-19 data analysis using BigData technologies and Tableau
• The purpose is to collect the real time streaming data from COVID19 open API for every
5 minutes into the ecosystem using NiFi and to process it and store it in the data lake on
AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further
processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi
and stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally KPIs are visualized in
tableau.

56
220223131020

Data Architecture

Procedure:

The setup

We will use flume to fetch the tweets and enqueue them on kafka and flume to dequeue the data
hence flume will act both as a kafka producer and consumer while kafka would be used as a
channel to hold data. This approach is also informally known as ―flafka‖. We will use the flume
agent provided by cloudera to fetch the tweets from the twitter api. This data would be stored on
kafka as a channel and consumed using flume agent with spark sink. Spark streaming will read the
polling stream from the custom sink created by flume. Spark streaming app will parse the data as
flume events separating the headers from the tweets in json format. Once spark has parsed the
flume events the data would be stored on hdfs presumably a hive warehouse. We can then create
an external table in hive using hive SERDE to analyze this data in hive. The data flow can be seen
as follows:

57
220223131020

Docker

All of the services mentioned above will be running on docker instances also known as docker
container instances. We will run three docker instances, more details on that later. If you don‘t
have docker available on your machine please go through the Installation section otherwise just
skip to launching the required docker instances.

Installing Docker

If you do not have docker, First of all you need to install docker on your system. You will find
detailed instructions on installing docker athttps://docs.docker.com/engine/installation/

Once docker is installed properly you can verify it by running a command as follows:

$ docker run hello-world

Unable to find image 'hello-world:latest' locally

1.
2.
3.

1.

Launching the required docker container instances

We will be launching three docker instances namely kafka, flume and spark. Please note that the
names Kafka, Spark and Flume are all separate docker instances of ―cloudera/quickstart‖ –
https://github.com/caioquirino/docker-cloudera-quickstart. The Kafka container instance, as
suggested by its name, will be running an instance of the Kafka distributed message queue server
along with an instance of the Zookeeper service. We can use the instance of this container to
create a topic, start producers and start consumers – which will be explained later. The Flume and
Spark container instances will be used to run our Flume agent and Spark streaming application
respectively.
The following figure shows the running containers:

58
220223131020

You can launch the docker instance for kafka as follows:

Notice that we are running the container by giving it the name (–name) kafka and providing two
environment variables namely KAFKA and ZOOKEEPER. The name of this container will be
used later to link it with the flume container instance.

Tools used:

1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau

Conclusion:

Quiz:

1. What is PySpark?
2. What are the characteristics of PySpark?
3. What do you mean by ‗joins‘ in PySpark DataFrame?

4. What are the different ways to handle row duplication in a PySpark DataFrame?
5. What are RDDs in PySpark?

59
220223131020

Suggested Reference:
 https://www.projectpro.io/project-use-case/pyspark-etl-project-to-build-a-data-pipeline-
using-hive-and-cassandra
 https://www.linkedin.com/pulse/creating-data-pipeline-using-flume-kafka-spark-hive-
mouzzam-hussain/
 https://towardsdatascience.com/create-your-first-etl-pipeline-in-apache-spark-and-python-
ec3d12e2c169
 https://hevodata.com/learn/spark-data-pipeline/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

60
220223131020

Experiment No: 11

Explore NoSQL database like MongoDB and perform basic CRUD operation

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce. MongoDB

Relevant CO: 3

Objective:

1. To basic operation of MongoDB and its installation

Theory:

MongoDB
MongoDB is an open-source document-oriented database that is designed to store a large scale
of data and also allows you to work with that data very efficiently. It is categorized under the
NoSQL (Not only SQL) database because the storage and retrieval of data in the MongoDB are
not in the form of tables.

The MongoDB database is developed and managed by MongoDB.Inc under SSPL(Server Side
Public License) and initially released in February 2009. It also provides official driver support
for all the popular languages like C, C++, C#, and .Net, Go, Java, Node.js, Perl, PHP, Python,
Motor, Ruby, Scala, Swift, Mongoid. So, that you can create an application using any of these
languages. Nowadays there are so many companies that used MongoDB like Facebook, Nokia,
eBay, Adobe, Google, etc. to store their large amount of data.

Features of MongoDB –

 Schema-less Database: It is the great feature provided by the MongoDB. A Schema-less


database means one collection can hold different types of documents in it. Or in other
words, in the MongoDB database, a single collection can hold multiple documents and these
documents may consist of the different numbers of fields, content, and size. It is not
necessary that the one document is similar to another document like in the relational
databases. Due to this cool feature, MongoDB provides great flexibility to databases.
 Document Oriented: In MongoDB, all the data stored in the documents instead of tables
like in RDBMS. In these documents, the data is stored in fields(key-value pair) instead of

61
220223131020

rows and columns which make the data much more flexible in comparison to RDBMS. And
each document contains its unique object id.
 Indexing: In MongoDB database, every field in the documents is indexed with primary and
secondary indices this makes easier and takes less time to get or search data from the pool of
the data. If the data is not indexed, then database search each document with the specified
query which takes lots of time and not so efficient.
 Scalability: MongoDB provides horizontal scalability with the help of sharding. Sharding
means to distribute data on multiple servers, here a large amount of data is partitioned into
data chunks using the shard key, and these data chunks are evenly distributed across shards
that reside across many physical servers. It will also add new machines to a running
database.
 Replication: MongoDB provides high availability and redundancy with the help of
replication, it creates multiple copies of the data and sends these copies to a different server
so that if one server fails, then the data is retrieved from another server.
 Aggregation: It allows to perform operations on the grouped data and get a single result or
computed result. It is similar to the SQL GROUPBY clause. It provides three different
aggregations i.e, aggregation pipeline, map-reduce function, and single-purpose aggregation
methods
 High Performance: The performance of MongoDB is very high and data persistence as
compared to another database due to its features like scalability, indexing, replication, etc.

Advantages of MongoDB :

 It is a schema-less NoSQL database. You do need not to design the schema of the database
when you are working with MongoDB.
 It does not support join operations.
 It provides great flexibility to the fields in the documents.
 It contains heterogeneous data.
 It provides high performance, availability, and scalability.
 It supports Geospatial efficiently.
 It is a document-oriented database and the data is stored in BSON documents.
 It also supports multiple document ACID transition(string from MongoDB 4.0).
 It does not require any SQL injection.
 It is easily integrated with Big Data Hadoop

62
220223131020

Disadvantages of MongoDB:

 It uses high memory for data storage.


 You are not allowed to store more than 16MB data in the documents.
 The nesting of data in BSON is also limited you are not allowed to nest data more than 100
levels.

MongoDB CRUD operations

As we know that we can use MongoDB for various things like building an application
(including web and mobile), or analysis of data, or an administrator of a MongoDB database, in
all these cases we need to interact with the MongoDB server to perform certain operations like
entering new data into the application, updating data into the application, deleting data from the
application, and reading the data of the application. MongoDB provides a set of some basic but
most essential operations that will help you to easily interact with the MongoDB server and
these operations are known as CRUD Operations.

Procedure:

Create Operations –

The create or insert operations are used to insert or add new documents in the collection. If a
collection does not exist, then it will create a new collection in the database. You can perform,
create operations using the following methods provided by the MongoDB:

Method Description
db.collection.insertOne() It is used to insert a single document in the collection.
db.collection.insertMany() It is used to insert multiple documents in the collection.
db.createCollection() It is used to create an empty collection.

63
220223131020

Example 1: In this example, we are inserting details of a single student in the form of document
in the student collection using db.collection.insertOne() method.

Example 2: In this example, we are inserting details of the multiple students in the form of
documents in the student collection using db.collection.insertMany() method.

64
220223131020

Read Operations –

The Read operations are used to retrieve documents from the collection, or in other words, read
operations are used to query a collection for a document. You can perform read operation using
the following method provided by the MongoDB:

Method Description
db.collection.find() It is used to retrieve documents from the collection.

Example : In this example, we are retrieving the details of students from the student collection
using db.collection.find()method.

Update Operations –

The update operations are used to update or modify the existing document in the collection. You
can perform update operations using the following methods provided by the MongoDB:

Method Description
It is used to update a single document in the collection that satisfy
db.collection.updateOne() the given criteria.
It is used to update multiple documents in the collection that satisfy
db.collection.updateMany() the given criteria.
It is used to replace single document in the collection that satisfy the
db.collection.replaceOne() given criteria.

65
220223131020

Example 1: In this example, we are updating the age of Sumit in the student collection

.
Example 2: In this example, we are updating the year of course in all the documents in the
student collection using db.collection.updateMany() method.

66
220223131020

Delete Operations –

The delete operation are used to delete or remove the documents from a collection. You can
perform delete operations using the following methods provided by the MongoDB:

Method Description
It is used to delete a single document from the collection that satisfy
db.collection.deleteOne() the given criteria.
It is used to delete multiple documents from the collection that satisfy
db.collection.deleteMany() the given criteria.

Example 1: In this example, we are deleting a document from the student collection using
db.collection.deleteOne() method.

67
220223131020

Example 2: In this example, we are deleting all the documents from the student collection using
db.collection.deleteMany() method.

Conclusion:

Quiz:

1. What makes MongoDB the best?


2. What is the syntax of the skip() method?
3. When do we use a namespace in MongoDB?
4. What are the data types of MongoDB?
5. Explain the situation when an index does not fit into RAM.
6. Does MongoDB support foreign key constraints?
7. Explain the structure of ObjectID in MongoDB.

68
220223131020

Suggested Reference:

 https://www.techtarget.com/searchdatamanagement/definition/MongoDB
 https://www.tutorialspoint.com/mongodb/index.htm
 https://www.w3schools.com/mongodb/
 https://en.wikipedia.org/wiki/MongoDB
 https://www.mongodb.com/

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Correct Result Correct Result Correct Result Correct Result

Conclusions Conclusions Conclusions

References References

Correct answer
to all questions

69
220223131020

Experiment No: 12

Case study based on the concept of Big Data Analytics. Prepare presentation in
the group of 4. Submit PPT.

Date:

Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce

Relevant CO: 1

Objective:

1. To explore new technology and participate in a group

Theory:

70
220223131020

71
220223131020

72
220223131020

73
220223131020

74
220223131020

Conclusion:

Suggested Reference:
 https://www.simplilearn.com/what-is-big-data-analytics-article
 https://www.coursera.org/articles/big-data-analytics
 https://www.ibm.com/analytics/big-data-analytics
 https://www.techtarget.com/searchbusinessanalytics/definition/big-data-analytics
 https://www.tableau.com/learn/articles/big-data-analytics

75
220223131020

References used by the students:

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked

Analysis Of The Analysis Of The Analysis Of The Analysis Of The


Issues Issues Issues Issues

Provide Provide Provide


suggestions on suggestions on suggestions on
appropriate appropriate appropriate
solutions solutions solutions

Conclusions Conclusions

Presentation
Skills

76

You might also like