BDA - Lab Manual
BDA - Lab Manual
B.E. Semester 7
(CSE)
Certificate
Place:
Date:
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.
By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each
experiment in this manual begins with competency, industry relevant skills, course outcomes as
well as practical outcomes (objectives). The students will also achieve safety and necessary
precautions to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in order that
the students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.
Engineering Thermodynamics is the fundamental course which deals with various forms of
energy and their conversion from one to the another. It provides a platform for students to
demonstrate first and second laws of thermodynamics, entropy principle and concept of exergy.
Students also learn various gas and vapor power cycles and refrigeration cycle. Fundamentals of
combustion are also learnt.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal
of errors if any.
220223131020
Big Data Analytics (3170722)
The following industry relevant competencies are expected to be developed in the student by
undertaking the practical work of this laboratory.
1. Apply knowledge of Big Data Analytics to solve real world problems
2. Big data analysts detect and analyze actionable data, such as hidden trends and
patterns. By fusing these findings with their in-depth knowledge of the market in
which their organizations operate, they can help leaders formulate informed strategic
business decisions.
Index
(Progressive Assessment
Sheet)
Relevant CO: 2
Pre-requisites: Basic understanding of Core Java and SQL.
Objectives:
Theory:
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage.
Hadoop is one of the top platforms for business data processing and analysis, and here are the
significant benefits of learning Hadoop tutorial for a bright career ahead:
1. Scalable: Businesses can process and get actionable insights from petabytes of data.
2. Flexible: To get access to multiple data sources and data types.
3. Agility: Parallel processing and minimal movement of data process substantial amounts of
data with speed.
4. Adaptable: To support a variety of coding languages, including Python, Java, and C++.
Hadoop is a collection of multiple tools and frameworks to manage, store, the process effectively,
and analyze broad data. HDFS acts as a distributed file system to store large datasets across
commodity hardware. YARN is the Hadoop resource manager to handle a cluster of nodes,
allocate RAM, memory, and other resources depending on the application requirements.
220223131020
Procedure:
javac -version
6
220223131020
For Hadoop Configuration we need to modify Six files that are listed below-
1. Core-site.xml
2. Mapred-site.xml
3. Hdfs-site.xml
4. Yarn-site.xml
5. Hadoop-env.cmd
6. Create two folders datanode and namenode
<configuration>
<property>
<name>fs.defaultFS</name>
7
220223131020
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
8
220223131020
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
9
220223131020
10
220223131020
Step 10:
Open: http://localhost:50070
Conclusion:
Hadoop is a widely used Big Data technology for storing, processing, and analyzing large
datasets. After reading this article on what is Hadoop, you would have understood how Big Data
evolved and the challenges it brought with it.
Quiz:
Suggested Reference:
https://www.simplilearn.com/tutorials/hadoop-tutorial
https://olympus.mygreatlearning.com/courses/10977/pages/hadoop-framework-stepping-
into-hadoop?module_item_id=449010
https://olympus.mygreatlearning.com/courses/10977/pages/hdfs-what-and-
why?module_item_id=449011
https://hadoop.apache.org/
11
220223131020
1. https://intellipaat.com/blog/tutorial/hadoop-tutorial/hdfs-operations/
2. http://hadooptutorial.info/formula-to-calculate-hdfs-nodes-storage/
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
12
220223131020
Experiment No: 02
Run Word count program in Hadoop with 250 MB size of Data Set
Date:
Relevant CO: 2
Pre-requisites:
Java Installation - Check whether the Java is installed or not using the following
command.
Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
Objectives:
Theory:
MapReduce programming offers several benefits to help you gain valuable insights from your big
data:
1. Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed File
System (HDFS).
2. Flexibility. Hadoop enables easier access to multiple sources of data and multiple types of
data.
3. Speed. With parallel processing and minimal data movement, Hadoop offers fast
processing of massive amounts of data.
4. Simple. Developers can write code in a choice of languages, including Java, C++ and
Python.
13
220223131020
Usage of MapReduce
It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
It can be used for distributed pattern-based searching.
We can also use MapReduce in machine learning.
It was used by Google to regenerate Google's index of the World Wide Web.
It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.
Procedure:
Create a text file in your local machine and write some text into it.
$ nano data.txt
14
220223131020
find out the frequency of each word exists in this text file.
File: WC_Mapper.java
1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import java.util.StringTokenizer;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.LongWritable;
7. import org.apache.hadoop.io.Text;
8. import org.apache.hadoop.mapred.MapReduceBase;
9. import org.apache.hadoop.mapred.Mapper;
10. import org.apache.hadoop.mapred.OutputCollector;
11. import org.apache.hadoop.mapred.Reporter;
12. public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Te
xt,Text,IntWritable>{
13. private final static IntWritable one = new IntWritable(1);
14. private Text word = new Text();
15. public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> out
put,
16. Reporter reporter) throws IOException{
17. String line = value.toString();
18. StringTokenizer tokenizer = new StringTokenizer(line);
15
220223131020
File: WC_Reducer.java
1. package com.javatpoint;
2. import java.io.IOException;
3. import java.util.Iterator;
4. import org.apache.hadoop.io.IntWritable;
5. import org.apache.hadoop.io.Text;
6. import org.apache.hadoop.mapred.MapReduceBase;
7. import org.apache.hadoop.mapred.OutputCollector;
8. import org.apache.hadoop.mapred.Reducer;
9. import org.apache.hadoop.mapred.Reporter;
10.
11. public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWrit
able,Text,IntWritable> {
12. public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWri
table> output,
13. Reporter reporter) throws IOException {
14. int sum=0;
15. while (values.hasNext()) {
16. sum+=values.next().get();
17. }
18. output.collect(key,new IntWritable(sum));
19. }
20. }
File: WC_Runner.java
1. package com.javatpoint;
2.
3. import java.io.IOException;
4. import org.apache.hadoop.fs.Path;
5. import org.apache.hadoop.io.IntWritable;
6. import org.apache.hadoop.io.Text;
7. import org.apache.hadoop.mapred.FileInputFormat;
8. import org.apache.hadoop.mapred.FileOutputFormat;
9. import org.apache.hadoop.mapred.JobClient;
10. import org.apache.hadoop.mapred.JobConf;
11. import org.apache.hadoop.mapred.TextInputFormat;
12. import org.apache.hadoop.mapred.TextOutputFormat;
13. public class WC_Runner {
14. public static void main(String[] args) throws IOException{
15. JobConf conf = new JobConf(WC_Runner.class);
16. conf.setJobName("WordCount");
16
220223131020
17. conf.setOutputKeyClass(Text.class);
18. conf.setOutputValueClass(IntWritable.class);
19. conf.setMapperClass(WC_Mapper.class);
20. conf.setCombinerClass(WC_Reducer.class);
21. conf.setReducerClass(WC_Reducer.class);
22. conf.setInputFormat(TextInputFormat.class);
23. conf.setOutputFormat(TextOutputFormat.class);
24. FileInputFormat.setInputPaths(conf,new Path(args[0]));
25. FileOutputFormat.setOutputPath(conf,new Path(args[1]));
26. JobClient.runJob(conf);
27. }
28. }
Conclusion:
Quiz:
1. What is MapReduce?
Suggested Reference:
https://www.udemy.com/topic/mapreduce/
https://www.simplilearn.com/tutorials/hadoop-tutorial/mapreduce
https://www.javatpoint.com/mapreduce
https://www.udacity.com/course/intro-to-hadoop-and-mapreduce--ud617
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
17
220223131020
Experiment No: 03
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical skill
Relevant CO: 2
Objective:
Theory:
Log files are an essential troubleshooting tool during testing and production and contain important
runtime information about the general health of workload daemons and system services. IBM®
Spectrum Symphony uses the log4j logging framework for MapReduce logging.
Log files and location
The MapReduce framework in IBM Spectrum Symphony provides the following log files, which
are located under the $PMR_HOME/logs/ directory. The following table describes the log files
and lists their location.
18
220223131020
Procedure:
19
220223131020
20
220223131020
Conclusion:
Quiz:
Suggested Reference:
http://hadooptutorial.info/log-analysis-hadoop/
https://www.edureka.co/blog/mapreduce-tutorial/
https://www.researchgate.net/publication/281768670_Driving_Big_Data_with_Hadoop_T
echnologies/figures?lo=1
https://stackoverflow.com/questions/23689653/how-to-process-a-log-file-using-
mapreduce
21
220223131020
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
22
220223131020
Experiment No: 04
Run two different Data sets/Different size of Datasets on Hadoop and Compare
the Logs
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical skill
Relevant CO: 2
Objective:
1. To experiment with varying size of datasets and data size and compare the logs.
Theory:
Mapreduce Join operation is used to combine two large datasets. However, this process involves
writing lots of code to perform the actual join operation. Joining two datasets begins by
comparing the size of each dataset. If one dataset is smaller as compared to the other dataset then
smaller dataset is distributed to every data node in the cluster.
Types of Join
Depending upon the place where the actual join is performed, joins in Hadoop are classified into-
1. Map-side join – When the join is performed by the mapper, it is called as map-side join. In this
type, the join is performed before data is actually consumed by the map function. It is mandatory
that the input to each map is in the form of a partition and is in sorted order. Also, there must be
an equal number of partitions and it must be sorted by the join key.
2. Reduce-side join – When the join is performed by the reducer, it is called as reduce-side join.
There is no necessity in this join to have a dataset in a structured form (or partitioned).
Procedure:
23
220223131020
24
220223131020
From the logs we can see that dataset with 920mb size takes around 2mins to run word
count Job.
25
220223131020
26
220223131020
From the logs we can see that dataset with 3.65GB size takes around 5.5mins to run word
count Job.
Conclusion:
For two different size of dataset running same map reduce job we can say that time taken for
running for job is increase as size of dataset bigger.
Quiz:
Suggested Reference:
https://www.integrate.io/blog/process-small-data-with-hadoop/
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/
https://www.simplilearn.com/tutorials/hadoop-tutorial/hive
https://www.simplilearn.com/tutorials/hadoop-tutorial/what-is-hadoop
27
220223131020
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
28
220223131020
Experiment No: 05
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical skill
Relevant CO: 2
Objective:
Theory:
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where data is
already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce framework consists of a single master Resource Manager, one worker
NodeManager per cluster-node, and MRAppMaster per application
Mapreduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map
output records (using the map output keys), so that all the records that reach a single reducer are
sorted. the diagram below shows the internals of how the shuffle phase works in mapreduce.
29
220223131020
Procedure:
Given that MapReduce already performs sorting between the map and reduce phases, then sorting
files can be accomplished with an identity function (one where the inputs to the map and reduce
phases are emitted directly). this is in fact what the sort example that is bundled with Hadoop
does. you can look at the how the example code works by examining
the org.apache.hadoop.examples.sort class. to use this example code to sort text files in hadoop,
you would use it as follows:
Hadoop example sort can be accomplished with the hadoop-utils sort as follows:
to bring sorting in mapreduce closer to the linux sort, the --key and --field-separator options can
30
220223131020
be used to specify one or more columns that should be used for sorting, as well as a custom
separator (whitespace is the default). for example, imagine you had a file in hdfs
called /input/300names.txt which contained first and last names:
the syntax of --key is pos1[,pos2] , where the first position (pos1) is required, and the second
position (pos2) is optional - if it‘s omitted then pos1 through the rest of the line is used for sorting.
Conclusion:
Quiz:
1. What are the configuration parameters required to be specified in MapReduce?
2. What do you mean by a combiner?
3. What is meant by JobTracker?
4. What do you mean by InputFormat? What are the types of InputFormat in MapReduce?
Suggested Reference:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
https://www.geeksforgeeks.org/hadoop-reducer-in-map-reduce/
https://medium.com/edureka/mapreduce-tutorial-3d9535ddbe7c
https://www.ibm.com/docs/en/spectrum-symphony/7.3.1?topic=sampleapp-using-hash-
31
220223131020
based-grouping-data-aggregation
https://www.talend.com/resources/what-is-mapreduce/
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
32
220223131020
Experiment No: 06
Date:
Relevant CO: 2
Objective:
Theory
The definition of big data is data that contains greater variety, arriving in increasing volumes and
with more velocity. This is also known as the three Vs.
Put simply, big data is larger, more complex data sets, especially from new data sources. These
data sets are so voluminous that traditional data processing software just can‘t manage them. But
these massive volumes of data can be used to address business problems you wouldn‘t have been
able to tackle before.
33
220223131020
Two more Vs have emerged over the past few years: value and veracity. Data has intrinsic value.
But it‘s of no use until that value is discovered. Equally important: How truthful is your data—
and how much can you rely on it?
Today, big data has become capital. Think of some of the world‘s biggest tech companies. A large
part of the value they offer comes from their data, which they‘re constantly analyzing to produce
more efficiency and develop new products.
Recent technological breakthroughs have exponentially reduced the cost of data storage and
compute, making it easier and less expensive to store more data than ever before. With an
increased volume of big data now cheaper and more accessible, you can make more accurate and
precise business decisions.
Finding value in big data isn‘t only about analyzing it (which is a whole other benefit). It‘s an
entire discovery process that requires insightful analysts, business users, and executives who ask
the right questions, recognize patterns, make informed assumptions, and predict behavior.
Procedure:
1. YelpDataset
This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together
for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on
Yelp's data and share their discoveries. In the most recent dataset you'll find information about
businesses across 8 metropolitan areas in the USA and Canada.
Website: https://www.yelp.com/dataste
34
220223131020
2. Kaggle
Website: https://www.kaggle.com/datasets
Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access GPUs at no cost
to you and a huge repository of community published data & code.
Inside Kaggle you‘ll find all the code & data you need to do your data science work. Use over
50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time.
Currently maintain 622 data sets as a service to the machine learning community. You may view
all data sets through our searchable interface. For a general overview of the Repository, please
visit our about page. For information about citing data sets in publications, please read our citation
policy. If you wish to donate a data set, please consult our donation policy. For any other
questions, feel free to contact the Repository librarians.
The UCI Machine Learning Repository is a collection of databases, domain theories, and data
generators that are used by the machine learning community for the empirical analysis of machine
learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow
graduate students at UC Irvine. Since that time, it has been widely used by students, educators,
and researchers all over the world as a primary source of machine learning data sets. As an
indication of the impact of the archive, it has been cited over 1000 times, making it one of the top
100 most cited "papers" in all of computer science. The current version of the web site was
designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration
with Rexa.info at the University of Massachusetts Amherst. Funding support from the National
Science Foundation is gratefully acknowledged.
35
220223131020
Website: http://archive.ics.uci.edu/ml/index.php
Conclusion:
Quiz:
1. What is Big Data, and where does it come from? How does it work?
2. What are the 5 V‘s in Big Data?
3. How can big data analytics benefit business?
4. What are some of the challenges that come with a big data project?
5. What are the key steps in deploying a big data platform?
Suggested Reference:
https://archive.ics.uci.edu/ml/index.php
https://www.kaggle.com/
https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
https://www.oracle.com/in/big-data/what-is-big-data/
https://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html
36
220223131020
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
37
220223131020
Experiment No: 07
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Spark
Relevant CO: 3
Objective:
Theory:
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It
is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The main
feature of Spark is its in-memory cluster computing that increases the processing speed of an
application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number of
read/write operations to disk. It stores the intermediate processing data in memory.
38
220223131020
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-
level operators for interactive querying.
Advanced Analytics − Spark not only supports ‗Map‘ and ‗reduce‘. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
The following diagram shows three ways of how Spark can be built with Hadoop components.
Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here,
Spark and MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any
pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem
or Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.
39
220223131020
Components of Spark
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in external
storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data.
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib developers
against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast
as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
40
220223131020
Procedure:
importorg.apache.spark.SparkConf;
importorg.apache.spark.api.java.JavaPairRDD;
importorg.apache.spark.api.java.JavaRDD;
importorg.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
importjava.util.Arrays;
JavaRDD<String>inputFile = sparkContext.textFile(fileName);
countData.saveAsTextFile("CountData");
}
if (args.length == 0)
{ System.out.println("No files provided.");
System.exit(0);
}
wordCount(args[0]);
}
}
Conclusion:
Quiz:
41
220223131020
Suggested Reference:
https://spark.apache.org/
https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-platform-
that-crushed-hadoop.html
https://aws.amazon.com/big-data/what-is-spark/
https://en.wikipedia.org/wiki/Apache_Spark
https://www.databricks.com/spark/about
https://www.tutorialspoint.com/apache_spark/apache_spark_introduction.htm
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
42
220223131020
Experiment No: 08
Creating the HDFS tables and loading them in Hive and learn joining of tables
in Hive
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce
Relevant CO: 5
Objective:
Theory:
HDFS is a distributed file system that handles large data sets running on commodity hardware. It
is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS
is one of the major components of Apache Hadoop, the others being MapReduce and YARN.
HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented
non-relational database management system that sits on top of HDFS and can better support real-
time data needs with its in-memory processing engine.
Because one HDFS instance may consist of thousands of servers, failure of at least one server is
inevitable. HDFS has been built to detect faults and automatically recover quickly.
HDFS is intended more for batch processing versus interactive use, so the emphasis in the design
is for high data throughput rates, which accommodate streaming access to data sets.
HDFS accommodates applications that have data sets typically gigabytes to terabytes in size.
HDFS provides high aggregate data bandwidth and can scale to hundreds of nodes in a single
cluster.
43
220223131020
Portability
To facilitate adoption, HDFS is designed to be portable across multiple hardware platforms and to
be compatible with a variety of underlying operating systems.
HDFS supports a traditional hierarchical file organization. A user or an application can create
directories and store files inside these directories. The file system namespace hierarchy is similar
to most other existing file systems; one can create and remove files, move a file from one
directory to another, or rename a file. HDFS does not yet implement user quotas. HDFS does not
support hard links or soft links. However, the HDFS architecture does not preclude implementing
these features.
The NameNode maintains the file system namespace. Any change to the file system namespace or
its properties is recorded by the NameNode. An application can specify the number of replicas of
a file that should be maintained by HDFS. The number of copies of a file is called the replication
factor of that file. This information is stored by the NameNode.
44
220223131020
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores
each file as a sequence of blocks; all blocks in a file except the last block are the same size. The
blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file. The replication
factor can be specified at file creation time and can be changed later. Files in HDFS are write-
once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a
Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.
1. NameNode(MasterNode):
o Manages all the slave nodes and assign work to them.
o It executes filesystem namespace operations like opening, closing, renaming files
and directories.
o It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
45
220223131020
2. DataNode(SlaveNode):
o Actual worker nodes, who do the actual work like reading, writing, processing etc.
o They also perform creation, deletion, and replication upon instruction from the
master.
o They can be deployed on commodity hardware.
Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.
Procedure:
46
220223131020
Move the text file from local file system into newly created folder called javachain
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign keys
of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
47
220223131020
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER
tables.
notranslate"> hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
49
220223131020
Conclusion:
Quiz:
Suggested Reference:
https://www.geeksforgeeks.org/introduction-to-hadoop-distributed-file-systemhdfs/
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://www.ibm.com/topics/hdfs
https://www.tutorialspoint.com/hadoop/hadoop_hdfs_overview.htm
https://www.alluxio.io/learn/hdfs/
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
50
220223131020
Experiment No: 09
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce, Spark sql
Relevant CO: 3
Objective:
Theory:
Spark SQL
Many data scientists, analysts, and general business intelligence users rely on interactive SQL
queries for exploring data. Spark SQL is a Spark module for structured data processing. It
provides a programming abstraction called DataFrames and can also act as a distributed SQL
query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing
deployments and data. It also provides powerful integration with the rest of the Spark ecosystem
(e.g., integrating SQL query processing with machine learning).
Spark SQL brings native support for SQL to Spark and streamlines the process of querying data
stored both in RDDs (Spark‘s distributed datasets) and in external sources. Spark SQL
conveniently blurs the lines between RDDs and relational tables. Unifying these powerful
abstractions makes it easy for developers to intermix SQL commands querying external data with
complex analytics, all within in a single application. Concretely, Spark SQL will allow developers
to:
Spark SQL also includes a cost-based optimizer, columnar storage, and code generation to make
queries fast. At the same time, it scales to thousands of nodes and multi-hour queries using the
51
220223131020
Spark engine, which provides full mid-query fault tolerance, without having to worry about using
a different engine for historical data.
Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark
stack. Apache Hive had certain limitations as mentioned below. Spark SQL was built to overcome
these drawbacks and replace Apache Hive.
Spark SQL is faster than Hive when it comes to processing speed. Below I have listed down a few
limitations of Hive over Spark SQL.
Hive launches MapReduce jobs internally for executing the ad-hoc queries. MapReduce
lags in the performance when it comes to the analysis of medium-sized datasets (10 to 200
GB).
Hive has no resume capability. This means that if the processing dies in the middle of a
workflow, you cannot resume from where it got stuck.
Hive cannot drop encrypted databases in cascade when the trash is enabled and leads to an
execution error. To overcome this, users have to use the Purge option to skip trash instead
of drop.
Procedure:
Matrix_multiply.py
defadd_tuples(a, b):
return list(sum(p) for p in zip(a,b))
def permutation(row):
rowPermutation = []
52
220223131020
returnrowPermutation
def main():
input = sys.argv[1]
output = sys.argv[2]
conf = SparkConf().setAppName('Matrix Multiplication')
sc = SparkContext(conf=conf)
assertsc.version>= '1.5.1'
# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)
matrix_multiply_sparse.py
defcreateCSRMatrix(input):
row = []
col = []
data = []
53
220223131020
returncsr_matrix((data,(row,col)), shape=(1,100))
defmultiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
return (csrTransponse*csrMatrix)
defformatOutput(indexValuePairs):
return ' '.join(map(lambda pair : str(pair[0]) + ':' + str(pair[1]), indexValuePairs))
def main():
input = sys.argv[1]
output = sys.argv[2]
conf = SparkConf().setAppName('Sparse Matrix Multiplication')
sc = SparkContext(conf=conf)
assertsc.version>= '1.5.1'
sparseMatrix = sc.textFile(input).map(lambda row : row.split('
')).map(createCSRMatrix).map(multiplyMatrix).reduce(operator.add)
outputFile = open(output, 'w')
Conclusion:
Quiz:
54
220223131020
Suggested Reference:
https://www.edureka.co/blog/spark-sql-tutorial/
https://intellipaat.com/blog/what-is-spark-sql/
https://sparkbyexamples.com/spark/spark-sql-explained/
https://spark.apache.org/sql/
https://www.databricks.com/glossary/what-is-spark-sql
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
55
220223131020
Experiment No: 10
Create A Data Pipeline Based On Messaging Using PySpark And Hive -Covid-
19 Analysis
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce. Spark sql
Relevant CO: 3
Objective:
Theory:
Building data pipeline for Covid-19 data analysis using BigData technologies and Tableau
• The purpose is to collect the real time streaming data from COVID19 open API for every
5 minutes into the ecosystem using NiFi and to process it and store it in the data lake on
AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further
processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi
and stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally KPIs are visualized in
tableau.
56
220223131020
Data Architecture
Procedure:
The setup
We will use flume to fetch the tweets and enqueue them on kafka and flume to dequeue the data
hence flume will act both as a kafka producer and consumer while kafka would be used as a
channel to hold data. This approach is also informally known as ―flafka‖. We will use the flume
agent provided by cloudera to fetch the tweets from the twitter api. This data would be stored on
kafka as a channel and consumed using flume agent with spark sink. Spark streaming will read the
polling stream from the custom sink created by flume. Spark streaming app will parse the data as
flume events separating the headers from the tweets in json format. Once spark has parsed the
flume events the data would be stored on hdfs presumably a hive warehouse. We can then create
an external table in hive using hive SERDE to analyze this data in hive. The data flow can be seen
as follows:
57
220223131020
Docker
All of the services mentioned above will be running on docker instances also known as docker
container instances. We will run three docker instances, more details on that later. If you don‘t
have docker available on your machine please go through the Installation section otherwise just
skip to launching the required docker instances.
Installing Docker
If you do not have docker, First of all you need to install docker on your system. You will find
detailed instructions on installing docker athttps://docs.docker.com/engine/installation/
Once docker is installed properly you can verify it by running a command as follows:
1.
2.
3.
1.
We will be launching three docker instances namely kafka, flume and spark. Please note that the
names Kafka, Spark and Flume are all separate docker instances of ―cloudera/quickstart‖ –
https://github.com/caioquirino/docker-cloudera-quickstart. The Kafka container instance, as
suggested by its name, will be running an instance of the Kafka distributed message queue server
along with an instance of the Zookeeper service. We can use the instance of this container to
create a topic, start producers and start consumers – which will be explained later. The Flume and
Spark container instances will be used to run our Flume agent and Spark streaming application
respectively.
The following figure shows the running containers:
58
220223131020
Notice that we are running the container by giving it the name (–name) kafka and providing two
environment variables namely KAFKA and ZOOKEEPER. The name of this container will be
used later to link it with the flume container instance.
Tools used:
1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau
Conclusion:
Quiz:
1. What is PySpark?
2. What are the characteristics of PySpark?
3. What do you mean by ‗joins‘ in PySpark DataFrame?
4. What are the different ways to handle row duplication in a PySpark DataFrame?
5. What are RDDs in PySpark?
59
220223131020
Suggested Reference:
https://www.projectpro.io/project-use-case/pyspark-etl-project-to-build-a-data-pipeline-
using-hive-and-cassandra
https://www.linkedin.com/pulse/creating-data-pipeline-using-flume-kafka-spark-hive-
mouzzam-hussain/
https://towardsdatascience.com/create-your-first-etl-pipeline-in-apache-spark-and-python-
ec3d12e2c169
https://hevodata.com/learn/spark-data-pipeline/
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
60
220223131020
Experiment No: 11
Explore NoSQL database like MongoDB and perform basic CRUD operation
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce. MongoDB
Relevant CO: 3
Objective:
Theory:
MongoDB
MongoDB is an open-source document-oriented database that is designed to store a large scale
of data and also allows you to work with that data very efficiently. It is categorized under the
NoSQL (Not only SQL) database because the storage and retrieval of data in the MongoDB are
not in the form of tables.
The MongoDB database is developed and managed by MongoDB.Inc under SSPL(Server Side
Public License) and initially released in February 2009. It also provides official driver support
for all the popular languages like C, C++, C#, and .Net, Go, Java, Node.js, Perl, PHP, Python,
Motor, Ruby, Scala, Swift, Mongoid. So, that you can create an application using any of these
languages. Nowadays there are so many companies that used MongoDB like Facebook, Nokia,
eBay, Adobe, Google, etc. to store their large amount of data.
Features of MongoDB –
61
220223131020
rows and columns which make the data much more flexible in comparison to RDBMS. And
each document contains its unique object id.
Indexing: In MongoDB database, every field in the documents is indexed with primary and
secondary indices this makes easier and takes less time to get or search data from the pool of
the data. If the data is not indexed, then database search each document with the specified
query which takes lots of time and not so efficient.
Scalability: MongoDB provides horizontal scalability with the help of sharding. Sharding
means to distribute data on multiple servers, here a large amount of data is partitioned into
data chunks using the shard key, and these data chunks are evenly distributed across shards
that reside across many physical servers. It will also add new machines to a running
database.
Replication: MongoDB provides high availability and redundancy with the help of
replication, it creates multiple copies of the data and sends these copies to a different server
so that if one server fails, then the data is retrieved from another server.
Aggregation: It allows to perform operations on the grouped data and get a single result or
computed result. It is similar to the SQL GROUPBY clause. It provides three different
aggregations i.e, aggregation pipeline, map-reduce function, and single-purpose aggregation
methods
High Performance: The performance of MongoDB is very high and data persistence as
compared to another database due to its features like scalability, indexing, replication, etc.
Advantages of MongoDB :
It is a schema-less NoSQL database. You do need not to design the schema of the database
when you are working with MongoDB.
It does not support join operations.
It provides great flexibility to the fields in the documents.
It contains heterogeneous data.
It provides high performance, availability, and scalability.
It supports Geospatial efficiently.
It is a document-oriented database and the data is stored in BSON documents.
It also supports multiple document ACID transition(string from MongoDB 4.0).
It does not require any SQL injection.
It is easily integrated with Big Data Hadoop
62
220223131020
Disadvantages of MongoDB:
As we know that we can use MongoDB for various things like building an application
(including web and mobile), or analysis of data, or an administrator of a MongoDB database, in
all these cases we need to interact with the MongoDB server to perform certain operations like
entering new data into the application, updating data into the application, deleting data from the
application, and reading the data of the application. MongoDB provides a set of some basic but
most essential operations that will help you to easily interact with the MongoDB server and
these operations are known as CRUD Operations.
Procedure:
Create Operations –
The create or insert operations are used to insert or add new documents in the collection. If a
collection does not exist, then it will create a new collection in the database. You can perform,
create operations using the following methods provided by the MongoDB:
Method Description
db.collection.insertOne() It is used to insert a single document in the collection.
db.collection.insertMany() It is used to insert multiple documents in the collection.
db.createCollection() It is used to create an empty collection.
63
220223131020
Example 1: In this example, we are inserting details of a single student in the form of document
in the student collection using db.collection.insertOne() method.
Example 2: In this example, we are inserting details of the multiple students in the form of
documents in the student collection using db.collection.insertMany() method.
64
220223131020
Read Operations –
The Read operations are used to retrieve documents from the collection, or in other words, read
operations are used to query a collection for a document. You can perform read operation using
the following method provided by the MongoDB:
Method Description
db.collection.find() It is used to retrieve documents from the collection.
Example : In this example, we are retrieving the details of students from the student collection
using db.collection.find()method.
Update Operations –
The update operations are used to update or modify the existing document in the collection. You
can perform update operations using the following methods provided by the MongoDB:
Method Description
It is used to update a single document in the collection that satisfy
db.collection.updateOne() the given criteria.
It is used to update multiple documents in the collection that satisfy
db.collection.updateMany() the given criteria.
It is used to replace single document in the collection that satisfy the
db.collection.replaceOne() given criteria.
65
220223131020
Example 1: In this example, we are updating the age of Sumit in the student collection
.
Example 2: In this example, we are updating the year of course in all the documents in the
student collection using db.collection.updateMany() method.
66
220223131020
Delete Operations –
The delete operation are used to delete or remove the documents from a collection. You can
perform delete operations using the following methods provided by the MongoDB:
Method Description
It is used to delete a single document from the collection that satisfy
db.collection.deleteOne() the given criteria.
It is used to delete multiple documents from the collection that satisfy
db.collection.deleteMany() the given criteria.
Example 1: In this example, we are deleting a document from the student collection using
db.collection.deleteOne() method.
67
220223131020
Example 2: In this example, we are deleting all the documents from the student collection using
db.collection.deleteMany() method.
Conclusion:
Quiz:
68
220223131020
Suggested Reference:
https://www.techtarget.com/searchdatamanagement/definition/MongoDB
https://www.tutorialspoint.com/mongodb/index.htm
https://www.w3schools.com/mongodb/
https://en.wikipedia.org/wiki/MongoDB
https://www.mongodb.com/
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
References References
Correct answer
to all questions
69
220223131020
Experiment No: 12
Case study based on the concept of Big Data Analytics. Prepare presentation in
the group of 4. Submit PPT.
Date:
Competency and Practical Skills: Java Programming, Hadoop, Data Modeling, Analytical
skill, Hadoop/MapReduce
Relevant CO: 1
Objective:
Theory:
70
220223131020
71
220223131020
72
220223131020
73
220223131020
74
220223131020
Conclusion:
Suggested Reference:
https://www.simplilearn.com/what-is-big-data-analytics-article
https://www.coursera.org/articles/big-data-analytics
https://www.ibm.com/analytics/big-data-analytics
https://www.techtarget.com/searchbusinessanalytics/definition/big-data-analytics
https://www.tableau.com/learn/articles/big-data-analytics
75
220223131020
Rubrics 1 2 3 4 5 Total
Marks Complete Complete Complete Complete Complete
implementation implementation implementation implementation implementation
as asked as asked as asked as asked as asked
Conclusions Conclusions
Presentation
Skills
76