0% found this document useful (0 votes)
11 views122 pages

32013105-BDA LabManual

The document outlines a series of projects and tasks related to Big Data and Analytics, including the use of tools like WEKA, Hadoop, MapReduce, PIG, HIVE, HBase, Spark, Flume, and Sqoop. It provides detailed instructions for implementing various data mining techniques, such as regression, clustering, and classification, using these tools. Additionally, it introduces WEKA as a user-friendly application for data analysis and machine learning, detailing its installation, features, and functionalities.

Uploaded by

minakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views122 pages

32013105-BDA LabManual

The document outlines a series of projects and tasks related to Big Data and Analytics, including the use of tools like WEKA, Hadoop, MapReduce, PIG, HIVE, HBase, Spark, Flume, and Sqoop. It provides detailed instructions for implementing various data mining techniques, such as regression, clustering, and classification, using these tools. Additionally, it introduces WEKA as a user-friendly application for data analysis and machine learning, detailing its installation, features, and functionalities.

Uploaded by

minakshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Big Data & Analytics Lab

Work
Submitted To
Dr. Minakshi Sharma
Submitted By
Arbaz Khan
(32013105)
Computer Engineering
2020-21

Department of Computer Engineering


National Institute of Technology Kurukshetra,
Haryana – 136119
Table of Contents
S.No. Title Date
1. WEKA tool- Implement the following concepts using 14.1.2021
WEKA tool and visualize the results (Note: Use any basic
dataset like Iris, Census, Titanic dataset from the above-
mentioned repositories):
Project 1 Linear and Logistic Regression
Project 2 Decision Trees
Project 3 K-Means Clustering
Project 4 Naïve Bayes Classifier

2. Hadoop 22.1.2021
(i) To setup and install Hadoop in all three modes.
(ii) Implement the following file management tasks
in Hadoop: Adding Files and Directories,
Retrieving Files, Deleting Files
3. Map Reduce 24.1.2021
(i) To run a basic Hello Word and Word Count Map
Reduce program to understand Map Reduce
Paradigm.

4. PIG 01.2.2021
(i) To install and run Apache Pig in Windows so as
to work with Hadoop.
(ii) Exploring various shell commands in PIG.
5. HIVE 04.2.2021
(i) To install and run HIVE in Windows.
(ii) To explore Hive with its basic commands: create,
alter, and drop databases, tables, views, functions
and indexes.
6. HBase 28.02.2021
(i) To install and run HBase in Windows.
(ii) Exploring various commands in HBase.
7. Spark 08.03.2021
(i) To install and run Spark in Windows.
(ii) Exploring various commands in Spark.

8. Flume 10.03.2021
(i) To install and run Flume in Windows.
(ii) Exploring various commands in Flume.

9. Sqoop 12.04.2021
(i) To install and run Sqoop in Windows.
(ii) Exploring various commands in Sqoop.
CHAPTER 1: INTRODUCTION TO WEKA
What is WEKA?

WEKA, formally called Waikato Environment for Knowledge Learning, is a


computer program that was developed at the University of Waikato in New Zealand
for the purpose of identifying information from raw data gathered from different
domains.
WEKA supports many different standard data mining tasks such as data pre-
processing, classification, clustering, regression, visualization and feature selection.
The basic premise of the application is to utilize a computer application that can be
trained to perform machine learning capabilities and derive useful information in the
form of trends and patterns.
WEKA is an open source application that is freely available under the GNU general
public license agreement. Originally written in C the WEKA application has been
completely rewritten in Java and is compatible with almost every computing platform.
It is user friendly with a graphical interface that allows for quick set up and operation.
WEKA operates on the predication that the user data is available as a flat file or
relation, this means that each data object is described by a fixed number of attributes
that usually are of a specific type, normal alpha-numeric or numeric values. The
WEKA application allows novice users a tool to identify hidden information from
database and file systems with simple to use options and visual interfaces.

KDD Process:
Installation of WEKA:

Weka can be downloaded from different sites, one of the sites is


https://waikato.github.io/wekawiki/downloading_weka/
There are different options to launch WEKA depending the operating systems.

Depending on the version click on the Click option. When we click on the download
option setup of weka gets downloaded. Click on setup and follow the below steps
Step 1:

Click on Next button


Step2:

Click on I Agree option

Step 3:

Click on Next Option


Step 4:

Click on Next option

Step 5:

Click on Finish button


CHAPTER 1.2: LAUNCHING WEKA EXPLORER

1.2.1 Starting with Weka


1.2.2 Pre-processing
1.2.3 Loading the Data
Once the program has been loaded on the user’s machine it is opened by navigating
to the programs start option and that will depend on the user’s operating system.

There are four options available on this initial screen.


1. Explorer- the graphical interface used to conduct experimentation on raw data
2. Simple CLI- provides users without a graphic interface option the ability to
execute commands from a terminal window.
3. Experimenter- this option allows users to conduct different experimental
variations on data sets and perform statistical manipulation.
4. Knowledge Flow-basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning
from previous results.

After selecting the Explorer option the program starts and provides the user with a
separate graphical interface.
Figure shows the opening screen with the available options. At first there is only the
option to select the Pre-process tab in the top left corner. This is due to the necessity
to present the data set to the application so it can be manipulated. After the data has
been pre-processed the other tabs become active for use.
There are six tabs:

1. Pre-process- used to choose the data file to be used by the application


2. Classify- used to test and train different learning schemes on the pre-processed
data file under experimentation
3. Cluster- used to apply different tools that identify clusters within the data file

4. Association- used to apply different rules to the data file that identify
association within the data
5. Select attributes-used to apply different rules to reveal changes based on
selected attributes inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set
in a 2D format, in scatter plot and bar graph output.
Pre-processing:
In order to experiment with the application, the data set needs to be presented to
WEKA in a format that the program understands. There are rules for the type of data
that WEKA will accept. There are three options for presenting data into the program.
 Open File- allows for the user to select files residing on the local machine or
recorded medium.
 Open URL- provides a mechanism to locate a file or data source from a
different location specified by the user.
 Open Database- allows the user to retrieve files or data from a database
source provided by the user.
1. Available Memory that displays in the log and in “Status’’ box the amount of
memory available to WEKA in bytes.
2. Run garbage collector that forces Java garbage collector to search for memory that
is no longer used, free this memory up and to allow this memory for new tasks.

Loading data:

Once the data is loaded, WEKA recognizes attributes that are shown in the ‗Attribute
‘window.
Left panel of ‗Pre-process ‘window shows the list of recognized attributes:
No: is a number that identifies the order of the attribute as they are in data file.
Selection tick boxes: allow you to select the attributes for working relation.

Name: is a name of an attribute as it was declared in the data file.


Name is the name of an attribute.
Type is most commonly nominal or Numeric.
Missing is the number (percentage) of instances in the data for which this attribute is
unspecified.
Distinct is the number of different values that the data contains for this attribute.
Unique is the number (percentage) of instances in the data having a value for this
attribute that no other instances have.

Once the data is loaded into weka changes can be made to the attributes by clicking
edit button shown above.
To make the changes double click on the attribute value and update the details as
user required. Different operations can be performed through edit are as follows:
1. Delete the attribute
2. Replace the attribute value
3. Set all values
4. Set missing values etc.

Click on visualize all


Attribute selection:

Setting Filters
Pre-processing tools in WEKA are called ―filters‖. WEKA contains filters for
discretization, normalization, resampling, attribute selection, transformation and
combination of attributes. Some techniques, such as association rule mining, can only
be performed on categorical data. This requires performing discretization on numeric
or continuous attributes.
Using filters, you can replace the discrete values to nominal values.
CHAPTER 1.3: CLASSIFIERS
1.3.1 Building classifiers
1.3.2 Setting Test Options

Building “Classifiers”:
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The
learning schemes available in WEKA include decision trees and lists, instance-based
classifiers, support vector machines, multi-layer perceptron, logistic regression, and
Bayes nets. “Metaclassifier’s include bagging, boosting, stacking, error-correcting
output codes, and locally weighted learning.
Once you have your data set loaded, all the tabs are available to you. Click on the
“Classify” tab.
“Classify” window comes up on the screen. Now you can start analysing the data using
the provided algorithms. In this exercise you will analyse the data.
Setting Test Options:
Before you run the classification algorithm, you need to set test options. Set test options
in the
“Test options box”. The test options that available are:
1. Use training set: Evaluates the classifier on how well it predicts the class of the
instances it was trained on
2. Supplied test set: Evaluates the classifier on how well it predicts the class of a
set of instances loaded from a file. Clicking on the “Set…” button brings up a
dialog allowing you to choose the file to test on.
3. Cross-validation. Evaluates the classifier by cross-validation, using the number
of folds that are entered in the “Folds” text field.

4. Percentage split. Evaluates the classifier on how well it predicts a certain


percentage of the data, which is held out for testing. The amount of data held out
depends on the value entered in the “%” field.

In the “Classifier” evaluation options‟ make sure that the following options are
checked:
1. Output model. The output is the classification model on the full training set, so
that it can be viewed, visualized, etc.
2. Output per-class stats. The precision/recall and true/false statistics for each
class output.

3. Output confusion matrix. The confusion matrix of the classifiers predictions is


included in the output.
4. Store predictions for visualization. The classifier’s predictions are
remembered so that they can be visualized.

5. Set “Random seed for Xval / % Split” to 1. This specifies the random seed
used when randomizing the data before it is divided up for evaluation purposes.

Once the options have been specified, you can run the classification algorithm. Click
on “Start” button to start the learning process. You can stop learning process at any
time by clicking on the “Stop” button When training set is complete, the “Classifier”
output area on the right panel of “Classify” window is filled with text describing the
results of training and testing. A new entry appears in the “Result list” box on the left
panel of “Classify” window.
' Use training set
Supplied test set

Cross—validation Folds 10

loo re opti one.

=== Detailed Accuracy By lass ===

TP Rate FP Rate Precision Recall F-Measure FC? ROt' Area PRC Area

0. 41
0.548
0. 19 6. 18

20 l3o l | b = L
45 l ?c | c = H

o
Preproces s Classi
0 Weka Classifier Tree Visualizer: 14:23:06 - trees.J48 (xAPI-Edu-Data-weka.filters.supervised.attribute.D... — 0 X

Tree View
homos LMT

8tudentAbs enc eD ays


' ' 5upplied test se
Gross-valiaatior rais edhands

Percentage split

" Ci” €t
* ti Parents cho olsati sfa ction R elation

(Nom) Class
L k' M ” VislTedResources gender L GradelD Pla PRO ñ« »

Result list trigtt-click M H (1 / . AnnouncementsView P" Parentsch‹ M L M L L L ' Disi L I M /\ M M L M L M M I I§ , ¿s; ft

14:28:10 - trees.Ra L M (3.0/1.0)

14:28:20 - trees.Ra
14:29:37 - trees.Ho
Pa « ntsch‹_* ation _ VisITe§ H (6./ (/’
Select attnbutes

LMT -I -1-M 1f -W 0

WeLa Classified \/isualize: MarginCurve

Fdds: 10

28 RardomTree
28 20 - trees RandomForest 29
37-trees HoeffdingTree
CHAPTER 1.4: CLUSTERING
1.4.1 Clustering Data
1.4.2 Choosing Clustering Scheme
1.4.3 Setting Test Options
1.4.4 Visualization of results

Clustering Data: WEKA contains ―clusters for finding groups of similar instances
in a dataset. The clustering schemes available in WEKA are k-Means, EM, Cobweb,
X-means, and Farthest First. Clusters can be visualized and compared to ―true‖
clusters (if given). Evaluation is based on log likelihood if clustering scheme
produces a probability distribution.

In ‗Preprocess‘window click on ‗Open file…‘button and select ― iris.arff file. Click


‗Cluster‘tab at the top of WEKA Explorer window. Choosing Clustering Scheme:
In Pre-process window click open file… then select iris.arff file. Then Following
window will popup.
Then Click on Cluster Tab Choosing Clustering Scheme: In the Cluster box click on
Choose button.
In pull-down menu select WEKA then Clusters, and select the cluster scheme
Hierarchical Clustering. Some implementations of K- means only allow numerical
values for attributes; therefore, we do not need to use a filter.
Once the clustering algorithm is chosen, right-click on the algorithm weak.gui.
GenericObjectEditor‖ comes up to the screen. Set the value in numClusters box to 5
(instead of default 2) because we have five clusters in our .arff file. Leave the value
of seed as is. The seed value is used in generating a random number, which is used
for making the initial assignment of instances to clusters.

Setting Test Options:


Before you run the clustering algorithm, you need to choose Cluster mode. Click on
Classes to cluster evaluation radio-button in Cluster mode box and select in the pull-
down box below.
Once the options have been specified, you can run the clustering algorithm. Click on
the Start button to execute the algorithm
Visualization of Results

Another way of representation of results of clustering is through visualization. Right-


click on the entry in the Result list and select Visualize cluster assignments in the pull-
down window.
Clustering Result

Representation in Dendrogram
CHAPTER 1.5: ASSOCIATIONS

1.5.1 Finding Associations


1.5.2 Setting Test Options

WEKA contains an implementation of the Apriori algorithm for learning association


rules. This is the only currently available scheme for learning associations in WEKA.
It works only with discrete data and will identify statistical dependencies between
groups of attributes. Apriori can compute all rules that have a given minimum support
and exceed a given confidence.

Right-click on the Associator box, and click on show properties GenericObjectEditor‘


appears on your screen. In the dialog box, change the value in minMetric to 0.4 for
confidence = 40%. Make sure that the default value of rules is set to 100. The upper
bound for minimum support upperBoundMinSupport should be set to 1.0 (100%) and
lowerBoundMinSupport to 0.1. Apriori in WEKA starts with the upper bound support
and incrementally decreases support (by delta increments, which by default is set to
0.05 or 5%). The algorithm halts when either the specified number of rules is
generated, or the lower bound for minimum support is reached. The significance
Level testing option is only applicable in the case of confidence and is (-1.0) by default
(not used) .
Once the options have been specified, you can run Apriori algorithm. Click on the
Start button to execute the algorithm.
CHAPTER 1.6: ATTRIBUTE SELECTION

1.6.1 Introduction
1.6.2 Selecting Options

Introduction:
Attribute selection searches through all possible combinations of attributes in the data
and finds which subset of attributes works best for prediction. Attribute selection
methods contain two parts: a search method such as best-first, forward selection,
random, exhaustive, genetic algorithm, ranking, and an evaluation method such as
correlation-based, wrapper, information gain, chi-squared. Attribute selection
mechanism is very flexible - WEKA allows (almost) arbitrary combinations of the
two methods.
To begin an attribute selection, click Select attributes tab.

Selecting Options

To search through all possible combinations of attributes in the data and find which
subset of attributes works best for prediction, make sure that you set up attribute
evaluator to CfsSubsetEval and a search method to BestFirst. The evaluator will
determine what method to use to assign a worth to each subset of attributes. The
search method will determine what style of search to perform.
The options that you can set for selection in the Attribute Selection Mode box are:
1. Use full training set. The worth of the attribute subset is determined using the
full set of training data.
2. Cross-validation. The worth of the attribute subset is determined by a process
of cross validation.
The Fold and Seed fields set the number of folds to use and the random seed used
when shuffling the data.
Specify which attribute to treat as the class in the drop-down box below the test
options. Once all the test options are set, you can start the attribute selection process
by clicking on Start button.
CHAPTER 1.7: DATA VISUALIZATION

1.7.1 Introduction
1.7.2 Changing the view
1.7.3 Selecting instances

Introduction:
WEKA visualization allows you to visualize a 2-D plot of the current working
relation. Visualization is very useful in practice it helps to determine difficulty of the
learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d
visualizations
(Xgobi-style). WEKA has Jitter option to deal with nominal attributes and to detect
―hidden data points.

Select a square that corresponds to the attributes you would like to visualize. For
example, let‘s choose petal width for X – axis and sepal length for Y – axis. Click
anywhere inside the square.
A Visualizing window appears on the screen
Changing the View
In the visualization window, beneath the X-axis selector there is a drop-down list
Colour for choosing the color scheme. This allows you to choose the color of points
based on the attribute selected. Below the plot area, there is a legend that describes
what values the colors correspond to. In your example, red represents no, while blue
represents yes. For better visibility you should change the color of label yes. Left-
click on yes in the Class colour box and select lighter color from the color palette. To
the right of the plot area there are series of horizontal strips. Each strip represents an
attribute, and the dots within it show the distribution values of the attribute. You can
choose what axes are used in the main graph by clicking on these strips (left-click
changes X axis, right click changes Y-axis). The software sets X - axis to petalwidth
attribute and Y - axis to sepallength. The instances are spread out in the plot area and
concentration points are not visible. Keep sliding Jitter, a random displacement given
to all points in the plot, to the right, until you can spot concentration points.

Selecting Instances:
Sometimes it is helpful to select a subset of the data using visualization tool. A
special case is the User Classifier, which lets you to build your own classifier by
interactively selecting instances. Below the Y– axis there is a drop-down list that
allows you to choose a selection method. A group of points on the graph can be
selected in four ways
Select Instance Click on an individual data point. It brings up a window listing
attributes of the point.
If more than one point will appear at the same location, more than one set of attributes
will be shown.
1. Rectangle: You can create a rectangle by dragging it around the points.
2. Polygon: You can select several points by building a free-form
polygon. Left-click on the graph to add vertices to the polygon and
right-click to complete it.

3. Polyline: To distinguish the points on one side from the once on


another, you can build a polyline. Left-click on the graph to add
vertices to the polyline and right- click to finish.
Experiment – 02 Hadoop
Hadoop Installation on Windows

You can install Hadoop in your system as well which would be a feasible way to learn Hadoop.
We will be installing single node pseudo-distributed hadoop cluster on windows 10.

Prerequisite: To install Hadoop, you should have Java version 1.8 in your system. Check your
java version through this command on command prompt:

java -version

If java is not installed in your system, then –


Go this link –
https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downl...
Accept the license,
Download the file according to your operating system. Keep the java folder directly under the
local disk directory (C:\Java\jdk1.8.0_281) rather than in Program Files (C:\Program
Files\Java\jdk1.8.0_281) as it can create errors afterwards.

After downloading java version 1.8, download hadoop version 3.0 and extract it to a folder.

Setup System Environment Variables


Open control panel to edit the system environment variable

Go to environment variable in system properties


Create a new user variable. Put the Variable name as HADOOP_HOME and Variable value as
the path of the bin folder where you extracted hadoop.

Likewise, create a new user variable with variable name as JAVA_HOME and variable value
as the path of the bin folder in the Java directory.
Now we need to set Hadoop bin directory and Java bin directory path in system variable path.
Edit Path in system variable

Click on New and add the bin directory path of Hadoop and Java in it.

Configurations:
Now we need to edit some files located in the hadoop directory of the etc folder where we
installed hadoop. The files that need to be edited have been highlighted.
1. Core site configuration
Now, we should configure the name node URL adding the following XML code into the
<configuration></configuration> element within “core-site.xml”:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9820</value>
</property>
2. Map Reduce site configuration
Now, we should add the following XML code into the <configuration></configuration>
element within “mapred-site.xml”: <property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>MapReduce framework name</description>
</property>

3. Create a folder ‘data’ in the hadoop directory

Create a folder with the name ‘datanode’ and a folder ‘namenode’ in this data
directory

4. HDFS site configuration


As we know, Hadoop is built using a master-slave paradigm. Before altering the HDFS
configuration file, we should create a directory to store all master node (name node) data and
another one to store data (data node). In this example, we created the following directories:
 E:\hadoop-env\hadoop\data\dfs\namenode
 E:\hadoop-env\hadoop\data\dfs\datanode
Now, let’s open “hdfs-site.xml” file located in “%HADOOP_HOME%\etc\hadoop” directory,
and we should add the following properties within the <configuration></configuration>
element:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///E:/hadoop-env/hadoop-3.0.0/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///E:/hadoop-env/hadoop-3.2.1/data/dfs/datanode</value>
</property>

5. Yarn site configuration


Now, we should add the following XML code into the <configuration></configuration>
element within “yarn-site.xml”:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Yarn Node Manager Aux Service</description>
</property>

6. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder
where your jdk 1.8 is installed.
Hadoop needs windows OS specific files which does not come with default download of
hadoop. To include those files, replace the bin folder in hadoop directory with the bin folder
provided in this github link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils
Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin
folder, rename it like bin_old and paste the copied bin folder in that directory.
Check whether hadoop is successfully installed by running this command on cmd:
hadoop version

Since it doesn’t throw error and successfully shows the hadoop version, that means hadoop is
successfully installed in the system.

Format the NameNode


Formatting the NameNode is done once when hadoop is installed and not for running hadoop
filesystem, else it will delete all the data inside HDFS. Run this command-
hdfs namenode -format

Access Hadoop UI from Browser:


1. Use your preferred browser and navigate to your localhost URL or IP. The default port
number 9870 gives you access to the Hadoop NameNode UI:
2. http://localhost:9870
The NameNode user interface provides a comprehensive overview of the entire cluster.
3. The default port 9864 is used to access individual DataNodes directly from your browser:
http://localhost:9864

4. The YARN Resource Manager is accessible on port 8088:


http://localhost:8088
The Resource Manager is an invaluable tool that allows you to monitor all running
processes in your Hadoop cluster.
Hadoop HDFS Commands
HDFS is the primary or major component of the Hadoop ecosystem which is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files. To use the HDFS commands, first you need
to start the Hadoop services using the following command:

sbin/start-dfs.sh
Hadoop Version: Prints the Hadoop Version

mkdir Command: To create a directory

Put Command: To copy files/folders from local file system to hdfs store.

ls Command: This command is used to list all the files. It will print all the directories
present in HDFS.

copyFromLocal Command: To copy files/folders from local file system to hdfs store. It is
similar to put command.

mv Command: This command is used to move files within hdfs.

cp Command: This command is used to copy files within hdfs.

rm Command: Removes the file or empty directory identified by <path>

rmr Command: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
HDFS to Local:
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.

cat Command: Display the contents of the file.

help Command: HDFS Command that displays help for given command or all commands if
none is specified.

help ls Command: If we want help regarding any particular command, then we can use this
command.
du Command: HDFS Command to check the file size.

touchz Command: HDFS Command to create a file in HDFS with file size 0 bytes.

stat Command: It will give the last modified time of directory or path. In short it will give
stats of the directory or file.
Experiment – 03 Mapreduce
How to run hello world program in Hadoop (Windows)

Write basic java hello world program using any text editor and save the file as HelloWorld.java

Run this java file using command prompt. After that .class file will be created.

You can create a manifest file in any text editor, and you can give your manifest file any name. Here the file
name is m1.txt

Which contains the main class:

Now to convert this .class file into .jar. Write the following command:
cfm means "create a jar, specify the output jar file name, specify the manifest file name." This
is followed by the name you wish to give to your jar file, the name of your manifest file, and
the list of .class files that you want included in the jar. (.class) means all class files in the current
directory.
Now, in order to run the jar file using hadoop. Run your hadoop cluster. And write the
command to run the jar file in hadoop by specifying the jar file location in the local file system
and the name of the class (‘HelloWorld’ in this example) which is to be called.
Run WordCount Program in Hadoop (Windows)

Step 1: Write the java MapReduce program for wordcount.

Step 2: Compile the .java file and convert the output files into .jar (or you can simple download
the .jar from the following link https://github.com/MuhammadBilalYar/HADOOP-
INSTALLATION-ON-WINDOW-10/blob/master/MapReduceClient.jar )

Step 3: Make sure that your hadoop is working fine by running the following commands.
● Open cmd as administrator.
● Change directory to /hadoop/sbin and start cluster
● start-all.cmd OR start-yarn.cmd & start-dfs.cmd
(kindly see, if your cluster is shutting down by itself. If so, then there might be some issue with
the hadoop installation)
Step 4: Create a directory in hdfs, where we will save our input file.

Step 5: Create a .txt file having some textual content to apply word count with the name
“file.txt”

Step 6: Copy the input file to the hadoop file system in the directory which we just created.

This will copy the file.txt file from the local file system to the hadoop file system in the
specified directory i.e. /input/.

Step 7: Verify input file file.txt is copied properly or not.

Step 8: You can also verify the content of the file by specifying the path of the file in the
hadoop file system.

Step 9: Run MapReduceClient.jar file.

hadoop jar D:\MapReduceClient.jar /inputdir /outputdir


Step 10: Check the content of the output.
Experiment – 04 Apache PIG

Installing Apache Pig on Windows 10

1. Prerequisites
1.1. Hadoop Cluster Installation

2. Downloading Apache Pig


To download the Apache Pig, you should go to the following link:
https://downloads.apache.org/pig/

If you are looking for the latest version, navigate to “latest” directory.
1. Download the pig-version.tar.gz
2. Now after downloading, extract pig anywhere. In this case I have extracted pig in
Hadoop home directory i.e. C:\hadoop\pig
3. Set environment variable in system settings

3.1 Search for View advanced system settings


3.2 Click on Environment Variables

3.3 First add a new User Variable i.e. PIG_HOME as Pig extracted folder path
Environment Variables X

User variables for ARBAZ

Variable Value
envC ontainerTelemetryApiC... - st "C:\Program Files\NVIDIA C orporation\NvC ontainer\NvC ontain...
envC ontainerTelemetryApiC... - st "C:\Program Files (x86)\NVIDIA C orporation\NvC ontainer\NvC o...
HADOOP HOME C:\hadoop\bin
JAVA_HOME C:\Java\bin
OneDrive C:\Users\shiva\OneDrive
OneDriveC onsumer C:\Users\shiva\OneDrive
Path C:\Users\shiva\A a aData\LncaI\Prnoramr\Pvthnn\PvthnnR9\ Sc riots...
New... Edict Delete

System variables

Variable Value
ComSpec C:\WINDOWS\system32\cmd.ae
DriverData C:\Windows\System 32\Drivers\DriverData
NUMBER OF PROCESSORS 4
OnlineServic es Online Services
OS Windows NT
Path C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Pro...
PATHEXT .CGM'.EXE'.BAT'.CMD'. VBS'.VBE', IS', ISE'.WSF'.WSH'.MSC *
N ... Edit... Delete

Cancel

Edit User Variable X

Variable name!

Variable value: C:\had00p\pig|

Browse Directory... Browse File...


5.4 Now, we should edit the Path user variable to add the following paths:

5.5 Add new path i.e. C:\hadoop\pig\bin


5.6 Now click OK and save it all.
6. Navigate to PIG_HOME\bin folder and open pig.cmd and set HADOOP_BIN_PATH
to C:\hadoop\libexec as shown in the screenshot below

1. Starting Apache Pig


After setting environment variables, let's try to run Apache Pig.

Note: Hadoop Services must be running


Open a command prompt as administrator, and execute the following command:

pig -version
The simplest way to write PigLatin statements is using Grunt shell which is an interactive tool
where we write a statement and get the desired output. There are two modes to involve Grunt
Shell:
1. Local: All scripts are executed on a single machine without requiring Hadoop.
(command: pig -x local)

2. MapReduce: Scripts are executed on a Hadoop cluster (command: pig -x MapReduce)


Apache Pig - Diagnostic Operators

The load statement will simply load the data into the specified relation in Apache Pig. To verify
the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin
provides four different types of diagnostic operators −
1. Dump operator
2. Describe operator
3. Explanation operator
4. Illustration operator

1. Dump Operator: This command is used to display the results on screen. It usually helps
in debugging.
grunt> Dump Relation_Name;
2. Describe Operator: The describe operator is used to view the schema of a relation.
Syntax:
The syntax of the describe operator is as follows –
grunt> Describe Relation_name;

3. Explain Operator: The explain operator is used to display the logical, physical, and
MapReduce execution plans of a relation.
grunt> explain Relation_name;
4. Illustrate Operator: The illustrate operator gives you the step-by-step execution of a
sequence of statements.
grunt> illustrate Relation_name;
Grouping & Joining

1. Group Operator: The GROUP operator is used to group the data in one or more relations.
It collects the data having the same key.

grunt> group_data = GROUP Relation_name by firstname;

Verification:
Verify the relation group_data using the DUMP operator as shown below:
grunt> Dump group_data;
You can see the schema of the table after grouping the data using the describe command as
shown below:

grunt> describe group_data;

In the same way, you can get the sample illustration of the schema using
the illustrate command as shown below:

grunt> illustrate group_data;

Grouping by Multiple Columns


grunt> group_multiple = GROUP student by (firstname, city);

2. Join Operator
The JOIN operator is used to combine records from two or more relations. While performing
a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these
keys match, the two particular tuples are matched, else the records are dropped. Joins can be
of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily
renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times,
under different aliases (names). Therefore, let us load the contents of the file customer.txt as
two tables as shown below.

Syntax
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key;

Verification
Verify the relation customers3 using the DUMP operator as shown below.

Output
Combining

Union Operator:

The UNION operator of Pig Latin is used to merge the content of two relations. To perform
UNION operation on two relations, their columns and domains must be identical.

Syntax:
Given below is the syntax of the UNION operator:
grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Assume that we have two files namely student.txt and student1.txt in the directory of HDFS
as shown below.

student.txt

student1.txt

And we have loaded these two files into Pig with the relations student and student1 as shown
below.

Let us now merge the contents of these two relations using the UNION operator as shown
below.

grunt> student = UNION student, student1;


Verification
Verify the relation student using the DUMP operator as shown below.

Output
Filtering

1. Distinct Operator
The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
grunt> Relation_name2 = DISTINCT Relatin_name1;

Assume that we have a file named student.txt in the HDFS directory as shown below and we
have loaded this file into Pig with the relation name student as shown below.

Let us now remove the redundant (duplicate) tuples from the relation named student using
the DISTINCT operator, and store it as another relation named distinct_data as shown below.

Verification

Output
2. Foreach Operator: The FOREACH operator is used to generate specified data
transformations based on the column data.

Syntax:
Given below is the syntax of FOREACH operator.

grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data); student.txt

we have loaded this file into Pig with the relation name.

Let us now get the id, firstname, and city values of each student from the relation student and
store it into another relation named foreach_data using the foreach operator as shown below.

Verify the relation foreach_data using the DUMP operator as shown below.

Output
Sorting

Order By:
The ORDER BY operator is used to display the contents of a relation in a sorted order based
on one or more fields.

Syntax:
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);

Let us now sort the relation in an ascending order based on the firstname of the student and
store it into another relation named order_by_data using the ORDER BY operator as shown
below. Also verify the relation order_by_data using the DUMP operator.

Output
Eval Functions
CONCAT(): The CONCAT() function of Pig Latin is used to concatenate two or more
expressions of the same type.

Syntax:
grunt> CONCAT (expression, expression, [...expression]);

student.txt

we have loaded this file into Pig with the relation name student as shown below.

In the above schema, you can observe that the name of the student is represented using two
chararray values namely firstname and lastname. Let us concatenate these two values using
the CONCAT() function.

Verify the relation student_name_concat using the DUMP operator as shown below.

Output
Word Count in Pig Latin
1. Here I have taken some sample text in which we will find different words
and no. of times they appear in our text.

Save this text as wc.txt anywhere on your local drive. In my case I have saved
this as wc.txt on D:\ location.

2. Now we need to put this txt file on our HDFS. To do that I have created a
new folder named pig_wc using this command –
hdfs dfs –mkdir /pig_wc

3. Now put the above created .txt file to hdfs using this command
hadoop fs –put /D:/wc.txt /pig_wc

4. Now start hadoop and then start pig


5. Load the data from wc.txt file to pig using this command

Relation_name = LOAD ‘file_path’ as (variable)


grunt> wc_data = LOAD ‘/pig_wc/wc.txt’ as (line);

6. Now use this command

grunt> words = foreach wc_data generate flatten(TOKENIZE(line)) as word;


TOKENIZE splits the line into a field for each word. Flatten will take the
collection of records returned by TOKENIZE and produce a separate record
for each one, calling the single field in the record word

7. Now group them together by each word


grunt> grpd = group words by word;

8. Now count all the words


grunt> cntd = foreach grpd generate group, COUNT(words);
9. Now just dump the data using this command
grunt> dump cntd;
Experiment – 05 Apache HIVE

Installing Apache Hive on Windows 10


1. Prerequisites

1.1. Installing Hadoop


To install Apache Hive, you must have a Hadoop Cluster installed and running.

1.2. Apache Derby


In addition, Apache Hive requires a relational database to create its Metastore (where all
metadata will be stored).

1.3. Cygwin
Since there are some Hive 3.1.2 tools that aren’t compatible with Windows (such as schema
tool). We will need the Cygwin tool to run some Linux commands.

2. Downloading Apache Hive binaries


In order to download Apache Hive binaries, you should go to the following website:
https://downloads.apache.org/hive/hive-3.1.2/. Then, download the apache-hive- 3.1.2.-
bin.tar.gz file.
3. Setting environment variables
After extracting Derby and Hive archives, we should go to Control Panel > System and Security
> System. Then Click on “Advanced system settings”.

In the advanced system settings dialog, click on “Environment variables” button.


Now we should add the following user variables:
 HIVE_HOME: “C:\hadoop\hive\”
 DERBY_HOME: “C:\hadoop\db-derby\”
 HIVE_LIB: “%HIVE_HOME%\lib”
 HIVE_BIN: “%HIVE_HOME%\bin”
 HADOOP_USER_CLASSPATH_FIRST: “true”

Besides, we should add the following system variable:


 HADOOP_USER_CLASSPATH_FIRST: “true”

Now, we should edit the Path user variable to add the following paths:
 %HIVE_BIN%
 %DERBY_HOME%\bin

4. Configuring Hive

4.1. Copy Derby libraries


Now, we should go to the Derby libraries directory (C:\hadoop\db-derby\lib) and copy all *.jar
files.

Then, we should paste them within the Hive libraries directory (C:\hadoop\hive\lib).
4.2. Configuring hive-site.xml
Now, we should go to the Apache Hive configuration directory (C:\hadoop\hive\conf) create a
new file “hive-site.xml”. We should paste the following XML code within this file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration><property> <name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property><property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD
based authentication KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom
authentication provider (Use with property hive.server2.custom.authentication.class)
</description>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>

5. Starting Services

5.1. Hadoop Services


To start Apache Hive, open the command prompt utility as administrator. Then, start the Hadoop
services using start-dfs and start-yarn commands.

5.2. Derby Network Server


Then, we should start the Derby network server on the localhost using the following command:
C:\hadoop\db-derby\bin\StartNetworkServer -h 0.0.0.0

6. Starting Apache Hive

Now, let try to open a command prompt tool and go to the Hive binaries directory
(C:\hadoop\hive\bin) and execute the following command:
hive

then Apache Hive will start successfully.


HIVE Commands

 Start Hive

 Create and Show database

 Create Table

 Describe Table
 Insert values into table

 Select operation on table

 Select Operation by Specifying Condition

 Alter Table Command for Renaming Table

 Select Operation from new table


 Show tables from Database

 Alter Table Command to add Column in Table

 Describe Command to Show added Column


PARTITIONING
Insert Command:

The insert command is used to load the data Hive table. Inserts can be done to a table or a
partition.
 INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
 INSERT INTO is used to append the data into existing data in a table.

Loading data into created table Stu_Rec:

Creation of partition table:


‘Partitioned by’ is used to divided the table into the Partition.

For partition we have to set this property:


Loading data into partition table
BUCKETING
Creating Bucket as shown below:

Displaying 4 buckets that is created above:


Experiment – 06 Apache HBase

Setup HBase in Windows 10 (Standalone Mode)

Pre-Requisite:
We are going to make a standalone setup of HBase in our machine which requires:
Java JDK 1.8
HBase - Apache HBase

HBase Installation Steps:


Step 1: Unzip the downloaded Hbase and place it in some common path, say C:/hbase-2.3.4

Step 2: Create a folder as shown below inside root folder for HBase data and zookeeper
-2.3.4/hbase
-2.3.4/zookeeper

Step 3:
Open C:/hbase-2.3.4/bin/hbase.cmd. Search for below given lines and
remove %HEAP_SETTINGS% from that line.

set java_arguments=%HEAP_SETTINGS% %HBASE_OPTS% -classpath


"%CLASSPATH%" %CLASS% %hbase-command-arguments%

Step 4:
Open C:/hbase-2.3.4/conf/hbase-env.cmd. Add the below lines to the file:
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-
XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-


Dcom.sun.management.jmxremote.authenticate=false"
set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-
Dcom.sun.management.jmxremote.port=10101"
set HBASE_REGIONSERVER_OPTS=%HBASE_JMX_BASE% "-
Dcom.sun.management.jmxremote.port=10102"
set HBASE_THRIFT_OPTS=%HBASE_JMX_BASE% "-
Dcom.sun.management.jmxremote.port=10103"
set HBASE_ZOOKEEPER_OPTS=%HBASE_JMX_BASE% -
Dcom.sun.management.jmxremote.port=10104"
set HBASE_REGIONSERVERS=%HBASE_HOME%\conf\regionservers
set HBASE_LOG_DIR=%HBASE_HOME%\logs
set HBASE_IDENT_STRING=%USERNAME%
set HBASE_MANAGES_ZK=true

Step 5:
Open C:/hbase-2.3.4/conf/hbase-site.xml. Add the below lines inside <configuration> tag.

<property>
<name>hbase.rootdir</name>
<value>file:///C:/Documents/hbase-2.2.5/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/C:/Documents/hbase-2.2.5/zookeeper</value>
</property>
<property>
<name> hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>

Step 6:
Setup the Environment variable for HBASE_HOME and add bin to the path variable.

Now we are all set to run HBase, to start HBase execute the command below from the bin
folder.
Open Command Prompt and cd to Hbase’ bin directory
Run start-hbase.cmd
Test the installation using HBase shell
HBASE Commands
General commands
Run the hbase shell.

1. Basic HBase Commands.

status - Provides the status of HBase, for example, the number of servers.

version - Provides the version of HBase being used.

whoami - Provides information about the user.


2. HBase Data Definition Commands.

create command: Creates a table.

list command: Lists all the tables in HBase.

disable command: Disables a table in HBase.

Now, if we try to scan our disabled table, it will through an error.


lLet us see.
is_disabled: Checks if the table is disabled or not and returns true if YES.

enable command: Enables the disabled table in HBase.

is_enabled command: Checks if the table is enabled or not and returns true if
YES.

drop command: Drops a table from HBase. But in order to drop a table, we need
to disable it first.
We can see if our table has been deleted or not by using list command.

describe command: Provides the description of the table.


Experiment – 07 Apache Spark

Install Apache Spark on Windows 10

1. Prerequisites
1.1. Install Java 8
1.2.Install Python

2. Download Apache Spark


2.1. Open a browser and navigate to https://spark.apache.org/downloads.html.
2.2. Under the Download Apache Spark heading, there are two drop-down menus. Use the
current non-preview version.
In our case, in Choose a Spark release drop-down menu select 3.1.1 (Mar 02 2021).
In the second drop-down Choose a package type, leave the selection Pre-built for Apache
Hadoop 3.2 and above.
2.3. Click the spark-3.1.1-bin-hadoop3.2.tgz link.
3. Install Apache Spark
Installing Apache Spark involves extracting the downloaded file to the desired location.

4. Add winutils.exe File


If you have not downloaded the HADOOP earlier then, there is a need to download the
winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.
For this:
4.1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder,
locate winutils.exe, and click it.
4.2. Find the Download button on the right side to download the file.
4.3.Copy the winutils.exe file to C:\hadoop\bin.

5. Configure Environment Variables


Configuring environment variables in Windows adds the Spark and Hadoop locations to your
system PATH. It allows you to run the Spark shell directly from a command prompt window.
5.1. Click Start and type environment.
5.2. Select the result labelled Edit the system environment variables.
5.3.A System Properties dialog box appears. In the lower-right corner, click Environment
Variables and then click New in the next window.
5.4. For Variable Name type SPARK_HOME.
5.5.For Variable Value type C:\Spark\spark-3.1.1-bin-hadoop3.2 and click OK. If you
changed the folder path, use that one instead.
5.6. In the top box, click the Path entry, then click Edit.
5.7.Enter the path to the Spark folder C:\Spark\spark-3.1.1-bin-hadoop3.2\bin. We
recommend using %SPARK_HOME%\bin to avoid possible issues with the path.
5.8. Repeat this process for Hadoop also.
6. Launch Spark
6.1. Open a new command-prompt window using the right-click and Run as administrator:
6.2. To start Spark, enter the following command:
spark-shell
Finally, the Spark logo appears, and the prompt displays the Scala shell.

6.4. Open a web browser and navigate to http://localhost:4040/.


6.5. You should see an Apache Spark shell Web UI. The example below shows
the Executors page.

7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window
SPARK Commands

1. Start the Spark Shell

The basic data structure of Spark is called an RDD (Resilient Distributed Datasets) which
contains an immutable collection of objects for distributed computing of records. All the
datasets of RDD are partitioned logically across multiple nodes of a cluster.
An RDD can be created only by reading from a local file system or by transforming an existing
RDD.
a) To create a new RDD we use the following command:

Here sc is called the object of SparkContext.

b) An RDD can be created through Parallelized Collection as follows:

c) To create from existing RDD’s:


There are two types of Spark RDD Operations which can be performed on the created datasets:
1. Actions
2. Transformations

1. Actions: It is used to perform certain required operations on the existing datasets. Following
are a few of the commands which can be used to perform the below actions on the created
datasets:
a) count() function to count the number of elements in RDD:

b) collect() function to display all the elements of the array:

c) first() function used to display the first element of the dataset:

d) take(n) function displays the first n elements of the array:

e) takeSample (with Replacement, num, [seed]) function displays a random array of


“num” elements where the seed is for the random number generator.

f) saveAsTextFile(path) function saves the dataset in the specified path of hdfs location

g) partitions. length function can be used to find the number of partitions in the RDD

2. RDD Transformations
Transformation is used to form a new RDD from the existing ones. Since the inputs of the RDD
are immutable, the result formed upon transformation can be one or more RDD as output.
There are two types of transformations:
2.1.Narrow Transformations
2.2.Wide Transformations
2.1. Narrow Transformations – Each parent RDD is divided into various partitions and
among these only one partition will be used by the child RDD.
 map() and filter() are the two basic kinds of basic transformations that are called when
an action is called.
 map(func) function operates on each of the elements in the dataset “value” iteratively to
produce the output RDD.

In this example, we are adding the value 5 to each of the elements of the dataset “Newdata” and
displaying the transformed output with the help of collect function.

 filter(func) function is basically used to filter out the elements satisfying a particular
condition specified using the function.
In this example, we are trying to retrieve all the elements except number 4 of the dataset
“Newdata” and fetching the output via the collect function.

2.2. Wide Transformations – A single parent RDD partition is shared upon its various
multiple child RDD partitions.
 groupbykey and reducebyKey are examples of wide transformations.
 groupbyKey function groups the dataset values into key-value pairs according to the key
values from another RDD. This process involves shuffling to take place when the group
by function collects the data associated with a particular key and stores them in a single
key-value pair.

Example: In this example, we are assigning the integers 6,7 to the string value “key” and
integers 9,2 assigned to “val” which are displayed in the same key-value pair format in the
output.
 reduceByKey function also combines the key-value pairs from different RDD’s. It
combines the keys and their respective values into a single element after performing the
mentioned transformation.

Example: In this example, the common keys of the array “letters” are first parallelized by the
function and each letter is mapped with count 20 to it. The reduceByKey will add the values
having similar keys and saves in the variable Reduce. The output is then displayed using the
collect function.
Word Count using SPARK
Steps to execute Spark word count:

1. Create a text file in your local machine and write some text into it.

2. Create a directory in HDFS, where to kept text file.

3. Upload the Input.txt file on HDFS in the specific directory.


4. Now, follow the below command to open the spark in Scala mode.

5. Let's create an RDD by using the following command.

6. Now, we can read the generated result by using the following command

7. Here, we split the existing data in the form of individual words by using the following
command

Now, we can read the generated result by using the following command.
8. Now, perform the map operation.

9. Now, perform the reduce operation

Here, we got the desired output.


Experiment – 08 Apache Flume

Install Apache Flume on Windows 10

First of all, download the latest version of Apache Flume software from the website
https://flume.apache.org/.

Step 1
Open the website. Click on the download link on the left-hand side of the home page. It will
take you to the download page of Apache Flume.

Step 2
In the Download page, you can see the links for binary and source files of Apache Flume. Click
on the link apache-flume-1.9.0-bin.tar.gz
You will be redirected to a list of mirrors where you can start your download by clicking any
of these mirrors.

Step 3
Extract the zip file and Move to C:\apache-flume-1.9.0-bin directory.

Step 4
Set Path and Classpath for Flume:
FLUME_HOME=C:\apache-flume-1.6.0-bin
FLUME_CONF=%FLUME_HOME%\conf
CLASSPATH=%FLUME_HOME%\lib\*
PATH=C:\apache-flume-1.6.0-bin\bin

Step 5
Now, go to the Flume folder and in that, you will find the folder named conf. Open it and in
that, open the file called log4j.properties file with a text editor and make the following
changes:
flume.root.logger=DEBUG,console
#flume.root.logger=INFO,LOGFILE
Step 6
Copy the file flume-conf.properties.template, flume-env.ps1.template and rename them
to flume-conf.properties, flume-env.ps1 respectively.

To verify installation, open command prompt and run:


flume-ng help

To check the version of flume, run the following command:


flume-ng version
Apache Flume - Fetching Twitter Data

We will create an application and get the tweets from it using the experimental twitter source
provided by Apache Flume. We will use the memory channel to buffer these tweets and HDFS
sink to push these tweets into the HDFS.

1. Creating a Twitter Application


In order to get the tweets from Twitter, it is needed to create a Twitter application. Follow the
steps given below to create a Twitter application.

Step 1
To create a Twitter application, click on the following link https://apps.twitter.com/. Sign in to
your Twitter account. You will have a Twitter Application Management window where you
can create, delete, and manage Twitter Apps.

Apply for the developer account by filling the basic information in the forms and once you get
the developer account you may proceed ahead to create an application on the developer portal.
Step 2
Click on the Create App button.

You will be redirected to a window where you will get an application form in which you have
to fill in your details in order to create the App.

Generate Consumer Keys and Access Token on this page. Remember them for future use.

2. Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start Hadoop and
create a folder in it to store Flume data. Follow the steps given below before configuring Flume.
Run the following command in the terminal:

start-all
Create a Directory in HDFS:

3. Configuring Flume
We have to configure the source, the channel, and the sink using the configuration file in the
conf folder.
Setting the classpath:
Set the classpath variable to the lib folder of Flume in Flume-env.sh file as shown below.
export CLASSPATH =C:\flume\apache-flume-1.9.0-bin\lib\*
Example – Configuration File
Given below is an example of the configuration file. Copy this content and save as twitter.conf
in the conf folder of Flume.

# Naming the components on the current agent.


TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

# Describing/Configuring the source


TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = XXXXXX
TwitterAgent.sources.Twitter.consumerSecret = XXXXXX
TwitterAgent.sources.Twitter.accessToken = XXXXXX
TwitterAgent.sources.Twitter.accessTokenSecret = XXXXXX
TwitterAgent.sources.Twitter.keywords = tutorials,bigdata

# Describing/Configuring the sink


TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/Hadoop/twitter_data/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

# Describing/Configuring the channel


TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel


TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

4. Execution
Browse through the Flume home directory and execute the application as shown below:
Run the following command in the terminal from the flume directory to stream the twitter data
in windows.
flume-ng agent --conf conf --conf-file ./conf/twitter.conf --name TwitterAgent -property
"flume.root.logger=DEBUG,console"
If everything goes fine, the streaming of tweets into HDFS will start. Given below is the
snapshot of the command prompt window while fetching tweets.

Verifying the tweet on the hdfs file system.


Experiment – 09 Apache Sqoop

Install Apache Sqoop on Windows 10


 Download Sqoop. From the given link.
http://www.apache.org/dyn/closer.lua/sqoop/1.4.7

Extract the file


Setup environment variables
Setup SQOOP_HOME environment variables and also add the bin subfolder
into PATH variable.

NOTE:
Download the MySQL-Connector (mysql-connector-java-8.0.21.jar)
https://downloads.mysql.com/archives/c-j/
Copy the mysql-connector jar file. You just download it and paste it to the sqoop/lib
folder.

Print out version using the following command:


sqoop version

Now, the installation is successful.


Working with Sqoop
Step 1: open database.

Step 2: Open cmd as administrator privilege.

To check list of databases are available in the database environment.

C:\WINDOWS\system32>sqoop list-databases\ –connect jdbc:mysql://localhost/ --


password arbaz --username root
Here Result is: mysql, information_schema, performance_schema, sys, sakila, world,
employeedb.

To check number of tables are exits in the particular database.

C:\WINDOWS\system32>sqoop list-tables --connect jdbc:mysql://localhost/world --


password arbaz --username root
Import : To import a particular table form database to hdfs use a following command.

C:\WINDOWS\system32>sqoop import --connect jdbc:mysql://localhost/world --


password arbaz --username root --table city
I am using world database and importing table city. Which contains 4079 records.
Check on hdfs server. http://localhost:9870/explorer.html
 Here we can see, the table is imported into 4 mappers.
 So, data will be equally divided into 4 part.

After downloading the file, this is how it looks like:


 To define number of mappers.

c:\> sqoop import --connect jdbc:mysql://localhost/world --password arbaz --


username root --table city --m 3 --where "id>100" --target-dir /city0
 Import is successful. Now check in HDFS system
 http://localhost:9870/explorer.html#/

File information - pan-m-00000

Doc, nload Head the file (first 32KJ Tail the rile {last 32K

Block 0

Block ID 1073741913

BIocK Pool ID BP-51728825-192 168 ññ 200-1619633455171

Generation Stamp: 1089

Size 49452 Availability’


• LAPTOP-GVVK9A4f',1

Home

G,Mendoza,l01,Godoy Cruz,2O6SS8
ARG,Misiones,1O2,Posadas,20127Z
ARG,Mendoza,l08,GuaymallA Afln,20O5S5
ARG,5antiago del Estero,1O4,€antiago del Estero,18SS47
ARG,Jujuy,lO5,€an Salvador de Jujuy,l78748 ARG,Buenos
Aires,106,Hurlingham,17O028
ARG,NeuquAfAOn,1O7,NeuquAfAOn,1672S6
ARG,Buenos Aires,l08,ItuzaingA A’,158lS7 ARG,Buenos
Aires,10S,€an Fernando,15ZOZ6
ARG,Formosa,llO,Formosa,l476Z6 ARG,Mendoza,l11,Las
Heras,145828
ARG,La Rioja,112,La Rioja,1Z8117
ARG,Catamarca,113,Han Fernando del Val
ARG,CAiA’rdoBa,114,RAiAo Cuarto,1Z4Z55
ARG,ohuBut,1l5,Comodoro Rivadavia,124l04
ARG,Mendoza,l16,Mendoza,128027 ARG,Buenos
Aires,ll7,€an
ARG,Han Juan,118,Han Juan,11S152 ARG,Buonos
Aires,11S,EscoBar,116675 ARG,Entre
Rios,120,Concordia,116485 ARG,Buenos
Aires,l2l,Pilar,118428 ARG,flan Luis,l22,flan
Luis,llOl36 ARG,Buenos Aires,l28,Ezeiza,SS578
Export: To do export 1𝑠𝑡 create an empty table with same identity and all other description
of same as the imported table.

Now time to export your database table from hdfs to sql. Use
given cmd:

C:\Windows\system32>sqoop export --connect jdbc:mysql://localhost/world --password


arbaz --username root --table city1 --export-dir /city0
- - - - - - - - - - - - - - - - - - - -THANK YOU- - - - - - - - - - - - - - - - - - - - - -

You might also like