32013105-BDA LabManual
32013105-BDA LabManual
Work
Submitted To
Dr. Minakshi Sharma
Submitted By
Arbaz Khan
(32013105)
Computer Engineering
2020-21
2. Hadoop 22.1.2021
(i) To setup and install Hadoop in all three modes.
(ii) Implement the following file management tasks
in Hadoop: Adding Files and Directories,
Retrieving Files, Deleting Files
3. Map Reduce 24.1.2021
(i) To run a basic Hello Word and Word Count Map
Reduce program to understand Map Reduce
Paradigm.
4. PIG 01.2.2021
(i) To install and run Apache Pig in Windows so as
to work with Hadoop.
(ii) Exploring various shell commands in PIG.
5. HIVE 04.2.2021
(i) To install and run HIVE in Windows.
(ii) To explore Hive with its basic commands: create,
alter, and drop databases, tables, views, functions
and indexes.
6. HBase 28.02.2021
(i) To install and run HBase in Windows.
(ii) Exploring various commands in HBase.
7. Spark 08.03.2021
(i) To install and run Spark in Windows.
(ii) Exploring various commands in Spark.
8. Flume 10.03.2021
(i) To install and run Flume in Windows.
(ii) Exploring various commands in Flume.
9. Sqoop 12.04.2021
(i) To install and run Sqoop in Windows.
(ii) Exploring various commands in Sqoop.
CHAPTER 1: INTRODUCTION TO WEKA
What is WEKA?
KDD Process:
Installation of WEKA:
Depending on the version click on the Click option. When we click on the download
option setup of weka gets downloaded. Click on setup and follow the below steps
Step 1:
Step 3:
Step 5:
After selecting the Explorer option the program starts and provides the user with a
separate graphical interface.
Figure shows the opening screen with the available options. At first there is only the
option to select the Pre-process tab in the top left corner. This is due to the necessity
to present the data set to the application so it can be manipulated. After the data has
been pre-processed the other tabs become active for use.
There are six tabs:
4. Association- used to apply different rules to the data file that identify
association within the data
5. Select attributes-used to apply different rules to reveal changes based on
selected attributes inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set
in a 2D format, in scatter plot and bar graph output.
Pre-processing:
In order to experiment with the application, the data set needs to be presented to
WEKA in a format that the program understands. There are rules for the type of data
that WEKA will accept. There are three options for presenting data into the program.
Open File- allows for the user to select files residing on the local machine or
recorded medium.
Open URL- provides a mechanism to locate a file or data source from a
different location specified by the user.
Open Database- allows the user to retrieve files or data from a database
source provided by the user.
1. Available Memory that displays in the log and in “Status’’ box the amount of
memory available to WEKA in bytes.
2. Run garbage collector that forces Java garbage collector to search for memory that
is no longer used, free this memory up and to allow this memory for new tasks.
Loading data:
Once the data is loaded, WEKA recognizes attributes that are shown in the ‗Attribute
‘window.
Left panel of ‗Pre-process ‘window shows the list of recognized attributes:
No: is a number that identifies the order of the attribute as they are in data file.
Selection tick boxes: allow you to select the attributes for working relation.
Once the data is loaded into weka changes can be made to the attributes by clicking
edit button shown above.
To make the changes double click on the attribute value and update the details as
user required. Different operations can be performed through edit are as follows:
1. Delete the attribute
2. Replace the attribute value
3. Set all values
4. Set missing values etc.
Setting Filters
Pre-processing tools in WEKA are called ―filters‖. WEKA contains filters for
discretization, normalization, resampling, attribute selection, transformation and
combination of attributes. Some techniques, such as association rule mining, can only
be performed on categorical data. This requires performing discretization on numeric
or continuous attributes.
Using filters, you can replace the discrete values to nominal values.
CHAPTER 1.3: CLASSIFIERS
1.3.1 Building classifiers
1.3.2 Setting Test Options
Building “Classifiers”:
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The
learning schemes available in WEKA include decision trees and lists, instance-based
classifiers, support vector machines, multi-layer perceptron, logistic regression, and
Bayes nets. “Metaclassifier’s include bagging, boosting, stacking, error-correcting
output codes, and locally weighted learning.
Once you have your data set loaded, all the tabs are available to you. Click on the
“Classify” tab.
“Classify” window comes up on the screen. Now you can start analysing the data using
the provided algorithms. In this exercise you will analyse the data.
Setting Test Options:
Before you run the classification algorithm, you need to set test options. Set test options
in the
“Test options box”. The test options that available are:
1. Use training set: Evaluates the classifier on how well it predicts the class of the
instances it was trained on
2. Supplied test set: Evaluates the classifier on how well it predicts the class of a
set of instances loaded from a file. Clicking on the “Set…” button brings up a
dialog allowing you to choose the file to test on.
3. Cross-validation. Evaluates the classifier by cross-validation, using the number
of folds that are entered in the “Folds” text field.
In the “Classifier” evaluation options‟ make sure that the following options are
checked:
1. Output model. The output is the classification model on the full training set, so
that it can be viewed, visualized, etc.
2. Output per-class stats. The precision/recall and true/false statistics for each
class output.
5. Set “Random seed for Xval / % Split” to 1. This specifies the random seed
used when randomizing the data before it is divided up for evaluation purposes.
Once the options have been specified, you can run the classification algorithm. Click
on “Start” button to start the learning process. You can stop learning process at any
time by clicking on the “Stop” button When training set is complete, the “Classifier”
output area on the right panel of “Classify” window is filled with text describing the
results of training and testing. A new entry appears in the “Result list” box on the left
panel of “Classify” window.
' Use training set
Supplied test set
Cross—validation Folds 10
TP Rate FP Rate Precision Recall F-Measure FC? ROt' Area PRC Area
0. 41
0.548
0. 19 6. 18
20 l3o l | b = L
45 l ?c | c = H
o
Preproces s Classi
0 Weka Classifier Tree Visualizer: 14:23:06 - trees.J48 (xAPI-Edu-Data-weka.filters.supervised.attribute.D... — 0 X
Tree View
homos LMT
Percentage split
" Ci” €t
* ti Parents cho olsati sfa ction R elation
(Nom) Class
L k' M ” VislTedResources gender L GradelD Pla PRO ñ« »
14:28:20 - trees.Ra
14:29:37 - trees.Ho
Pa « ntsch‹_* ation _ VisITe§ H (6./ (/’
Select attnbutes
LMT -I -1-M 1f -W 0
Fdds: 10
28 RardomTree
28 20 - trees RandomForest 29
37-trees HoeffdingTree
CHAPTER 1.4: CLUSTERING
1.4.1 Clustering Data
1.4.2 Choosing Clustering Scheme
1.4.3 Setting Test Options
1.4.4 Visualization of results
Clustering Data: WEKA contains ―clusters for finding groups of similar instances
in a dataset. The clustering schemes available in WEKA are k-Means, EM, Cobweb,
X-means, and Farthest First. Clusters can be visualized and compared to ―true‖
clusters (if given). Evaluation is based on log likelihood if clustering scheme
produces a probability distribution.
Representation in Dendrogram
CHAPTER 1.5: ASSOCIATIONS
1.6.1 Introduction
1.6.2 Selecting Options
Introduction:
Attribute selection searches through all possible combinations of attributes in the data
and finds which subset of attributes works best for prediction. Attribute selection
methods contain two parts: a search method such as best-first, forward selection,
random, exhaustive, genetic algorithm, ranking, and an evaluation method such as
correlation-based, wrapper, information gain, chi-squared. Attribute selection
mechanism is very flexible - WEKA allows (almost) arbitrary combinations of the
two methods.
To begin an attribute selection, click Select attributes tab.
Selecting Options
To search through all possible combinations of attributes in the data and find which
subset of attributes works best for prediction, make sure that you set up attribute
evaluator to CfsSubsetEval and a search method to BestFirst. The evaluator will
determine what method to use to assign a worth to each subset of attributes. The
search method will determine what style of search to perform.
The options that you can set for selection in the Attribute Selection Mode box are:
1. Use full training set. The worth of the attribute subset is determined using the
full set of training data.
2. Cross-validation. The worth of the attribute subset is determined by a process
of cross validation.
The Fold and Seed fields set the number of folds to use and the random seed used
when shuffling the data.
Specify which attribute to treat as the class in the drop-down box below the test
options. Once all the test options are set, you can start the attribute selection process
by clicking on Start button.
CHAPTER 1.7: DATA VISUALIZATION
1.7.1 Introduction
1.7.2 Changing the view
1.7.3 Selecting instances
Introduction:
WEKA visualization allows you to visualize a 2-D plot of the current working
relation. Visualization is very useful in practice it helps to determine difficulty of the
learning problem.
WEKA can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d
visualizations
(Xgobi-style). WEKA has Jitter option to deal with nominal attributes and to detect
―hidden data points.
Select a square that corresponds to the attributes you would like to visualize. For
example, let‘s choose petal width for X – axis and sepal length for Y – axis. Click
anywhere inside the square.
A Visualizing window appears on the screen
Changing the View
In the visualization window, beneath the X-axis selector there is a drop-down list
Colour for choosing the color scheme. This allows you to choose the color of points
based on the attribute selected. Below the plot area, there is a legend that describes
what values the colors correspond to. In your example, red represents no, while blue
represents yes. For better visibility you should change the color of label yes. Left-
click on yes in the Class colour box and select lighter color from the color palette. To
the right of the plot area there are series of horizontal strips. Each strip represents an
attribute, and the dots within it show the distribution values of the attribute. You can
choose what axes are used in the main graph by clicking on these strips (left-click
changes X axis, right click changes Y-axis). The software sets X - axis to petalwidth
attribute and Y - axis to sepallength. The instances are spread out in the plot area and
concentration points are not visible. Keep sliding Jitter, a random displacement given
to all points in the plot, to the right, until you can spot concentration points.
Selecting Instances:
Sometimes it is helpful to select a subset of the data using visualization tool. A
special case is the User Classifier, which lets you to build your own classifier by
interactively selecting instances. Below the Y– axis there is a drop-down list that
allows you to choose a selection method. A group of points on the graph can be
selected in four ways
Select Instance Click on an individual data point. It brings up a window listing
attributes of the point.
If more than one point will appear at the same location, more than one set of attributes
will be shown.
1. Rectangle: You can create a rectangle by dragging it around the points.
2. Polygon: You can select several points by building a free-form
polygon. Left-click on the graph to add vertices to the polygon and
right-click to complete it.
You can install Hadoop in your system as well which would be a feasible way to learn Hadoop.
We will be installing single node pseudo-distributed hadoop cluster on windows 10.
Prerequisite: To install Hadoop, you should have Java version 1.8 in your system. Check your
java version through this command on command prompt:
java -version
After downloading java version 1.8, download hadoop version 3.0 and extract it to a folder.
Likewise, create a new user variable with variable name as JAVA_HOME and variable value
as the path of the bin folder in the Java directory.
Now we need to set Hadoop bin directory and Java bin directory path in system variable path.
Edit Path in system variable
Click on New and add the bin directory path of Hadoop and Java in it.
Configurations:
Now we need to edit some files located in the hadoop directory of the etc folder where we
installed hadoop. The files that need to be edited have been highlighted.
1. Core site configuration
Now, we should configure the name node URL adding the following XML code into the
<configuration></configuration> element within “core-site.xml”:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9820</value>
</property>
2. Map Reduce site configuration
Now, we should add the following XML code into the <configuration></configuration>
element within “mapred-site.xml”: <property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>MapReduce framework name</description>
</property>
Create a folder with the name ‘datanode’ and a folder ‘namenode’ in this data
directory
6. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder
where your jdk 1.8 is installed.
Hadoop needs windows OS specific files which does not come with default download of
hadoop. To include those files, replace the bin folder in hadoop directory with the bin folder
provided in this github link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils
Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin
folder, rename it like bin_old and paste the copied bin folder in that directory.
Check whether hadoop is successfully installed by running this command on cmd:
hadoop version
Since it doesn’t throw error and successfully shows the hadoop version, that means hadoop is
successfully installed in the system.
sbin/start-dfs.sh
Hadoop Version: Prints the Hadoop Version
Put Command: To copy files/folders from local file system to hdfs store.
ls Command: This command is used to list all the files. It will print all the directories
present in HDFS.
copyFromLocal Command: To copy files/folders from local file system to hdfs store. It is
similar to put command.
rmr Command: This command deletes a file from HDFS recursively. It is very useful
command when you want to delete a non-empty directory.
HDFS to Local:
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
help Command: HDFS Command that displays help for given command or all commands if
none is specified.
help ls Command: If we want help regarding any particular command, then we can use this
command.
du Command: HDFS Command to check the file size.
touchz Command: HDFS Command to create a file in HDFS with file size 0 bytes.
stat Command: It will give the last modified time of directory or path. In short it will give
stats of the directory or file.
Experiment – 03 Mapreduce
How to run hello world program in Hadoop (Windows)
Write basic java hello world program using any text editor and save the file as HelloWorld.java
Run this java file using command prompt. After that .class file will be created.
You can create a manifest file in any text editor, and you can give your manifest file any name. Here the file
name is m1.txt
Now to convert this .class file into .jar. Write the following command:
cfm means "create a jar, specify the output jar file name, specify the manifest file name." This
is followed by the name you wish to give to your jar file, the name of your manifest file, and
the list of .class files that you want included in the jar. (.class) means all class files in the current
directory.
Now, in order to run the jar file using hadoop. Run your hadoop cluster. And write the
command to run the jar file in hadoop by specifying the jar file location in the local file system
and the name of the class (‘HelloWorld’ in this example) which is to be called.
Run WordCount Program in Hadoop (Windows)
Step 2: Compile the .java file and convert the output files into .jar (or you can simple download
the .jar from the following link https://github.com/MuhammadBilalYar/HADOOP-
INSTALLATION-ON-WINDOW-10/blob/master/MapReduceClient.jar )
Step 3: Make sure that your hadoop is working fine by running the following commands.
● Open cmd as administrator.
● Change directory to /hadoop/sbin and start cluster
● start-all.cmd OR start-yarn.cmd & start-dfs.cmd
(kindly see, if your cluster is shutting down by itself. If so, then there might be some issue with
the hadoop installation)
Step 4: Create a directory in hdfs, where we will save our input file.
Step 5: Create a .txt file having some textual content to apply word count with the name
“file.txt”
Step 6: Copy the input file to the hadoop file system in the directory which we just created.
This will copy the file.txt file from the local file system to the hadoop file system in the
specified directory i.e. /input/.
Step 8: You can also verify the content of the file by specifying the path of the file in the
hadoop file system.
1. Prerequisites
1.1. Hadoop Cluster Installation
If you are looking for the latest version, navigate to “latest” directory.
1. Download the pig-version.tar.gz
2. Now after downloading, extract pig anywhere. In this case I have extracted pig in
Hadoop home directory i.e. C:\hadoop\pig
3. Set environment variable in system settings
3.3 First add a new User Variable i.e. PIG_HOME as Pig extracted folder path
Environment Variables X
Variable Value
envC ontainerTelemetryApiC... - st "C:\Program Files\NVIDIA C orporation\NvC ontainer\NvC ontain...
envC ontainerTelemetryApiC... - st "C:\Program Files (x86)\NVIDIA C orporation\NvC ontainer\NvC o...
HADOOP HOME C:\hadoop\bin
JAVA_HOME C:\Java\bin
OneDrive C:\Users\shiva\OneDrive
OneDriveC onsumer C:\Users\shiva\OneDrive
Path C:\Users\shiva\A a aData\LncaI\Prnoramr\Pvthnn\PvthnnR9\ Sc riots...
New... Edict Delete
System variables
Variable Value
ComSpec C:\WINDOWS\system32\cmd.ae
DriverData C:\Windows\System 32\Drivers\DriverData
NUMBER OF PROCESSORS 4
OnlineServic es Online Services
OS Windows NT
Path C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Pro...
PATHEXT .CGM'.EXE'.BAT'.CMD'. VBS'.VBE', IS', ISE'.WSF'.WSH'.MSC *
N ... Edit... Delete
Cancel
Variable name!
pig -version
The simplest way to write PigLatin statements is using Grunt shell which is an interactive tool
where we write a statement and get the desired output. There are two modes to involve Grunt
Shell:
1. Local: All scripts are executed on a single machine without requiring Hadoop.
(command: pig -x local)
The load statement will simply load the data into the specified relation in Apache Pig. To verify
the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin
provides four different types of diagnostic operators −
1. Dump operator
2. Describe operator
3. Explanation operator
4. Illustration operator
1. Dump Operator: This command is used to display the results on screen. It usually helps
in debugging.
grunt> Dump Relation_Name;
2. Describe Operator: The describe operator is used to view the schema of a relation.
Syntax:
The syntax of the describe operator is as follows –
grunt> Describe Relation_name;
3. Explain Operator: The explain operator is used to display the logical, physical, and
MapReduce execution plans of a relation.
grunt> explain Relation_name;
4. Illustrate Operator: The illustrate operator gives you the step-by-step execution of a
sequence of statements.
grunt> illustrate Relation_name;
Grouping & Joining
1. Group Operator: The GROUP operator is used to group the data in one or more relations.
It collects the data having the same key.
Verification:
Verify the relation group_data using the DUMP operator as shown below:
grunt> Dump group_data;
You can see the schema of the table after grouping the data using the describe command as
shown below:
In the same way, you can get the sample illustration of the schema using
the illustrate command as shown below:
2. Join Operator
The JOIN operator is used to combine records from two or more relations. While performing
a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these
keys match, the two particular tuples are matched, else the records are dropped. Joins can be
of the following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily
renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times,
under different aliases (names). Therefore, let us load the contents of the file customer.txt as
two tables as shown below.
Syntax
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key;
Verification
Verify the relation customers3 using the DUMP operator as shown below.
Output
Combining
Union Operator:
The UNION operator of Pig Latin is used to merge the content of two relations. To perform
UNION operation on two relations, their columns and domains must be identical.
Syntax:
Given below is the syntax of the UNION operator:
grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Assume that we have two files namely student.txt and student1.txt in the directory of HDFS
as shown below.
student.txt
student1.txt
And we have loaded these two files into Pig with the relations student and student1 as shown
below.
Let us now merge the contents of these two relations using the UNION operator as shown
below.
Output
Filtering
1. Distinct Operator
The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
grunt> Relation_name2 = DISTINCT Relatin_name1;
Assume that we have a file named student.txt in the HDFS directory as shown below and we
have loaded this file into Pig with the relation name student as shown below.
Let us now remove the redundant (duplicate) tuples from the relation named student using
the DISTINCT operator, and store it as another relation named distinct_data as shown below.
Verification
Output
2. Foreach Operator: The FOREACH operator is used to generate specified data
transformations based on the column data.
Syntax:
Given below is the syntax of FOREACH operator.
we have loaded this file into Pig with the relation name.
Let us now get the id, firstname, and city values of each student from the relation student and
store it into another relation named foreach_data using the foreach operator as shown below.
Verify the relation foreach_data using the DUMP operator as shown below.
Output
Sorting
Order By:
The ORDER BY operator is used to display the contents of a relation in a sorted order based
on one or more fields.
Syntax:
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Let us now sort the relation in an ascending order based on the firstname of the student and
store it into another relation named order_by_data using the ORDER BY operator as shown
below. Also verify the relation order_by_data using the DUMP operator.
Output
Eval Functions
CONCAT(): The CONCAT() function of Pig Latin is used to concatenate two or more
expressions of the same type.
Syntax:
grunt> CONCAT (expression, expression, [...expression]);
student.txt
we have loaded this file into Pig with the relation name student as shown below.
In the above schema, you can observe that the name of the student is represented using two
chararray values namely firstname and lastname. Let us concatenate these two values using
the CONCAT() function.
Verify the relation student_name_concat using the DUMP operator as shown below.
Output
Word Count in Pig Latin
1. Here I have taken some sample text in which we will find different words
and no. of times they appear in our text.
Save this text as wc.txt anywhere on your local drive. In my case I have saved
this as wc.txt on D:\ location.
2. Now we need to put this txt file on our HDFS. To do that I have created a
new folder named pig_wc using this command –
hdfs dfs –mkdir /pig_wc
3. Now put the above created .txt file to hdfs using this command
hadoop fs –put /D:/wc.txt /pig_wc
1.3. Cygwin
Since there are some Hive 3.1.2 tools that aren’t compatible with Windows (such as schema
tool). We will need the Cygwin tool to run some Linux commands.
Now, we should edit the Path user variable to add the following paths:
%HIVE_BIN%
%DERBY_HOME%\bin
4. Configuring Hive
Then, we should paste them within the Hive libraries directory (C:\hadoop\hive\lib).
4.2. Configuring hive-site.xml
Now, we should go to the Apache Hive configuration directory (C:\hadoop\hive\conf) create a
new file “hive-site.xml”. We should paste the following XML code within this file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration><property> <name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property><property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD
based authentication KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom
authentication provider (Use with property hive.server2.custom.authentication.class)
</description>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>
5. Starting Services
Now, let try to open a command prompt tool and go to the Hive binaries directory
(C:\hadoop\hive\bin) and execute the following command:
hive
Start Hive
Create Table
Describe Table
Insert values into table
The insert command is used to load the data Hive table. Inserts can be done to a table or a
partition.
INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
INSERT INTO is used to append the data into existing data in a table.
Pre-Requisite:
We are going to make a standalone setup of HBase in our machine which requires:
Java JDK 1.8
HBase - Apache HBase
Step 2: Create a folder as shown below inside root folder for HBase data and zookeeper
-2.3.4/hbase
-2.3.4/zookeeper
Step 3:
Open C:/hbase-2.3.4/bin/hbase.cmd. Search for below given lines and
remove %HEAP_SETTINGS% from that line.
Step 4:
Open C:/hbase-2.3.4/conf/hbase-env.cmd. Add the below lines to the file:
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-
XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true
Step 5:
Open C:/hbase-2.3.4/conf/hbase-site.xml. Add the below lines inside <configuration> tag.
<property>
<name>hbase.rootdir</name>
<value>file:///C:/Documents/hbase-2.2.5/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/C:/Documents/hbase-2.2.5/zookeeper</value>
</property>
<property>
<name> hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
Step 6:
Setup the Environment variable for HBASE_HOME and add bin to the path variable.
Now we are all set to run HBase, to start HBase execute the command below from the bin
folder.
Open Command Prompt and cd to Hbase’ bin directory
Run start-hbase.cmd
Test the installation using HBase shell
HBASE Commands
General commands
Run the hbase shell.
status - Provides the status of HBase, for example, the number of servers.
is_enabled command: Checks if the table is enabled or not and returns true if
YES.
drop command: Drops a table from HBase. But in order to drop a table, we need
to disable it first.
We can see if our table has been deleted or not by using list command.
1. Prerequisites
1.1. Install Java 8
1.2.Install Python
7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window
SPARK Commands
The basic data structure of Spark is called an RDD (Resilient Distributed Datasets) which
contains an immutable collection of objects for distributed computing of records. All the
datasets of RDD are partitioned logically across multiple nodes of a cluster.
An RDD can be created only by reading from a local file system or by transforming an existing
RDD.
a) To create a new RDD we use the following command:
1. Actions: It is used to perform certain required operations on the existing datasets. Following
are a few of the commands which can be used to perform the below actions on the created
datasets:
a) count() function to count the number of elements in RDD:
f) saveAsTextFile(path) function saves the dataset in the specified path of hdfs location
g) partitions. length function can be used to find the number of partitions in the RDD
2. RDD Transformations
Transformation is used to form a new RDD from the existing ones. Since the inputs of the RDD
are immutable, the result formed upon transformation can be one or more RDD as output.
There are two types of transformations:
2.1.Narrow Transformations
2.2.Wide Transformations
2.1. Narrow Transformations – Each parent RDD is divided into various partitions and
among these only one partition will be used by the child RDD.
map() and filter() are the two basic kinds of basic transformations that are called when
an action is called.
map(func) function operates on each of the elements in the dataset “value” iteratively to
produce the output RDD.
In this example, we are adding the value 5 to each of the elements of the dataset “Newdata” and
displaying the transformed output with the help of collect function.
filter(func) function is basically used to filter out the elements satisfying a particular
condition specified using the function.
In this example, we are trying to retrieve all the elements except number 4 of the dataset
“Newdata” and fetching the output via the collect function.
2.2. Wide Transformations – A single parent RDD partition is shared upon its various
multiple child RDD partitions.
groupbykey and reducebyKey are examples of wide transformations.
groupbyKey function groups the dataset values into key-value pairs according to the key
values from another RDD. This process involves shuffling to take place when the group
by function collects the data associated with a particular key and stores them in a single
key-value pair.
Example: In this example, we are assigning the integers 6,7 to the string value “key” and
integers 9,2 assigned to “val” which are displayed in the same key-value pair format in the
output.
reduceByKey function also combines the key-value pairs from different RDD’s. It
combines the keys and their respective values into a single element after performing the
mentioned transformation.
Example: In this example, the common keys of the array “letters” are first parallelized by the
function and each letter is mapped with count 20 to it. The reduceByKey will add the values
having similar keys and saves in the variable Reduce. The output is then displayed using the
collect function.
Word Count using SPARK
Steps to execute Spark word count:
1. Create a text file in your local machine and write some text into it.
6. Now, we can read the generated result by using the following command
7. Here, we split the existing data in the form of individual words by using the following
command
Now, we can read the generated result by using the following command.
8. Now, perform the map operation.
First of all, download the latest version of Apache Flume software from the website
https://flume.apache.org/.
Step 1
Open the website. Click on the download link on the left-hand side of the home page. It will
take you to the download page of Apache Flume.
Step 2
In the Download page, you can see the links for binary and source files of Apache Flume. Click
on the link apache-flume-1.9.0-bin.tar.gz
You will be redirected to a list of mirrors where you can start your download by clicking any
of these mirrors.
Step 3
Extract the zip file and Move to C:\apache-flume-1.9.0-bin directory.
Step 4
Set Path and Classpath for Flume:
FLUME_HOME=C:\apache-flume-1.6.0-bin
FLUME_CONF=%FLUME_HOME%\conf
CLASSPATH=%FLUME_HOME%\lib\*
PATH=C:\apache-flume-1.6.0-bin\bin
Step 5
Now, go to the Flume folder and in that, you will find the folder named conf. Open it and in
that, open the file called log4j.properties file with a text editor and make the following
changes:
flume.root.logger=DEBUG,console
#flume.root.logger=INFO,LOGFILE
Step 6
Copy the file flume-conf.properties.template, flume-env.ps1.template and rename them
to flume-conf.properties, flume-env.ps1 respectively.
We will create an application and get the tweets from it using the experimental twitter source
provided by Apache Flume. We will use the memory channel to buffer these tweets and HDFS
sink to push these tweets into the HDFS.
Step 1
To create a Twitter application, click on the following link https://apps.twitter.com/. Sign in to
your Twitter account. You will have a Twitter Application Management window where you
can create, delete, and manage Twitter Apps.
Apply for the developer account by filling the basic information in the forms and once you get
the developer account you may proceed ahead to create an application on the developer portal.
Step 2
Click on the Create App button.
You will be redirected to a window where you will get an application form in which you have
to fill in your details in order to create the App.
Generate Consumer Keys and Access Token on this page. Remember them for future use.
2. Starting HDFS
Since we are storing the data in HDFS, we need to install / verify Hadoop. Start Hadoop and
create a folder in it to store Flume data. Follow the steps given below before configuring Flume.
Run the following command in the terminal:
start-all
Create a Directory in HDFS:
3. Configuring Flume
We have to configure the source, the channel, and the sink using the configuration file in the
conf folder.
Setting the classpath:
Set the classpath variable to the lib folder of Flume in Flume-env.sh file as shown below.
export CLASSPATH =C:\flume\apache-flume-1.9.0-bin\lib\*
Example – Configuration File
Given below is an example of the configuration file. Copy this content and save as twitter.conf
in the conf folder of Flume.
4. Execution
Browse through the Flume home directory and execute the application as shown below:
Run the following command in the terminal from the flume directory to stream the twitter data
in windows.
flume-ng agent --conf conf --conf-file ./conf/twitter.conf --name TwitterAgent -property
"flume.root.logger=DEBUG,console"
If everything goes fine, the streaming of tweets into HDFS will start. Given below is the
snapshot of the command prompt window while fetching tweets.
NOTE:
Download the MySQL-Connector (mysql-connector-java-8.0.21.jar)
https://downloads.mysql.com/archives/c-j/
Copy the mysql-connector jar file. You just download it and paste it to the sqoop/lib
folder.
Doc, nload Head the file (first 32KJ Tail the rile {last 32K
Block 0
Block ID 1073741913
Home
G,Mendoza,l01,Godoy Cruz,2O6SS8
ARG,Misiones,1O2,Posadas,20127Z
ARG,Mendoza,l08,GuaymallA Afln,20O5S5
ARG,5antiago del Estero,1O4,€antiago del Estero,18SS47
ARG,Jujuy,lO5,€an Salvador de Jujuy,l78748 ARG,Buenos
Aires,106,Hurlingham,17O028
ARG,NeuquAfAOn,1O7,NeuquAfAOn,1672S6
ARG,Buenos Aires,l08,ItuzaingA A’,158lS7 ARG,Buenos
Aires,10S,€an Fernando,15ZOZ6
ARG,Formosa,llO,Formosa,l476Z6 ARG,Mendoza,l11,Las
Heras,145828
ARG,La Rioja,112,La Rioja,1Z8117
ARG,Catamarca,113,Han Fernando del Val
ARG,CAiA’rdoBa,114,RAiAo Cuarto,1Z4Z55
ARG,ohuBut,1l5,Comodoro Rivadavia,124l04
ARG,Mendoza,l16,Mendoza,128027 ARG,Buenos
Aires,ll7,€an
ARG,Han Juan,118,Han Juan,11S152 ARG,Buonos
Aires,11S,EscoBar,116675 ARG,Entre
Rios,120,Concordia,116485 ARG,Buenos
Aires,l2l,Pilar,118428 ARG,flan Luis,l22,flan
Luis,llOl36 ARG,Buenos Aires,l28,Ezeiza,SS578
Export: To do export 1𝑠𝑡 create an empty table with same identity and all other description
of same as the imported table.
Now time to export your database table from hdfs to sql. Use
given cmd: