0% found this document useful (0 votes)
17 views

Bda Record

Bid Data Analytics

Uploaded by

David Mullangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Bda Record

Bid Data Analytics

Uploaded by

David Mullangi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

DEPARTMENT OF INFORMATION TECHNOLOGY

Big Data with Hadoop Lab Manual


(17CI68)

B.Tech VII SEMESTER – R17


LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING
(AUTONOMOUS)
Affiliated to JNTUK, Kakinada & Approved by AICTE, New Delhi

NAAC Accredited with “B++” grade, Certified by ISO 9001:2015

L.B.REDDY NAGAR, MYLAVARAM-KRISHNA DIST.

DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE

Certificate that this is bonafied record of practical work done in Big Data with
Hadoop Laboratory by Ponnam.Vasavi with Regd No 18761A1245 of
IV B.Tech Course(VII Semister) in IT branch during the academic year 2021-2022

Date: TEACHER-IN-CHARGE

Internal Examiner External Examiner


Pre-requisites:

 Java Programming
 Database Knowledge

Course Educational Objectives (CEO):


This course provides practical, foundation level
training that enables immediate and effective participation in Big Data and other
Analytics
projects using Hadoop and R

Course Outcomes (COs):After the completion of this course, the student will be
able to:
CO1: Preparing for data summarization, query, and analysis.
CO2: Applying data modelling techniques to large data sets
CO3: Creating applications for Big Data analytics

4. Course Articulation Matrix:

Course COs Programme Outcomes


Code 1 2 3 4 5
CO1 3 2 - - -
CO2 2 - - 1 3
CO3 1 3 - 3 -
CO4 1 2 2 3 1
CO5 - 2 3 3 1
1= Slight(low) 2=Moderate(Medium) 3=Substantial(High)
Index
S.n Date Experiment Signature
o
1. 13/07/2022 Downloading and installing Hadoop;
Understanding different Hadoop modes.
20/07/2022 Startup scripts, Configuration files.
2. 27/07/2022 Hadoop Implementation of file
management tasks, such as Adding files
03/08/2022 and directories,
Retrieving files and Deleting files
3. 10/08/2022 Implementation of Matrix
17/08/2022 Multiplication with Hadoop Map
Reduce
4. 24/08/2022 Implementation of Run a basic Word
Count Map Reduce program to understand
07/09/2022 Map Reduce Paradigm.
5. 14/09/2022 Implementation of K-means clustering
21/09/2022 using map reduce

6. 28/09/2022 Installation of Hive along with practice


12/10/2022 examples

7. 19/10/2022 Installation of HBase, Installing thrift


26/10/2022 along with Practice examples

8. 2/11/2022 Installation of R, along with Practice


9/11/2022 examples in R.

9. 16/11/2022 Downloading and installing Hadoop;


Understanding different Hadoop modes.
Startup scripts, Configuration files.
List of Experiments

Week-1:

Downloading and installing Hadoop; Understanding different Hadoop modes. Startup


Scripts, Configuration files.

Week-2:

Hadoop Implementation of file management tasks, such as Adding files and


directories,
Retrieving files and Deleting files

Week-3:

Implementation of Matrix Multiplication with Hadoop Map Reduce

Week-4

Implementation of Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm.

Week-5:

Implementation of K-means clustering using map reduce

Week-6:

Installation of Hive along with practice examples.

Week-7:

Installation of HBase, Installing thrift along with Practice examples

Week-8:

Installation of R, along with Practice examples in R.


EXPERIMENT-1
1. (i)Perform setting up and Installing Hadoop in its three operating modes:
 Standalone
 Pseudo distributed
 Fully distributed
(ii) Use web based tools to monitor your Hadoop setup.

Hadoop can run on three modes


a) Standalone mode
b) Pseudo mode
c) Fully distributed mode
The software requirements for hadoop installation are
 Java Development Kit
 Hadoop framework
 Secured shell

A) STANDALONE MODE:
 Installation of jdk 7
Command: sudo apt-get install openjdk-7-jdk

 Download and extract Hadoop


Command: wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-
1.2.0.tar.gz
Command: tar -xvf hadoop-1.2.0.tar.gz
Command: sudo mv hadoop-1.2.0 /usr/lib/hadoop

 Set the path for java and hadoop


Command: sudo gedit $HOME/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_COMMON_HOME=/usr/lib/hadoop
export HADOOP_MAPRED_HOME=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/Sbin
 Checking of java and hadoop
Command: java -version
Command: hadoop version

B) PSEUDO MODE:
Hadoop single node cluster runs on single machine. The namenodes and datanodes
are performing on the one machine. The installation and configuration steps as given below:

 Installation of secured shell


Command: sudo apt-get install openssh-server
 Create a ssh key for passwordless ssh configuration
Command: ssh-keygen -t rsa –P ""

 Moving the key to authorized key


Command: cat $HOME/.ssh/id_rsa.pub >>$HOME/.ssh/authorized_keys

/**************RESTART THE COMPUTER********************/

 Checking of secured shell login


Command: ssh localhost

 Add JAVA_HOME directory in hadoop-env.sh file


Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creating namenode and datanode directories for


hadoop Command: sudo mkdir -p
/usr/lib/hadoop/dfs/namenode Command: sudo mkdir
-p /usr/lib/hadoop/dfs/datanode

 Configure core-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

 Configure hdfs-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>
 Configure mapred-site.xml
Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

 Format the name node


Command: hadoop namenode -format

 Start the namenode, datanode


Command: start-dfs.sh

 Start the task tracker and job tracker


Command: start-mapred.sh

 To check if Hadoop started correctly


Command: jps
namenode
secondarynamenode
datanode
jobtracker
tasktracker

C) FULLY DISTRIBUTED MODE:


All the demons like namenodes and datanodes are runs on different machines. The
data will replicate according to the replication factor in client machines. The secondary
namenode will store the mirror images of namenode periodically. The namenode having the
metadata where the blocks are stored and number of replicas in the client machines. The
slaves and master communicate each other periodically. The configurations of multinode
cluster are given below:

 Configure the hosts in all nodes/machines


Command: sudo gedit /etc/hosts/
192.168.1.58 pcetcse1
192.168.1.4 pcetcse2
192.168.1.5 pcetcse3
192.168.1.7 pcetcse4
192.168.1.8 pcetcse5

 Passwordless Ssh Configuration

 Create ssh key on namenode/master.


Command: ssh-keygen -t rsa -p “”

 Copy the generated public key all datanodes/slaves.


Command: ssh-copy-id -i ~/.ssh/id_rsa.pub
huser@pcetcse2 Command: ssh-copy-id -i
~/.ssh/id_rsa.pub huser@pcetcse3 Command: ssh-copy-id -
i ~/.ssh/id_rsa.pub huser@pcetcse4 Command: ssh-copy-id
-i ~/.ssh/id_rsa.pubhuser@pcetcse5

/**************RESTART ALL NODES/COMPUTERS/MACHINES ************/

NOTE: Verify the passwordless ssh environment from namenode to all datanodes as “huser”
user.
 Login to master node
Command: ssh pcetcse1
Command: ssh pcetcse2
Command: ssh pcetcse3
Command: ssh pcetcse4
Command: ssh pcetcse5

 Add JAVA_HOME directory in hadoop-env.sh file in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386

 Creating namenode directory in namenode/master


Command: sudo mkdir -p /usr/lib/hadoop/dfs/namenode

 Creating namenode directory in datanonodes/slaves


Command: sudo mkdir -p /usr/lib/hadoop/dfs/datanode

 Configure core-site.xml in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://pcetcse1:8020</value>
</property>

 Configure hdfs-site.xml in namenode/master


Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/lib/hadoop/dfs/namenode</value></property>
 Configure hdfs-site.xml in datanodes/slaves
Command: sudo gedit /usr/lib/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/lib/hadoop/dfs/datanode</value>
</property>

 Configure mapred-site.xml in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>pcetcse1:8021</value>
</property>

 Configure masters in all namenode/master give the secondary namenode hostname


Command: sudo gedit /usr/lib/hadoop/conf/masters
pcetcse2

 Configure masters in all datanodes/slaves give the namenode hostname


Command: sudo gedit /usr/lib/hadoop/conf/masters
pcetcse1

 Configure slaves in all nodes/machines


Command: sudo gedit /usr/lib/hadoop/conf/slaves
pcetcse2
pcetcse3
pcetcse4
pcetcse5

 Format the name node


Command: hadoop namenode -format

 Start the namenode, datanode


Command: start-dfs.sh

 Start the task tracker and job tracker


Command: start-mapred.sh
 To check if Hadoop started correctly check in all the nodes/machines
huser@pcetcse1:$ jps
namenode
jobtracker
huser@pcetcse2:$ jps
secondarynamenode
tasktracker
datanode
huser@pcetcse3:$ jps
datanode
tasktracker

huser@pcetcse4:$ jps
datanode
tasktracker

huser@pcetcse5:$ jps
datanode
tasktracker

Using HDFS monitoring UI


 HDFS Namenode on UI
http://locahost:50070/
 HDFS Live Nodes list

 HDFS Jobtracker
http://locahost:5003
0/
 HDFS Logs
http://locahost:50070/logs

 HDFS Tasktracker
http://locahost:50060
/
Aim: Wordcount program using mapreduce.
Procedure in eclipse:
Open eclipse IDE
 Go to file->new->java project->project name(wordcount). Check the version of Java SE-1.8.
 Right click on project name(wordcount) ->new->package name as com.lbrce.wordcount
 Right click on package name->new->class>create class with respective class names and add
code.
 Right on package->bulid path->configure->goto libraries-click on add external jars. Then add
required jar files.
 Right click on the project(wordcount)->export->Type as jar in text field and click on jar file
then browse jar file, save file name as wordcount->next->next->main class browse->next-
>click wordcount driver.click on finish.
Procedure in WinSCP:
Open WinSCP IDE
Enter host address: 172.16.0.70
Username: student
Password: LbrceStudent
Two windows will there in the environment. In right panel(server),open it2021->your respective
folder(eg.18761A1201).In the left panel select the respective folder select your jar file and drag to the
right panel(server).
Procedure in Termius:
Host address: ssh [email protected]
Password: LbrceStudent
Then click on connect.
>>cd it2021
>>cd 18761A1201(your directory)
>>scp jar filename(i.e.,wordcount.jar) [email protected]:/home/hduser/it2021/18761A1201
>>enter password as ipc
>>ssh [email protected]
>>enter password as ipc
>>cd it2021/18761A1201
create a file as wordcount1
>>cat >>wordcount1
Enter the text in the file to find count of words in the file.
Then click ctrl+D
>> hadoop fs -mkdir /it2021/18761a1201
>> hadoop fs -put wordcount1 /it2021/18761a1201
>> hadoop jar jarfilename(wordcount.jar) /it2021/18761a1201/wordcount1
/it2021/18761a1201/wordcountoutput
>> hadoop fs -cat /it2021/18761a1201/wordcountoutput/part*
Output will be displayed.

Aim: Matrix Multiplication program using MapReduce.


Procedure in eclipse:
Open eclipse IDE
 Go to file->new->java project->project name(matrixmultiplication).Check the version of Java
SE-1.8.
 Right click on project name(matrixmultiplication) ->new->package name as
com.lbrce.matrixmultiplication
 Right click on package name->new->class>create class with respective class names and add
code.
 Right on package->bulid path->configure->goto libraries-click on add external jars.Then add
required jar files.
 Right click on the project(matrixmultiplication)->export->Type as jar in text field and click on
jar file then browse jar file,save file name as matrixmultiplication ->next->next->main class
browse->next->click matrixmultiplication driver.click on finish.
Procedure in WinSCP:
Open WinSCP IDE
Enter host address: 172.16.0.70
Username: student
Password: LbrceStudent
Two windows will there in the environment.In right panel(server),open it2021->your respective
folder(eg.18761A1201).In the left panel select the respective folder select your jar file and drag to the
right panel(server).
Procedure in Termius:
Host address: ssh [email protected]
Password: LbrceStudent
Then click on connect.
>>cd it2021
>>cd 18761A1201(your directory)
>>scp jar filename(i.e matrixmultiplication.jar)
[email protected]:/home/hduser/it2021/18761A1201
>>enter password as ipc
>>ssh [email protected]
>>enter password as ipc
>>cd it2021/18761A1201
create a file as matrixmultiplication1
>>cat >> matrixmultiplication1
M,0,0,1
M,0,1,2
M,1,0,3
M,1,1,4
N,0,0,1
N,0,1,3
N,1,0,4
N,1,1,5
Then click ctrl+D
>> hadoop fs -mkdir /it2021/18761a1201
>> hadoop fs -put matrixmultiplication1 /it2021/18761a1201
>> hadoop jar jarfilename(matrixmultiplication.jar) /it2021/18761a1201/ matrixmultiplication1
/it2021/18761a1201/matrixmultiplicationoutput
>> hadoop fs -cat /it2021/18761a1201/ matrixmultiplicationoutput/part*
Output will be displayed.
Aim: K-Means program using mapreduce.
Procedure in eclipse:
Open eclipse IDE
 Go to file->new->java project->project name(kmeans).Check the version of Java SE-1.8.
 Right click on project name(kmeans) ->new->package name as com.lbrce.kmeans
 Right click on package name->new->class>create class with respective class names and add
code.
 Right on package->bulid path->configure->goto libraries-click on add external jars.Then add
required jar files.
 Right click on the project(kmeans)->export->Type as jar in text field and click on jar file then
browse jar file,save file name as kmeans->next->next->main class browse->next->click
kmeans driver.click on finish.
Procedure in WinSCP:
Open WinSCP IDE
Enter host address: 172.16.0.70
Username: student
Password: LbrceStudent
Two windows will there in the environment.In right panel(server),open it2021->your respective
folder(eg.18761A1201).In the left panel select the respective folder select your jar file and drag the
kmeans jar file to the right panel(server).
Procedure in Termius:
Host address: ssh [email protected]
Password: LbrceStudent
Then click on connect.
>>cd it2021
>>cd 18761A1201(your directory)
>>scp jar filename(i.e.,kmeans.jar) [email protected]:/home/hduser/it2021/18761A1201
>>enter password as ipc
>>ssh [email protected]
>>enter password as ipc
>>cd it2021/18761A1201
create a file as kmeans1
>>cat >> kmeanscentroid
20
30
40
>>cat >> kmeansdatapoints
16
14
10
26
34
12
28
19
Then click ctrl+D
>> hadoop fs -mkdir /it2021/18761a1201
>> hadoop fs -put kmeanscentroid /it2021/18761a1201
>> hadoop fs -put kmeansdatapoints /it2021/18761a1201
>> hadoop jar jarfilename(kmeanscentrod.jar) /it2021/18761a1201/kmeanscentroid
/it2021/18761a1201/kmeansoutput
>> hadoop fs -cat /it2021/18761a1201/kmeansoutput/part*
Output will be displayed.
EXPERIMENT -6
3) Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
Functions and indexes

 Download and extract Hive:


Command: wget https://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0-
bin.tar.gz
Command: tar zxvf apache-hive-0.14.0-bin.tar.gz
Command: sudo mv apache-hive-0.13.1-bin /usr/lib/hive
Command: sudo gedit $HOME/.bashrc
export HIVE_HOME=/usr/lib/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/*.jar
export CLASSPATH=$CLASSPATH:/usr/lib/hive/lib/*.jar
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/lib/hadoop

 Downloading Apache Derby


The following command is used to download Apache Derby. It takes some time to
download.
Command: wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-
10.4.2.0-bin.tar.gz
Command: tar zxvf db-derby-10.4.2.0-bin.tar.gz
Command: sudo mv db-derby-10.4.2.0-bin
/usr/lib/derby Command: sudo gedit $HOME/.bashrc

export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar:$DERBY_HOME/lib/derbyclient.jar
Command: sudo mkdir $DERBY_HOME/data
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-default.xml.template hive-site.xml
Command: Sudo gedit $HOVE_HOME/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
 Create a file named jpox.properties and add the following lines into it:

javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create =
true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Command: HADOOP_HOME/bin/hadoop fs -mkdir /tmp


Command: HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Command: hive
Logging initialized using configuration in jar:file:/home/hadoop/hive-
0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.tx
t
………………….

hive> show tables;


OK
Time Taken: 2.798 seconds

 Database and table creation, dropping:


hive> CREATE DATABASE [IF NOT EXISTS]
userdb; hive> SHOW
DATABASES; default
userdb
hive> DROP DATABASE IF EXISTS userdb;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT „Employee details‟
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY „\t‟
> LINES TERMINATED BY „\n‟
> STORED AS TEXTFILE;

Example
We will insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
1205 Kranthi 30000 Op Admin

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'


> OVERWRITE INTO TABLE employee;
hive> SELECT * FROM employee WHERE Salary>=40000;

+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+ + + + + +

hive> ALTER TABLE employee RENAME TO emp;


hive> DROP TABLE IF EXISTS employee;

Functions:

Return Signature Description


Type
BIGINT round(double a) It returns the rounded BIGINT
value of the double.
BIGINT floor(double a) It returns the maximum BIGINT
value that is equal or less than the
double.
BIGINT ceil(double a) It returns the minimum BIGINT
value that is equal or greater than
the double.
double rand(), rand(int seed) It returns a random number that
changes from row to row.
string concat(string A, string B,...) It returns the string resulting from
concatenating B after A.
string substr(string A, int start) It returns the substring of A starting
from start position till the end of
string A.

string substr(string A, int start, int It returns the substring of A starting


length) from start position with the given
length.
string upper(string A) It returns the string resulting from
converting all characters of A to
upper case.
string ucase(string A) Same as above.
string lower(string A) It returns the string resulting from
converting all characters of B to
lower case.

hive> SELECT round(2.6) from temp;


2.0
 Views:
Example
Let us take an example for view. Assume employee table as given below, with
the fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the
employee details who earn a salary of more than Rs 30000. We store the result in a view
named emp_30000.

+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |

The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
> SELECT * FROM employee
> WHERE salary>30000;
 Indexes:
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
EXPERIMENT – 7

6) Installation of HBase, Installing thrift along with Practice examples.

Apache HBase Installation Modes


Apache HBase can be installed in three modes.

1) Standalone mode installation (No dependency on Hadoop system)


 This is default mode of HBase
 It runs against local file system
 It doesn't use Hadoop HDFS
 Only HMaster daemon can run
 Not recommended for production environment
 Runs in single JVM
2) Pseudo-Distributed mode installation (Single node Hadoop system + HBase installation)
 It runs on Hadoop HDFS
 All Daemons run in single node
 Recommend for production environment
3) Fully Distributed mode installation (MultinodeHadoop environment + HBase
installation)
 It runs on Hadoop HDFS
 All daemons going to run across all nodes present in the cluster
 Highly recommended for production environment

Step 1) Go to the link here to download HBase. It will open a webpage as shown below.
Step 2) Select stable version as shown below 1.1.2 version
Step 3) Click on the hbase-1.1.2-bin.tar.gz. It will download tar file. Copy the tar file into an
installation location.

Hbase - Standalone mode installation:


Installation is performed on Ubuntu with Hadoop already installed.
Step 1) Place hbase-1.1.2-bin.tar.gz in /home/hduser
Step 2) Unzip it by executing command $tar -xvf hbase-1.1.2-bin.tar.gz. It will unzip the
contents, and it will create hbase-1.1.2 in the location /home/hduser
Step 3) Open hbase-env.sh as below and mention JAVA_HOME path in the location.

Step 4) Open ~/.bashrc file and mention HBASE_HOME path as shown in below
Step 5) Open hbase-site.xml and place the following properties inside the file
hduser@ubuntu$ gedit hbase-site.xml(code as below)

<property>

<name>hbase.rootdir</name>

<value>file:///home/hduser/HBASE/hbase</value>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/hduser/HBASE/zookeeper</value>
</property>

Here we are placing two properties


One for HBase root directory and
Second one for data directory correspond to ZooKeeper.
All HMaster and ZooKeeper activities point out to this hbase-site.xml.

Step 6) Open hosts file present in /etc. location and mention the IPs as shown in below.
Step 7) Now Run Start-hbase.sh in hbase-1.1.1/bin location as shown below.
And we can check by jps command to see HMaster is running or not.

Step8) HBase shell can start by using "hbase shell" and it will enter into interactive shell
mode as shown in below screenshot. Once it enters into shell mode, we can perform all type
of commands.

HBase Shell
HBase contains a shell using which you can communicate with HBase. HBase uses the Hadoop
File System to store its data. It will have a master server and region servers. The data storage will
be in the form of regions (tables). These regions will be split up and stored in region servers.
The master server manages these region servers and all these tasks take place on HDFS. Given
below are some of the commands supported by HBase Shell.

General Commands
status - Provides the status of HBase, for example, the number of servers.
version - Provides the version of HBase being used.
table_help - Provides help for table-reference commands.
whoami - Provides information about the user.

Data Definition Language


These are the commands that operate on the tables in HBase.
create - Creates a table.
list - Lists all the tables in HBase.
disable - Disables a table.
is_disabled - Verifies whether a table is disabled.
enable - Enables a table.
is_enabled - Verifies whether a table is enabled.
describe - Provides the description of a table.
alter - Alters a table.
exists - Verifies whether a table exists.
drop - Drops a table from HBase.
drop_all - Drops the tables matching the ‘regex’ given in the command.
Java Admin API - Prior to all the above commands, Java provides an Admin API to achieve
DDL functionalities through programming. Under org.apache.hadoop.hbase.client package,
HBaseAdmin and HTableDescriptor are the two important classes in this package that provide
DDL functionalities.

Data Manipulation Language


put - Puts a cell value at a specified column in a specified row in a particular table.
get - Fetches the contents of row or a cell.
delete - Deletes a cell value in a table.
deleteall - Deletes all the cells in a given row.
scan - Scans and returns the table data.
count - Counts and returns the number of rows in a table.
truncate - Disables, drops, and recreates a specified table.
EXPERIMENT – 8

Installation of R, along with Practice examples in R.

Installation of R software in Windows and Linux environments

Requirements Analysis

Installation of R in Windows OS: The Comprehensive R Archive Network (CRAN) is a network of


websites that host the R program and that mirror the original R website. The benefit of having this
network of websites is improved download speeds. For all intents and purposes, CRAN is the R
website and holds downloads (including old versions of software) and documentation. R can be
installed in Windows7/8/10/Vista and supports both the 32-bit and 64-bit versions. Go to the CRAN
website and select the latest installer R 3.4.0 for Windows and download the .exe file. Double click
on the download file and select Run as Administrator form the popup menu. Select the language to
be used for installation and follow the directions. The installation folder for R can be found in C:\
Programs\R. The steps for installing R:

1. Click on the link https://cran.r-project.org/bin/windows/base/ which redirects you to the


download.

page.

2. Select the latest installer R-3.4.0 for installation and download the same. After download, clicking.

on the setup file opens the dialog box.

3. Click on the ‘Next’ button starts the installation process. This redirects you to the license window.

and selecting ‘Next’.

4. After selecting the Next button from the previous step the installation folder path is required.
Select the desired folder for installation; it is advisable to select the C directory for smooth running
of the program.

5. Next select the components for installation based on the requirements of your operating system
to avoid unwanted use of disk space.

6. In the next dialog box, we need to select the start menu folder. Here, it is better to go with the
default option given by the installer.

7. After setting up the Start menu folder, check the additional options for completing the setup.

8. After clicking next from the previous step, the installation procedure end and the window is
displayed. Click ‘Finish’ to exist from the installation window.
Installing R-Studio

Installing and Configuring R-Studio in Windows: The Integrated Development Environment(IDE) for
R is R Studio and it provides a variety of features such as an editor with direct code execution and
syntax highlighting, a console, tools for plotting graphs, history lookup, debugging, and an
environment for workspace creation. R Studio can be installed in any of the Windows platforms such
as Windows 7/8/10/Vista and can be configured within a few minutes. The basic requirement is R
2.11.1+ version. The following are the steps involved to setup R Studio:

1) Download the latest version of R Studio just by clicking on the link provided here
https://www.rstudio.com/products/rstudio/download/ and it redirects you to download page.
There are two versions of R Studio available – desktop and server. Based on your usage and
comfort, select the appropriate version to initiate your download. The latest desktop version for R
Studio is 1.0.136.

2) Download the .exe file and double click on it to initiate the installation.

3) Click on the ‘Next’ button and it redirects you to select the installation folder. Select ‘C:\’ as your
installation directory since R and R Studio must be installed in the same directory to avoid path
issues for running R programs.

4) Click ‘Next’ to continue and a dialog box asking you to select the Start menu folder opens. It is
advisable to create your own folder to avoid any possible confusion and click on Install button to
install R Studio. After completion of installation, clicking ‘Next’ from the previous step, the
installation procedure ends, and the window is displayed. Click ‘Finish’ to exist from the installation
window.
Installation of R in Ubuntu
Installation of R in Ubuntu: Go to software center and search for R Base and install. Then open
terminal and enter R to get R command prompt in terminal. Installation of R-studio in Ubuntu: Open
terminal and type the following commands.

Experiment-8
R Programming language has numerous libraries to create charts and graphs. A pie-chart is a
representation of values as slices of a circle with different colors. The slices are labeled and the
numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input.
The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
 x is a vector containing the numeric values used in the pie chart.
 labels is used to give description to the slices.
 radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
 main indicates the title of the chart.
 col indicates the color palette.
 clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Example
A very simple pie-chart is created using just the input vector and labels.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")

# Give the chart file a name.


png(file = "city.png")

# Plot the chart.


pie(x,labels)

# Save the file.


dev.off()

# Create data for the graph.


x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")

# Give the chart file a name.


png(file = "city_title_colours.jpg")

# Plot the chart with title and rainbow color pallet.


pie(x, labels, main = "City pie chart", col = rainbow(length(x)))

# Save the file.


dev.off()

A bar chart represents data in rectangular bars with length of the bar proportional to the value of the
variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal
bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used −

 H is a vector or matrix containing numeric values used in bar chart.


 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the title of the bar chart.
 names.arg is a vector of names appearing under each bar.
 col is used to give colors to the bars in the graph.
Example
A simple bar chart is created using just the input vector and the name of each bar.

# Create the data for the chart


H <- c(7,12,28,3,41)

# Give the chart file a name


png(file = "barchart.png")

# Plot the bar chart


barplot(H)

# Save the file


dev.off()
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into
three quartiles. This graph represents the minimum, maximum, median, first quartile and third
quartile in the data set. It is also useful in comparing the distribution of data across data sets by
drawing boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
 x is a vector or a formula.
 data is the data frame.
 notch is a logical value. Set as TRUE to draw a notch.
 varwidth is a logical value. Set as true to draw width of the box proportionate to the sample
size.
 names are the group labels which will be printed under each boxplot.
 main is used to give a title to the graph.

Example
Give the chart file a name.
png(file = "boxplot.png")

# Plot the chart.


boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")

# Save the file.


dev.off()

sssA histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is
similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
 v is a vector containing numeric values used in histogram.
 main indicates title of the chart.
 col is used to set color of the bars.
 border is used to set border color of each bar.
 xlab is used to give description of x-axis.
 xlim is used to specify the range of values on the x-axis.
 ylim is used to specify the range of values on the y-axis.
 breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.


png(file = "histogram.png")

# Create the histogram.


hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.


dev.off()
A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are
usually used in identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
 v is a vector containing the numeric values.
 type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw
both points and lines.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the Title of the chart.
 col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input vector and the type parameter as "O". The below script
will create and save a line chart in the current R working directory.
# Create the data for the chart.
v <- c(7,12,28,3,41)

# Give the chart file a name.


png(file = "line_chart.jpg")

# Plot the bar chart.


plot(v,type = "o")

# Save the file.


dev.off()
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two
variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
 x is the data set whose values are the horizontal coordinates.
 y is the data set whose values are the vertical coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.
 xlim is the limits of the values of x used for plotting.
 ylim is the limits of the values of y used for plotting.
 axes indicates whether both axes should be drawn on the plot.

Reading a CSV File


Following is a simple example of read.csv() function to read a CSV file available in your current
working directory −
data <- read.csv("input.csv")
print(data)

Analyzing the CSV File


By default the read.csv() function gives the output as a data frame. This can be easily checked as
follows. Also we can check the number of columns and rows.
data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result −
[1] TRUE
[1] 5
[1] 8

# Create a data frame.


data <- read.csv("input.csv")
# Get the max salary from data frame.
sal <- max(data$salary)
print(sal)

Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA
values.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)

# Find mean.
result.mean <- mean(x)
print(result.mean)

# Find mean dropping NA values.


result.mean <- mean(x,na.rm = TRUE)
print(result.mean)

Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
 x is the input vector.
 trim is used to drop some observations from both end of the sorted vector.
 na.rm is used to remove the missing values from the input vector.
Example

# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)
When we execute the above code, it produces the following result −
[1] 8.22
Median
The middle most value in a data series is called the median. The median() function is used in R to
calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
 x is the input vector.
 na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)
When we execute the above code, it produces the following result −
[1] 5.6

Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean and
median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we create a user function to
calculate mode of a data set in R. This function takes the vector as input and gives the mode value as
output.
Example
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.


v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.


result <- getmode(v)
print(result)

# Create the vector with characters.


charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.


result <- getmode(charv)
print(result)
Linear Regression :

library(tidyverse)

library(caret)

theme_set(theme_bw())

data("marketing", package = "datarium")

sample_n(marketing, 3)

set.seed(123)

training.samples <- marketing$sales %>%

createDataPartition(p = 0.8, list = FALSE)

train.data <- marketing[training.samples, ]

test.data <- marketing[-training.samples, ]

model <- lm(sales ~., data = train.data)

summary(model)

model <- lm(sales ~ youtube, data = train.data)

summary(model)$coef

model <- lm(sales ~ facebook, data = train.data)

summary(model)$coef

newdata <- data.frame(youtube = c(0, 1000))

model %>% predict(newdata)

model <- lm(sales ~ youtube + facebook + newspaper, data = train.data)

summary(model)$coef

ggplot(marketing, aes(x = youtube, y = sales)) +geom_point() +stat_smooth()

predictions <- model %>% predict(test.data)

RMSE(predictions, test.data$sales)

R2(predictions, test.data$sales)
Results

Call:

lm(formula = sales ~ ., data = train.data)

Residuals:

Min 1Q Median 3Q Max

-10.7142 -0.9939 0.3684 1.4494 3.3619

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.594142 0.420815 8.541 1.05e-14 ***

youtube 0.044636 0.001552 28.758 < 2e-16 ***

facebook 0.188823 0.009529 19.816 < 2e-16 ***

newspaper 0.002840 0.006442 0.441 0.66

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.043 on 158 degrees of freedom

Multiple R-squared: 0.8955, Adjusted R-squared: 0.8935

F-statistic: 451.2 on 3 and 158 DF, p-value: < 2.2e-16


VIVA – QUESTIONS
1. What are the basic differences between relational database and HDFS?
2. Explain “Big Data” and what are five V’s of Big Data?
3. What is Hadoop and its components.
4. What are HDFS and YARN?
5. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.
6. Compare HDFS with Network Attached Storage (NAS).
7. List the difference between Hadoop 1 and Hadoop 2.
8. What are active and passive “NameNodes”?
9. Why does one remove or add nodes in a Hadoop cluster frequently?
10. What happens when two clients try to access the same file in the HDFS?
11. How does NameNode tackle DataNode failures?
12. What will you do when NameNode is down?
13. What is a checkpoint?
14. How is HDFS fault tolerant?
15. Can NameNode and DataNode be a commodity hardware?
16. Why do we use HDFS for applications having large data sets and not when
there are a lot of small files?
17. How do you define “block” in HDFS? What is the default block size in Hadoop 1
and in Hadoop 2? Can it be changed?
18. What does ‘jps’ command do?
19. How do you define “Rack Awareness” in Hadoop?
20. What is “speculative execution” in Hadoop?
21. How can I restart “NameNode” or all the daemons in Hadoop?
22. What is the difference between an “HDFS Block” and an “Input Split”?
23. Name the three modes in which Hadoop can run.
24. What is “MapReduce”? What is the syntax to run a “MapReduce” program?
25. What are the main configuration parameters in a “MapReduce” program?
26. State the reason why we can’t perform “aggregation” (addition) in mapper?
Why do we need the “reducer” for this?
27. What is the purpose of “RecordReader” in Hadoop?
28. Explain “Distributed Cache” in a “MapReduce Framework”.
29. How do “reducers” communicate with each other?
30. What does a “MapReduce Partitioner” do?
31. How will you write a custom partitioner?
32. What is a “Combiner”?
33. What do you know about “SequenceFileInputFormat”?
34. What are the benefits of Apache Pig over MapReduce?
35. What are the different data types in Pig Latin?
36. What are the different relational operations in “Pig Latin” you worked with?
37. What is a UDF?
38. What is “SerDe” in “Hive”?
39. Can the default “Hive Metastore” be used by multiple users (processes) at the same
time?
40. What is the default location where “Hive” stores table data?

You might also like