0% found this document useful (0 votes)
29 views

Big Data Analytics Lab

This document provides instructions for implementing various data structures like linked lists, stacks, queues, sets and maps in Java and also provides a sample algorithm for creating collection data structures. It then gives an example of implementing a word count map reduce program to understand the map reduce paradigm.

Uploaded by

ANAND BHANDARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Big Data Analytics Lab

This document provides instructions for implementing various data structures like linked lists, stacks, queues, sets and maps in Java and also provides a sample algorithm for creating collection data structures. It then gives an example of implementing a word count map reduce program to understand the map reduce paradigm.

Uploaded by

ANAND BHANDARI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

lOMoARcPSD|25753199

Big Data Analytics Lab

B.Tech (Bikaner Technical University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by ANAND BHANDARI ([email protected])
lOMoARcPSD|25753199

GOVERNMENT ENGINEERING COLLEGE BIKANER

Lab Manual
Department of Computer Science and
Engineering

8th SEMESTER

Big Data Analytics Lab


Subject Code: 8CS4-21

Submitted To: Dr. Narpat Singh Sir Submitted By: ASHISH KULHARI
Branch: CSE 4th Year
Roll no: 18EEBCS012

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

EXERCISE-1:-
AIM:- Implement the following Data Structures in Java
a) Linked Lists b)Stacks c)Queues d)Set
e)Map SKELTON OF
JAVA.UTIL.COLLECTION INTERFACE
public interface Collection<E> extends Iterable<E> {

int size(); boolean isEmpty(); boolean contains(Object o);


Iterator<E> iterator();
Object[] toArray(); <T> T[] toArray(T[] a); boolean add(E e); boolean remove(Object o); boolean
addAll(Collection<? extends E> c); boolean removeAll(Collection<?> c); boolean
retainAll(Collection<?> c); void clear(); boolean equals(Object o); int hashCode();
}
ALGORITHM for All Collection Data
Structures:- Steps of Creation of Collection
1. Create a Object of Generic Type E,T,K or V
2. Create a Model class or Plain Old Java Object (POJO) of type.
3. Generate Setters and Getters
4. Create a Collection Object of type either Set or List or Map or Queue
5. Add Objects to the collection
Boolean add(E e)
6. Add Collection to the
Collection. Boolean addAll(Collection)
7. Remove or retain data from Collection
Remove(Collection) retailAll(Collection)
8. Iterate Objects using Enumeration or Iterator or ListIterator
Iterator listIterator()
9. Display Objects from Collection
10. END
SAMPLE
INPUT:

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

Sample Employee Data Set:


(employee.txt) e100,james,asst.prof,cse,8000,16000,4000,8.7
e101,jack,asst.prof,cse,8350,17000,4500,9.2 e102,jane,assoc.prof,cse,15000,30000,8000,7.8
e104,john,prof,cse,30000,60000,15000,8.8 e105,peter,assoc.prof,cse,16500,33000,8600,6.9
e106,david,assoc.prof,cse,18000,36000,9500,8.3 e107,daniel,asst.prof,cse,9400,19000,5000,7.9
e108,ramu,assoc.prof,cse,17000,34000,9000,6.8 e109,rani,asst.prof,cse,10000,21500,4800,6.4
e110,murthy,prof,cse,35000,71500,15000,9,3
EXPECTED OUTPUT:- Prints the information of employee with all its attributes

EXERCISE
-2:- AIM:-
i) Perform setting up and Installing Hadoop in its three operating modes:
• Standalone
• Pseudo Distributed
• Fully Distributed

ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:-
1. Command for installing ssh is “sudo apt-get install ssh”.
2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in the
eclipse.ini file

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

8. Export java path and hadoop path in ./bashrc


9. Check the installation successful or not by checking the java version and hadoop version
10. Check the hadoop instance in standalone mode working correctly or not by using an implicit
hadoop jar file named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means that standalone mode is
installed successfully.

ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE:-
1. In order install pseudo distributed mode we need to configure the hadoop configuration files
resides in the directory /home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the hadoop-env.sh file by changing the java path.
3. Configure the core-site.xml which contains a property tag, it contains name and value. Name as
fs.defaultFS and value as hdfs://localhost:9000
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to mapred-
site.xml.
7. Now format the name node by using command hdfs namenode –format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command hdfs dfs –
mkdr /csedir and enter some data into lendi.txt using command nano lendi.txt and copy from local
directory to hadoop using command hdfs dfs – copyFromLocal lendi.txt /csedir/and run sample jar file
wordcount to check whether pseudo distributed mode is working or not.
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.

FULLY DISTRIBUTED MODE


INSTALLATION: ALGORITHM
1. Stop all single node clusters
$stop-all.sh

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).


3. Copy public key to all three hosts to get a password less SSH access
$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
4. Configure all Configuration files, to name Master and Slave Nodes.
$cd $HADOOP_HOME/etc/hadoop
$nano core-site.xml
$ nano hdfs-site.xml
5. Add hostnames to file slaves and save it.
$ nano slaves
6. Configure $ nano yarn-site.xml
7. Do in Master Node
$ hdfs namenode –format
$ start-dfs.sh
$start-yarn.sh
8. Format NameNode
9. Daemons Starting in Master and Slave Nodes
10. END

INPUT ubuntu @localhost> jps


OUTPUT:
Data node, name nodem Secondary name node,
NodeManager, Resource Manager

EXERCISE
-3:- AIM:-
Implement the following file management tasks in Hadoop:

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

• Adding files and directories


• Retrieving files
• Deleting Files

ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1 Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into

HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created for you,
though, so let‘s create it with the mkdir command. For the purpose of illustration, we use chuck. You
should substitute your user name in the example commands.
hadoop fs -mkdir /user/chuck hadoop fs -put example.txt hadoop fs -put example.txt
/user/chuck
Step-2 Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve example.txt,
we can run the following command:
hadoop fs -cat example.txt
Step-3 Deleting Files from
HDFS hadoop fs -rm
example.txt
• Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.

• Adding directory is done through the command “hdfs dfs –put lendi_english /”.

Step-4 Copying Data from NFS to HDFS

Copying from directory command is “hdfs dfs –copyFromLocal


/home/lendi/Desktop/shakes/glossary /lendicse/”
• View the file by using the command “hdfs dfs –cat /lendi_english/glossary” • Command for
listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
• Command for Deleting files is “hdfs dfs –rm r /kartheek”.

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

SAMPLE INPUT: Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:

EXERCIS
E-4 AIM:-
Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm

ALGORITHM
MAPREDUCE
PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given text
input data set. WordCount fits very well with the MapReduce programming model making it a great
example to understand the Hadoop Map/Reduce programming style. Our implementation consists of three
main parts:
1. Mapper
2. Reducer
3. Driver

Step-1. Write a Mapper


A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper" which
provides <key, value> pairs as the input. A Mapper implementation may output <key,value> pairs using
the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key would
be the line number <line_number, line_of_text> . Map task outputs <word, one> for each word in the
line of text.
Pseudo-code void Map (key, value){ for each word x in value: output.collect(x, 1);
}
Step-2. Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single
result. Here, the WordCount program will sum up the occurrence of each word to pairs as <word,
occurrence>.
Pseudo-code void Reduce (keyword, <list of value>){ for each x in <list of value>:

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

sum+=x; final_output.collect(keyword, sum); }


Step-3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
• Job Name : name of this Job
• Executable (Jar) Class: the main executable class. For here, WordCount.
• Mapper Class: class which overrides the "map" function. For here, Map.
• Reducer: class which override the "reduce" function. For here , Reduce.
• Output Key: type of output key. For here, Text.
• Output Value: type of output value. For here, IntWritable.
• File Input Path
• File Output Path
INPUT:-
Set of Data Related Shakespeare Comedies, Glossary, Poems

OUTPUT:-

EXERCISE-5:-
AIM:- Write a Map Reduce Program that mines Weather
Data. ALGORITHM:- MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given text
input data set. WordCount fits very well with the MapReduce programming model making it a great
example to understand the Hadoop Map/Reduce programming style. Our implementation consists of three
main parts:

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

1. Mapper
2. Reducer
3. Main program
Step-1. Write a Mapper
A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper" which
provides <key, value> pairs as the input. A Mapper implementation may output <key,value> pairs using
the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key would
be the line number <line_number, line_of_text> . Map task outputs <word, one> for each word in the
line of text.
Pseudo-code void Map (key, value){ for each max_temp x in value:
output.collect(x, 1);
} void Map (key, value){ for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single
result. Here, the WordCount program will sum up the occurrence of each word to pairs as <word,
occurrence>.
Pseudo-code void Reduce (max_temp, <list of value>){ for each x in <list of
value>: sum+=x;
final_output.collect(max_temp, sum);
} void Reduce (min_temp, <list of value>){ for each x in <list of
value>: sum+=x;
final_output.collect(min_temp, sum);
}
3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

Output Key: type of output key. For here, Text.


Output Value: type of output value. For here, IntWritable.
File Input Path
File Output Path

INPUT:-

Set of Weather Data over the years

OUTPUT:-

EXERCISE
-6:- AIM:-
Write a Map Reduce Program that implements Matrix Multiplication.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse matrix format,
where each key is a pair of indices (i,j) and each value is the corresponding matrix element value. The
output files for matrix C=A*B are in the same format.
We have the following input parameters:
The path of the input file or directory for matrix A.
The path of the input file or directory for matrix B. The path of the directory for the output files for
matrix C. strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in A and rows in B.

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

J = the number of columns in B and C.


IB = the number of rows per A block and C block.
KB = the number of columns per A block and rows per B
block. JB = the number of columns per B block and C block.
In the pseudo-code for the individual strategies below, we have intentionally avoided factoring common
code for the purposes of clarity.
Steps
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb, then by m.
Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that each reducer R[ib,kb,jb]
receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data for the A block immediately
preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1

Reduce (key, valueList)

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

17. if key is (ib, kb, jb, 0)


18. // Save the A block.
19. sib = ib
20. skb = kb
21. Zero matrix A
22. for each value = (i, k, v) in valueList A(i,k) = v
23. if key is (ib, kb, jb, 1)
24. if ib != sib or kb != skb return // A[ib,kb] must be zero!
25. // Build the B block.
26. Zero matrix B
27. for each value = (k, j, v) in valueList B(k,j) = v
28. // Multiply the blocks and emit the result.
29. ibase = ib*IB
30. jbase = jb*JB
31. for 0 <= i < row dimension of A
32. for 0 <= j < column dimension of B
33. sum = 0
34. for 0 <= k < column dimension of A = row dimension of B
a. sum += A(i,k)*B(k,j)
35. if sum != 0 emit (ibase+i, jbase+j),
sum INPUT:-
Set of Data sets over different Clusters are taken as Rows and Columns OUTPUT:-

EXERCISE-7:-
AIM:- Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the
data. ALGORITHM
STEPS FOR INSTALLING APACHE PIG

1) Extract the pig-0.15.0.tar.gz and move to home directory

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

2) Set the environment of PIG in bashrc file.


3) Pig can run in two modes
Local Mode and Hadoop Mode
Pig –x local and pig

4) Grunt
Shell Grunt >
5) LOADING Data into Grunt Shell
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as (ATTRIBUTE :
DataType1, ATTRIBUTE : DataType2…..)
6) Describe Data
Describe DATA;
7) DUMP
Data Dump DATA;
8) FILTER Data
FDATA = FILTER DATA by ATTRIBUTE = VALUE;
9) GROUP Data
GDATA = GROUP DATA by ATTRIBUTE;
10) Iterating Data
FOR_DATA = FOREACH DATA GENERATE GROUP AS
GROUP_FUN, ATTRIBUTE = <VALUE>
11) Sorting Data
SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;
12) LIMIT Data
LIMIT_DATA = LIMIT DATA COUNT;
13) JOIN Data
JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) ,
DATA2 BY (ATTRIBUTE3,ATTRIBUTE….N)
INPUT: Input as Website Click Count
Data OUTPUT:

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

EXERCISE-8:-
AIM:- Install and Run Hive then use Hive to Create, alter and drop databases, tables, views,
functions and Indexes.
ALGORITHM:
Apache HIVE INSTALLATION STEPS
1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password

3) Creating User and granting all


Privileges Mysql –uroot –proot
Create user <USER_NAME> identified by <PASSWORD>
4) Extract and Configure Apache
Hive tar xvfz apache-hive-1.0.1.bin.tar.gz
5) Move Apache Hive from Local directory to Home directory

6) Set CLASSPATH in bashrc


Export HIVE_HOME = /home/apache-
hive Export PATH =
$PATH:$HIVE_HOME/bin
7) Configuring hive-default.xml by adding My SQL Server Credentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
</value>
</property>
<property>

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copying mysql-java-connector.jar to hive/lib directory.

SYNTAX for HIVE Database Operations


DATABASE Creation
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Drop Database Statement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];
Creating and Dropping Table in HIVE
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] Loading
Data into table log_data
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;
Alter Table in
HIVE Syntax
ALTER TABLE name RENAME TO new_name

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

ALTER TABLE name ADD COLUMNS (col_spec[,


col_spec ...]) ALTER TABLE name DROP [COLUMN]
column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Creating and Dropping View


CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
Dropping View
Syntax:
DROP VIEW
view_name Functions in HIVE
String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc
Date and Time Functions:- year(), month(), day(), to_date() etc
Aggregate Functions :- sum(), min(), max(), count(), avg() etc
INDEXES
CREATE INDEX index_name ON TABLE base_table_name
(col_name, ...) AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value,
...)] [IN TABLE index_table_name]
[PARTITIONED BY (col_name,
...)] [
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Creating Index

Downloaded by ANAND BHANDARI ([email protected])


lOMoARcPSD|25753199

CREATE INDEX index_ip ON TABLE log_data(ip_address) AS


'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD; Altering and Inserting Index
ALTER INDEX index_ip_address ON log_data
REBUILD; Storing Index Data in Metastore
SET hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_re
sult;

SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
Dropping Index
DROP INDEX INDEX_NAME on TABLE_NAME;
INPUT Input as Web Server Log Data
OUTPUT

Downloaded by ANAND BHANDARI ([email protected])

You might also like