Big Data Analytics Lab
Big Data Analytics Lab
Lab Manual
Department of Computer Science and
Engineering
8th SEMESTER
Submitted To: Dr. Narpat Singh Sir Submitted By: ASHISH KULHARI
Branch: CSE 4th Year
Roll no: 18EEBCS012
EXERCISE-1:-
AIM:- Implement the following Data Structures in Java
a) Linked Lists b)Stacks c)Queues d)Set
e)Map SKELTON OF
JAVA.UTIL.COLLECTION INTERFACE
public interface Collection<E> extends Iterable<E> {
EXERCISE
-2:- AIM:-
i) Perform setting up and Installing Hadoop in its three operating modes:
• Standalone
• Pseudo Distributed
• Fully Distributed
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:-
1. Command for installing ssh is “sudo apt-get install ssh”.
2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in the
eclipse.ini file
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE:-
1. In order install pseudo distributed mode we need to configure the hadoop configuration files
resides in the directory /home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the hadoop-env.sh file by changing the java path.
3. Configure the core-site.xml which contains a property tag, it contains name and value. Name as
fs.defaultFS and value as hdfs://localhost:9000
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to mapred-
site.xml.
7. Now format the name node by using command hdfs namenode –format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command hdfs dfs –
mkdr /csedir and enter some data into lendi.txt using command nano lendi.txt and copy from local
directory to hadoop using command hdfs dfs – copyFromLocal lendi.txt /csedir/and run sample jar file
wordcount to check whether pseudo distributed mode is working or not.
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.
EXERCISE
-3:- AIM:-
Implement the following file management tasks in Hadoop:
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1 Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created for you,
though, so let‘s create it with the mkdir command. For the purpose of illustration, we use chuck. You
should substitute your user name in the example commands.
hadoop fs -mkdir /user/chuck hadoop fs -put example.txt hadoop fs -put example.txt
/user/chuck
Step-2 Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve example.txt,
we can run the following command:
hadoop fs -cat example.txt
Step-3 Deleting Files from
HDFS hadoop fs -rm
example.txt
• Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.
• Adding directory is done through the command “hdfs dfs –put lendi_english /”.
SAMPLE INPUT: Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:
EXERCIS
E-4 AIM:-
Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm
ALGORITHM
MAPREDUCE
PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given text
input data set. WordCount fits very well with the MapReduce programming model making it a great
example to understand the Hadoop Map/Reduce programming style. Our implementation consists of three
main parts:
1. Mapper
2. Reducer
3. Driver
OUTPUT:-
EXERCISE-5:-
AIM:- Write a Map Reduce Program that mines Weather
Data. ALGORITHM:- MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given text
input data set. WordCount fits very well with the MapReduce programming model making it a great
example to understand the Hadoop Map/Reduce programming style. Our implementation consists of three
main parts:
1. Mapper
2. Reducer
3. Main program
Step-1. Write a Mapper
A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper" which
provides <key, value> pairs as the input. A Mapper implementation may output <key,value> pairs using
the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key would
be the line number <line_number, line_of_text> . Map task outputs <word, one> for each word in the
line of text.
Pseudo-code void Map (key, value){ for each max_temp x in value:
output.collect(x, 1);
} void Map (key, value){ for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single
result. Here, the WordCount program will sum up the occurrence of each word to pairs as <word,
occurrence>.
Pseudo-code void Reduce (max_temp, <list of value>){ for each x in <list of
value>: sum+=x;
final_output.collect(max_temp, sum);
} void Reduce (min_temp, <list of value>){ for each x in <list of
value>: sum+=x;
final_output.collect(min_temp, sum);
}
3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
INPUT:-
OUTPUT:-
EXERCISE
-6:- AIM:-
Write a Map Reduce Program that implements Matrix Multiplication.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse matrix format,
where each key is a pair of indices (i,j) and each value is the corresponding matrix element value. The
output files for matrix C=A*B are in the same format.
We have the following input parameters:
The path of the input file or directory for matrix A.
The path of the input file or directory for matrix B. The path of the directory for the output files for
matrix C. strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in A and rows in B.
EXERCISE-7:-
AIM:- Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the
data. ALGORITHM
STEPS FOR INSTALLING APACHE PIG
4) Grunt
Shell Grunt >
5) LOADING Data into Grunt Shell
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as (ATTRIBUTE :
DataType1, ATTRIBUTE : DataType2…..)
6) Describe Data
Describe DATA;
7) DUMP
Data Dump DATA;
8) FILTER Data
FDATA = FILTER DATA by ATTRIBUTE = VALUE;
9) GROUP Data
GDATA = GROUP DATA by ATTRIBUTE;
10) Iterating Data
FOR_DATA = FOREACH DATA GENERATE GROUP AS
GROUP_FUN, ATTRIBUTE = <VALUE>
11) Sorting Data
SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;
12) LIMIT Data
LIMIT_DATA = LIMIT DATA COUNT;
13) JOIN Data
JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) ,
DATA2 BY (ATTRIBUTE3,ATTRIBUTE….N)
INPUT: Input as Website Click Count
Data OUTPUT:
EXERCISE-8:-
AIM:- Install and Run Hive then use Hive to Create, alter and drop databases, tables, views,
functions and Indexes.
ALGORITHM:
Apache HIVE INSTALLATION STEPS
1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copying mysql-java-connector.jar to hive/lib directory.
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
Dropping Index
DROP INDEX INDEX_NAME on TABLE_NAME;
INPUT Input as Web Server Log Data
OUTPUT