CCS334 Bda
CCS334 Bda
Name : ………………………………………………………
Reg. No : ………………………………………………………
Branch : ………………………………………………………
Year/Sem : ………………………………………………………
KARPAGAM INSTITUTE OF TECHNOLOGY
COIMBATORE - 641 105
DEPARTMENT
OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
conducted on ……………………
VISION
To impart quality technical education emphasizing innovations and research with social and
ethical values.
MISSION
VISION
MISSION
COURSE OBJECTIVES:
LIST OF EXPERIMENTS:
1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup scripts,
Configuration files.
2. Hadoop Implementation of file management tasks, such as Adding files and directories, retrieving files
and Deleting files
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
8. Run Pig Latin scripts to Sort, Group, Join, Project and Filter the Data
COURSE OUTCOMES:
CO1: Describe big data and use cases from selected business domains.
CO2: Explain NoSQL big data management.
CO3: Experiment with Hadoop and HDFS
CO4: Make use of Hadoop to perform map reduce analytics
CO5: Make use of Hadoop related tools such as HBase, Cassandra, Pig and Hive for Big Data Analytics
INDEX
MARKS
PAG STAFF
S.
DATE NAME OF THE EXPERIMENTS E SIGNAT
No
No. URE
Practical
Record Viva Total
Assessment
(15) (10) (50)
(25)
IMPLEMENT OF MATRIX
3. MULTIPLICATION WITH HADOOP 10
MAP REDUCE
INSTALLATION OF HBASE,
6. INSTALLING THRIFT ALONG WITH 18
PRACTICE EXAMPLES
AIM
Perform setting up and Installing Hadoop in its three operating modes:
• Standalone
• Pseudo Distributed
• Fully Distributed
DESCRIPTION
Hadoop is written in Java, so you will need to have Java installed on your machine, Version 6 or later. Sun’s JDK is
the one most widely used with Hadoop, although others have Been reported to work.
Hadoop runs on Unix and on Windows. Linux is the only supported production platform, But other
flavors of Unix (including Mac OS X) can be used to run Hadoop for development. Windows is only supported as a
development platform, and additionally requires Cygwin to run. During the Cygwin installation process, you should
include the open ssh package if you plan to run Hadoop in pseudo-distributed mode
ALGORITHM
1. Command for installing ssh is “sudo apt-get install ssh”
2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in the Eclipse.ini file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop version
10. Check the hadoop instance in standalone mode working correctly or not by using an implicit hadoop jar file
named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means thatstandalone mode is installed
successfully.
6
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like NameNode, DataNode,
SecondaryNameNode ,ResourceManager, NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command hdfs dfs –mkdr /csedir
and enter some data into lendi.txt using command nano lendi.txt and copy from local directory to hadoop using
command hdfs dfs –copyFromLocal lendi.txt /csedir/and run sample jar file wordcount to check whether pseudo
distributed mode is working or not.
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.
INPUT
ubuntu @localhost> jps
OUTPUT
Data node, name nodem Secondary name node, NodeManager, Resource Manager
RESULT
Thus, the experiment concludes with the successful installation of Hadoop in various modes, verified startup
scripts, and proper configuration file setup.
7
Ex. No: 02 HADOOP IMPLEMENTATION OF FILE MANAGEMENT TASKS,
SUCH AS ADDING FILES AND DIRECTORIES, RETRIEVING
Date:
FILES AND DELETING FILES
AIM
Implement the following file management tasks in Hadoop:
• Adding files and directories
• Retrieving files
• Deleting Files
DESCRIPTION
HDFS is a scalable distributed file system designed to scale to petabytes of data while running on top of the
underlying file system of the operating system. HDFS keeps track of where the data resides in a network by
associating the name of its rack (or network switch) with the dataset. This allows Hadoop to efficiently schedule
tasks to those nodes that contain data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides
a set of command line utilities that work similarly to the Linux file commands, and serve as your primary interface
with HDFS. We ‘re going to have a look into HDFS by interacting with it from the command line. We will take a
look at the most common file management tasks in Hadoop, which include:
ALGORITHM
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
STEP-1: ADDING FILES AND DIRECTORIES TO HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into HDFS first. Let‘s
create a directory and put a file in it. HDFS has a default working directory of /user/$USER, where $USER is your
login user name. This directory isn‘t automatically created For you, though, so let‘s create it with the mkdir command.
For the purpose of illustration, we Use chuck. You should substitute your user name in the example commands.
8
STEP-4: COPYING DATA FROM NFS TO HDFS
Copying from directory command is “hdfs dfs –copyFromLocal /home/lendi/Desktop/shakes/glossary /lendicse/”
• View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
• Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
• Command for Deleting files is “hdfs dfs –rm r /kartheek”.
INPUT
Input as any data format of type structured, Unstructured or Semi Structured
OUTPUT
RESULT
Thus, the Hadoop implementation for file management tasks, including adding files and directories, retrieving files,
and deleting files, has been successfully executed and verified.
9
Ex. No: 03
IMPLEMENT OF MATRIX MULTIPLICATION WITH HADOOP
Date: MAP REDUCE
AIM
To write a Map Reduce Program that implements Matrix Multiplication
DESCRIPTION
We can represent a matrix as a relation (table) in RDBMS where each cell in the matrix can be
represented as a record (i,j,value). As an example let us consider the following matrix and its representation. It is
important to understand that this relation is a very inefficient relation If the matrix is dense. Let us say we have 5
Rows and 6 Columns , then we need to store only 30 values. But if you consider above relation we are storing 30
row_id, 30 col_id and 30 values in other sense we are tripling the data. So a natural question arises why we need to
store in this Format ? In practice most of the matrices are sparse matrices . In sparse matrices not all cells used to
have any values , so we don‘t have to store those cells in DB. So this turns out to be very Efficient in storing such
matrices.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse matrix format, where each key
is a pair of indices (I,j) and each value is the corresponding matrix element value. The output files for matrix
C=A*B are in the same format.
We have the following input parameters:
10
In the pseudo-code for the individual strategies below, we have intentionally avoided factoring common code for
the purposes of clarity.
Note that in all the strategies the memory footprint of both the mappers and the reducers is flat at scale.
Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse matrices we do
not emit zero elements. That said, the simple pseudo-code for multiplying the individual blocks shown here is
certainly not optimal for sparse matrices. As a learning exercise, Our focus here is on mastering the Map Reduce
complexities, not on optimizing the sequential matrix multiplication algorithm for the individual blocks
STEPS
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(I,k) and value=a(I,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (I mod IB, k mod KB, a(I,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb,
Then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that each reducer R[ib,kb,jb] receives the data
it needs for blocks A[ib,kb] and B[kb,jb], with the data for the A block immediately preceding the data for the
B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1
Reduce (key, valueList)
17. If key is (ib, kb, jb, 0)
18. // Save the A block.
19. sib = ib
20. skb = kb
21. Zero matrix A
22. for each value = (I, k, v) in valueList A(I,k) = v
23. if key is (ib, kb, jb, 1)
24. if ib != sib or kb != skb return // A[ib,kb] must be zero!
25. // Build the B block.
26. Zero matrix B
27. for each value = (k, j, v) in valueList B(k,j) = v
28. // Multiply the blocks and emit the result.
29. ibase = ib*IB
30. jbase = jb*JB
31. for 0 <= I < row dimension of A
32. for 0 <= j < column dimension of B
33. sum = 0
11
34. for 0 <= k < column dimension of A = row dimension of B
sum += A(I,k)*B(k,j)
35. if sum != 0 emit (ibase+I, jbase+j), sum
INPUT
Set of Data sets over different Clusters are taken as Rows and Columns
OUTPUT
RESULT
Thus, the matrix multiplication implementation using Hadoop Map Reduce has been successfully completed and
verified.
12
Ex. No: 04
RUN A BASIC WORD COUNT MAP REDUCE PROGRAM TO
Date: UNDERSTAND MAP REDUCE PARADIGM
AIM
To Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm.
DESCRIPTION
Map Reduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability
across hundreds or thousands of servers in a Hadoop cluster. The Map Reduce concept is fairly simple to understand
for those who are familiar with clustered scale-out data processing solutions. The term Map Reduce actually refers
to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The
reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the
sequence of the name Map Reduce implies, the reduce job is always performed after the map job.
ALGORITHM
MAP REDUCE PROGRAM
Word Count is a simple program which counts the number of occurrences of each word in a given text input data
set. Word Count fits very well with the Map Reduce programming model making it a great example to understand
the Hadoop Map/Reduce programming style. Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver
PSEUDO-CODE
void Map (key, value)
{
for each word x in value:
output.collect(x, 1);
}
13
PSEUDO-CODE
void Reduce (keyword,<list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(keyword, sum);
}
INPUT
Set of Data Related Shakespeare Comedies, Glossary, Poems
OUTPUT
14
RESULT
Thus, the basic Word Count Map Reduce program was executed and successfully verified, providing a clear
understanding of the Map Reduce paradigm.
15
Ex. No: 05
INSTALLATION OF HIVE ALONG WITH PRACTICE
Date: EXAMPLES
AIM
To write a python program to implement the installation of Hive on the system along with some practical examples.
ALGORITHM
Step 1: Install PyHive
Step 2: Connect the hive Server
Step 3: Create and connect the Database
Step 4: Query the Table and Run aggregations
Step 5: Close the cursor and Connection
PROGRAM
Install pyhive:
# Create a database
cursor.execute("CREATE DATABASE IF NOT EXISTS mydb")
# Create a table
cursor.execute("CREATE TABLE IF NOT EXISTS employee (id INT, name STRING, salary DOUBLE)")
16
cursor.execute("CREATE EXTERNAL TABLE IF NOT EXISTS ext_employee (id INT, name STRING, salary
DOUBLE) \
LOCATION '/path/to/external_data/'")
# Run aggregations
cursor.execute("SELECT AVG(salary) FROM employee")
result = cursor.fetchone()
print("Average Salary:", result[0])
OUTPUT
RESULT
Thus, the basic Word Count Map Reduce program was executed and successfully verified, providing a clear
understanding of the Map Reduce paradigm.
17
Ex. No: 06
INSTALLATION OF HBASE, INSTALLING THRIFT ALONG WITH
Date: PRACTICE EXAMPLES
AIM
To write a python program to Installation of HBase, Installing thrift along with Practice examples.
ALGORITHM
Step 1: Install HBase and thrift
Step 2: Import the needed package for the practice
Step 3: Create a connection to HBase and thrift server using protocol
Step 4: Define and put a data into HBase
Step 5: Get the data and close the connection of the HBase
PROGRAM
namespace python hbase
service HBase {
bool put(1: binary row, 2: binary column, 3: binary value),
binary get(1: binary row, 2: binary column),
}
thrift -r --gen py HBase.thrift
18
OUTPUT
RESULT
Thus, the program to write a python program to Installation of HBase, Installing thrift along with Practice
examples.
19
Ex. No: 07
PATRICE IMPORTING AND EXPORTING DATA FROM VARIOUS
Date: DATABASES
AIM
To write a python program to implement patrice importing and exporting data from various data bases.
ALGORITHM
Step 1: Install required libraries such as pyspark and hdfs
Step 2: Import the necessary package
Step 3: Create a connection to the pyspark database using python package
Step 4: Import and export the data using pyspark
Step 5: Close the Connection
PROGRAM
# importing necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
if __name__ == "__main__":
schema_lst = ["State","Cases","Recovered","Deaths"]
20
# creating the dataframe using createDataFrame function
df = spark.createDataFrame(rd_df,schema_lst)
OUTPUT
RESULT
Thus, the program to Implement patrice importing and exporting data from various databases.
21
Ex. No: 08
RUN APACHE PIG LATIN SCRIPTS TO SORT, GROUP, JOIN,
Date: PROJECT AND FILTER THE DATA
AIM
To run Apache pig latin scripts to sort, group, join, project and filter the data
ALGORITHM
Step 1: Extract the pig-0.15.0.tar.gz and move to home directory
Step 2: Set the environment of PIG in bashrc file.
Step 3: Pig can run in two modes
Step 4: Grunt Shell
Step 5: LOADING Data into Grunt Shell
Step 6: Describe Data
Step 7: DUMP Data
Step 8: FILTER Data
Step 9: GROUP Data
Step 10: Iterating Data
Step 11: Sorting Data
Step 12: LIMIT Data
Step 13: JOIN Data
PROGRAM
Local Mode and Hadoop Mode
Pig –x local and pig
Grunt>
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as (ATTRIBUTE : DataType1,
ATTRIBUTE : DataType2…..)
Describe DATA;
Dump DATA;
FDATA = FILTER DATA by ATTRIBUTE = VALUE;
GDATA = GROUP DATA by ATTRIBUTE;
FOR_DATA = FOREACH DATA GENERATE GROUP AS GROUP_FUN
ATTRIBUTE = <VALUE>
SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;
LIMIT_DATA = LIMIT DATA COUNT;
JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) , DATA2 BY (ATTRIBUTE3,ATTRIBUTE….N)
INPUT
Input as Website Click Count Data
22
OUTPUT
RESULT
Thus, running Apache pig latin scripts to sort, group, join, project and filter the data was executed and verified
successfully.
23