0% found this document useful (0 votes)

27 views23 pages

CCS334 Bda

Uploaded by

Praveen Kanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views23 pages

CCS334 Bda

Uploaded by

Praveen Kanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

KARPAGAM INSTITUTE OF TECHNOLOGY

COIMBATORE - 641 105

CCS334 – BIG DATA ANALYTICS

LAB RECORD

Name : ………………………………………………………

Reg. No : ………………………………………………………

Branch : ………………………………………………………

Year/Sem : ………………………………………………………
KARPAGAM INSTITUTE OF TECHNOLOGY
COIMBATORE - 641 105

DEPARTMENT
OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Certified that this is the bonafide record of work done

by………………………………………………… in the CCS334 BIG DATA ANALYTICS of

this institution, as prescribed by Anna University, Chennai for the year/semester

during the year 2022-2023.

Staff-in-charge Head of the Department

REGISTER NO.: ……………………….

Submitted for the ……………………… Year / Semester Examination of the Institution

conducted on ……………………

Internal Examiner External Examiner

DEPARTMENT OF ARTIFICAL INTELLIGENCE AND DATA SCIENCE

INSTITUTION VISION & MISSION

VISION

To impart quality technical education emphasizing innovations and research with social and
ethical values.

MISSION

• Establishing state-of–the-art infrastructure, effective procedures for recruitment of competent

faculty and innovative teaching practices.
• Creating a conducive environment for nurturing innovative ideas and encouraging research
skills.
• Inculcating social and ethical values through co-curricular and extra-curricular activities.

DEPARTMENT VISION & MISSION

VISION

To become centre of excellence in technical education to produce competent professionals to

the dynamic needs in the field of Artificial Intelligence and Data Science.

MISSION

• Educating the fundamentals of AI&DS by continuously improving the teaching learning

methodologies using modern tools.
• Creating a conducive environment for learning and research-based activities.
• Inculcating the interdisciplinary skill set required to fulfil the societal need.
CCS334 BIG DATA ANALYTICS B P T C
0 0 4 2

COURSE OBJECTIVES:

• To understand big data.

• To learn and use NoSQL big data management.
• To learn mapreduce analytics using Hadoop and related tools.
• To work with map reduce applications

LIST OF EXPERIMENTS:

1. Downloading and installing Hadoop; Understanding different Hadoop modes. Startup scripts,
Configuration files.
2. Hadoop Implementation of file management tasks, such as Adding files and directories, retrieving files
and Deleting files

3. Implement of Matrix Multiplication with Hadoop Map Reduce

4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

5. Installation of Hive along with practice examples.

6. Installation of HBase, Installing thrift along with Practice examples

7. Practice importing and exporting data from various databases.

8. Run Pig Latin scripts to Sort, Group, Join, Project and Filter the Data

COURSE OUTCOMES:

CO1: Describe big data and use cases from selected business domains.
CO2: Explain NoSQL big data management.
CO3: Experiment with Hadoop and HDFS
CO4: Make use of Hadoop to perform map reduce analytics
CO5: Make use of Hadoop related tools such as HBase, Cassandra, Pig and Hive for Big Data Analytics
INDEX

MARKS

PAG STAFF
S.
DATE NAME OF THE EXPERIMENTS E SIGNAT
No
No. URE
Practical
Record Viva Total
Assessment
(15) (10) (50)
(25)

DOWNLOADING AND INSTALLING

HADOOP; UNDERSTANDING
1. DIFFERENT HADOOP MODES. 6
STARTUP SCRIPTS, CONFIGURATION
FILES.
HADOOP IMPLEMENTATION OF FILE
MANAGEMENT TASKS, SUCH AS
2. ADDING FILES AND DIRECTORIES, 8
RETRIEVING FILES AND DELETING
FILES

IMPLEMENT OF MATRIX
3. MULTIPLICATION WITH HADOOP 10
MAP REDUCE

RUN A BASIC WORD COUNT MAP

REDUCE PROGRAM TO
4. 13
UNDERSTAND MAP REDUCE
PARADIGM.

INSTALLATION OF HIVE ALONG

5. WITH PRACTICE EXAMPLES.
16

INSTALLATION OF HBASE,
6. INSTALLING THRIFT ALONG WITH 18
PRACTICE EXAMPLES

PRACTICE IMPORTING AND

7. EXPORTING DATA FROM VARIOUS 20
DATABASES.

RUN PIG LATIN SCRIPTS TO SORT,

8. GROUP, JOIN, PROJECT AND FILTER 22
THE DATA
Ex. No: 01 DOWNLOADING AND INSTALLING HADOOP;
UNDERSTANDING DIFFERENT HADOOP MODES. STARTUP
Date:
SCRIPTS, CONFIGURATION FILES

AIM
Perform setting up and Installing Hadoop in its three operating modes:
• Standalone
• Pseudo Distributed
• Fully Distributed

DESCRIPTION
Hadoop is written in Java, so you will need to have Java installed on your machine, Version 6 or later. Sun’s JDK is
the one most widely used with Hadoop, although others have Been reported to work.

Hadoop runs on Unix and on Windows. Linux is the only supported production platform, But other
flavors of Unix (including Mac OS X) can be used to run Hadoop for development. Windows is only supported as a
development platform, and additionally requires Cygwin to run. During the Cygwin installation process, you should
include the open ssh package if you plan to run Hadoop in pseudo-distributed mode

ALGORITHM
1. Command for installing ssh is “sudo apt-get install ssh”
2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in the Eclipse.ini file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop version
10. Check the hadoop instance in standalone mode working correctly or not by using an implicit hadoop jar file
named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means thatstandalone mode is installed
successfully.

STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE

1. In order install pseudo distributed mode we need to configure the hadoop Configuration files resides in the
directory /home/lendi/hadoop-2.7.1/etc/hadoop
2. First configure the hadoop-env.sh file by changing the java path.
3. Configure the core-site.xml which contains a property tag, it contains name and value. Name as fs.defaultFS and
value as hdfs://localhost:9000
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to mapped-site.xml.
7. Now format the name node by using command hdfs namenode –format.

6
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like NameNode, DataNode,
SecondaryNameNode ,ResourceManager, NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command hdfs dfs –mkdr /csedir
and enter some data into lendi.txt using command nano lendi.txt and copy from local directory to hadoop using
command hdfs dfs –copyFromLocal lendi.txt /csedir/and run sample jar file wordcount to check whether pseudo
distributed mode is working or not.
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.

STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE

1. Stop all single node clusters
a. stop-all.sh
2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).
3. Copy public key to all three hosts to get a password less SSH access
a. $ ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
4. Configure all Configuration files, to name Master and Slave Nodes.
a. $cd $HADOOP_HOME/etc/Hadoop
b. $ nano core-site.xml
c. $ nano hdfs-site.xml
5. Add hostnames to file slaves and save it.
a. $ nano slaves
6. Configure
a. $ nano yarn-site.xml
7. Do in Master Node
a. $ hdfs namenode –format
b. $ start-dfs.sh
c. $start-yarn.sh
8. Format NameNode
9. Daemons Starting in Master and Slave Nodes
10. END

INPUT
ubuntu @localhost> jps

OUTPUT
Data node, name nodem Secondary name node, NodeManager, Resource Manager

RESULT
Thus, the experiment concludes with the successful installation of Hadoop in various modes, verified startup
scripts, and proper configuration file setup.

7
Ex. No: 02 HADOOP IMPLEMENTATION OF FILE MANAGEMENT TASKS,
SUCH AS ADDING FILES AND DIRECTORIES, RETRIEVING
Date:
FILES AND DELETING FILES

AIM
Implement the following file management tasks in Hadoop:
• Adding files and directories
• Retrieving files
• Deleting Files

DESCRIPTION
HDFS is a scalable distributed file system designed to scale to petabytes of data while running on top of the
underlying file system of the operating system. HDFS keeps track of where the data resides in a network by
associating the name of its rack (or network switch) with the dataset. This allows Hadoop to efficiently schedule
tasks to those nodes that contain data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides
a set of command line utilities that work similarly to the Linux file commands, and serve as your primary interface
with HDFS. We ‘re going to have a look into HDFS by interacting with it from the command line. We will take a
look at the most common file management tasks in Hadoop, which include:

• Adding files and directories to HDFS

• Retrieving files from HDFS to local file system
• Deleting files from HDFS

ALGORITHM
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
STEP-1: ADDING FILES AND DIRECTORIES TO HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into HDFS first. Let‘s
create a directory and put a file in it. HDFS has a default working directory of /user/$USER, where $USER is your
login user name. This directory isn‘t automatically created For you, though, so let‘s create it with the mkdir command.
For the purpose of illustration, we Use chuck. You should substitute your user name in the example commands.

• Hadoop fs -mkdir /user/chuck

• Hadoop fs -put example.txt
• Hadoop fs -put example.txt /user/chuck

STEP-2: RETRIEVING FILES FROM HDFS

The Hadoop command get copies files from HDFS back to the local file system. To retrieve example.txt, we can
run the following command:
hadoop fs -cat example.txt

STEP-3: DELETING FILES FROM HDFS

Hadoop fs -rm example.txt

• Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.

• Adding directory is done through the command “hdfs dfs –put lendi_english /”

8
STEP-4: COPYING DATA FROM NFS TO HDFS
Copying from directory command is “hdfs dfs –copyFromLocal /home/lendi/Desktop/shakes/glossary /lendicse/”

• View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
• Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
• Command for Deleting files is “hdfs dfs –rm r /kartheek”.

INPUT
Input as any data format of type structured, Unstructured or Semi Structured

OUTPUT

RESULT
Thus, the Hadoop implementation for file management tasks, including adding files and directories, retrieving files,
and deleting files, has been successfully executed and verified.

9
Ex. No: 03
IMPLEMENT OF MATRIX MULTIPLICATION WITH HADOOP
Date: MAP REDUCE

AIM
To write a Map Reduce Program that implements Matrix Multiplication

DESCRIPTION
We can represent a matrix as a relation (table) in RDBMS where each cell in the matrix can be
represented as a record (i,j,value). As an example let us consider the following matrix and its representation. It is
important to understand that this relation is a very inefficient relation If the matrix is dense. Let us say we have 5
Rows and 6 Columns , then we need to store only 30 values. But if you consider above relation we are storing 30
row_id, 30 col_id and 30 values in other sense we are tripling the data. So a natural question arises why we need to
store in this Format ? In practice most of the matrices are sparse matrices . In sparse matrices not all cells used to
have any values , so we don‘t have to store those cells in DB. So this turns out to be very Efficient in storing such
matrices.

MAP REDUCE LOGIC

Logic is to send the calculation part of each output cell of the result matrix to a reducer. So in matrix multiplication
the first cell of output (0,0) has multiplication and summation of elements from row 0 of the matrix A and elements
from col 0 of matrix B. To do the computation of value in the output cell (0,0) of resultant matrix in a separate reducer
we need to use (0,0) as output key of map phase and value should have array of values from row 0 of matrix A and
column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm output from map phase
should be having a , where key represents the output cell location (0,0) , (0,1) etc.. and value will be list of all values
required for reducer to do computation. Let us take an example for calculating value at output cell (00). Here we need
to collect values from row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) as key. So a single
reducer can do the calculation.

ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse matrix format, where each key
is a pair of indices (I,j) and each value is the corresponding matrix element value. The output files for matrix
C=A*B are in the same format.
We have the following input parameters:

• The path of the input file or directory for matrix A.

• The path of the input file or directory for matrix B.
• The path of the directory for the output files for matrix C.
• strategy = 1, 2, 3 or 4. • R = the number of reducers.
• I = the number of rows in A and C.
• K = the number of columns in A and rows in B.
• J = the number of columns in B and C.
• IB = the number of rows per A block and C block.
• KB = the number of columns per A block and rows per B block.
• JB = the number of columns per B block and C block.

10
In the pseudo-code for the individual strategies below, we have intentionally avoided factoring common code for
the purposes of clarity.
Note that in all the strategies the memory footprint of both the mappers and the reducers is flat at scale.
Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse matrices we do
not emit zero elements. That said, the simple pseudo-code for multiplying the individual blocks shown here is
certainly not optimal for sparse matrices. As a learning exercise, Our focus here is on mastering the Map Reduce
complexities, not on optimizing the sequential matrix multiplication algorithm for the individual blocks

STEPS
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(I,k) and value=a(I,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (I mod IB, k mod KB, a(I,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb,
Then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that each reducer R[ib,kb,jb] receives the data
it needs for blocks A[ib,kb] and B[kb,jb], with the data for the A block immediately preceding the data for the
B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1
Reduce (key, valueList)
17. If key is (ib, kb, jb, 0)
18. // Save the A block.
19. sib = ib
20. skb = kb
21. Zero matrix A
22. for each value = (I, k, v) in valueList A(I,k) = v
23. if key is (ib, kb, jb, 1)
24. if ib != sib or kb != skb return // A[ib,kb] must be zero!
25. // Build the B block.
26. Zero matrix B
27. for each value = (k, j, v) in valueList B(k,j) = v
28. // Multiply the blocks and emit the result.
29. ibase = ib*IB
30. jbase = jb*JB
31. for 0 <= I < row dimension of A
32. for 0 <= j < column dimension of B
33. sum = 0

11
34. for 0 <= k < column dimension of A = row dimension of B
sum += A(I,k)*B(k,j)
35. if sum != 0 emit (ibase+I, jbase+j), sum

INPUT
Set of Data sets over different Clusters are taken as Rows and Columns

OUTPUT

RESULT
Thus, the matrix multiplication implementation using Hadoop Map Reduce has been successfully completed and
verified.

12
Ex. No: 04
RUN A BASIC WORD COUNT MAP REDUCE PROGRAM TO
Date: UNDERSTAND MAP REDUCE PARADIGM

AIM
To Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm.

DESCRIPTION
Map Reduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability
across hundreds or thousands of servers in a Hadoop cluster. The Map Reduce concept is fairly simple to understand
for those who are familiar with clustered scale-out data processing solutions. The term Map Reduce actually refers
to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The
reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the
sequence of the name Map Reduce implies, the reduce job is always performed after the map job.

ALGORITHM
MAP REDUCE PROGRAM
Word Count is a simple program which counts the number of occurrences of each word in a given text input data
set. Word Count fits very well with the Map Reduce programming model making it a great example to understand
the Hadoop Map/Reduce programming style. Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver

STEP-1: WRITE A MAPPER

A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper" which provides
pairs as the input. A Mapper implementation may output pairs using the provided Context Input value of the
WordCount Map task will be a line of text from the input data file and the key would be the line number . Map task
outputs for each word in the line of text.

PSEUDO-CODE
void Map (key, value)
{
for each word x in value:
output.collect(x, 1);
}

STEP-2: WRITE A REDUCER

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single
result. Here, the WordCount program will sum up the occurrence of each word to pairs as <word,
occurrence>

13
PSEUDO-CODE
void Reduce (keyword,<list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(keyword, sum);
}

STEP-2: WRITE DRIVER

The Driver program configures and run the Map Reduce job. We use the main program to perform basic
configurations such as:
• Job Name: name of this Job
• Executable (Jar) Class: the main executable class. For here, Word Count.
• Mapper Class: class which overrides the "map" function. For here, Map.
• Reducer: class which override the "reduce" function. For here , Reduce.
• Output Key: type of output key. For here, Text.
• Output Value: type of output value. For here, IntWritable.
• File Input Path
• File Output Path

INPUT
Set of Data Related Shakespeare Comedies, Glossary, Poems

OUTPUT

14
RESULT
Thus, the basic Word Count Map Reduce program was executed and successfully verified, providing a clear
understanding of the Map Reduce paradigm.

15
Ex. No: 05
INSTALLATION OF HIVE ALONG WITH PRACTICE
Date: EXAMPLES

AIM
To write a python program to implement the installation of Hive on the system along with some practical examples.

ALGORITHM
Step 1: Install PyHive
Step 2: Connect the hive Server
Step 3: Create and connect the Database
Step 4: Query the Table and Run aggregations
Step 5: Close the cursor and Connection

PROGRAM
Install pyhive:

pip install pyhive

Install Hive Server and Start it:

hive --service hiveserver2

Python Code Examples for Hive:

from pyhive import hive

# Connect to Hive server

conn = hive.connect(host="localhost", port=10000, username="your_username")
cursor = conn.cursor()

# Create a database
cursor.execute("CREATE DATABASE IF NOT EXISTS mydb")

# Switch to the database

cursor.execute("USE mydb")

# Create a table
cursor.execute("CREATE TABLE IF NOT EXISTS employee (id INT, name STRING, salary DOUBLE)")

# Load data into the table

cursor.execute("LOAD DATA LOCAL INPATH '/path/to/employee_data.csv' INTO TABLE employee")

# Query the table

cursor.execute("SELECT * FROM employee WHERE salary > 50000")
results = cursor.fetchall()

for row in results:

print(row)

# Create an external table

16
cursor.execute("CREATE EXTERNAL TABLE IF NOT EXISTS ext_employee (id INT, name STRING, salary
DOUBLE) \
LOCATION '/path/to/external_data/'")

# Run aggregations
cursor.execute("SELECT AVG(salary) FROM employee")
result = cursor.fetchone()
print("Average Salary:", result[0])

# Close the cursor and connection

cursor.close()
conn.close()

OUTPUT

RESULT
Thus, the basic Word Count Map Reduce program was executed and successfully verified, providing a clear
understanding of the Map Reduce paradigm.

17
Ex. No: 06
INSTALLATION OF HBASE, INSTALLING THRIFT ALONG WITH
Date: PRACTICE EXAMPLES

AIM
To write a python program to Installation of HBase, Installing thrift along with Practice examples.

ALGORITHM
Step 1: Install HBase and thrift
Step 2: Import the needed package for the practice
Step 3: Create a connection to HBase and thrift server using protocol
Step 4: Define and put a data into HBase
Step 5: Get the data and close the connection of the HBase

PROGRAM
namespace python hbase

service HBase {
bool put(1: binary row, 2: binary column, 3: binary value),
binary get(1: binary row, 2: binary column),
}
thrift -r --gen py HBase.thrift

Python Code Examples for HBase with Thrift:

from thrift.transport import TSocket

from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase

# Create a connection to HBase Thrift server

transport = TSocket.TSocket('localhost', 9090)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hbase.Client(protocol)
transport.open()

# Define data for a put operation

row_key = b'row1'
column_family = b'cf1'
qualifier = b'col1'
value = b'value1'

# Put data into HBase

client.put(row_key, column_family, qualifier, value, {})

# Get data from HBase

result = client.get(row_key, column_family, qualifier, {})
print(result)

# Close the connection

transport.close()

18
OUTPUT

RESULT
Thus, the program to write a python program to Installation of HBase, Installing thrift along with Practice
examples.

19
Ex. No: 07
PATRICE IMPORTING AND EXPORTING DATA FROM VARIOUS
Date: DATABASES

AIM
To write a python program to implement patrice importing and exporting data from various data bases.

ALGORITHM
Step 1: Install required libraries such as pyspark and hdfs
Step 2: Import the necessary package
Step 3: Create a connection to the pyspark database using python package
Step 4: Import and export the data using pyspark
Step 5: Close the Connection

PROGRAM
# importing necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# function to create new SparkSession

def create_session():
spk = SparkSession.builder \
.appName("Corona_cases_statewise.com") \
.getOrCreate()
return spk

# function to create RDD

def create_RDD(sc_obj,data):
df = sc.parallelize(data)
return df

if __name__ == "__main__":

input_data = [("Uttar Pradesh",122000,89600,12238),

("Maharashtra",454000,380000,67985),
("Tamil Nadu",115000,102000,13933),
("Karnataka",147000,111000,15306),
("Kerala",153000,124000,5259)]

# calling function to create SparkSession

spark = create_session()

# creating spark context object

sc = spark.sparkContext

# calling function to create RDD

rd_df = create_RDD(sc,input_data)

schema_lst = ["State","Cases","Recovered","Deaths"]

20
# creating the dataframe using createDataFrame function
df = spark.createDataFrame(rd_df,schema_lst)

# showing the dataframe and schema

df.printSchema()
df.show()

print("Retrieved Data is:-")

# Retrieving multiple rows using collect() and for loop

for row in df.collect()[0:3]:
print((row["State"]),",",str(row["Cases"]),",", str(row["Recovered"]),",",str(row["Deaths"]))

OUTPUT

RESULT
Thus, the program to Implement patrice importing and exporting data from various databases.

21
Ex. No: 08
RUN APACHE PIG LATIN SCRIPTS TO SORT, GROUP, JOIN,
Date: PROJECT AND FILTER THE DATA

AIM
To run Apache pig latin scripts to sort, group, join, project and filter the data

ALGORITHM
Step 1: Extract the pig-0.15.0.tar.gz and move to home directory
Step 2: Set the environment of PIG in bashrc file.
Step 3: Pig can run in two modes
Step 4: Grunt Shell
Step 5: LOADING Data into Grunt Shell
Step 6: Describe Data
Step 7: DUMP Data
Step 8: FILTER Data
Step 9: GROUP Data
Step 10: Iterating Data
Step 11: Sorting Data
Step 12: LIMIT Data
Step 13: JOIN Data

PROGRAM
Local Mode and Hadoop Mode
Pig –x local and pig
Grunt>
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as (ATTRIBUTE : DataType1,
ATTRIBUTE : DataType2…..)
Describe DATA;
Dump DATA;
FDATA = FILTER DATA by ATTRIBUTE = VALUE;
GDATA = GROUP DATA by ATTRIBUTE;
FOR_DATA = FOREACH DATA GENERATE GROUP AS GROUP_FUN
ATTRIBUTE = <VALUE>
SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;
LIMIT_DATA = LIMIT DATA COUNT;
JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) , DATA2 BY (ATTRIBUTE3,ATTRIBUTE….N)

INPUT
Input as Website Click Count Data

22
OUTPUT

RESULT
Thus, running Apache pig latin scripts to sort, group, join, project and filter the data was executed and verified
successfully.

Corel Draw 10 Step-By-Step Learning Ebook
84% (43)
Corel Draw 10 Step-By-Step Learning Ebook
97 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
I Like To Move It: Madagascar OST
No ratings yet
I Like To Move It: Madagascar OST
2 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
Final Copy - BDA LAB Record
No ratings yet
Final Copy - BDA LAB Record
44 pages
CCS334 BIg Data Final Front Sheet BATHIMA_pagenumber
No ratings yet
CCS334 BIg Data Final Front Sheet BATHIMA_pagenumber
47 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
BDA LAB FILE Final 18EGICS110
No ratings yet
BDA LAB FILE Final 18EGICS110
54 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
Bda Manual Lab Manual
No ratings yet
Bda Manual Lab Manual
117 pages
CCS334 BDA Lab Manual
No ratings yet
CCS334 BDA Lab Manual
35 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
47 pages
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
Bda Record
No ratings yet
Bda Record
46 pages
213nt1306- Big Data Analytics Lab Manual
No ratings yet
213nt1306- Big Data Analytics Lab Manual
80 pages
Bigdatamanualfinal 231019063224 d211cb48
No ratings yet
Bigdatamanualfinal 231019063224 d211cb48
45 pages
01 Hk 082010005150001
No ratings yet
01 Hk 082010005150001
56 pages
Big_data_Lab_Manual[1] (4)
No ratings yet
Big_data_Lab_Manual[1] (4)
32 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
bi lab file
No ratings yet
bi lab file
19 pages
Prachi 20CS111 BDALab File
No ratings yet
Prachi 20CS111 BDALab File
20 pages
Big Data Lab Manual Printout Copy
No ratings yet
Big Data Lab Manual Printout Copy
51 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Ccs334 Bda Lab Manual PRINT
No ratings yet
Ccs334 Bda Lab Manual PRINT
53 pages
CCS334-BDA LAB MANUAL final (1)
No ratings yet
CCS334-BDA LAB MANUAL final (1)
46 pages
Bigdata Manual Final
No ratings yet
Bigdata Manual Final
65 pages
Bigdata Lab Manual
No ratings yet
Bigdata Lab Manual
37 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
Bda Record
No ratings yet
Bda Record
48 pages
Ccs334 Bda Lab (1)
No ratings yet
Ccs334 Bda Lab (1)
54 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Big Data Manual Ai
No ratings yet
Big Data Manual Ai
33 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Hadoop Lab Manual
No ratings yet
Hadoop Lab Manual
54 pages
CCS334 BDA LAB MANUAL
No ratings yet
CCS334 BDA LAB MANUAL
48 pages
1.Mrplab Intro
No ratings yet
1.Mrplab Intro
18 pages
Rush
No ratings yet
Rush
90 pages
bda lab
No ratings yet
bda lab
4 pages
Amrita CC 3.1
No ratings yet
Amrita CC 3.1
7 pages
Pro 3
No ratings yet
Pro 3
45 pages
big-data-file
No ratings yet
big-data-file
32 pages
BDA Record (1)
No ratings yet
BDA Record (1)
34 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
58 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
BigData_Lab_Manual
No ratings yet
BigData_Lab_Manual
44 pages
ccs 334 bigdata manual
No ratings yet
ccs 334 bigdata manual
45 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
1 Big Data Lab - 230823 - 103054
No ratings yet
1 Big Data Lab - 230823 - 103054
34 pages
EXP 1-2
No ratings yet
EXP 1-2
9 pages
Hands On-Exercies
No ratings yet
Hands On-Exercies
17 pages
BDA LAB MANUAL
No ratings yet
BDA LAB MANUAL
45 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Big Data File
No ratings yet
Big Data File
16 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Integrating Data Is Getting Harder, But Also More Important - Special Report - The Economist
No ratings yet
Integrating Data Is Getting Harder, But Also More Important - Special Report - The Economist
9 pages
mg400-robot-arm-kit-desktop-manual
No ratings yet
mg400-robot-arm-kit-desktop-manual
38 pages
Schumann Op68 06 Pauvre Orpheline A4
No ratings yet
Schumann Op68 06 Pauvre Orpheline A4
1 page
A Debt-Fuelled Spree That Led To $5.4B PharmEasy's Catch-22 - The Ken
No ratings yet
A Debt-Fuelled Spree That Led To $5.4B PharmEasy's Catch-22 - The Ken
1 page
Ordinance Polifor
100% (1)
Ordinance Polifor
3 pages
G1-VE-E-8300-C13E44010202Rev2
No ratings yet
G1-VE-E-8300-C13E44010202Rev2
18 pages
Assignment No.1 English (386) FA Spring 2020
No ratings yet
Assignment No.1 English (386) FA Spring 2020
8 pages
DAILY-NARRATIVE - Immersion (Mikaella M.)
No ratings yet
DAILY-NARRATIVE - Immersion (Mikaella M.)
23 pages
Binary Solutions
No ratings yet
Binary Solutions
25 pages
Module 7: Graphs, Part I
No ratings yet
Module 7: Graphs, Part I
33 pages
ICSE-Literature-in-English-X-26-42
No ratings yet
ICSE-Literature-in-English-X-26-42
17 pages
Network Analysis
No ratings yet
Network Analysis
22 pages
Portfolio A12 - Juan J. Garcia
No ratings yet
Portfolio A12 - Juan J. Garcia
15 pages
Multi Clean 25 LTR: Marine Chemicals
No ratings yet
Multi Clean 25 LTR: Marine Chemicals
2 pages
Effect of Solanum Aethiopicum and Solanum Macrocarpon Fruits On Weight Gain
No ratings yet
Effect of Solanum Aethiopicum and Solanum Macrocarpon Fruits On Weight Gain
4 pages
UCIe
No ratings yet
UCIe
3 pages
Destructive Methods of Fish Catching
100% (1)
Destructive Methods of Fish Catching
4 pages
Psychological Capital As Personal Resources PDF
No ratings yet
Psychological Capital As Personal Resources PDF
37 pages
A Study of Relationship of Consumer Attitudes and Purchases Intentions To Celebrity Advertisements
No ratings yet
A Study of Relationship of Consumer Attitudes and Purchases Intentions To Celebrity Advertisements
12 pages
Double Planetary Mixer PDF
100% (1)
Double Planetary Mixer PDF
7 pages
Important Questions
No ratings yet
Important Questions
21 pages
PSW Formal Assessment - Term 1 - 2023
No ratings yet
PSW Formal Assessment - Term 1 - 2023
2 pages
Name: Jo-Ann R. Arribado Age: 20 Sex: Female Home Address: Lanao, Daanbantayan, Cebu Citizenship: Filipino Birthday: November 11, 1998 CONTACT NUMBER: 09361663659 Email
No ratings yet
Name: Jo-Ann R. Arribado Age: 20 Sex: Female Home Address: Lanao, Daanbantayan, Cebu Citizenship: Filipino Birthday: November 11, 1998 CONTACT NUMBER: 09361663659 Email
3 pages
Mock Test-V Law+bcr Answer Sheet
No ratings yet
Mock Test-V Law+bcr Answer Sheet
19 pages
3
No ratings yet
3
7 pages
How To Make Money With Digital Products Preview
100% (1)
How To Make Money With Digital Products Preview
5 pages
Perfect Murder, A - Kateri Young
No ratings yet
Perfect Murder, A - Kateri Young
119 pages
BodyGuard I-Tag For Heavy Equipment r2.6
No ratings yet
BodyGuard I-Tag For Heavy Equipment r2.6
7 pages

CCS334 Bda

Uploaded by

CCS334 Bda

Uploaded by

KARPAGAM INSTITUTE OF TECHNOLOGY

COIMBATORE - 641 105

CCS334 – BIG DATA ANALYTICS

Certified that this is the bonafide record of work done

by………………………………………………… in the CCS334 BIG DATA ANALYTICS of

this institution, as prescribed by Anna University, Chennai for the year/semester

during the year 2022-2023.

Staff-in-charge Head of the Department

REGISTER NO.: ……………………….

Submitted for the ……………………… Year / Semester Examination of the Institution

Internal Examiner External Examiner

INSTITUTION VISION & MISSION

• Establishing state-of–the-art infrastructure, effective procedures for recruitment of competent

DEPARTMENT VISION & MISSION

To become centre of excellence in technical education to produce competent professionals to

• Educating the fundamentals of AI&DS by continuously improving the teaching learning

• To understand big data.

3. Implement of Matrix Multiplication with Hadoop Map Reduce

5. Installation of Hive along with practice examples.

6. Installation of HBase, Installing thrift along with Practice examples

7. Practice importing and exporting data from various databases.

DOWNLOADING AND INSTALLING

RUN A BASIC WORD COUNT MAP

INSTALLATION OF HIVE ALONG

PRACTICE IMPORTING AND

RUN PIG LATIN SCRIPTS TO SORT,

STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE

STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE

• Adding files and directories to HDFS

• Hadoop fs -mkdir /user/chuck

STEP-2: RETRIEVING FILES FROM HDFS

STEP-3: DELETING FILES FROM HDFS

• Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.

MAP REDUCE LOGIC

• The path of the input file or directory for matrix A.

STEP-1: WRITE A MAPPER

STEP-2: WRITE A REDUCER

STEP-2: WRITE DRIVER

pip install pyhive

Install Hive Server and Start it:

hive --service hiveserver2

Python Code Examples for Hive:

from pyhive import hive

# Connect to Hive server

# Switch to the database

# Load data into the table

# Query the table

for row in results:

# Create an external table

# Close the cursor and connection

Python Code Examples for HBase with Thrift:

from thrift.transport import TSocket

# Create a connection to HBase Thrift server

# Define data for a put operation

# Put data into HBase

# Get data from HBase

# Close the connection

# function to create new SparkSession

# function to create RDD

input_data = [("Uttar Pradesh",122000,89600,12238),

# calling function to create SparkSession

# creating spark context object

# calling function to create RDD

# showing the dataframe and schema

print("Retrieved Data is:-")

# Retrieving multiple rows using collect() and for loop

You might also like