0% found this document useful (0 votes)
1K views51 pages

Big Data & Analytics Lab Manual

The document outlines experiments for installing and using various Apache big data tools: 1. It describes installing Apache Hadoop in standalone mode through 11 steps including downloading Java, Hadoop, configuring files, and verifying the installation. 2. It explains how to perform file management tasks in Hadoop like adding, retrieving, and deleting files through commands like hdfs fs -mkdir, hdfs fs -put, and hdfs fs -rm. 3. It provides an example MapReduce program to count word frequencies in a file, explaining the map and reduce functions through 5 workflow steps including splitting, mapping, intermediate splitting, reducing, and combining. 4. It begins describing installing Apache Pig in Windows

Uploaded by

Sathish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
1K views51 pages

Big Data & Analytics Lab Manual

The document outlines experiments for installing and using various Apache big data tools: 1. It describes installing Apache Hadoop in standalone mode through 11 steps including downloading Java, Hadoop, configuring files, and verifying the installation. 2. It explains how to perform file management tasks in Hadoop like adding, retrieving, and deleting files through commands like hdfs fs -mkdir, hdfs fs -put, and hdfs fs -rm. 3. It provides an example MapReduce program to count word frequencies in a file, explaining the map and reduce functions through 5 workflow steps including splitting, mapping, intermediate splitting, reducing, and combining. 4. It begins describing installing Apache Pig in Windows

Uploaded by

Sathish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 51

BIG DATA AND ANALYTICS

Lab Manual
List of Experiments

Experiment Experiment Title Page


Number No.
1 Installation of Apache Hadoop

2 File management tasks in Hadoop

3 Map Reduce program

4 Installation of Apache Pig & loading dataset

5 Installation of Apache Hive

6 Create, alter, and drop databases, tables, views, and indexes in


Apache Hive

7 Queries to sort and aggregate the data in a table using Hive

8 HIVE operations
EXP. NO: 1 Installation of Apache Hadoop
Aim: To install apache Hadoop in standalone mode.

Algorithm
Step 1: Download latest version of java from following website.
https://www.oracle.com/in/java/technologies/downloads/
Step 2: Install the java in C folder and set following PATH and Environmental variable.
Go to setting -> search-> System environment variable.

PATH
C:\jdk-20.0.2\bin
JAVA_HOME
C:\jdk-20.0.2\bin

Step 3: Restart the system. After restarting the system type following command in the
command prompt to check whether java is properly installed in your system.
Java -version
Javac -version

Step 4: Download latest version of version of Hadoop. Latest version of hadoop can be
downloaded from the following official website.
https://hadoop.apache.org/releases.html
Step 5: Install Hadoop in C folder. Now you can see following folder will be installed in
C drive.
C:\hadoop-3.3.6
Step 6: Now open the hadoop-3.3.6 folder and you can see following folders inside the
hadoop-3.3.6 folder.

Step 7: Create a folder data inside hadoop-3.3.6 folder. Inside folder data create two
folders namenode and datanode.

Step 8: Install notepad+ + from following link.


https://notepad-plus-plus.org/downloads/

Step 9: Go the following folder and edit following files as follows.


C:\hadoop-3.3.6\etc\hadoop
1.Core-site.xml
2. mapred-site.xml
3.yarn-site.xml
4. hdfs-site.xml
5.hadoop-env.cmd
i) Open Core-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>

<property>
<name>fs.defaultersFS</name>
<value>hdfs://localhost:9000</value>
</property>

ii) Open mapred-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>mapreduce.framework.name</name>
<value> yarn </value>
</property>
iii) Open yarn-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value> mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle</name>
<value>org.apache.hadoop.mapred.shufflehandler</value>
</property>
iv) Open hdfs-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name> dfs.namenode.name.dir</name>
<value>C:\hadoop-3.3.6\data\namenode</value>
</property>

<property>
<name> dfs.datanode.name.dir</name>
<value>C:\hadoop-3.3.6\data\datanode</value>
</property>
iv) Open hadoop-env.cmd file in Notepad++ and changed following script as follows.
JAVA_HOME=%JAVA_HOME%
Can be changed as
JAVA_HOME=C:\jdk-20.0.2
Step 10: Create following paths and environmental variables for hadoop.
PATH
C:\hadoop-3.3.6\bin
C:\hadoop-3.3.6\sbin

HADOOP_HOME
C:\hadoop-3.3.6\bin
Step 11: Restart the system. After restarting the system type following command in command
prompt to check whether hadoop is installed correctly.
hdfs namenode –format
Step 11: Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin

Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Result: Hadoop is installed successfully.
HADOOP AND BIG DATA
EXP. NO: 2

AIM:-

Implement the following file management tasks in Hadoop:

 Adding files and directories



 Retrieving files

 Deleting Files

DESCRIPTION:-

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while


running on top of the underlying filesystem of the operating system. HDFS keeps track of where
the data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command line
utilities that work similarly to the Linux file commands, and serve as your primary interface with
HDFS. We‘re going to have a look into HDFS by interacting with it from the command line. We
will take a look at the most common file management tasks in Hadoop, which include:

 Adding files and directories to HDFS



 Retrieving files from HDFS to local filesystem

 Deleting files from HDFS

ALGORITHM:-

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step-1

Adding Files and Directories to HDFS


HADOOP AND BIG DATA
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.

hadoop fs -mkdir /user/chuck


hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck

Step-2

Retrieving Files from HDFS

The Hadoop command get copies files from HDFS back to the local filesystem. To
retrieve example.txt, we can run the following command:

hadoop fs -cat example.txt

Step-3

Deleting Files from HDFS

hadoop fs -rm example.txt


 Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.

 Adding directory is done through the command “hdfs dfs –put lendi_english /”.

Step-4

Copying Data from NFS to HDFS


HADOOP AND BIG DATA

Copying from directory command is “hdfs dfs –copyFromLocal


/home/lendi/Desktop/shakes/glossary /lendicse/”
 View the file by using the command “hdfs dfs –cat /lendi_english/glossary”

 Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.

 Command for Deleting files is “hdfs dfs –rm r /kartheek”.

Output

EXP NO: 3
MapReduce
Date:

AIM: To run a MapReduce program to calculate the frequency of a given word in a given file
Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair). Example – (Map function in Word
Count)

Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, Train, bus, bus, caR, CAR, car, BUS, TRAIN

Output
Convert into another set of data
(Key,Value)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)

Reduce Function – Takes the output from Map as an input and combines those data
tuples into a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples
(output of Map function)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)

Output Converts into smaller set of tuples


(BUS,7), (CAR,7), (TRAIN,4)
Work Flow of Program

Department of CSE
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g.
splitting by space, comma, semicolon, or even by a new line
(‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different
clusters. In order to group them in “Reduce Phase” the similar KEY
data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result
set from each cluster) is combine together to form a Result

Step 1: After installing Java and Hadoop open cmd prompt and move to C:\hadoop-
3.3.6\sbin> start-all
Step 2: Create an input directory in HDFS.
hadoop fs -mkdir /input_dir

Step 3: Copy the input text file named input_file.txt in the input directory (input_dir) of
HDFS.
hadoop fs -put C:/input_file.txt /input_dir
Step 4: Verify content of the copied file.
hadoop dfs -cat /input_dir/input_file.txt

Step 5: Run the mapreduce program


hadoop jar C:/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar
wordcount /input_dir /output_dir
Step 6 : Verify content for generated output file.
hadoop dfs -cat /output_dir/*

Output:

BUS 7

CAR 4

TRAIN 6

To delete file

hadoop fs -rm -r /input_dir/input_file.txt


EXP. No. 4 Installation of Apache Pig & loading dataset
Aim: To install Apache Pig in Windows.
Apache Pig is a data manipulation tool that is built over Hadoop’s MapReduce. Pig provides
us with a scripting language for easier and faster data manipulation. This scripting language
is called Pig Latin.
Apache Pig scripts can be executed in 3 ways as follows:
Using Grunt Shell (Interactive Mode) – Write the commands in the grunt shell and get the
output there itself using the DUMP command.
Using Pig Scripts (Batch Mode) – Write the pig latin commands in a single file with .pig
extension and execute the script on the prompt.
Using User-Defined Functions (Embedded Mode) – Write your own Functions on languages
like Java and then use them in the scripts.

Steps in Installing Apache Pig


Step 1: Download the Pig version 0.17.0 tar file from the official Apache pig site. Navigate
to the website https://downloads.apache.org/pig/latest/. Download the file ‘pig-0.17.0.tar.gz’
from the website.
https://downloads.apache.org/pig/latest/
Then extract this tar file using 7-Zip tool (use 7-Zip for faster extraction. First we extract the
.tar.gz file by right-clicking on it and clicking on ‘7-Zip → Extract Here’. Then we extract
the .tar file in the same way). To have the same paths as you can see in the diagram then you
need to extract in the C: drive.

Step 2: Add the path variables of PIG_HOME and PIG_HOME\bin.


Click the Windows Button and in the search bar type ‘Environment Variables’. Then click on
the ‘Edit the system environment variables’.

Now click on the Path variable in the System variables. This will open a new tab. Then click
the ‘New’ button. And add the value C:\pig-0.17.0\bin in the text box. Then hit OK until all
tabs have closed.
Step 3: Correcting the Pig Command File
Find file ‘pig.cmd’ in the bin folder of the pig file ( C:\pig-0.17.0\bin)
set HADOOP_BIN_PATH = %HADOOP_HOME%\bin
Find the line:
set HADOOP_BIN_PATH=%HADOOP_HOME%\bin
Replace this line by:
set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec
And save this file. We are finally here. Now you are all set to start exploring Pig and it’s
environment.

There are 2 Ways of Invoking the grunt shell:

Local Mode: All the files are installed, accessed, and run in the local machine itself. No
need to use HDFS. The command for running Pig in local mode is as follows.

pig -x local

MapReduce Mode: The files are all present on the HDFS . We need to load this data to
process it. The command for running Pig in MapReduce/HDFS Mode is as follows.

pig -x mapreduce
Apache PIG CASE STUDY:
1. Download the dataset containing the Agriculture related data about crops in various
regions and their area and produce. The link for dataset –
https://www.kaggle.com/abhinand05/crop-production-in-india The dataset contains 7
columns namely as follows.

2. Enter pig local mode using

grunt > pig -x local

3. Load the dataset in the local mode

grunt > agriculture= LOAD 'F:/csv files/crop_production.csv' using PigStorage (',')


as ( State_Name:chararray , District_Name:chararray , Crop_Year:int ,
Season:chararray , Crop:chararray , Area:int , Production:int ) ;
4. Dump and describe the data set agriculture using

grunt > dump agriculture;


grunt > describe agriculture;
EXP. NO. : 5 Installation of Apache Hive
Aim: To install Apache Hive to create database in Hadoop.

Algorithm
Step 1: Download latest version of java from following website.
https://www.oracle.com/in/java/technologies/downloads/
Step 2: Install the java in C folder and set following PATH and Environmental variable.
Go to setting -> search-> System environment variable.

PATH
C:\jdk-20.0.2\bin
JAVA_HOME
C:\jdk-20.0.2\bin

Step 3: Restart the system. After restarting the system type following command in the
command prompt to check whether java is properly installed in your system.
Java -version
Javac -version

Step 4: Download latest version of version of Hadoop. Latest version of hadoop can be
downloaded from the following official website.
https://hadoop.apache.org/releases.html
Step 5: Install Hadoop in C folder. Now you can see following folder will be installed in
C drive.
C:\hadoop-3.3.6
Step 6: Now open the hadoop-3.3.6 folder and you can see following folders inside the
hadoop-3.3.6 folder.

Step 7: Create a folder data inside hadoop-3.3.6 folder. Inside folder data create two
folders namenode and datanode.

Step 8: Install notepad+ + from following link.


https://notepad-plus-plus.org/downloads/

Step 9: Go the following folder and edit following files as follows.


C:\hadoop-3.3.6\etc\hadoop
1.Core-site.xml
2. mapred-site.xml
3.yarn-site.xml
4. hdfs-site.xml
5.hadoop-env.cmd
i) Open Core-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>

<property>
<name>fs.defaultersFS</name>
<value>hdfs://localhost:9000</value>
</property>

ii) Open mapred-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>mapreduce.framework.name</name>
<value> yarn </value>
</property>
iii) Open yarn-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value> mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle</name>
<value>org.apache.hadoop.mapred.shufflehandler</value>
</property>
iv) Open hdfs-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name> dfs.namenode.name.dir</name>
<value>C:\hadoop-3.3.6\data\namenode</value>
</property>

<property>
<name> dfs.datanode.name.dir</name>
<value>C:\hadoop-3.3.6\data\datanode</value>
</property>
iv) Open hadoop-env.cmd file in Notepad++ and changed following script as follows.
JAVA_HOME=%JAVA_HOME%
Can be changed as
JAVA_HOME=C:\jdk-20.0.2
Step 10: Create following paths and environmental variables for hadoop.
PATH
C:\hadoop-3.3.6\bin
C:\hadoop-3.3.6\sbin

HADOOP_HOME
C:\hadoop-3.3.6\bin
Step 11: Restart the system. After restarting the system type following command in command
prompt to check whether hadoop is installed correctly.
hdfs namenode –format
Step 11: Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin

Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Step11: Downloading Apache Hive
https://downloads.apache.org/hive/hive-3.1.2/
Step12: Extract the Apache Hive in c folder as follows

Step 13: Download Apache Derby


https://archive.apache.org/dist/db/derby/db-derby-10.14.2.0/
Step 14: Extract the Apache Derby in c folder as follows.

Step 15: Download hive-site.xml from following site.


https://github.com/apache/hive/blob/master/data/conf/hive-site.xml
Step 16: paste the hive-site.xml file in to C:\apache-hive-3.1.2-bin\conf

Step 17: Got to following location C:\db-derby-10.14.2.0-bin\lib. Copy all the following file
from the location.
Step 17: Paste the above copied file in C:\apache-hive-3.1.2-bin\lib.

Step 18: Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin
Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Step 19: open command prompt in admin mode and go to C:\db-derby-10.14.2.0-bin\bin>

Step 20: Run following command in command prompt.


C:\db-derby-10.14.2.0-bin\bin>startNetworkServer -h.0.0.0.0

Step 21: Open another command prompt in admin mode and type following command.
Now hive will get started and you can see following screen appears.

Now you are ready to work with hive.


Step 21: Type following Command and press enter.
hive> CREATE DATABASE demo;
You can see following screen.

Type following Command


hive> SHOW DATABASE;
You can see following Screen.

Result: Hive is installed and database is created successfully.


EXP. No: 6 Create, alter, and drop databases, tables, views, and indexes in Apache Hive
Aim: To create, alter, and drop databases, tables, views, and indexes.
Algorithm
Step 1: Start all your Hadoop Daemon
Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin

Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Type jps in command prompt to check running demons
Step 2: Launch hive from terminal.
Step 3: Syntax to Make Database
Syntax to Make Database:
CREATE DATABASE <database-name>;
Command:
CREATE DATABASE student_detail; # this will create database student_detail
SHOW DATABASES; # list down all the available databases
Step 4: Syntax to use database.
Syntax:
USE <database-name>;
Command:
USE student detail;

Step 5: Syntax to Create Table in Hive.


CREATE TABLE [IF NOT EXISTS] <table-name> (
<column-name> <data-type>,
<column-name> <data-type> COMMENT 'Your Comment',
<column-name> <data-type>,
.
.
.
<column-name> <data-type>
)
COMMENT 'Add if you want'
LOCATION 'Location On HDFS'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Note:
1. We can add a comment to the table as well as to each individual column.
2. ROW FORMAT DELIMITED shows that whenever a new line is encountered the new
record entry will start.
3. FIELDS TERMINATED BY ‘,’ shows that we are using ‘,’ delimiter to separate each
column.
4. We can also override the default database location with the LOCATION option.
So let’s create the table student_data in our student_detail database with the help of the
command shown below.

CREATE TABLE IF NOT EXISTS student_data(


Student_Name STRING COMMENT 'This col. Store the name of student',
Student_Rollno INT COMMENT 'This col. Stores the rollno of student',
Student_Marks FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
We have successfully created the table student_data in our student_detail database with 3
different fields Student_Name, Student_Rollno, Student_Marks as STRING, INT, FLOAT
respectively.

Step 6: Syntax to show Table in Hive.


Syntax:
SHOW TABLES [IN <database_name>];
Command:
SHOW TABLES IN student_detail;
Step 7: Syntax to Altering Table in Hive

This command is used to add new columns to the end of existing column..
ALTER TABLE <tablename> ADD COLUMNS (
column1 datatype [COMMENT tablecomment],
column2 datatype [COMMENT tablecomment],
...
columnN datatype [COMMENT tablecomment]);

Example:
ALTER TABLE Employee ADD COLUMNS (
designation string COMMENT 'Employee designation',
age int COMMENT 'Employee Age');

This command is used to remove all the existing column and replace it with the new columns
specified.

Syntax:
ALTER TABLE <tablename> REPLACE COLUMNS (
column1 datatype [COMMENT tablecomment],
column2 datatype [COMMENT tablecomment],
...
columnN datatype [COMMENT tablecomment]);

Example:
ALTER TABLE Employee REPLACE COLUMNS (
employee_id string COMMENT 'Employee id',
first_name string COMMENT 'Employee First name',

This command is used to rename the table name to a new one.


Syntax:
ALTER TABLE<tablename> RENAME TO <new_tablename>
Example:
ALTER TABLE Employee RENAME TO NewEmployee

Step 8: Syntax to Create a View in Hive


Creating a view in Hive is straightforward. Here's the basic syntax:
Example In Hive
CREATE VIEW view_name AS SELECT column1, column2
FROM table_name
WHERE condition;
For instance, if we have a table called employee with fields id , name , department , and
salary , and we want to create a view that only shows the id and name fields, we could do so
with the following command:

Example In Hive
CREATE VIEW employee_view AS SELECT id, name
FROM employee;
Now you can query employee_view just like a regular table:
Example In Hive
SELECT * FROM employee_view;

Updating and Dropping Views in Hive link to this section


To change the definition of a view, you can use the CREATE OR REPLACE VIEW
command, like so:

Example In Hive
CREATE OR REPLACE VIEW employee_view AS SELECT id, name, department
FROM employee;
This command will update employee_view to include the department field.

To delete a view, you can use the DROP VIEW command:

Example In Hive
DROP VIEW IF EXISTS employee_view;
This command will delete employee_view if it exists.

Step 9: Creating indexes in Hive.


Creating an Index
An Index is nothing but a pointer on a particular column of a table. Creating an index means
creating a pointer on a particular column of a table. Its syntax is as follows:

CREATE INDEX index_name


ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Example
Let us take an example for index. Use the same employee table that we have used earlier with
the fields Id, Name, Salary, Designation, and Dept. Create an index named index_salary on
the salary column of the employee table.

The following query creates an index:

hive> CREATE INDEX inedx_salary ON TABLE employee(salary)


AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes are stored using an
index value.
Dropping an Index
The following syntax is used to drop an index:
DROP INDEX <index_name> ON <table_name>
The following query drops an index named index_salary:
hive> DROP INDEX index_salary ON employee;

Conclusion: Created database, table, views and indexes in Apache Hive.


EXP NO: 7
Queries to sort and aggregate the data in a table using Hive
Date:

AIM: To Write queries to sort and aggregate the data in a table using Hive.

Description:

Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-
like query language called HiveQL. These queries are compiled into MapReduce jobs that are
executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that
goes into writing and maintaining MapReduce jobs.

Hive supports database concepts like tables, columns, rows and partitions. Both primitive
(integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these
types can be composed to support structures of arbitrary complexity. The tables are
serialized/deserialized using default serializers/deserializer. Any new data format and type can be
supported by implementing SerDe and ObjectInspector java interface.

HiveQL - ORDER BY and SORT BY Clause


By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column. It returns
the result set either in ascending or descending order. Here, we are going to execute these clauses
on the records of the below table:

HiveQL - ORDER BY Clause


BIG DATA ANALYTICS LABORATORY (19A05602P)

In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence, the
complete data is passed through a single reducer. This may take much time in the execution of
large datasets. However, we can use LIMIT to minimize the sorting time.

Example:

Select the database in which we want to create a table.

hive> use hiveql;

Now, create a table by using the following command:

hive> create table emp (Id int, Name string , Salary float, Department string)
row format delimited
fields terminated by ',' ;

Load the data into the table

hive> load data local inpath '/home/codegyani/hive/emp_data' into table emp;

82

Department of CSE
BIG DATA ANALYTICS LABORATORY (19A05602P)

Now, fetch the data in the descending order by using the following command

hive> select * from emp order by salary desc;

83

Department of CSE
BIG DATA ANALYTICS LABORATORY (19A05602P)

HiveQL - SORT BY Clause


The HiveQL SORT BY clause is an alternative of ORDER BY clause. It orders the data within each
reducer. Hence, it performs the local ordering, where each reducer's output is sorted
separately. It may also give a partially ordered result.

Example:

Let's fetch the data in the descending order by using the following command

hive> select * from emp sort by salary desc;

84

Department of CSE
BIG DATA ANALYTICS LABORATORY (19A05602P)

Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.

Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to
distribute the rows among reducers. Cluster BY columns will go to the multiple reducers.

It ensures sorting orders of values present in multiple reducers

For example, Cluster By clause mentioned on the Id column name of the table employees_guru
table. The output when executing this query will give results to multiple reducers at the back end.
But as front end it is an alternative clause for both Sort By and Distribute By.

Example:
SELECT Id, Name from employees_guru CLUSTER BY Id;
BIG DATA ANALYTICS LABORATORY (19A05602P)

Result: Writeen queries to sort and aggregate the data in a table using HiveQL
EXP NO: 8
HIVE OPERATIONS
Date:

AIM: To Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.

RESOURCES:
VMWare, XAMPP Server, Web Browser, 1GB RAM, Hard Disk 80 GB.

PROGRAM LOGIC:
SYNTAX for HIVE Database Operations
DATABASE Creation
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Drop Database Statement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE]; Creating and Dropping Table in
HIVE
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format]
Loading Data into table log_data
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO
TABLE u_data;
Alter Table in HIVE
Syntax
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec
...]) Creating and Dropping View
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
Dropping View
Syntax:
DROP VIEW view_name
Functions in HIVE
String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc
Date and Time Functions:- year(), month(), day(), to_date() etc
Aggregate Functions :- sum(), min(), max(), count(), avg() etc
INDEXES
CREATE INDEX index_name ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value,
...)] [IN TABLE index_table_name] [PARTITIONED
BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Creating Index
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
Altering and Inserting Index
ALTER INDEX index_ip_address ON log_data REBUILD;
Storing Index Data in Metastore SET

hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipadd
ress_result;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFor
mat;
Dropping Index
DROP INDEX INDEX_NAME on TABLE_NAME;

OUTPUT:

Result: Used Hive to create, alter, and drop databases, tables, views, functions, and indexes.

You might also like