Big Data & Analytics Lab Manual
Big Data & Analytics Lab Manual
Lab Manual
List of Experiments
8 HIVE operations
EXP. NO: 1 Installation of Apache Hadoop
Aim: To install apache Hadoop in standalone mode.
Algorithm
Step 1: Download latest version of java from following website.
https://www.oracle.com/in/java/technologies/downloads/
Step 2: Install the java in C folder and set following PATH and Environmental variable.
Go to setting -> search-> System environment variable.
PATH
C:\jdk-20.0.2\bin
JAVA_HOME
C:\jdk-20.0.2\bin
Step 3: Restart the system. After restarting the system type following command in the
command prompt to check whether java is properly installed in your system.
Java -version
Javac -version
Step 4: Download latest version of version of Hadoop. Latest version of hadoop can be
downloaded from the following official website.
https://hadoop.apache.org/releases.html
Step 5: Install Hadoop in C folder. Now you can see following folder will be installed in
C drive.
C:\hadoop-3.3.6
Step 6: Now open the hadoop-3.3.6 folder and you can see following folders inside the
hadoop-3.3.6 folder.
Step 7: Create a folder data inside hadoop-3.3.6 folder. Inside folder data create two
folders namenode and datanode.
<property>
<name>fs.defaultersFS</name>
<value>hdfs://localhost:9000</value>
</property>
ii) Open mapred-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>mapreduce.framework.name</name>
<value> yarn </value>
</property>
iii) Open yarn-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value> mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle</name>
<value>org.apache.hadoop.mapred.shufflehandler</value>
</property>
iv) Open hdfs-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name> dfs.namenode.name.dir</name>
<value>C:\hadoop-3.3.6\data\namenode</value>
</property>
<property>
<name> dfs.datanode.name.dir</name>
<value>C:\hadoop-3.3.6\data\datanode</value>
</property>
iv) Open hadoop-env.cmd file in Notepad++ and changed following script as follows.
JAVA_HOME=%JAVA_HOME%
Can be changed as
JAVA_HOME=C:\jdk-20.0.2
Step 10: Create following paths and environmental variables for hadoop.
PATH
C:\hadoop-3.3.6\bin
C:\hadoop-3.3.6\sbin
HADOOP_HOME
C:\hadoop-3.3.6\bin
Step 11: Restart the system. After restarting the system type following command in command
prompt to check whether hadoop is installed correctly.
hdfs namenode –format
Step 11: Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin
Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Result: Hadoop is installed successfully.
HADOOP AND BIG DATA
EXP. NO: 2
AIM:-
DESCRIPTION:-
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1
Step-2
The Hadoop command get copies files from HDFS back to the local filesystem. To
retrieve example.txt, we can run the following command:
Step-3
Step-4
AIM: To run a MapReduce program to calculate the frequency of a given word in a given file
Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair). Example – (Map function in Word
Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, Train, bus, bus, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data
tuples into a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples
(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Department of CSE
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g.
splitting by space, comma, semicolon, or even by a new line
(‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different
clusters. In order to group them in “Reduce Phase” the similar KEY
data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result
set from each cluster) is combine together to form a Result
Step 1: After installing Java and Hadoop open cmd prompt and move to C:\hadoop-
3.3.6\sbin> start-all
Step 2: Create an input directory in HDFS.
hadoop fs -mkdir /input_dir
Step 3: Copy the input text file named input_file.txt in the input directory (input_dir) of
HDFS.
hadoop fs -put C:/input_file.txt /input_dir
Step 4: Verify content of the copied file.
hadoop dfs -cat /input_dir/input_file.txt
Output:
BUS 7
CAR 4
TRAIN 6
To delete file
Now click on the Path variable in the System variables. This will open a new tab. Then click
the ‘New’ button. And add the value C:\pig-0.17.0\bin in the text box. Then hit OK until all
tabs have closed.
Step 3: Correcting the Pig Command File
Find file ‘pig.cmd’ in the bin folder of the pig file ( C:\pig-0.17.0\bin)
set HADOOP_BIN_PATH = %HADOOP_HOME%\bin
Find the line:
set HADOOP_BIN_PATH=%HADOOP_HOME%\bin
Replace this line by:
set HADOOP_BIN_PATH=%HADOOP_HOME%\libexec
And save this file. We are finally here. Now you are all set to start exploring Pig and it’s
environment.
Local Mode: All the files are installed, accessed, and run in the local machine itself. No
need to use HDFS. The command for running Pig in local mode is as follows.
pig -x local
MapReduce Mode: The files are all present on the HDFS . We need to load this data to
process it. The command for running Pig in MapReduce/HDFS Mode is as follows.
pig -x mapreduce
Apache PIG CASE STUDY:
1. Download the dataset containing the Agriculture related data about crops in various
regions and their area and produce. The link for dataset –
https://www.kaggle.com/abhinand05/crop-production-in-india The dataset contains 7
columns namely as follows.
Algorithm
Step 1: Download latest version of java from following website.
https://www.oracle.com/in/java/technologies/downloads/
Step 2: Install the java in C folder and set following PATH and Environmental variable.
Go to setting -> search-> System environment variable.
PATH
C:\jdk-20.0.2\bin
JAVA_HOME
C:\jdk-20.0.2\bin
Step 3: Restart the system. After restarting the system type following command in the
command prompt to check whether java is properly installed in your system.
Java -version
Javac -version
Step 4: Download latest version of version of Hadoop. Latest version of hadoop can be
downloaded from the following official website.
https://hadoop.apache.org/releases.html
Step 5: Install Hadoop in C folder. Now you can see following folder will be installed in
C drive.
C:\hadoop-3.3.6
Step 6: Now open the hadoop-3.3.6 folder and you can see following folders inside the
hadoop-3.3.6 folder.
Step 7: Create a folder data inside hadoop-3.3.6 folder. Inside folder data create two
folders namenode and datanode.
<property>
<name>fs.defaultersFS</name>
<value>hdfs://localhost:9000</value>
</property>
ii) Open mapred-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>mapreduce.framework.name</name>
<value> yarn </value>
</property>
iii) Open yarn-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value> mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle</name>
<value>org.apache.hadoop.mapred.shufflehandler</value>
</property>
iv) Open hdfs-site.xml file in Notepad++ and insert following script between
<Configuration> </configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name> dfs.namenode.name.dir</name>
<value>C:\hadoop-3.3.6\data\namenode</value>
</property>
<property>
<name> dfs.datanode.name.dir</name>
<value>C:\hadoop-3.3.6\data\datanode</value>
</property>
iv) Open hadoop-env.cmd file in Notepad++ and changed following script as follows.
JAVA_HOME=%JAVA_HOME%
Can be changed as
JAVA_HOME=C:\jdk-20.0.2
Step 10: Create following paths and environmental variables for hadoop.
PATH
C:\hadoop-3.3.6\bin
C:\hadoop-3.3.6\sbin
HADOOP_HOME
C:\hadoop-3.3.6\bin
Step 11: Restart the system. After restarting the system type following command in command
prompt to check whether hadoop is installed correctly.
hdfs namenode –format
Step 11: Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin
Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Step11: Downloading Apache Hive
https://downloads.apache.org/hive/hive-3.1.2/
Step12: Extract the Apache Hive in c folder as follows
Step 17: Got to following location C:\db-derby-10.14.2.0-bin\lib. Copy all the following file
from the location.
Step 17: Paste the above copied file in C:\apache-hive-3.1.2-bin\lib.
Step 18: Open command prompt. In command prompt change directory to Hadoop 3.3.6/sbin
Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Step 19: open command prompt in admin mode and go to C:\db-derby-10.14.2.0-bin\bin>
Step 21: Open another command prompt in admin mode and type following command.
Now hive will get started and you can see following screen appears.
Type Start-all.cmd in command prompt and Appearance of following screen shows hadoop
was installed successfully.
Type jps in command prompt to check running demons
Step 2: Launch hive from terminal.
Step 3: Syntax to Make Database
Syntax to Make Database:
CREATE DATABASE <database-name>;
Command:
CREATE DATABASE student_detail; # this will create database student_detail
SHOW DATABASES; # list down all the available databases
Step 4: Syntax to use database.
Syntax:
USE <database-name>;
Command:
USE student detail;
This command is used to add new columns to the end of existing column..
ALTER TABLE <tablename> ADD COLUMNS (
column1 datatype [COMMENT tablecomment],
column2 datatype [COMMENT tablecomment],
...
columnN datatype [COMMENT tablecomment]);
Example:
ALTER TABLE Employee ADD COLUMNS (
designation string COMMENT 'Employee designation',
age int COMMENT 'Employee Age');
This command is used to remove all the existing column and replace it with the new columns
specified.
Syntax:
ALTER TABLE <tablename> REPLACE COLUMNS (
column1 datatype [COMMENT tablecomment],
column2 datatype [COMMENT tablecomment],
...
columnN datatype [COMMENT tablecomment]);
Example:
ALTER TABLE Employee REPLACE COLUMNS (
employee_id string COMMENT 'Employee id',
first_name string COMMENT 'Employee First name',
Example In Hive
CREATE VIEW employee_view AS SELECT id, name
FROM employee;
Now you can query employee_view just like a regular table:
Example In Hive
SELECT * FROM employee_view;
Example In Hive
CREATE OR REPLACE VIEW employee_view AS SELECT id, name, department
FROM employee;
This command will update employee_view to include the department field.
Example In Hive
DROP VIEW IF EXISTS employee_view;
This command will delete employee_view if it exists.
AIM: To Write queries to sort and aggregate the data in a table using Hive.
Description:
Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-
like query language called HiveQL. These queries are compiled into MapReduce jobs that are
executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that
goes into writing and maintaining MapReduce jobs.
Hive supports database concepts like tables, columns, rows and partitions. Both primitive
(integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these
types can be composed to support structures of arbitrary complexity. The tables are
serialized/deserialized using default serializers/deserializer. Any new data format and type can be
supported by implementing SerDe and ObjectInspector java interface.
In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence, the
complete data is passed through a single reducer. This may take much time in the execution of
large datasets. However, we can use LIMIT to minimize the sorting time.
Example:
hive> create table emp (Id int, Name string , Salary float, Department string)
row format delimited
fields terminated by ',' ;
82
Department of CSE
BIG DATA ANALYTICS LABORATORY (19A05602P)
Now, fetch the data in the descending order by using the following command
83
Department of CSE
BIG DATA ANALYTICS LABORATORY (19A05602P)
Example:
Let's fetch the data in the descending order by using the following command
84
Department of CSE
BIG DATA ANALYTICS LABORATORY (19A05602P)
Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to
distribute the rows among reducers. Cluster BY columns will go to the multiple reducers.
For example, Cluster By clause mentioned on the Id column name of the table employees_guru
table. The output when executing this query will give results to multiple reducers at the back end.
But as front end it is an alternative clause for both Sort By and Distribute By.
Example:
SELECT Id, Name from employees_guru CLUSTER BY Id;
BIG DATA ANALYTICS LABORATORY (19A05602P)
Result: Writeen queries to sort and aggregate the data in a table using HiveQL
EXP NO: 8
HIVE OPERATIONS
Date:
AIM: To Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.
RESOURCES:
VMWare, XAMPP Server, Web Browser, 1GB RAM, Hard Disk 80 GB.
PROGRAM LOGIC:
SYNTAX for HIVE Database Operations
DATABASE Creation
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Drop Database Statement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE]; Creating and Dropping Table in
HIVE
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format]
Loading Data into table log_data
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO
TABLE u_data;
Alter Table in HIVE
Syntax
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec
...]) Creating and Dropping View
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
Dropping View
Syntax:
DROP VIEW view_name
Functions in HIVE
String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc
Date and Time Functions:- year(), month(), day(), to_date() etc
Aggregate Functions :- sum(), min(), max(), count(), avg() etc
INDEXES
CREATE INDEX index_name ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value,
...)] [IN TABLE index_table_name] [PARTITIONED
BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Creating Index
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
Altering and Inserting Index
ALTER INDEX index_ip_address ON log_data REBUILD;
Storing Index Data in Metastore SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipadd
ress_result;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFor
mat;
Dropping Index
DROP INDEX INDEX_NAME on TABLE_NAME;
OUTPUT:
Result: Used Hive to create, alter, and drop databases, tables, views, functions, and indexes.