0% found this document useful (0 votes)
3 views

BDA-Unit 5-notes

The document provides an overview of Apache Pig, a platform for analyzing large datasets using a high-level language called Pig Latin, which simplifies MapReduce tasks. It discusses the installation and execution of Pig, its features, and comparisons with SQL and Hive, highlighting its ease of use for programmers not proficient in Java. Additionally, it covers the data model, built-in functions, and various data processing operators available in Pig.

Uploaded by

anisha01531
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

BDA-Unit 5-notes

The document provides an overview of Apache Pig, a platform for analyzing large datasets using a high-level language called Pig Latin, which simplifies MapReduce tasks. It discusses the installation and execution of Pig, its features, and comparisons with SQL and Hive, highlighting its ease of use for programmers not proficient in Java. Additionally, it covers the data model, built-in functions, and various data processing operators available in Pig.

Uploaded by

anisha01531
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

BIG DATA ANALYTICS

UNIT – V
Pig Latin: Installing and Running Pig, Comparison with Databases, Different Pig Latin
expression, Ways of executing Pig Programs, Built-in functions in Pig, Data Processing
Operators, Pig Latin commands/ Pig in Practice.

Hive: Installing Hive, The Hive Shell, Comparison with Traditional Databases, HiveQL,
Tables, User Defined Functions.

What is Apache Pig?

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of
data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.

To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their own functions for reading,
writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these
scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

Why Do We Need Apache Pig?

Programmers who are not so good at Java normally used to struggle working with Hadoop, especially
while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.

 Using Pig Latin, programmers can perform MapReduce tasks easily without having to type
complex codes in Java.

 Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time
by almost 16 times.

 Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with
SQL.

 Apache Pig provides many built-in operators to support data operations like joins, filters,
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
Features of Pig:
Apache Pig comes with the following features −

 Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.

 Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are
good at SQL.

 Optimization opportunities − The tasks in Apache Pig optimize their execution automatically,
so the programmers need to focus only on semantics of the language.

 Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.

 UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.

 Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as
unstructured. It stores the results in HDFS.

Apache Pig Vs SQL


 Listed below are the major differences between Apache Pig and SQL.

Pig SQL

SQL is
Pig Latin is a procedural language.
a declarative language.

In Apache Pig, schema is optional. We


can store data without designing a
Schema is mandatory in SQL.
schema (values are stored as $01, $02
etc.)

The data model in Apache Pig is nested The data model used in
relational. SQL is flat relational.

Apache Pig provides limited opportunity There is more opportunity for


for Query optimization. query optimization in SQL.
Applications of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc processing and
quick prototyping. Apache Pig is used −

 To process huge data sources such as web logs.

 To perform data processing for search platforms.

 To process time sensitive data loads.

Apache Pig - Architecture


The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data
processing language which provides a rich set of data types and operators to perform various operations
on the data.

To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig
Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs,
Embedded). After execution, these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the
programmer’s job easy. The architecture of Apache Pig is shown below.

1. Parser – It checks the syntax of


the script.

2. Optimizer – It performs
activities such as merge, split,
joins, Order by, group by, etc. It
basically tries to reduce the
amount of data which is being
send to the next stage.

3. Compiler – It converts the code


into Mapreduce jobs.

4. Execution – Finally the job is


submitted and the code is
executed. We can then use Dump
to show the output on the screen
or can use store to store the
output in text file or other type of
file.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.

Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string
and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic
values of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’

Tuple

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A
tuple is similar to a row in a table of RDBMS.

Example − (Raja, 30)

Bag

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag.
Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a
table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same
number of fields or that the fields in the same position (column) have the same type.

Example − {(Raja, 30), (Mohammad, 45)}


A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, [email protected],}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be
unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are
processed in any particular order).

5.1. Installing and Running Pig:


Pig runs as a client-side application. Even if you want to run Pig on a Hadoop cluster, there is nothing
extra to install on the cluster: Pig launches jobs and interacts withHDFS (or other Hadoop filesystems)
from your workstation.
Installation is straightforward. Java 6 is a prerequisite (and on Windows, you will need Cygwin).
Download a stable release from http://pig.apache.org/releases.html, and unpack the tarball in a suitable
place on your workstation:

% tar xzf pig-x.y.z.tar.gz

It’s convenient to add Pig’s binary directory to your command-line path. For example:
% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
You also need to set the JAVA_HOME environment variable to point to a suitable Java installation.
Try typing pig -help to get usage instructions.

Running Apache Pig/ 2 Execution Types

1.Local Mode (Without Hadoop Cluster)

Run Pig in local mode for testing:

Bash

pig -x local

This opens the Grunt Shell, where you can enter Pig Latin commands interactively.

Example Pig Latin Script:

Pig

A = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int);


B = FILTER A BY age > 25;
DUMP B;

To execute a script saved as myscript.pig:

Bash

pig -x local myscript.pig


2.MapReduce Mode (Hadoop Cluster Required)

Run Pig in MapReduce mode (default mode):

Bash

Pig

Or run a Pig script:

Bash

pig myscript.pig

Output Verification

 DUMP Command: Displays output in the terminal.


 STORE Command: Saves the output to HDFS. Example:

Pig

STORE B INTO '/output' USING PigStorage(',');

Now you’re ready to write and execute Pig Latin scripts for data analysis!

Running Pig Programs


There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
Script
Pig can run a script file that contains Pig commands. For example, pig
script.pig runs the commands in the local file script.pig. Alternatively, for very
short scripts, you can use the -e option to run a script specified as a string on the
command line.
Grunt
Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to
run Pig scripts from within Grunt using run and exec.
Embedded
You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.

5.2. Comparison with Databases


Though old SQL still is the favorite of many and is popularly used in numerous organizations, Apache
Hive and Pig have become the buzz terms in the big data world today. These tools provide easy
alternatives to carry out the complex programming of MapReduce helping data developers and analysts.

The organizations looking for open source querying and programming to tame Big data have adopted
Hive and Pig widely. At the same time, it is vital to pick and choose the right platform and tool for
managing your data well. Hence it is essential to understand the differences between Hive vs Pig vs SQL
and choose the best suitable option for the project.

Technical Differences between Hive vs Pig vs SQL

Does the comparison for Hive vs Pig vs SQL direct the winning of the game?
We have seen that there are significant differences in the three- Hive vs Pig vs SQL. All of these
perform specific functions and meet unique requirements of the business. Also, all three requires
proper infrastructure and skills for their efficient use while working on data sets.

Hive vs Pig vs SQL

Nature of Uses Procedural Uses Declarative SQL itself is Declarative


Language language called Pig language called language
Latin HiveQL

Definition An open source and An open source built General purpose


high-level data flow with an analytical database language for
language with a Multi- focus used for analytical and
query approach Analytical queries transactional queries

Suitable for Suitable for Complex Ideal for Batch Ideal for more
as well as Nested data Processing – OLAP straightforward business
structure (Online Analytical demands for fast data
Processing) analysis

Operational Semi-structured and Used only for A domain-specific


for structured data Structured Data language for a relational
database management
system

Compatibility Pig works on top of Hive works on top of Not compatible with
MapReduce MapReduce MapReduce
programming

Use of No concept of a Supports Schema for Strict use of schemas in


Schema schema to store data data insertion in case of storing data
tables

5.3. Different Pig Latin expressions


An expression is something that is evaluated to yield a value. Expressions can be used
in Pig as a part of a statement containing a relational operator. Pig has a rich variety of
expressions, many of which will be familiar from other programming languages.

5.4. Ways of Executing Pig Programs


Apache Pig Run Modes

Apache Pig executes in two modes: Local Mode


and MapReduce Mode.
Local Mode

o It executes in a single JVM and is used for development experimenting and prototyping.
o Here, files are installed and run using localhost.
o The local mode works on a local file system. The input and output data stored in the local
file system.

The command for local mode grunt shell:

1. $ pig-x local

MapReduce Mode

o The MapReduce mode is also known as Hadoop Mode.


o It is the default mode.
o In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster.
o It can be executed against semi-distributed or fully distributed Hadoop installation.
o Here, the input and output data are present on HDFS.

The command for Map reduce mode:

1. $ pig

Or,

1. $ pig -x mapreduce

Ways to execute Pig Program

These are the following ways of executing a Pig program on local and MapReduce mode: -

o Interactive Mode(Grunt) - In this mode, the Pig is executed in the Grunt shell. To
invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide
Pig Latin statements and command interactively at the command line.
o Batch Mode(Script) - In this mode, we can run a script file having a .pig extension.
These files contain Pig Latin commands.
o Embedded Mode - In this mode, we can define our own functions. These functions can
be called as UDF (User Defined Functions). Here, we use programming languages like
Java and Python.

5.5. Built-in Functions in Pig


Pig comes with a collection of built-in functions, a selection of which are listed in
Table 11-7. The complete list of built-in functions, which includes a large number of
standard math and string functions, can be found in the documentation for each Pig
release.
Type Examples
EVAL functions AVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN,
SIZE etc
LOAD or STORE Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage
functions etc
Math functions ABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc
String functions TRIM, RTRIM, SUBSTRING, LOWER, UPPER etc
DateTime function GetDay, GetHour, GetYear, ToUnixTime, ToString etc
5.6. Data Processing Operators:
1. Loading and Storing data 2. Filtering data 3.Grouping and Joining data

4. Combining and Splitting data 5. Sorting data 6. Diagnostic Operators

The Load Operator: You can load data into Apache Pig from the file system (HDFS/ Local)
using LOAD operator of Pig Latin.

Syntax: Relation_name = LOAD 'Input file path' USING function as schema;

EX:grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'


USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );

Following is the description of the above statement.

Relation
We have stored the data in the schema student.
name
Input file We are reading data from the file student_data.txt, which is in the
path /pig_data/ directory of HDFS.
We have used the PigStorage() function. It loads and stores data as
Storage
structured text files. It takes a delimiter using which each entity of a tuple
function
is separated, as a parameter. By default, it takes ‘\t’ as a parameter.
We have stored the data using the following schema.

schema id firstname lastname Phone City


column
datatype int char array char array char array char array

The Store Operator: You can store the loaded data in the file system using the store operator.

Syntax: STORE Relation_name INTO ' required_directory_path ' [USING function];

Ex: grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING


PigStorage (',');
2.Filtering Data: Filter, Distinct, Foreach operators
Once you have some data loaded into a relation, the next step is often to filter it to
remove the data that you are not interested in. By filtering early in the processing pipeline,
you minimize the amount of data flowing through the system, which can improve efficiency.

i) The FILTER operator: is used to select the required tuples from a relation based on a
condition.

Syntax: grunt> Relation2_name = FILTER Relation1_name BY (condition);

Example: filter_data = FILTER student_details BY city == 'Chennai';

ii) The DISTINCT operator: is used to remove redundant (duplicate) tuples from a relation.

Syntax: grunt> Relation_name2 = DISTINCT Relatin_name1;


Example: grunt> distinct_data = DISTINCT student_details;
iii) The FOREACH operator: is used to generate specified data transformations based on the column
data.
Syntax: grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);

Example: grunt> foreach_data = FOREACH student_details GENERATE id,age,city;

3. Grouping and Joining Data: Group, Cogroup, Join and Cross operators
i) The GROUP operator is used to group the data in one or more relations. It collects the data having the
same key.
Syntax: grunt> Group_data = GROUP Relation_name BY age;
Example: grunt> group_data = GROUP student_details by age;
 Grouping by Multiple Columns
Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
 Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;

ii) The COGROUP operator : works more or less in the same way as the GROUP operator. The only
difference between the two operators is that the group operator is normally used with one relation, while
the cogroup operator is used in statements involving two or more relations.
Grouping Two Relations using Cogroup
Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS
directory /pig_data/
Ex: grunt> cogroup_data = COGROUP student_details by age, employee_details by age;
grunt> Dump cogroup_data;
The cogroup operator groups the tuples from each relation according to age where each group depicts a
particular age value.

iii) The JOIN operator: is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the
two particular tuples are matched, else the records are dropped. Joins can be of the following types −
 Self-join
 Inner-join
 Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two
files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS .
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least
one relation.
Syntax: grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the two
relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when
there is a match in both tables.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example:grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Outer Join: Unlike inner join, outer join returns all the rows from at least one of the relations. An outer
join operation is carried out in three ways −
 Left outer join
 Right outer join
 Full outer join
The left outer Join operation returns all rows from the left table, even if there are no matches in the right
relation.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY
customer_id;
Example:grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;

The right outer join operation returns all rows from the right table, even if there are no matches in the
left table.
Syntax: grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example: grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
The full outer join operation returns rows when there is a match in one of the relations.
Syntax: grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example: grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
iv) The CROSS operator computes the cross-product of two or more relations.
Syntax: grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example: grunt> cross_data = CROSS customers, orders;

4. Combining and Splitting Data: Union and Split operators

i) The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION
operation on two relations, their columns and domains must be identical.
Syntax: grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Example: grunt> student = UNION student1, student2;

ii) The SPLIT operator is used to split a relation into two or more relations.
Syntax: grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2),
Example:
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25);

5. Sorting Operator: Order By and Limit operators


i) The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or
more fields.
Syntax: grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example: grunt> order_by_data = ORDER student_details BY age DESC;
Verify the relation order_by_data using the DUMP operator as shown below.
grunt> Dump order_by_data;

ii) The LIMIT operator is used to get a limited number of tuples from a relation.
Synta.: grunt> Result = LIMIT Relation_name required number of tuples;
Example: grunt> limit_data = LIMIT student_details 4;

6. Diagnostic Operators: Dump, Describe, Explain and Illustrate

The load statement will simply load the data into the specified relation in Apache Pig. To verify
the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin
provides four different types of diagnostic operators −

 Dump operator
 Describe operator
 Explanation operator
 Illustration operator

i) The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is
generally used for debugging Purpose.
Syntax:grunt> Dump Relation_Name
Example: grunt> Dump student
ii)The describe operator is used to view the schema of a relation.
Syntax : grunt> Describe Relation_name
Example: grunt> describe student;
iii)The explain operator is used to display the logical, physical, and MapReduce execution plans of a
relation.
Syntax : grunt> explain Relation_name;
Example: grunt> explain student;
iv)The illustrate operator gives you the step-by-step execution of a sequence of statements.
Syntax: grunt> illustrate Relation _name;

Example: grunt> illustrate student;

5.7. Pig in Practice


There are some practical techniques that are worth knowing about when you are
developing and running Pig programs. This section covers some of them.

Parallelism
When running in MapReduce mode it’s important that the degree of parallelism
matches the size of the dataset. By default, Pig will sets the number of reducers by
looking at the size of the input, and using one reducer per 1GB of input, up to a maximum
of 999 reducers. You can override these parameters by setting pig.exec.reduc
ers.bytes.per.reducer (the default is 1000000000 bytes) and pig.exec.reducers.max
(default 999).
To explictly set the number of reducers you want for each job, you can use a PARALLEL
clause for operators that run in the reduce phase. These include all the grouping and
joining operators (GROUP, COGROUP, JOIN, CROSS), as well as DISTINCT and
ORDER. The following line sets the number of reducers to 30 for the GROUP:
grouped_records = GROUP records BY year PARALLEL 30;
Alternatively, you can set the default_parallel option, and it will take effect for all
subsequent jobs:

grunt> set default_parallel 30

Parameter Substitution
If you have a Pig script that you run on a regular basis, then it’s quite common to want
to be able to run the same script with different parameters. For example, a script that
runs daily may use the date to determine which input files it runs over. Pig supports
parameter substitution, where parameters in the script are substituted with values supplied at runtime.
5.8.Hive: Introduction
What is Hive?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce

Features of Hive

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or HQL.

 It is familiar, fast, scalable, and extensible

Architecture of Hive
Hive Client

Hive allows writing applications in various languages, including Java, Python, and C++. It supports different types
of clients such as:-

o Thrift Server - It is a cross-language service provider platform that serves the request from all those
programming languages that supports Thrift.

o JDBC Driver - It is used to establish a connection between hive and Java applications. The JDBC Driver is
present in the class org.apache.hadoop.hive.jdbc.HiveDriver.

o ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hi ve.

Hive Services

The following are the services provided by Hive:-

o Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive queries and
commands.
o Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-based
GUI for executing Hive queries and commands.
o Hive MetaStore - It is a central repository that stores all the structure information of various tables and
partitions in the warehouse. It also includes metadata of column and its type information, the serializers and
deserializers which is used to read and write data and the corresponding HDFS files where the data is
stored.
o Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients and
provides it to Hive Driver.
o Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC driver.
It transfers the queries to the compiler.
o Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis on the
different query blocks and expressions. It converts HiveQL statements into MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce tasks and
HDFS tasks. In the end, the execution engine executes the incoming tasks in the order of their
dependencies.
The Metastore
o The metastore is the central repository of Hive metadata. The metastore is divided intotwo pieces: a service
and the backing store for the data.
o By default, the metastore service runs in the same JVM as the Hive service and contains an embedded
Derby database instance backed by the local disk. This is called the embedded metastore configuration
o (see Figure 12-2).

Figure 12-2 Metasrore configurations

5.9. Installing Hive:


 In normal use, Hive runs on your workstation and converts your SQL query into a series of
MapReduce jobs for execution on a Hadoop cluster. Hive organizes data into tables, which
provide a means for attaching structure to data stored in HDFS. Metadata—such as table schemas
—is stored in a database called the metastore.
 When starting out with Hive, it is convenient to run the metastore on your local machine. In this
configuration, which is the default, the Hive table definitions that you create will be local to your
machine, so you can’t share them with other users.
 Installation of Hive is straightforward. Java 6 is a prerequisite; and on Windows, you will need
Cygwin, too. You also need to have the same version of Hadoop installed locally that your cluster
is running. Ofcourse, you may choose to run Hadoop locally, either in standalone or pseudo-
distributed mode, while getting started with Hive.
 Download a release at http://hive.apache.org/releases.html, and unpack the tarball in a suitable
place on your workstation:
% tar xzf hive-x.y.z-dev.tar.gz
It’s handy to put Hive on your path to make it easy to launch:

% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev
% export PATH=$PATH:$HIVE_INSTALL/bin

Now type hive to launch the Hive shell:


% hive
hive>
4.2.2. The Hive Shell
In both interactive and non-interactive mode, Hive will print information to standard error—such as the
time taken to run a query—during the course of operation. You can suppress these messages using the -
S option at launch time, which has the effect of only showing the output result for queries:

Other useful Hive shell features include the ability to run commands on the host operating system by
using a ! prefix to the command and the ability to access Hadoop filesystems using the dfs command.

5.1.1. Comparison with Traditional Databases


While Hive resembles a traditional database in many ways (such as supporting an SQL interface), its
HDFS and MapReduce underpinnings mean that there are a number of architectural differences that
directly influence the features that Hive supports, which in turn affects the uses that Hive can be put to.

Schema on Read Versus Schema on Write

In a traditional database, a table’s schema is enforced at data load time. If the data being loaded doesn’t
conform to the schema, then it is rejected. This design is sometimes called schema on write, since the data
is checked against the schema when it is written into the database.

Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a query is issued. This
is called schema on read.

There are trade-offs between the two approaches. Schema on read makes for a very fast initial load, since
the data does not have to be read, parsed, and serialized to disk in the database’s internal format. The load
operation is just a file copy or move. It is more flexible, too: consider having two schemas for the same
underlying data, depending on the analysis being performed

Schema on write makes query time performance faster, since the database can index columns and perform
compression on the data. The trade-off, however, is that it takes longer to load data into the database.
Furthermore, there are many scenarios where the schema is not known at load time, so there are no
indexes to apply, since the queries have not been formulated yet. These scenarios are where Hive shines.

Updates, Transactions, and Indexes

Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these
features have not been considered a part of Hive’s feature set. This is because Hive was built to operate
over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by
transforming the data into a new table. For a data warehousing application that runs over large portions of
the dataset, this works well.

However, there are workloads where updates (or insert appends, at least) are needed, or where indexes
would yield significant performance gains. On the transactions front, Hive doesn’t define clear semantics
for concurrent access to tables, which means applications need to build their own application-level
concurrency or locking mechanism. The Hive team is actively working on improvements in all these
areas.

Change is also coming from another direction: HBase integration. HBase has different storage
characteristics to HDFS, such as the ability to do row updates and column indexing, so we can expect to
see these features used by Hive in future releases. It is already possible to access HBase tables from Hive.

Difference Between RDBMS and Hive


RDBMS Hive
Feature

It is used to maintain It is used to maintain data


database. warehouse.
Purpose

It uses SQL (Structured It uses HQL (Hive Query


Query Language). Language).
Query Language

Schema is fixed in
Schema varies in it.
Schema RDBMS.
Flexibility

Normalized and de-normalized


Normalized data is stored.
Data both type of data is stored.
Normalization

Tables in rdms are


Table in hive are dense.
sparse.
Table Structure

It doesn’t support
It supports automation partition.
Partitioning partitioning.
Support

No partition method is Sharding method is used for


used. partition.
Partition Method

5.1.2. HiveQL
A query language called HiveQL (Hive Query Language) is used to communicate with Apache Hive, a
Hadoop data warehouse and SQL-like query language. For searching and managing massively distributed
data held in Hadoop's HDFS (Hadoop Distributed File System) or other comparable storage systems.

HiveQL is intended to make big data analysis and querying easier, especially for individuals who are
already comfortable with SQL.

To interact with tables, databases, and queries, Hive provides a SQL like environment through hadoop
hiveql. To execute various types of data processing and querying, we can have different types of Clauses
for improved communication with various nodes outside the ecosystem. HIVE also has JDBC
connectivity.

Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification. There are a number of
reasons for this. Being a fairly young project, it has not had time to provide the full repertoire of SQL-92
language constructs
Data Types

Hive data types are divided into the following 5 different categories:

1. Numeric Type: TINYINT, SMALLINT, INT, BIGINT

2. Date/Time Types: TIMESTAMP, DATE, INTERVAL

3. String Types: STRING, VARCHAR, CHAR

4. Complex Types: STRUCT, MAP, UNION, ARRAY

5. Misc Types: BOOLEAN, BINARY

Hive supports both primitive and complex data types. Primitives include numeric, boolean, string, and
timestamp types. The complex data types include arrays, maps, and structs. Hive’s data types are listed
in Table 12-3. Note that the literals shown are those used from within HiveQL; they are not the serialized
form used in the table’s storage format.
Operators and Functions
The usual set of SQL operators is provided by Hive:

relational operators (such as x = 'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for
pattern matching),

arithmetic operators (such as x + 1 for addition), and

logical operators (such as x OR y for logical OR). The operators match those in MySQL, which deviates
from SQL-92 since || is logical OR, not string concatenation. Use the concat function for the latter in
both MySQL and Hive.

Hive comes with a large number of built-in functions—too many to list here—divided into categories
including mathematical and statistical functions, string functions, date functions (for operating on string
representations of dates), conditional functions, aggregate functions, and functions for working with
XML (using the xpath function) and JSON.

You can retrieve a list of functions from the Hive shell by typing SHOW FUNCTIONS. To get brief usage
instructions for a particular function, use the DESCRIBE command:

hive> DESCRIBE FUNCTION length;

length(str) - Returns the length of str

In the case when there is no built-in function that does what you want, you can write your own;

“User-Defined Functions”.

5.1.3. Tables
A Hive table is logically made up of the data being stored and the associated metadata describing the
layout of the data in the table. The data typically resides in HDFS, although it may reside in any Hadoop
filesystem, including the local filesystem or S3. Hive stores the metadata in a relational database—and
not in HDFS.
Multiple Database/Schema Support

Many relational databases have a facility for multiple namespaces, which allow users and applications to be segregated into different
databases or schemas. Hive supports the same facility, and provides commands such as CREATE DATABASE dbname, USE dbname, and DROP
DATABASE dbname. You can fully qualify a table by writing dbname.tablename. If no database is specified, tables belong to the default
database.

Create Table

We use the create table statement to create a table and the complete syntax is as follows .

CREATE TABLE IF NOT EXISTS <<database_name.>><<table_name>>


(column_name_1 data_type_1,
column_name_2 data_type_2,
.
.
column_name_n data_type_n)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Load Data into Table

Now, the tables have been created. It’s time to load the data into it. We can load the data
from any local file on our system using the following syntax .

LOAD DATA LOCAL INPATH <<path of file on your local system>>


INTO TABLE
<<database_name.>><<table_name>> ;

Alter Table

In the hive, we can do multiple modifications to the existing tables like renaming the tables,
adding more columns to the table. The commands to alter the table are very much similar to
the SQL commands.

Here is the syntax to rename the table:

ALTER TABLE <<table_name>> RENAME TO <<new_name>> ;

Syntax to add more columns from the table:

## to add more columns


ALTER TABLE <<table_name>> ADD COLUMNS
(new_column_name_1 data_type_1,
new_column_name_2 data_type_2,
.
.
new_column_name_n data_type_n) ;

Drop Table

Dropping a table from the Hive meta store deletes the entire table with all rows and columns.

Syntax:

Drop table;

Managed Tables and External Tables:

When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the
data into its warehouse directory. Alternatively, you may create an external table, which tells Hive to
refer to the data that is at an existing location outside the warehouse directory.

The difference between the two types of table is seen in the LOAD and DROP semantics. Let’s consider a
managed table first.

When you load data into a managed table, it is moved into Hive’s warehouse directory. For example:

CREATE TABLE managed_table (dummy STRING);

LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table
table, which is hdfs://user/hive/warehouse/managed_table.

If the table is later dropped, using:

DROP TABLE managed_table;

then the table, including its metadata and its data, is deleted.

An external table behaves differently. You control the creation and deletion of the data. The location of
the external data is specified at table creation time:

CREATE EXTERNAL TABLE external_table (dummy STRING)

LOCATION '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;


With the EXTERNAL keyword, Hive knows that it is not managing the data, so it doesn’t move it to its
warehouse directory. Indeed, it doesn’t even check if the external location exists at the time it is
defined. This is a useful feature, since it means you can create the data lazily after creating the table.
When you drop an external table, Hive will leave the data untouched and only delete the metadata

Managed vs. External Table – What’s the Difference?


Managed Table External Table

Hive assumes that it owns the data for managed tables. For external tables, Hive assumes that it does not manage
the data.

If a managed table or partition is dropped, the data and Dropping the table does not delete the data, although the
metadata associated with that table or partition are deleted. metadata for the table will be deleted.

For Managed tables, Hive stores data into its warehouse For External Tables, Hive stores the data in the LOCATION
directory specified during creation of the table(generally not in
warehouse directory)

Managed table provides ACID/transnational action support. External Table does not provide ACID/transactional action
support.

Statements: ARCHIVE, UNARCHIVE, TRUNCATE, MERGE, Not supported.


CONCATENATE supported

Query Result Caching supported(saves the results of an Not Supported


executed Hive query for reuse )

Partitions and Buckets:


Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the
value of a partition column, such as date. Using partitions can make it faster to do queries on slices of
the data.

Tables or partitions may further be subdivided into buckets, to give extra structure to the data that may
be used for more efficient queries. For example, bucketing by user ID means we can quickly evaluate a
user-based query by running it on a randomized sample of the total set of users.

Partitions

A table may be partitioned in multiple dimensions. For example, in addition to partitioning logs by date,
we might also subpartition each date partition by country to permit efficient queries by location.
Partitions are defined at table creation time using the PARTITIONED BY clause, which takes a list of
column definitions.
For the hypothetical log files example, we might define a table with records comprising a timestamp
and the log line itself:

CREATE TABLE logs (ts BIGINT, line STRING)

PARTITIONED BY (dt STRING, country STRING);

Buckets

Hive Bucketing a.k.a (Clustering) is a technique to split the data into more manageable files, (By
specifying the number of buckets to create). The value of the bucketing column will be hashed by a user-
defined number into buckets.

Bucketing can be created on just one column, you can also create bucketing on a partitioned table to
further split the data which further improves the query performance of the partitioned table.

Each bucket is stored as a file within the table’s directory or the partitions directories. Note that
partition creates a directory and you can have a partition on one or more columns; these are some of
the differences between Hive partition and bucket.

From our example, we already have a partition on state which leads to around 50 subdirectories on a
table directory, and creating a bucketing 10 on zipcode column creates 10 files for each partitioned
subdirectory.

First, let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to
specify the columns to bucket on and the number of buckets:

CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) INTO 4 BUCKETS

Example: we are creating a bucketing on zipcode column on top of partitioned by state.


Differences Between Hive Partitioning vs Bucketing

Below are some of the differences between Partitioning vs bucketing

PARTITIONING BUCKETING

Directory is created on HDFS for File is created on HDFS for each bucket.
each partition.

You can have one or more Partition You can have only one Bucketing column
columns

You can’t manage the number of You can manage the number of buckets to
partitions to create create by specifying the count

NA Bucketing can be created on a partitioned table

Uses PARTITIONED BY Uses CLUSTERED BY


Storage Formats:

There are two dimensions that govern table storage in Hive: the row format and the
file format. The row format dictates how rows, and the fields in a particular row, are
stored. In Hive parlance, the row format is defined by a SerDe, a portmanteau word
for a Serializer-Deserializer

The file format dictates the container format for fields in a row. The simplest format is
a plain text file, but there are row-oriented and column-oriented binary formats available,
too.
The default storage format: Delimited text
When you create a table with no ROW FORMAT or STORED AS clauses, the default format is
delimited text, with a row per line.
Binary storage formats: Sequence files and RCFiles
Hadoop’s sequence file format is a general purpose binary format for sequences of records (key-value
pairs). You can use sequence files inHive by using the declaration STORED AS SEQUENCEFILE in the
CREATE TABLE statement.
One of the main benefits of using sequence files is their support for splittable compression.
If you have a collection of sequence files that were created outside Hive, then Hive will read them with
no extra configuration. If, on the other hand, you want tables populated from Hive to use compressed
sequence files for their storage, you need to set a few properties to enable compression :
hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compress=true;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> INSERT OVERWRITE TABLE ...;

Sequence files are row-oriented. What this means is that the fields in each row are stored together, as the
contents of a single sequence file record.
Hive provides another binary storage format called RCFile, short for Record Columnar File. RCFiles are
similar to sequence files, except that they store data in a columnoriented fashion. RCFile breaks up the
table into row splits, then within each split stores the values for each row in the first column, followed by
the values for each row in the second column, and so on. This is shown diagrammatically in Figure 12-3.
Figure 12-3. Row-oriented versus column-oriented storage
A column-oriented layout permits columns that are not accessed in a query to be skipped.
Consider a query of the table in Figure 12-3 that processes only column 2. With row-oriented storage, like
a sequence file, the whole row (stored in a sequence file record) is loaded into memory, even though only
the second column is actually read.

Importing Data
We’ve already seen how to use the LOAD DATA operation to import data into a Hive table (or partition)
by copying or moving files to the table’s directory. You can also populate a table with data from another
Hive table using an INSERT statement, or at creation time using the CTAS construct, which is an
abbreviation used to refer to CREATE TABLE...AS SELECT.
INSERT OVERWRITE TABLE
Here’s an example of an INSERT statement:
INSERT OVERWRITE TABLE target
SELECT col1, col2
FROM source;

You can specify the partition dynamically, by determining the partition value from the SELECT
statement:
INSERT OVERWRITE TABLE target
PARTITION (dt)
SELECT col1, col2, dt
FROM source;
This is known as a dynamic-partition insert. This feature is off by default, so you need to enable it by
setting hive.exec.dynamic.partition to true first.

Multitable insert
In HiveQL, you can turn the INSERT statement around and start with the FROM clause, for the same
effect:
FROM source
INSERT OVERWRITE TABLE target
SELECT col1, col2;
The reason for this syntax becomes clear when you see that it’s possible to have multiple INSERT clauses
in the same query. This so-called multitable insert is more efficient than multiple INSERT statements.

CREATE TABLE...AS SELECT


It’s often very convenient to store the output of a Hive query in a new table, perhaps because it is too
large to be dumped to the console or because there are further processing steps to carry out on the result.
The new table’s column definitions are derived from the columns retrieved by the SELECT clause. In the
following query, the target table has two columns named col1 and col2 whose types are the same as the
ones in the source table:
CREATE TABLE target
AS
SELECT col1, col2
FROM source;
A CTAS operation is atomic, so if the SELECT query fails for some reason, then the table is not created.

Altering Tables
You can rename a table using the ALTER TABLE statement:
ALTER TABLE source RENAME TO target;
Hive allows you to change the definition for columns, add new columns, or even replace all existing
columns in a table with a new set.
For example, consider adding a new column:
ALTER TABLE target ADD COLUMNS (col3 STRING);

Dropping Tables
The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only
the metadata is deleted—the data is left untouched.
If you want to delete all the data in a table, but keep the table definition (like DELETE orTRUNCATE in
MySQL), then you can simply delete the data files. For example:
hive> dfs -rmr /user/hive/warehouse/my_table;
Hive treats a lack of files (or indeed no directory for the table) as an empty table.

5.1.4User-Defined Functions:
 Sometimes the query you want to write can’t be expressed easily (or at all) using the built-in
functions that Hive provides.
 By writing a user-defined function (UDF), Hive makes it easy to plug in your own processing
code and invoke it from a Hive query.
 UDFs have to be written in Java, the language that Hive itself is written in.
 For other languages, consider using a SELECT TRANSFORM query, which allows you to stream
data through a user-defined script.
There are three types of UDF in Hive: (regular) UDFs, UDAFs (user-defined aggregate functions), and
UDTFs (user-defined table-generating functions). They differ in the numbers of rows that they accept as
input and produce as output:
• A UDF operates on a single row and produces a single row as its output. Most functions, such as
mathematical functions and string functions, are of this type.
• A UDAF works on multiple input rows and creates a single output row. Aggregate functions include
such functions as COUNT and MAX.
• A UDTF operates on a single row and produces multiple rows—a table—as output.

1.Writing a UDF:
To illustrate the process of writing and using a UDF, we’ll write a simple UDF to trim characters from the
ends of strings. Hive already has a built-in function called trim, so we’ll call ours strip. The code for the
Strip Java class is shown in Example 12-2.
Example 12-2. A UDF for stripping characters from the ends of strings
package com.hadoopbook.hive;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString()));
return result;
}
public Text evaluate(Text str, String stripChars) {
if (str == null) {
return null;
}
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
}
}

A UDF must satisfy the following two properties:


1. A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.
2. A UDF must implement at least one evaluate() method.

2. Writing a UDAF:
An aggregate function is more difficult to write than a regular UDF, since values are aggregated in
chunks (potentially across many Map or Reduce tasks), so the implementation has to be capable of
combining partial aggregations into a final result.
Figure 12-4. Data flow with partial results for a UDAF

The init() method initializes the evaluator and resets its internal state.
The iterate() method is called every time there is a new value to be aggregated.
The terminatePartial() method is called when Hive wants a result for the partial aggregation.
The merge() method is called when Hive decides to combine one partial aggregation with another
The terminate() method is called when the final result of the aggregation is needed.

You might also like