0% found this document useful (0 votes)

111 views40 pages

Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

This document provides an overview and comparison of Pig, Hive, and Jaql for processing large datasets. It describes the key components, execution methods, and query languages of each system. Pig uses Pig Latin and processes data flows. Hive uses HiveQL, a SQL dialect, and operates on relational schemas. Jaql uses a JSON query language and can operate on complex nested data structures without a defined schema. All three systems translate queries to MapReduce jobs to distribute processing across a cluster.

Uploaded by

Franklin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views40 pages

Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

Uploaded by

Franklin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Pig, Hive, and Jaql

IBM Information Management

Cloud Computing Center of Competence
IBM Toronto Lab

Agenda

Overview
Pig
Hive
Jaql

Agenda

Overview
Pig
Hive
Jaql

Similarities of Pig, Hive and Jaql

All translate their respective high-level languages to

MapReduce jobs
All offer significant reductions in program size over
Java
All provide points of extension to cover gaps in
functionality
All provide interoperability with other languages
None support random reads/writes or low-latency
queries

Comparing Pig, Hive, and Jaql

Characteristic

Pig

Hive

Jaql

Developed by

Yahoo!

Facebook

IBM

Language name

Pig Latin

HiveQL

Jaql

Type of language

Data flow

Declarative (SQL
dialect)

Data flow

Data structures it
operates on

Complex, nested

JSON

Schema optional?

Yes

No, but data can

have many
schemas

Relational complete?

Yes

Turing complete?

Yes when
extended with
Java UDFs

Yes

Agenda

Overview
Pig
Hive
Jaql

Pig components
Two Components

Language (called Pig Latin)

Compiler

Pig
Pig Latin
Compiler

Two execution environments

Local (Single JVM)

Distributed (Hadoop cluster)

pig -x local
pig -x mapreduce, or simply pig

Execution Environment
Local
Distributed

Running Pig

Script
pig scriptfile.pig

Grunt (command line)

pig (to launch command line tool)

Embedded
Call in to Pig from Java

Sample code
#pig
grunt> records = LOAD econ_assist.csv
using PigStorage (,)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum

= FOREACH grouped
GENERATE group,
SUM(records, sum);

grunt> DUMP thesum;

Pig Latin
Statements, operations and commands

A Pig Latin program is a collection of statements.

A statement is an operation or a command

Example of an operation: LOAD 'statement.txt';

Example of a command: ls *.txt

Logical plan/physical plan

As statement is processed, it is added to logical plan
When a statement such as 'DUMP relation' is reached, logical
plan is compiled to physical plan and executed

Pig Latin statements

UDF Statements

Commands

Hadoop Filesystem (cat, ls, etc.)

Hadoop MapReduce (kill)
Utility (exec, help, quit, run, set)

Diagnostic Operators

DESCRIBE, EXPLAIN, ILLUSTRATE

Pig Latin Relational operators

Loading and storing (LOAD, STORE, DUMP)

Filtering (FILTER, DISTINCT, FOREACH...GENERATE,
STREAM, SAMPLE)
Grouping and joining (JOIN, COGROUP, GROUP,
CROSS)
Sorting (ORDER, LIMIT)
Combining and splitting (UNION, SPLIT)

Pig Latin Relations and schemata

Result of a relational operator is a relation

A relation is a set of tuples
Relations can be named using an alias
x = LOAD 'sample.txt' AS (id: int, year: int)
DUMP x

alias

(1,1987) tuple

Structure of a relation is a schema

DESCRIBE x

x: {id: int, year: int} schema

Pig Latin expressions

Part of a statement containing a relational operator

Categories of expressions:
Constant
Map lookup
Conditional
Functional

Field
Cast
Boolean
Flatten

Projection
Arithmetic
Comparison

Pig Latin Data types

Simple types:
int
long

float
double

bytearray
chararray

Complex types:
Tuple
Bag
Map

Sequence of fields of any type

Unordered collection of tuples
Set of key-value pairs. Keys must be chararray.

Pig Latin Function types

Eval
Input: One or more expressions
Output: An expression
Example: MAX

Filter
Input: Bag or map
Output: boolean
Example: IsEmpty

Pig Latin Function types

Load
Input: Data from external storage
Output: A relation
Example: PigStorage

Store
Input: A relation
Output: Data to external storage
Example: PigStorage

Pig Latin User-Defined Functions

Written in Java
Packaged in a JAR file
Register JAR file using the REGISTER statement
Optionally, alias it with DEFINE statement

Agenda

Overview
Pig
Hive
Jaql

Hive - Configuration
Three ways to configure hive:
hive-site.xml
-

fs.default.name
mapred.job.tracker
Metastore configuration settings

hive hiveconf
Set command in the Hive Shell

Running Hive

Hive Shell

Interactive
hive

Script
hive -f myscript

Inline
hive -e 'SELECT * FROM mytable'

Hive services
hive --service servicename
where servicename can be:

hiveserver
server for Thrift, JDBC, ODBC clients

hwi
web interface

jar
hadoop jar with Hive jars in classpath

metastore
out of process metastore

Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ,
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH econ_assist.csv
OVERWRITE INTO TABLE foreign_aid;

hive> SELECT * FROM foreign_aid LIMIT 10;

hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;

Hive - Metastore

Stores Hive metadata

Configurations
Embedded
in-process metastore, in-process database
Local
in-process metastore, out-of-process database
Remote
out-of-process metastore, out-of-process
database

Hive Schema-On-Read

Faster loads into the database (simply copy

or move)
Slower queries
Flexibility multiple schemas for the same
data

Hive Query Language (HiveQL)

SQL dialect
Does not support full SQL92 specification
No support for:

HAVING clause in SELECT

Correlated subqueries
Subqueries outside FROM clauses
Updateable or materialized views
Stored procedures

Hive Query Language (HiveQL)

Extensions
MySQL-like extensions
MapReduce extensions

Multi-table insert, MAP, REDUCE, TRANSFORM clauses

Data Types

Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING

Complex
ARRAY, MAP, STRUCT

Hive Query Language (HiveQL)

Built-in Functions

SHOW FUNCTIONS
DESCRIBE FUNCTION

Hive - Tables
Managed CREATE TABLE

External CREATE EXTERNAL TABLE

LOAD File moved into Hive's data warehouse directory

DROP Both metadata and data deleted
LOAD No files moved
DROP Only metadata deleted

Use EXTERNAL when:

Sharing data between Hive and other Hadoop applications

You wish to use multiple schemas on the same data

Hive - Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table's partitions

Hive - Bucketing
Can make some queries faster
Supports sampling data
Use CLUSTERED BY clause when creating table
For sorted buckets, use SORTED BY clause
when creating table
To query a sample of your data use the
TABLESAMPLE command which uses bucketing

Hive Storage formats

Delimited Text (default)

ROW FORMAT DELIMITED

SerDe Serializer/Deserializer

ROW FORMAT SERDE serdename

e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'

Binary SerDe

Row-oriented (Sequence file)

STORED AS SEQUENCEFILE

Column-oriented (RCFile)
STORED AS RCFILE

Hive User-Defined Functions

Written in Java
Three UDF types:

UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows

Register UDF using ADD JAR

Create alias using CREATE TEMPORARY FUNCTION
33

Agenda

Overview
Pig
Hive
Jaql

Jaql
A JSON Query Language
JSON = Javascript Object Notation
Running Jaql

Jaql Shell

Interactive

jaqlshell

Batch

jaqlshell -b myscript.jaql

Inline

jaqlshell -e jaqlstatement

Modes

Cluster

jaqlshell -c

Minicluster

jaqlshell

Jaql
Query Language

Sources and sinks

e.g. Copy data from a local file to a new file on HDFS

source

sink

read(file(input.json)) -> write(hdfs(output))

Core Operators
Filter

Group

Tee

Transform

Join

Sort

Expand

Union

Top

Jaql
Query Language

Variables

= operator binds source output to a variable

e.g. $tweets = read(hdfs(twitterfeed))

Pipes, streams, and consumers

Pipe operator (->) streams data to a consumer

Pipe expects array as input

e.g. $tweets filter $.from_src == 'tweetdeck';

$ implicit variable referencing current array value

Jaql
Query Language

agg
number
string
function
random
record

Jaql
Data Storage

Data store examples

Amazon S3

DB2

HBase

HTTP

JDBC

Local FS

Data format examples

JSON

CSV

XML

HDFS

Thank you!

Physics Olympiad Books
75% (4)
Physics Olympiad Books
4 pages
Manual PCS 3300 - Gas
100% (2)
Manual PCS 3300 - Gas
128 pages
Bridge Engineering Notes
No ratings yet
Bridge Engineering Notes
26 pages
Performance of UV-Vis Spectrophotometers
No ratings yet
Performance of UV-Vis Spectrophotometers
6 pages
SX028a-EN-EU-Example - Design Resistance of A Screwed Connection of Cold-Formed Members PDF
No ratings yet
SX028a-EN-EU-Example - Design Resistance of A Screwed Connection of Cold-Formed Members PDF
4 pages
Ip309 2016SC
No ratings yet
Ip309 2016SC
16 pages
Hill CH 1 Ed 3
No ratings yet
Hill CH 1 Ed 3
60 pages
Computer Architecture As A Multilevel Hierarchical Framework
100% (1)
Computer Architecture As A Multilevel Hierarchical Framework
6 pages
C264 Brochure
No ratings yet
C264 Brochure
8 pages
VIVA PIC16 Schematic
100% (4)
VIVA PIC16 Schematic
2 pages
+1 Super Practical Reading
100% (1)
+1 Super Practical Reading
10 pages
L298N DataSheet
No ratings yet
L298N DataSheet
11 pages
ROAD RA PB Rajkot Update
No ratings yet
ROAD RA PB Rajkot Update
58 pages
CHP122 - MicroSCADA Pro For Substation Automation - Operation
No ratings yet
CHP122 - MicroSCADA Pro For Substation Automation - Operation
2 pages
Am Reb670
No ratings yet
Am Reb670
344 pages
How To Calculate Voltage Regulation of Distribution Line - EEP
No ratings yet
How To Calculate Voltage Regulation of Distribution Line - EEP
8 pages
Cadelec 2004 User Guide Eng
No ratings yet
Cadelec 2004 User Guide Eng
472 pages
A011 MRamirez
No ratings yet
A011 MRamirez
1 page
Leçon4 Hadoop Query Languages
No ratings yet
Leçon4 Hadoop Query Languages
21 pages
Jmetal 4.5 User Manual: Antonio J. Nebro, Juan J. Durillo
No ratings yet
Jmetal 4.5 User Manual: Antonio J. Nebro, Juan J. Durillo
88 pages
Ity of OS Ngeles: GENERAL APPROVAL - Reevaluation - LENTON
No ratings yet
Ity of OS Ngeles: GENERAL APPROVAL - Reevaluation - LENTON
3 pages
Operating Systems: Figure 1: The Operating System in A Hierarchy
No ratings yet
Operating Systems: Figure 1: The Operating System in A Hierarchy
6 pages
Hive Pig
No ratings yet
Hive Pig
20 pages
Lecture38 PDF
No ratings yet
Lecture38 PDF
23 pages
Module No: - 9 Umbrella Brands and Corporate Brand: Dual "C", BIMM 07-09 1
No ratings yet
Module No: - 9 Umbrella Brands and Corporate Brand: Dual "C", BIMM 07-09 1
8 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
XBA-SAR4 (Pulse+MDB+ICT+Parallel A3) PDF
No ratings yet
XBA-SAR4 (Pulse+MDB+ICT+Parallel A3) PDF
1 page
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
Rms 10 A
No ratings yet
Rms 10 A
1 page
911 Expert Opinion Van Romero
No ratings yet
911 Expert Opinion Van Romero
5 pages
WTP O&M Manual WATER Treatment Plant
100% (1)
WTP O&M Manual WATER Treatment Plant
112 pages
BDS Session 8
No ratings yet
BDS Session 8
49 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
CVSFFFFFFF PDF
No ratings yet
CVSFFFFFFF PDF
2 pages
LF Brochure
No ratings yet
LF Brochure
5 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Model QN Paper 1 - Dss Vii Sem
No ratings yet
Model QN Paper 1 - Dss Vii Sem
4 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Microbiological Quality Assessment of Water Served by A Fast-Food Restaurant Along Muralla Street, Intramuros
No ratings yet
Microbiological Quality Assessment of Water Served by A Fast-Food Restaurant Along Muralla Street, Intramuros
31 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
BD 5
No ratings yet
BD 5
28 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Beamer Sample
No ratings yet
Beamer Sample
31 pages
Syllabus 2018 19 Bayesian Networks
No ratings yet
Syllabus 2018 19 Bayesian Networks
11 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
CH 6 BDA
No ratings yet
CH 6 BDA
10 pages
Chapter 6.4-5 Modeling Shapes With Polygonal Meshes
No ratings yet
Chapter 6.4-5 Modeling Shapes With Polygonal Meshes
40 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
Individualizing Propofol Dosage: A Multivariate Linear Model Approach
No ratings yet
Individualizing Propofol Dosage: A Multivariate Linear Model Approach
12 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
Unit-3 FBDA
No ratings yet
Unit-3 FBDA
34 pages
Chapter 5 - Introducing Pig Pig Architecture
No ratings yet
Chapter 5 - Introducing Pig Pig Architecture
81 pages
Usg Mars Climaplus Ceiling Panels Data SC1966
No ratings yet
Usg Mars Climaplus Ceiling Panels Data SC1966
2 pages
Unit 4
No ratings yet
Unit 4
29 pages
1 Dg-Mahila
No ratings yet
1 Dg-Mahila
1 page
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
BDA
No ratings yet
BDA
16 pages
06 Hadoop Query Languages
No ratings yet
06 Hadoop Query Languages
23 pages
Bda Notes Jntuk R20 Unit 4
No ratings yet
Bda Notes Jntuk R20 Unit 4
14 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Pig Viva Ques
No ratings yet
Pig Viva Ques
6 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Unit 5
No ratings yet
Unit 5
24 pages
Hive - PIG - HBase - Zookeeper
100% (1)
Hive - PIG - HBase - Zookeeper
31 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Unit 5 Short
No ratings yet
Unit 5 Short
14 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
No ratings yet
Big Data - Unit 5 - Frame Works - Mini Xerox - Easy Read
23 pages
Notes 5 Unit Big Data
No ratings yet
Notes 5 Unit Big Data
23 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Unit 5
No ratings yet
Unit 5
19 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)

Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

Uploaded by

Pig, Hive, and Jaql: IBM Information Management Cloud Computing Center of Competence IBM Toronto Lab

Uploaded by

Pig, Hive, and Jaql

IBM Information Management

Similarities of Pig, Hive and Jaql

All translate their respective high-level languages to

Comparing Pig, Hive, and Jaql

No, but data can

Language (called Pig Latin)

Two execution environments

Local (Single JVM)

Distributed (Hadoop cluster)

Grunt (command line)

grunt> DUMP thesum;

A Pig Latin program is a collection of statements.

Example of an operation: LOAD 'statement.txt';

Example of a command: ls *.txt

Logical plan/physical plan

Pig Latin statements

Hadoop Filesystem (cat, ls, etc.)

DESCRIBE, EXPLAIN, ILLUSTRATE

Pig Latin Relational operators

Loading and storing (LOAD, STORE, DUMP)

Pig Latin Relations and schemata

Result of a relational operator is a relation

Structure of a relation is a schema

x: {id: int, year: int} schema

Pig Latin expressions

Part of a statement containing a relational operator

Pig Latin Data types

Sequence of fields of any type

Pig Latin Function types

Pig Latin Function types

Pig Latin User-Defined Functions

hive> SELECT * FROM foreign_aid LIMIT 10;

Stores Hive metadata

Faster loads into the database (simply copy

Hive Query Language (HiveQL)

HAVING clause in SELECT

Hive Query Language (HiveQL)

Multi-table insert, MAP, REDUCE, TRANSFORM clauses

Hive Query Language (HiveQL)

External CREATE EXTERNAL TABLE

LOAD File moved into Hive's data warehouse directory

Use EXTERNAL when:

Sharing data between Hive and other Hadoop applications

Hive Storage formats

Delimited Text (default)

ROW FORMAT DELIMITED

ROW FORMAT SERDE serdename

e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'

Row-oriented (Sequence file)

Hive User-Defined Functions

Register UDF using ADD JAR

Sources and sinks

e.g. Copy data from a local file to a new file on HDFS

read(file(input.json)) -> write(hdfs(output))

= operator binds source output to a variable

e.g. $tweets = read(hdfs(twitterfeed))

Pipes, streams, and consumers

Pipe operator (->) streams data to a consumer

Pipe expects array as input

e.g. $tweets filter $.from_src == 'tweetdeck';

$ implicit variable referencing current array value

Categories of Built-in Functions

Data store examples

Data format examples

You might also like