Pig, Hive, and Jaql
IBM Information Management
Cloud Computing Center of Competence
IBM Toronto Lab
Agenda
Overview
Pig
Hive
Jaql
Agenda
Overview
Pig
Hive
Jaql
Similarities of Pig, Hive and Jaql
All translate their respective high-level languages to
MapReduce jobs
All offer significant reductions in program size over
Java
All provide points of extension to cover gaps in
functionality
All provide interoperability with other languages
None support random reads/writes or low-latency
queries
Comparing Pig, Hive, and Jaql
Characteristic
Pig
Hive
Jaql
Developed by
Yahoo!
Facebook
IBM
Language name
Pig Latin
HiveQL
Jaql
Type of language
Data flow
Declarative (SQL
dialect)
Data flow
Data structures it
operates on
Complex, nested
JSON
Schema optional?
Yes
No, but data can
have many
schemas
Relational complete?
Yes
Yes
Turing complete?
Yes when
extended with
Java UDFs
Yes when
extended with
Java UDFs
Yes
Yes
Agenda
Overview
Pig
Hive
Jaql
Pig components
Two Components
Language (called Pig Latin)
Compiler
Pig
Pig Latin
Compiler
Two execution environments
Local (Single JVM)
Distributed (Hadoop cluster)
pig -x local
pig -x mapreduce, or simply pig
Execution Environment
Local
Distributed
Running Pig
Script
pig scriptfile.pig
Grunt (command line)
pig (to launch command line tool)
Embedded
Call in to Pig from Java
Sample code
#pig
grunt> records = LOAD econ_assist.csv
using PigStorage (,)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum
= FOREACH grouped
GENERATE group,
SUM(records, sum);
grunt> DUMP thesum;
Pig Latin
Statements, operations and commands
10
A Pig Latin program is a collection of statements.
A statement is an operation or a command
Example of an operation: LOAD 'statement.txt';
Example of a command: ls *.txt
Logical plan/physical plan
As statement is processed, it is added to logical plan
When a statement such as 'DUMP relation' is reached, logical
plan is compiled to physical plan and executed
Pig Latin statements
UDF Statements
Commands
Hadoop Filesystem (cat, ls, etc.)
Hadoop MapReduce (kill)
Utility (exec, help, quit, run, set)
Diagnostic Operators
11
REGISTER, DEFINE
DESCRIBE, EXPLAIN, ILLUSTRATE
Pig Latin Relational operators
12
Loading and storing (LOAD, STORE, DUMP)
Filtering (FILTER, DISTINCT, FOREACH...GENERATE,
STREAM, SAMPLE)
Grouping and joining (JOIN, COGROUP, GROUP,
CROSS)
Sorting (ORDER, LIMIT)
Combining and splitting (UNION, SPLIT)
Pig Latin Relations and schemata
Result of a relational operator is a relation
A relation is a set of tuples
Relations can be named using an alias
x = LOAD 'sample.txt' AS (id: int, year: int)
DUMP x
alias
(1,1987) tuple
Structure of a relation is a schema
DESCRIBE x
x: {id: int, year: int} schema
13
Pig Latin expressions
Part of a statement containing a relational operator
Categories of expressions:
Constant
Map lookup
Conditional
Functional
14
Field
Cast
Boolean
Flatten
Projection
Arithmetic
Comparison
Pig Latin Data types
Simple types:
int
long
float
double
bytearray
chararray
Complex types:
Tuple
Bag
Map
15
Sequence of fields of any type
Unordered collection of tuples
Set of key-value pairs. Keys must be chararray.
Pig Latin Function types
Eval
Input: One or more expressions
Output: An expression
Example: MAX
Filter
Input: Bag or map
Output: boolean
Example: IsEmpty
16
Pig Latin Function types
Load
Input: Data from external storage
Output: A relation
Example: PigStorage
Store
Input: A relation
Output: Data to external storage
Example: PigStorage
17
Pig Latin User-Defined Functions
Written in Java
Packaged in a JAR file
Register JAR file using the REGISTER statement
Optionally, alias it with DEFINE statement
18
Agenda
19
Overview
Pig
Hive
Jaql
Hive - Configuration
Three ways to configure hive:
hive-site.xml
-
fs.default.name
mapred.job.tracker
Metastore configuration settings
hive hiveconf
Set command in the Hive Shell
20
Running Hive
Hive Shell
Interactive
hive
Script
hive -f myscript
Inline
hive -e 'SELECT * FROM mytable'
21
Hive services
hive --service servicename
where servicename can be:
hiveserver
server for Thrift, JDBC, ODBC clients
hwi
web interface
jar
hadoop jar with Hive jars in classpath
metastore
out of process metastore
22
Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ,
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH econ_assist.csv
OVERWRITE INTO TABLE foreign_aid;
hive> SELECT * FROM foreign_aid LIMIT 10;
hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;
23
Hive - Metastore
Stores Hive metadata
Configurations
Embedded
in-process metastore, in-process database
Local
in-process metastore, out-of-process database
Remote
out-of-process metastore, out-of-process
database
24
Hive Schema-On-Read
25
Faster loads into the database (simply copy
or move)
Slower queries
Flexibility multiple schemas for the same
data
Hive Query Language (HiveQL)
SQL dialect
Does not support full SQL92 specification
No support for:
HAVING clause in SELECT
Correlated subqueries
Subqueries outside FROM clauses
Updateable or materialized views
Stored procedures
26
Hive Query Language (HiveQL)
Extensions
MySQL-like extensions
MapReduce extensions
Multi-table insert, MAP, REDUCE, TRANSFORM clauses
Data Types
Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING
Complex
ARRAY, MAP, STRUCT
27
Hive Query Language (HiveQL)
Built-in Functions
28
SHOW FUNCTIONS
DESCRIBE FUNCTION
Hive - Tables
Managed CREATE TABLE
External CREATE EXTERNAL TABLE
LOAD File moved into Hive's data warehouse directory
DROP Both metadata and data deleted
LOAD No files moved
DROP Only metadata deleted
Use EXTERNAL when:
29
Sharing data between Hive and other Hadoop applications
You wish to use multiple schemas on the same data
Hive - Partitioning
Can make some queries faster
Divide data based on partition column
Use PARTITION BY clause when creating table
Use PARTITION clause when loading data
SHOW PARTITIONS will show a table's partitions
30
Hive - Bucketing
Can make some queries faster
Supports sampling data
Use CLUSTERED BY clause when creating table
For sorted buckets, use SORTED BY clause
when creating table
To query a sample of your data use the
TABLESAMPLE command which uses bucketing
31
Hive Storage formats
Delimited Text (default)
ROW FORMAT DELIMITED
SerDe Serializer/Deserializer
ROW FORMAT SERDE serdename
e.g. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy'
Binary SerDe
Row-oriented (Sequence file)
STORED AS SEQUENCEFILE
Column-oriented (RCFile)
STORED AS RCFILE
32
Hive User-Defined Functions
Written in Java
Three UDF types:
UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows
Register UDF using ADD JAR
Create alias using CREATE TEMPORARY FUNCTION
33
Agenda
34
Overview
Pig
Hive
Jaql
Jaql
A JSON Query Language
JSON = Javascript Object Notation
Running Jaql
35
Jaql Shell
Interactive
jaqlshell
Batch
jaqlshell -b myscript.jaql
Inline
jaqlshell -e jaqlstatement
Modes
Cluster
jaqlshell -c
Minicluster
jaqlshell
Jaql
Query Language
Sources and sinks
e.g. Copy data from a local file to a new file on HDFS
source
sink
read(file(input.json)) -> write(hdfs(output))
36
Core Operators
Filter
Group
Tee
Transform
Join
Sort
Expand
Union
Top
Jaql
Query Language
Variables
= operator binds source output to a variable
e.g. $tweets = read(hdfs(twitterfeed))
Pipes, streams, and consumers
Pipe operator (->) streams data to a consumer
Pipe expects array as input
e.g. $tweets filter $.from_src == 'tweetdeck';
37
$ implicit variable referencing current array value
Jaql
Query Language
38
Categories of Built-in Functions
system
schema
core
xml
hadoop
regex
io
binary
array
date
index
nil
agg
number
string
function
random
record
Jaql
Data Storage
Data store examples
Amazon S3
DB2
HBase
HTTP
JDBC
Local FS
Data format examples
JSON
39
CSV
XML
HDFS
Thank you!
Thank you!