0% found this document useful (0 votes)
30 views102 pages

CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views102 pages

CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Yarmouk University

Faculty of Information Technology and Computer


Sciences

CIS 468: Big Data Management

Topic 4: NoSQL, HBase, and Hive

Acknowledgements: this course materials is adopted from Dr. Rafat Hammad.


Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and adopted for our
course. Additional slides have been added from the mentioned references in the syllabus.
Part 1: Introduction to NoSQL

2
¡ Relational model with schemas
¡ Powerful, flexible query language (SQL)
¡ Transactional semantics: ACID
¡ Rich ecosystem, lots of tool support (MySQL,
PostgreSQL, etc.)

3
 The name stands for Not Only SQL
 Does not use SQL as querying language
 Class of non-relational data storage systems
¡ The term NOSQL was introduced by Eric Evans
when an event was organized to discuss open
source distributed databases
 It's not a replacement for a RDBMS but compliments
it
 All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)

4
¡ Key features (advantages):
§ non-relational
§ don’t require strict schema
§ data are replicated to multiple nodes (so, identical & fault-tolerant) and
can be partitioned:
§ down nodes easily replaced
§ no single point of failure
§ horizontal scalable
§ cheap, easy to implement
(open-source)
§ massive write performance
§ fast key-value access

5
¡ Web apps have different needs (than the apps
that RDBMS were designed for)
§ Low and predictable response time (latency)
§ Scalability & elasticity (at low cost!)
§ High availability
§ Flexible schemas / semi-structured data
§ Geographic distribution (multiple data centers)
¡ Web apps can (usually) do without
§ Transactions / strong consistency / integrity
§ Complex queries
6
¡ Google (BigTable)
¡ LinkedIn (Voldemort)
¡ Facebook (Cassandra)
¡ Twitter (HBase, Cassandra)
¡ Baidu (HyperTable)

7
¡ Three major papers were the seeds of the
NoSQL movement
§ BigTable (Google)
§ Dynamo (Amazon)
§ Ring partition and replication
§ Gossip protocol (discovery and error detection)
§ Distributed key-value data store
§ Eventual consistency
§ CAP Theorem (discuss in the next few slides)

8
¡ Suppose three properties of a
distributed system (sharing data)
§ Consistency:
A
§ all copies have same value C
§ Availability:
P
§ reads and writes always succeed even if a
node in the cluster goes down
§ Partition-tolerance:
§ system properties (consistency and/or
availability) hold even when network failures
prevent some machines from
communicating with others

9
¡ Brewer’s CAP Theorem:
§ For any system sharing data, it is “impossible” to
guarantee simultaneously all of these three properties
§ You can have at most two of these three properties for
any shared-data system
¡ Very large systems will “partition” at some point:
§ That leaves either C or A to choose from (traditional
DBMS prefers C over A and P )
§ In almost all cases, you would choose A over C (except
in specific applications such as order processing)

10
All client always have
Availability
the same view of the
data
Consistency
Once a writer has written,
Partition all readers will see that
tolerance write

¡ Two kinds of consistency:


§ Strong consistency – ACID (Atomicity Consistency Isolation
Durability)
§ Weak consistency – BASE (Basically Available Soft-state Eventual
consistency )
11
¡ ACID
§ A DBMS is expected to support “ACID transactions,”
processes that are:
§ Atomicity: either the whole process is done or none is
§ Consistency: only valid data are written
§ Isolation: one operation at a time
§ Durability: once committed, it stays that way
¡ CAP
§ Consistency: all data on cluster has the same copies
§ Availability: cluster always accepts reads and writes
§ Partition tolerance: guaranteed properties are
maintained even when network failures prevent some
machines from communicating with others

12
¡ A consistency model determines rules for
visibility and apparent order of updates
¡ Example:
§ Row X is replicated on nodes M and N
§ Client A writes row X to node N
§ Some period of time t elapses
§ Client B reads row X from node M
§ Does client B see the write from client A?
§ Consistency is a continuum with tradeoffs
§ For NOSQL, the answer would be: “maybe”
§ CAP theorem states: “strong consistency can't be
achieved at the same time as availability and
partition-tolerance”
13
¡ When no updates occur for a long period of time,
eventually all updates will propagate through the
system and all the nodes will be consistent
¡ For a given accepted update and a given node,
eventually either the update reaches the node or
the node is removed from service
¡ Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
¡ http://en.wikipedia.org/wiki/Eventual_consisten
cy

14
¡ The types of large systems based on CAP aren't
ACID they are BASE
(http://queue.acm.org/detail.cfm?id=1394128):
§ Basically Available - system seems to work all the time
§ Soft State - it doesn't have to be consistent all the time
§ Eventually Consistent - becomes consistent at some
later time

¡ Everyone who builds big applications builds them


on CAP and BASE: Google, Yahoo, Facebook,
Amazon, eBay, etc.
15
Availability System is available
during software and
Consistency hardware upgrades and
node failures.
Partition
tolerance

¡ Traditionally, thought of as the server/process available five 9’s


(99.999 %).
§ However, for large node system, at almost any point in time there’s a
good chance that a node is either down or there is a network disruption
among the nodes.
§ Want a system that is resilient in the face of network disruption

16
A system can continue to
operate in the presence
Availability of a network partitions.

Consistency

Partition
tolerance

17
Availability
CAP Theorem: You can
Consistency have at most two of
these properties for any
Partition shared-data system
tolerance

18
¡ Key-Value stores
§ Simple K/V lookups (DHT)
¡ Column stores
§ Each key is associated with many attributes (columns)
§ NoSQL column stores are actually hybrid row/column
stores
§ Different from “pure” relational column stores!
¡ Document stores
§ Store semi-structured documents (JSON)
¡ Graph databases
§ Neo4j, etc.
§ Not exactly NoSQL
§ can’t satisfy the requirements for High Availability and
Scalability/Elasticity very well
19
¡ Focus on scaling to huge amounts of data
¡ Designed to handle massive load
¡ Based on Amazon’s dynamo paper
¡ Data model: (global) collection of Key-value
pairs
¡ Dynamo ring partitioning and replication
¡ Example: (DynamoDB)
§ items having one or more attributes (name, value)
§ An attribute can be single-valued or multi-valued
like set.
§ items are combined into a table
20
¡ Basic API access:
§ get(key): extract the value given a key
§ put(key, value): create or update the value given
its key
§ delete(key): remove the key and its associated
value
§ execute(key, operation, parameters): invoke an
operation to the value (given its key) which is a
special data structure (e.g. List, Set, Map .... etc)

21
¡ Pros:
§ very fast
§ very scalable (horizontally distributed to nodes
based on key)
§ simple data model
§ eventual consistency
§ fault-tolerance

¡ Cons
 Can’t model more complex data structure such as objects

22
Name Producer Data model Querying

SimpleDB Amazon set of couples (key, {attribute}), restricted SQL; select, delete,
where attribute is a couple GetAttributes, and
(name, value) PutAttributes operations
Redis Salvatore set of couples (key, value), primitive operations for each
Sanfilippo where value is simple typed value type
value, list, ordered (according
to ranking) or unordered set,
hash value
Dynamo Amazon like SimpleDB simple get operation and put
in a context
Voldemort LinkeIn like SimpleDB similar to Dynamo

23
¡ Can model more complex objects
¡ Inspired by Lotus Notes
¡ Data model: collection of documents
¡ Document: JSON (JavaScript Object Notation is a data
model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean with
nesting), XML, other semi-structured formats.
¡ Example: (MongoDB) document
§ {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3",
"Kirsten: "1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}

24
Name Producer Data model Querying

MongoDB 10gen object-structured manipulations with objects in


documents stored in collections (find object or
collections; objects via simple selections
each object has a primary and logical expressions,
key called ObjectId delete, update,)
Couchbase Couchbase document as a list of by key and key range, views
named (structured) items via Javascript and
(JSON document) MapReduce

25
¡ Based on Google’s BigTable paper
¡ Like column oriented relational databases (store data in
column order) but with a twist
¡ Tables similarly to RDBMS, but handle semi-structured
¡ Data model:
§ Collection of Column Families
§ Column family = (key, value) where value = set of related columns (standard, super)
§ indexed by row key, column key and timestamp

26
27
28
29
¡ One column family can have variable
numbers of columns
¡ Cells within a column family are sorted
“physically”
¡ Very sparse, most cells have null values
¡ Comparison: RDBMS vs column-based NOSQL
§ Query on multiple tables
§ RDBMS: must fetch data from several places on disk and glue
together
§ Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a column
family are stored together on the disk, so multiple rows can be
retrieved in one read operation data locality)

30
Name Producer Data model Querying

BigTable Google set of couples (key, {value}) selection (by combination of


row, column, and time stamp
ranges)
HBase Apache groups of columns (a BigTable JRUBY IRB-based shell
clone) (similar to SQL)
Hypertable Hypertable like BigTable HQL (Hypertext Query
Language)
CASSANDRA Apache columns, groups of columns simple selections on key,
(originally corresponding to a key range queries, column or
Facebook) (supercolumns) columns ranges
PNUTS Yahoo (hashed or ordered) tables, selection and projection from a
typed arrays, flexible schema single table (retrieve an
arbitrary single record by
primary key, range queries,
complex predicates, ordering,
top-k)

31
¡ Focus on modeling the structure of data
(interconnectivity)
¡ Scales to the complexity of data
¡ Inspired by mathematical Graph Theory (G=(E,V))
¡ Data model:
§ (Property Graph) nodes and edges
§ Nodes may have properties (including ID)
§ Edges may have labels or roles
§ Key-value pairs on both
¡ Interfaces and query languages vary
¡ Single-step vs path expressions vs full recursion
¡ Example:
§ Neo4j, FlockDB, Pregel, InfoGrid …

32
¡ Advantages
§ Massive scalability
§ High availability
§ Lower cost (than competitive solutions at that scale)
§ (usually) predictable elasticity
§ Schema flexibility, sparse & semi-structured data
¡ Disadvantages
§ Don’t fully support relational features
§ no join, group by, order by operations (except within partitions)
§ no referential integrity constraints across partitions
§ No declarative query language (e.g., SQL) ® more programming
§ Eventual consistency is not intuitive to program for
§ Makes client applications more complicated
§ No easy integration with other applications that support SQL
§ Relaxed ACID (see CAP theorem later) ® fewer guarantees

33
¡ NOSQL database cover only a part of data-
intensive cloud applications (mainly Web
applications)
¡ Problems with cloud computing:
§ SaaS (Software as a Service or on-demand software)
applications require enterprise-level functionality,
including ACID transactions, security, and other
features associated with commercial RDBMS
technology, i.e. NOSQL should not be the only option
in the cloud
§ Hybrid solutions:
§ Voldemort with MySQL as one of storage backend
§ deal with NOSQL data as semi-structured data
->integrating RDBMS and NOSQL via SQL/XML

34
Part 2: Introduction to HBase

35
¡ HBase is an open-source, distributed, column-
oriented database built on top of HDFS based on
BigTable
§ Distributed – uses HDFS for storage
§ Row/column store
§ Column-oriented - nulls are free
§ Multi-Dimensional (Versions)
§ Untyped - stores byte[]
¡ HBase is part of Hadoop
¡ HBase is the Hadoop application to use when you
require real-time read/write random access to
very large datasets
§ Aim to support low-latency random access
36
¡ A sparse, distributed, persistent multi-dimensional sorted map
¡ Sparse
§ Sparse data is supported with no waste of costly storage space
§ HBase can handle the fact that we don’t (yet) know that information
§ HBase as a schema-less data store; that is, it’s fluid — we can add to,
subtract from or modify the schema as you go along
¡ Distributed and persistent
§ Persistent simply means that the data you store in HBase will persist
or remain after our program or session ends
§ Just as HBase is an open source implementation of BigTable, HDFS is an
open source implementation of GFS.
§ HBase leverages HDFS to persist its data to disk storage.
§ By storing data in HDFS, HBase offers reliability, availability, seamless
scalability and high performance — all on cost effective distributed
servers.

37
¡ Multi-dimensional sorted map
§ A map (also known as an associative array) is an
abstract collection of key-value pairs, where the
key is unique.
§ The keys are stored in HBase and sorted.
§ Each value can have multiple versions, which
makes the data model multidimensional. By
default, data versions are implemented with a
timestamp.

38
HBase is built on top of YARN and HDFS

HBase files are


internally
stored in HDFS

39
¡ Both are distributed systems that scale to hundreds or
thousands of nodes
¡ HDFS is good for batch processing (scans over big files)
§ Not good for record lookup
§ Not good for incremental addition of small batches
§ Not good for updates
¡ HBase is designed to efficiently address the above
points
§ Fast record lookup
§ Support for record-level insertion
§ Support for updates (not in place)
¡ HBase updates are done by creating new versions of
values
40
If application has neither random reads or writes è Stick to HDFS

41
¡ Tables have one primary index, the row key.
¡ No join operators.
¡ Scans and queries can select a subset of
available columns, perhaps by using a
wildcard.
¡ There are three types of lookups:
§ Fast lookup using row key and optional timestamp.
§ Full table scan
§ Range scan from region start to end.
¡ Limited atomicity and transaction support.
§ HBase supports multiple batched mutations (‫ )ﺗﺤﻮﻻت‬of single rows only.
§ Data is unstructured and untyped.
¡ No accessed or manipulated via SQL.
§ Programmatic access via Java, HBase shell, Thrift (Ruby, Python, Perl, C++, ..) etc.
42
¡ Entities

¡ Relationships

¡ Examples: Concerts

43
¡ Entities → Tables
¡ Attributes → Columns
¡ Relationships → Foreign Keys
¡ Many-to-many → Junction tables
¡ Natural keys → Artificial IDs
¡ 1NF, 2NF, BCNF, 3NF, 4NF…

44
¡ Two types of data: two big, or not too big
¡ If data is not too big, a relational database should be used
§ The model is less likely to change as your business needs
change. You may want to ask different questions over time, but
if you got the logical model correct, you'll have the answers.
¡ The data is too big?
§ The relational model doesn't acknowledge scale.
§ You need to:
§ Add indexes
§ Write really complex, messy SQL
§ Denormalize
§ Cache
§ ……
§ How NoSQL/HBase can help?

45
¡ Table: Design-time namespace, has multiple
sorted rows.
¡ Row:
§ Atomic key/value container, with one row key
§ Rows are sorted alphabetically by the row key as they are stored
§ store data in such a way that related rows are near each other (e.g., a website domain)
¡ Column:
§ A column in HBase consists of a column family and a column qualifier, which are delimited by
a : (colon) character.
¡ Table schema only define it’s Column Families
§ Column families physically co-locate a set of columns and their values
§ Column: a key in the k/v container inside a row
§ Value: a time-versioned value in the k/v container
§ Each column consists of any number of versions
§ Each column family has a set of storage properties, such as whether its values should be
cached in memory etc.
§ Columns within a family are sorted and stored together

46
¡ Column:
§ A column qualifier is added to a column family to provide the index for a given piece of data
§ Given a column family content, a column qualifier might be content:html, and another might
be content:pdf
§ Column families are fixed at table creation, but column qualifiers are mutable and may differ
greatly between rows.

¡ Timestamp: long milliseconds, sorted descending


§ A timestamp is written alongside each value, and is the identifier for a given version of a value.
§ By default, the timestamp represents the time on the RegionServer when the data was written,
but you can specify a different timestamp value when you put data into the cell

¡ Cell:
§ A combination of row, column family, and column qualifier, and contains a value and a
timestamp, which represents the value’s version
¡ (Row, Family:<Column, Value>, Timestamp) à
Value

47
HBase is based on Google’s Bigtable model

Column Family
Row key

TimeStamp value

48
¡ Key Column family named “Contents”
Column family named “anchor”
§ Byte array
§ Serves as the primary
key for the table Time
Column
§ Indexed for fast lookup Row key content Column anchor:
Stamp
s:
¡ Column Family
§ Has a name (string) t12
<html>

§ Contains one or more com.apac Column qualifier
related columns he.ww t11

<html>

¡ Column Qualifier w
anchor:apache APACH
§ Belongs to one column t10
.com E
family anchor:cnnsi.co
t15 CNN
§ Included inside the row m

§ familyName:columnName t13
anchor:my.look. CNN.co
ca m

com.cnn.w <html>
ww t6

<html>
t5

<html>
t3

49
Version number for each row
¡ Version Number
§ Unique within each
Column
key Row key
Time
content Column anchor:
Stamp
s:
§ By defaultà System’s
value
timestamp t12
<html>

§ Data type is Long com.apac
<html>
he.ww t11
¡ Value w

anchor:apache APACH
§ Byte array t10
.com E
anchor:cnnsi.co
t15 CNN
m

anchor:my.look. CNN.co
t13
ca m

com.cnn.w <html>
ww t6

<html>
t5

<html>
t3

50
Column family: Column family:
Row Timestamp
animal: repairs:

animal:type animal:size repairs:cost


t2 zebra 1000 EUR
enclosure1
t1 lion big
enclosure2 … … … …
¡ Storage: every "cell" (i.e. the time-versioned value of one column in one row) is
stored "fully qualified" (with its full row key, column family, column name, etc.) on
disk

(enclosure1, t2, animal:type) zebra

Column family animal: (enclosure1, t1, animal:size) big


(enclosure1, t1, animal:type) lion

Column family repairs: (enclosure1, t1, repairs:cost) 1000 EUR

51
52
¡ Row:
§ The "row" is atomic, and gets flushed to disk periodically.
But it doesn't have to be flushed into just a single file!
§ It can be broken up into different files with different
properties, an reads can look at just a subset.
¡ Column Family: divide columns into physical files
§ Columns within the same family are stored together
§ Why? Table is sparse, many columns
§ No need to scan the whole row when accessing a few columns
§ Each column a file will generate too many files

¡ Row keys, column names, values: arbitrary bytes


¡ Table and column family names: printable characters
¡ Timestamps: long integers

53
¡ HBase schema consists of several Tables
¡ Each table consists of a set of Column Families
§ Columns are not part of the schema
¡ HBase has Dynamic Columns
§ Because column names are encoded inside the cells
§ Different cells can have different columns

“Roles” column family


has different columns
in different cells

54
¡ The version number can be user-supplied
§ Even does not have to be inserted in increasing order
§ Version number are unique within each key

¡ Table can be very sparse


§ Many cells are empty

¡ Keys are indexed as the primary key

A conceptual view of HBase table

55
¡ Each column family is stored in a separate file
(called HTables)
¡ Key & Version numbers are replicated with
each column family
¡ Empty cells are not stored

56
57
¡ Column Families stored separately on disk:
access one without wasting I/O on the other
¡ HBase Regions
§ Each HTable (column family) is partitioned
horizontally into regions
§ Regions are counterpart to HDFS blocks

Each will be one region

58
¡ Major Components
§ The MasterServer (HMaster)
§ One master server
§ Responsible for coordinating the slaves
§ Assigns regions, detects failures
§ Admin functions
§ The RegionServer (HRegionServer)
§ Many region servers
§ Region (HRegion)
§ A subset of a table’s rows, like horizontal range partitioning
§ Automatically done
§ Manages data regions
§ Serves data for reads and writes (using a log)
§ The HBase client

59
60
¡ HBase clusters can be huge and coordinating the operations
of the MasterServers, RegionServers, and clients can be a
daunting task, but that’s where Zookeeper enters the picture.
¡ Zookeeper is a distributed cluster of servers that collectively
provides reliable coordination and synchronization services
for clustered applications.

¡ HBase depends on ZooKeeper


¡ By default HBase manages the
ZooKeeper instance
§ E.g., starts and stops ZooKeeper
¡ HMaster and HRegionServers register
themselves with ZooKeeper

61
¡ No real indexes
¡ Automatic partitioning
¡ Scale linearly and automatically with new
nodes
¡ Commodity hardware
¡ Fault tolerance
¡ Batch processing

62
63
¡ You need random write, random read, or both
(but not neither, otherwise stick to HDFS)

¡ You need to do many thousands of operations


per second on multiple TB of data

¡ Your acces patterns are well-known and


simple

64
Part 3: Introduction to Hive

65
¡ A data warehouse system for Hadoop that
§ facilitates easy data summarization
§ supports ad-hoc queries (still batch though…)
§ created by Facebook
¡ A mechanism to project structure onto this data
and query the data using a SQL-like language –
HiveQL
§ Interactive-console –or-
§ Execute scripts
§ Kicks off one or more MapReduce jobs in the
background
¡ An ability to use indexes, built-in user-defined
functions
66
¡ Limitation of MR
§ Have to use M/R model
§ Not Reusable
§ Error prone
§ For complex jobs:
§ Multiple stage of Map/Reduce functions
§ Just like ask developer to write specified physical execution
plan in the database
¡ Hive intuitive
§ Make the unstructured data looks like tables
regardless how it really lays out
§ SQL based query can be directly against these tables
§ Generate specified execution plan for this query

67
¡ A subset of SQL covering the most common
statements
¡ Agile data types: Array, Map, Struct, and JSON
objects
¡ User Defined Functions and Aggregates
¡ Regular Expression support
¡ MapReduce support
¡ JDBC support
¡ Partitions and Buckets (for performance
optimization)
¡ Views and Indexes
68
69
create table doc(
text string
) row format delimited fields terminated by '\n' stored as textfile;

load data local inpath '/home/Words' overwrite into table doc;

SELECT word, COUNT(*) FROM doc LATERAL VIEW


explode(split(text, ' ')) Table as word GROUP BY word;

70
71
JDBC ODBC

Web
Command Line Interface Interface Thrift Server

Metastore
Driver
(Compiler, Optimizer, Executor)

¡ Metastore
§ The component that store the system catalog and
meta data about tables, columns, partitions etc.
§ Stored in a relational RDBMS (built-in Derby)

72
JDBC ODBC

Web
Command Line Interface Interface Thrift Server

Metastore
Driver
(Compiler, Optimizer, Executor)

¡ Driver: manages the lifecycle of a HiveQL statement as it moves through


Hive.
§ Query Compiler: compiles HiveQL into map/reduce tasks
§ Optimizer: generate the best execution plan
§ Execution Engine: executes the tasks produced by the compiler in proper dependency
order. The execution engine interacts with the underlying Hadoop instance.

73
JDBC ODBC

Web
Command Line Interface Interface Thrift Server

Metastore
Driver
(Compiler, Optimizer, Executor)

¡ Thrift Server
§ Cross-language support
§ Provides a thrift interface and a JDBC/ODBC server and
provides a way of integrating Hive with other applications.

74
JDBC ODBC

Web
Command Line Interface Interface Thrift Server

Metastore
Driver
(Compiler, Optimizer, Executor)

¡ Client Components
§ Including Command Line Interface(CLI), the
web UI and JDBC/ODBC driver.

75
¡ Primitive types
§ Integers: TINYINT, SMALLINT, INT, BIGINT.
§ Boolean: BOOLEAN.
§ Floating point numbers: FLOAT, DOUBLE.
§ Fixed point numbers: DECIMAL
§ String: STRING, CHAR, VARCHAR.
§ Date and time types: TIMESTAMP, DATE
¡ Complex types
§ Structs: c has type {a INT; b INT}. c.a to access the first field
§ Maps: M['group'].
§ Arrays: ['a', 'b', 'c'], A[1] returns 'b'.
¡ Example
§ list< map<string, struct< p1:int,p2:int > > >
§ Represents list of associative arrays that map strings to structs
that contain two ints

76
¡ Databases: Namespaces function to avoid naming conflicts
for tables, views, partitions, columns, and so on.
¡ Tables: Homogeneous units of data which have the same
schema.
§ Analogous to tables in relational DBs.
§ Each table has corresponding directory in HDFS.
§ An example table: page_views:
§ timestamp—which is of INT type that corresponds to a UNIX timestamp
of when the page was viewed.
§ userid —which is of BIGINT type that identifies the user who viewed the
page.
§ page_url—which is of STRING type that captures the location of the
page.
§ referer_url—which is of STRING that captures the location of the page
from where the user arrived at the current page.
§ IP—which is of STRING type that captures the IP address from where
the page request was made.

77
¡ Partitions:
§ Each Table can have one or more partition Keys which
determines how the data is stored
§ Example:
§ Given the table page_views, we can define two partitions a
date_partition of type STRING and country_partition of type STRING
§ All "US" data from "2009-12-23" is a partition of the page_views
table
§ Partition columns are virtual columns, they are not part of
the data itself but are derived on load
§ It is the user's job to guarantee the relationship between
partition name and data content
¡ Buckets: Data in each partition may in turn be divided
into Buckets based on the value of a hash function of
some column of the Table
§ Example: the page_views table may be bucketed by userid
78
Tables (dir)
Partitions (dir)

Buckets (file)

79
¡ Syntax: CREATE TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT
col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED
BY (col_name [ASC|DESC], ...)] INTO num_buckets
BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]

See full CREATE TABLE command at:


https://cwiki.apache.org/confluence/display/Hive/Langua
geManual+DDL
80
¡ SerDe is a short name for "Serializer and
Deserializer.“
§ Describe how to load the data from the file into a
representation that make it looks like a table;
¡ Hive uses SerDe (and FileFormat) to read and
write table rows.
¡ HDFS files --> InputFileFormat --> <key, value> -->
Deserializer --> Row object
¡ Row object --> Serializer --> <key, value> -->
OutputFileFormat --> HDFS files
¡ More details see:
https://cwiki.apache.org/confluence/display/Hiv
e/DeveloperGuide#DeveloperGuide-HiveSerDe

81
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION
ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
Default values: Ctrl+A, Ctrl+B, Ctrl+C, new line, respectively
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT
output_format_classname

82
¡ Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime)
INTO 32 BUCKETS

ROW FORMAT DELIMITED


FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

83
¡ To list existing tables in the warehouse
§ SHOW TABLES;
¡ To list tables with prefix 'page'
§ SHOW TABLES 'page.*';
¡ To list partitions of a table
§ SHOW PARTITIONS page_view;
¡ To list columns and column types of table.
§ DESCRIBE page_view;

84
¡ To rename existing table to a new name
§ ALTER TABLE old_table_name RENAME TO new_table_name;
¡ To rename the columns of an existing table
§ ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);
¡ To add columns to an existing table
§ ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int
column', c2 STRING DEFAULT 'def val');
¡ To rename a partition
§ ALTER TABLE table_name PARTITION old_partition_spec RENAME TO
PARTITION new_partition_spec;
¡ To rename a column
§ ALTER TABLE table_name CHANGE old_col_name new_col_name
column_type
¡ More details see:
https://cwiki.apache.org/confluence/display/Hive/LanguageManua
l+DDL#LanguageManualDDL-AlterTable

85
¡ To drop a table
§ DROP TABLE [IF EXISTS] table_name
§ Example:
§ DROP TABLE page_view
¡ To drop a paritition
§ ALTER TABLE table_name DROP [IF EXISTS]
PARTITION partition_spec[, PARTITION
partition_spec, ...]
§ Example:
§ ALTER TABLE pv_users DROP PARTITION (ds='2008-08-
08')
86
¡ Hive does not do any transformation while loading
data into tables. Load operations are currently pure
copy/move operations that move datafiles into
locations corresponding to Hive tables.
¡ Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

§ Load data from a file in the local files system


§ LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE
page_view PARTITION(date='2008-06-08', country='US')
§ Load data from a file in HDFS
§ LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE
page_view PARTITION(date='2008-06-08', country='US')
§ The input data format must be the same as the table
format!

87
¡ Insert rows into a table:
§ Syntax
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1],
partcol2[=val2] ...)] VALUES values_row [, values_row ...]

¡ Inserting data into Hive Tables from queries


§ Syntax
INSERT INTO TABLE tablename [PARTITION (partcol1=val1,
partcol2=val2 ...)] select_statement FROM from_statement;

§ Example:
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;
88
¡ Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

¡ Synopsis
§ The referenced column must be a column of the table
being updated.
§ The value assigned must be an expression that Hive
supports in the select clause. Thus arithmetic
operators, UDFs, casts, literals, etc. are supported.
Subqueries are not supported.
§ Only rows that match the WHERE clause will be
updated.
§ Partitioning columns cannot be updated.
§ Bucketing columns cannot be updated.
89
¡ Select Syntax:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...


FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]

90
¡ Difference between Order By and Sort By
§ The former guarantees total order in the output while the latter only
guarantees ordering of the rows within a reducer

¡ Cluster By
§ Cluster By is a short-cut for both Distribute By and Sort By.
§ Hive uses the columns in Distribute By to distribute the rows among reducers.
All rows with the same Distribute By columns will go to the same reducer.
However, Distribute By does not guarantee clustering or sorting properties on
the distributed keys.

x1 x1
x1 x2 x1 x1
x2 x1 x2 x2
x4 x4
x3 x3
x1 x4 x1 x3
x3 x4
Distribute By Cluster By
91
¡ Selects column 'foo' from all rows of partition ds=2008-08-
15 of the invites table. The results are not stored anywhere,
but are displayed on the console.
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

¡ Selects all rows from partition ds=2008-08-15 of


the invites table into an HDFS directory.
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT
a.* FROM invites a WHERE a.ds='2008-08-15';

¡ Selects all rows from pokes table into a local directory.


hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out'
SELECT a.* FROM pokes a;

92
¡ Count the number of distinct users by gender
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

¡ Multiple DISTINCT expressions in the same


query is not allowed
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
93
¡ Hive does not support join conditions that are not equality conditions
§ it is very difficult to express such conditions as a map/reduce job
§ SELECT a.* FROM a JOIN b ON (a.id = b.id)
§ However, the following statement is not allowed:
§ SELECT a.* FROM a JOIN b ON (a.id <> b.id)
¡ More than 2 tables can be joined in the same query.
§ SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2)
¡ Example:

SELECT s.word, s.freq, k.freq FROM


shakespeare s
JOIN bible k ON (s.word = k.word)
WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

94
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

(Abstract Syntax Tree)


(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF bible k) (= (. (TOK_TABLE_OR_COL s)
word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k)
freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))

(one or more of MapReduce jobs)

95
¡ Built-in operators:
§ relational, arithmetic, logical, etc.
¡ Built-in functions:
§ mathematical, date function, string function, etc.
¡ Built-in aggregate functions:
§ max, min, count, etc.
¡ Built-in table-generating functions: transform a single
input row to multiple output rows
§ explode(ARRAY): Returns one row for each element from the
array.
§ explode(MAP): Returns one row for each key-value pair from the
input map with two columns in each row
¡ Create Custom UDFs
¡ More details see:
https://cwiki.apache.org/confluence/display/Hive/Languag
eManual+UDF#LanguageManualUDF-explode
96
¡ Create a table in Hive

create table doc(


text string
) row format delimited fields terminated by '\n' stored as textfile;
¡ Load file into table
load data local inpath '/home/Words' overwrite into table doc;
¡ Compute word count using select
SELECT word, COUNT(*) FROM doc LATERAL VIEW
explode(split(text, ' ')) Table as word GROUP BY word;

§ Lateral view is used in conjunction with user-defined table generating


functions such as explode()
§ A lateral view first applies the UDTF to each row of base table and then
joins resulting output rows to form a virtual table
97
¡ Load Data

¡ Two insertion from select

98
¡ Pros
§ A easy way to process large scale data
§ Support SQL-based queries
§ Provide more user defined interfaces to extend
§ Programmability
§ Efficient execution plans for performance
§ Interoperability with other database
¡ Cons
§ No easy way to append data
§ Files in HDFS are immutable
99
¡ Log processing
§ Daily Report
§ User Activity Measurement
¡ Data/Text mining
§ Machine learning (Training Data)
¡ Business intelligence
§ Advertising Delivery
§ Spam Detection

100
¡ https://hbase.apache.org/book.html
¡ https://cwiki.apache.org/confluence/display/
Hive/Home#Home-HiveDocumentation
¡ http://www.tutorialspoint.com/hive/
¡ Hadoop The Definitive Guide. HBase Chapter.
¡ Hadoop the Definitive Guide. Hive Chapter

101
102

You might also like