CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
2
¡ Relational model with schemas
¡ Powerful, flexible query language (SQL)
¡ Transactional semantics: ACID
¡ Rich ecosystem, lots of tool support (MySQL,
PostgreSQL, etc.)
3
The name stands for Not Only SQL
Does not use SQL as querying language
Class of non-relational data storage systems
¡ The term NOSQL was introduced by Eric Evans
when an event was organized to discuss open
source distributed databases
It's not a replacement for a RDBMS but compliments
it
All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)
4
¡ Key features (advantages):
§ non-relational
§ don’t require strict schema
§ data are replicated to multiple nodes (so, identical & fault-tolerant) and
can be partitioned:
§ down nodes easily replaced
§ no single point of failure
§ horizontal scalable
§ cheap, easy to implement
(open-source)
§ massive write performance
§ fast key-value access
5
¡ Web apps have different needs (than the apps
that RDBMS were designed for)
§ Low and predictable response time (latency)
§ Scalability & elasticity (at low cost!)
§ High availability
§ Flexible schemas / semi-structured data
§ Geographic distribution (multiple data centers)
¡ Web apps can (usually) do without
§ Transactions / strong consistency / integrity
§ Complex queries
6
¡ Google (BigTable)
¡ LinkedIn (Voldemort)
¡ Facebook (Cassandra)
¡ Twitter (HBase, Cassandra)
¡ Baidu (HyperTable)
7
¡ Three major papers were the seeds of the
NoSQL movement
§ BigTable (Google)
§ Dynamo (Amazon)
§ Ring partition and replication
§ Gossip protocol (discovery and error detection)
§ Distributed key-value data store
§ Eventual consistency
§ CAP Theorem (discuss in the next few slides)
8
¡ Suppose three properties of a
distributed system (sharing data)
§ Consistency:
A
§ all copies have same value C
§ Availability:
P
§ reads and writes always succeed even if a
node in the cluster goes down
§ Partition-tolerance:
§ system properties (consistency and/or
availability) hold even when network failures
prevent some machines from
communicating with others
9
¡ Brewer’s CAP Theorem:
§ For any system sharing data, it is “impossible” to
guarantee simultaneously all of these three properties
§ You can have at most two of these three properties for
any shared-data system
¡ Very large systems will “partition” at some point:
§ That leaves either C or A to choose from (traditional
DBMS prefers C over A and P )
§ In almost all cases, you would choose A over C (except
in specific applications such as order processing)
10
All client always have
Availability
the same view of the
data
Consistency
Once a writer has written,
Partition all readers will see that
tolerance write
12
¡ A consistency model determines rules for
visibility and apparent order of updates
¡ Example:
§ Row X is replicated on nodes M and N
§ Client A writes row X to node N
§ Some period of time t elapses
§ Client B reads row X from node M
§ Does client B see the write from client A?
§ Consistency is a continuum with tradeoffs
§ For NOSQL, the answer would be: “maybe”
§ CAP theorem states: “strong consistency can't be
achieved at the same time as availability and
partition-tolerance”
13
¡ When no updates occur for a long period of time,
eventually all updates will propagate through the
system and all the nodes will be consistent
¡ For a given accepted update and a given node,
eventually either the update reaches the node or
the node is removed from service
¡ Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
¡ http://en.wikipedia.org/wiki/Eventual_consisten
cy
14
¡ The types of large systems based on CAP aren't
ACID they are BASE
(http://queue.acm.org/detail.cfm?id=1394128):
§ Basically Available - system seems to work all the time
§ Soft State - it doesn't have to be consistent all the time
§ Eventually Consistent - becomes consistent at some
later time
16
A system can continue to
operate in the presence
Availability of a network partitions.
Consistency
Partition
tolerance
17
Availability
CAP Theorem: You can
Consistency have at most two of
these properties for any
Partition shared-data system
tolerance
18
¡ Key-Value stores
§ Simple K/V lookups (DHT)
¡ Column stores
§ Each key is associated with many attributes (columns)
§ NoSQL column stores are actually hybrid row/column
stores
§ Different from “pure” relational column stores!
¡ Document stores
§ Store semi-structured documents (JSON)
¡ Graph databases
§ Neo4j, etc.
§ Not exactly NoSQL
§ can’t satisfy the requirements for High Availability and
Scalability/Elasticity very well
19
¡ Focus on scaling to huge amounts of data
¡ Designed to handle massive load
¡ Based on Amazon’s dynamo paper
¡ Data model: (global) collection of Key-value
pairs
¡ Dynamo ring partitioning and replication
¡ Example: (DynamoDB)
§ items having one or more attributes (name, value)
§ An attribute can be single-valued or multi-valued
like set.
§ items are combined into a table
20
¡ Basic API access:
§ get(key): extract the value given a key
§ put(key, value): create or update the value given
its key
§ delete(key): remove the key and its associated
value
§ execute(key, operation, parameters): invoke an
operation to the value (given its key) which is a
special data structure (e.g. List, Set, Map .... etc)
21
¡ Pros:
§ very fast
§ very scalable (horizontally distributed to nodes
based on key)
§ simple data model
§ eventual consistency
§ fault-tolerance
¡ Cons
Can’t model more complex data structure such as objects
22
Name Producer Data model Querying
SimpleDB Amazon set of couples (key, {attribute}), restricted SQL; select, delete,
where attribute is a couple GetAttributes, and
(name, value) PutAttributes operations
Redis Salvatore set of couples (key, value), primitive operations for each
Sanfilippo where value is simple typed value type
value, list, ordered (according
to ranking) or unordered set,
hash value
Dynamo Amazon like SimpleDB simple get operation and put
in a context
Voldemort LinkeIn like SimpleDB similar to Dynamo
23
¡ Can model more complex objects
¡ Inspired by Lotus Notes
¡ Data model: collection of documents
¡ Document: JSON (JavaScript Object Notation is a data
model, key-value pairs, which supports objects,
records, structs, lists, array, maps, dates, Boolean with
nesting), XML, other semi-structured formats.
¡ Example: (MongoDB) document
§ {Name:"Jaroslav",
Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3",
"Kirsten: "1", "Otis: "3", Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]
}
24
Name Producer Data model Querying
25
¡ Based on Google’s BigTable paper
¡ Like column oriented relational databases (store data in
column order) but with a twist
¡ Tables similarly to RDBMS, but handle semi-structured
¡ Data model:
§ Collection of Column Families
§ Column family = (key, value) where value = set of related columns (standard, super)
§ indexed by row key, column key and timestamp
26
27
28
29
¡ One column family can have variable
numbers of columns
¡ Cells within a column family are sorted
“physically”
¡ Very sparse, most cells have null values
¡ Comparison: RDBMS vs column-based NOSQL
§ Query on multiple tables
§ RDBMS: must fetch data from several places on disk and glue
together
§ Column-based NOSQL: only fetch column families of those
columns that are required by a query (all columns in a column
family are stored together on the disk, so multiple rows can be
retrieved in one read operation data locality)
30
Name Producer Data model Querying
31
¡ Focus on modeling the structure of data
(interconnectivity)
¡ Scales to the complexity of data
¡ Inspired by mathematical Graph Theory (G=(E,V))
¡ Data model:
§ (Property Graph) nodes and edges
§ Nodes may have properties (including ID)
§ Edges may have labels or roles
§ Key-value pairs on both
¡ Interfaces and query languages vary
¡ Single-step vs path expressions vs full recursion
¡ Example:
§ Neo4j, FlockDB, Pregel, InfoGrid …
32
¡ Advantages
§ Massive scalability
§ High availability
§ Lower cost (than competitive solutions at that scale)
§ (usually) predictable elasticity
§ Schema flexibility, sparse & semi-structured data
¡ Disadvantages
§ Don’t fully support relational features
§ no join, group by, order by operations (except within partitions)
§ no referential integrity constraints across partitions
§ No declarative query language (e.g., SQL) ® more programming
§ Eventual consistency is not intuitive to program for
§ Makes client applications more complicated
§ No easy integration with other applications that support SQL
§ Relaxed ACID (see CAP theorem later) ® fewer guarantees
33
¡ NOSQL database cover only a part of data-
intensive cloud applications (mainly Web
applications)
¡ Problems with cloud computing:
§ SaaS (Software as a Service or on-demand software)
applications require enterprise-level functionality,
including ACID transactions, security, and other
features associated with commercial RDBMS
technology, i.e. NOSQL should not be the only option
in the cloud
§ Hybrid solutions:
§ Voldemort with MySQL as one of storage backend
§ deal with NOSQL data as semi-structured data
->integrating RDBMS and NOSQL via SQL/XML
34
Part 2: Introduction to HBase
35
¡ HBase is an open-source, distributed, column-
oriented database built on top of HDFS based on
BigTable
§ Distributed – uses HDFS for storage
§ Row/column store
§ Column-oriented - nulls are free
§ Multi-Dimensional (Versions)
§ Untyped - stores byte[]
¡ HBase is part of Hadoop
¡ HBase is the Hadoop application to use when you
require real-time read/write random access to
very large datasets
§ Aim to support low-latency random access
36
¡ A sparse, distributed, persistent multi-dimensional sorted map
¡ Sparse
§ Sparse data is supported with no waste of costly storage space
§ HBase can handle the fact that we don’t (yet) know that information
§ HBase as a schema-less data store; that is, it’s fluid — we can add to,
subtract from or modify the schema as you go along
¡ Distributed and persistent
§ Persistent simply means that the data you store in HBase will persist
or remain after our program or session ends
§ Just as HBase is an open source implementation of BigTable, HDFS is an
open source implementation of GFS.
§ HBase leverages HDFS to persist its data to disk storage.
§ By storing data in HDFS, HBase offers reliability, availability, seamless
scalability and high performance — all on cost effective distributed
servers.
37
¡ Multi-dimensional sorted map
§ A map (also known as an associative array) is an
abstract collection of key-value pairs, where the
key is unique.
§ The keys are stored in HBase and sorted.
§ Each value can have multiple versions, which
makes the data model multidimensional. By
default, data versions are implemented with a
timestamp.
38
HBase is built on top of YARN and HDFS
39
¡ Both are distributed systems that scale to hundreds or
thousands of nodes
¡ HDFS is good for batch processing (scans over big files)
§ Not good for record lookup
§ Not good for incremental addition of small batches
§ Not good for updates
¡ HBase is designed to efficiently address the above
points
§ Fast record lookup
§ Support for record-level insertion
§ Support for updates (not in place)
¡ HBase updates are done by creating new versions of
values
40
If application has neither random reads or writes è Stick to HDFS
41
¡ Tables have one primary index, the row key.
¡ No join operators.
¡ Scans and queries can select a subset of
available columns, perhaps by using a
wildcard.
¡ There are three types of lookups:
§ Fast lookup using row key and optional timestamp.
§ Full table scan
§ Range scan from region start to end.
¡ Limited atomicity and transaction support.
§ HBase supports multiple batched mutations ( )ﺗﺤﻮﻻتof single rows only.
§ Data is unstructured and untyped.
¡ No accessed or manipulated via SQL.
§ Programmatic access via Java, HBase shell, Thrift (Ruby, Python, Perl, C++, ..) etc.
42
¡ Entities
¡ Relationships
¡ Examples: Concerts
43
¡ Entities → Tables
¡ Attributes → Columns
¡ Relationships → Foreign Keys
¡ Many-to-many → Junction tables
¡ Natural keys → Artificial IDs
¡ 1NF, 2NF, BCNF, 3NF, 4NF…
44
¡ Two types of data: two big, or not too big
¡ If data is not too big, a relational database should be used
§ The model is less likely to change as your business needs
change. You may want to ask different questions over time, but
if you got the logical model correct, you'll have the answers.
¡ The data is too big?
§ The relational model doesn't acknowledge scale.
§ You need to:
§ Add indexes
§ Write really complex, messy SQL
§ Denormalize
§ Cache
§ ……
§ How NoSQL/HBase can help?
45
¡ Table: Design-time namespace, has multiple
sorted rows.
¡ Row:
§ Atomic key/value container, with one row key
§ Rows are sorted alphabetically by the row key as they are stored
§ store data in such a way that related rows are near each other (e.g., a website domain)
¡ Column:
§ A column in HBase consists of a column family and a column qualifier, which are delimited by
a : (colon) character.
¡ Table schema only define it’s Column Families
§ Column families physically co-locate a set of columns and their values
§ Column: a key in the k/v container inside a row
§ Value: a time-versioned value in the k/v container
§ Each column consists of any number of versions
§ Each column family has a set of storage properties, such as whether its values should be
cached in memory etc.
§ Columns within a family are sorted and stored together
46
¡ Column:
§ A column qualifier is added to a column family to provide the index for a given piece of data
§ Given a column family content, a column qualifier might be content:html, and another might
be content:pdf
§ Column families are fixed at table creation, but column qualifiers are mutable and may differ
greatly between rows.
¡ Cell:
§ A combination of row, column family, and column qualifier, and contains a value and a
timestamp, which represents the value’s version
¡ (Row, Family:<Column, Value>, Timestamp) à
Value
47
HBase is based on Google’s Bigtable model
Column Family
Row key
TimeStamp value
48
¡ Key Column family named “Contents”
Column family named “anchor”
§ Byte array
§ Serves as the primary
key for the table Time
Column
§ Indexed for fast lookup Row key content Column anchor:
Stamp
s:
¡ Column Family
§ Has a name (string) t12
<html>
…
§ Contains one or more com.apac Column qualifier
related columns he.ww t11
…
<html>
¡ Column Qualifier w
anchor:apache APACH
§ Belongs to one column t10
.com E
family anchor:cnnsi.co
t15 CNN
§ Included inside the row m
§ familyName:columnName t13
anchor:my.look. CNN.co
ca m
com.cnn.w <html>
ww t6
…
<html>
t5
…
<html>
t3
…
49
Version number for each row
¡ Version Number
§ Unique within each
Column
key Row key
Time
content Column anchor:
Stamp
s:
§ By defaultà System’s
value
timestamp t12
<html>
…
§ Data type is Long com.apac
<html>
he.ww t11
¡ Value w
…
anchor:apache APACH
§ Byte array t10
.com E
anchor:cnnsi.co
t15 CNN
m
anchor:my.look. CNN.co
t13
ca m
com.cnn.w <html>
ww t6
…
<html>
t5
…
<html>
t3
…
50
Column family: Column family:
Row Timestamp
animal: repairs:
51
52
¡ Row:
§ The "row" is atomic, and gets flushed to disk periodically.
But it doesn't have to be flushed into just a single file!
§ It can be broken up into different files with different
properties, an reads can look at just a subset.
¡ Column Family: divide columns into physical files
§ Columns within the same family are stored together
§ Why? Table is sparse, many columns
§ No need to scan the whole row when accessing a few columns
§ Each column a file will generate too many files
53
¡ HBase schema consists of several Tables
¡ Each table consists of a set of Column Families
§ Columns are not part of the schema
¡ HBase has Dynamic Columns
§ Because column names are encoded inside the cells
§ Different cells can have different columns
54
¡ The version number can be user-supplied
§ Even does not have to be inserted in increasing order
§ Version number are unique within each key
55
¡ Each column family is stored in a separate file
(called HTables)
¡ Key & Version numbers are replicated with
each column family
¡ Empty cells are not stored
56
57
¡ Column Families stored separately on disk:
access one without wasting I/O on the other
¡ HBase Regions
§ Each HTable (column family) is partitioned
horizontally into regions
§ Regions are counterpart to HDFS blocks
58
¡ Major Components
§ The MasterServer (HMaster)
§ One master server
§ Responsible for coordinating the slaves
§ Assigns regions, detects failures
§ Admin functions
§ The RegionServer (HRegionServer)
§ Many region servers
§ Region (HRegion)
§ A subset of a table’s rows, like horizontal range partitioning
§ Automatically done
§ Manages data regions
§ Serves data for reads and writes (using a log)
§ The HBase client
59
60
¡ HBase clusters can be huge and coordinating the operations
of the MasterServers, RegionServers, and clients can be a
daunting task, but that’s where Zookeeper enters the picture.
¡ Zookeeper is a distributed cluster of servers that collectively
provides reliable coordination and synchronization services
for clustered applications.
61
¡ No real indexes
¡ Automatic partitioning
¡ Scale linearly and automatically with new
nodes
¡ Commodity hardware
¡ Fault tolerance
¡ Batch processing
62
63
¡ You need random write, random read, or both
(but not neither, otherwise stick to HDFS)
64
Part 3: Introduction to Hive
65
¡ A data warehouse system for Hadoop that
§ facilitates easy data summarization
§ supports ad-hoc queries (still batch though…)
§ created by Facebook
¡ A mechanism to project structure onto this data
and query the data using a SQL-like language –
HiveQL
§ Interactive-console –or-
§ Execute scripts
§ Kicks off one or more MapReduce jobs in the
background
¡ An ability to use indexes, built-in user-defined
functions
66
¡ Limitation of MR
§ Have to use M/R model
§ Not Reusable
§ Error prone
§ For complex jobs:
§ Multiple stage of Map/Reduce functions
§ Just like ask developer to write specified physical execution
plan in the database
¡ Hive intuitive
§ Make the unstructured data looks like tables
regardless how it really lays out
§ SQL based query can be directly against these tables
§ Generate specified execution plan for this query
67
¡ A subset of SQL covering the most common
statements
¡ Agile data types: Array, Map, Struct, and JSON
objects
¡ User Defined Functions and Aggregates
¡ Regular Expression support
¡ MapReduce support
¡ JDBC support
¡ Partitions and Buckets (for performance
optimization)
¡ Views and Indexes
68
69
create table doc(
text string
) row format delimited fields terminated by '\n' stored as textfile;
70
71
JDBC ODBC
Web
Command Line Interface Interface Thrift Server
Metastore
Driver
(Compiler, Optimizer, Executor)
¡ Metastore
§ The component that store the system catalog and
meta data about tables, columns, partitions etc.
§ Stored in a relational RDBMS (built-in Derby)
72
JDBC ODBC
Web
Command Line Interface Interface Thrift Server
Metastore
Driver
(Compiler, Optimizer, Executor)
73
JDBC ODBC
Web
Command Line Interface Interface Thrift Server
Metastore
Driver
(Compiler, Optimizer, Executor)
¡ Thrift Server
§ Cross-language support
§ Provides a thrift interface and a JDBC/ODBC server and
provides a way of integrating Hive with other applications.
74
JDBC ODBC
Web
Command Line Interface Interface Thrift Server
Metastore
Driver
(Compiler, Optimizer, Executor)
¡ Client Components
§ Including Command Line Interface(CLI), the
web UI and JDBC/ODBC driver.
75
¡ Primitive types
§ Integers: TINYINT, SMALLINT, INT, BIGINT.
§ Boolean: BOOLEAN.
§ Floating point numbers: FLOAT, DOUBLE.
§ Fixed point numbers: DECIMAL
§ String: STRING, CHAR, VARCHAR.
§ Date and time types: TIMESTAMP, DATE
¡ Complex types
§ Structs: c has type {a INT; b INT}. c.a to access the first field
§ Maps: M['group'].
§ Arrays: ['a', 'b', 'c'], A[1] returns 'b'.
¡ Example
§ list< map<string, struct< p1:int,p2:int > > >
§ Represents list of associative arrays that map strings to structs
that contain two ints
76
¡ Databases: Namespaces function to avoid naming conflicts
for tables, views, partitions, columns, and so on.
¡ Tables: Homogeneous units of data which have the same
schema.
§ Analogous to tables in relational DBs.
§ Each table has corresponding directory in HDFS.
§ An example table: page_views:
§ timestamp—which is of INT type that corresponds to a UNIX timestamp
of when the page was viewed.
§ userid —which is of BIGINT type that identifies the user who viewed the
page.
§ page_url—which is of STRING type that captures the location of the
page.
§ referer_url—which is of STRING that captures the location of the page
from where the user arrived at the current page.
§ IP—which is of STRING type that captures the IP address from where
the page request was made.
77
¡ Partitions:
§ Each Table can have one or more partition Keys which
determines how the data is stored
§ Example:
§ Given the table page_views, we can define two partitions a
date_partition of type STRING and country_partition of type STRING
§ All "US" data from "2009-12-23" is a partition of the page_views
table
§ Partition columns are virtual columns, they are not part of
the data itself but are derived on load
§ It is the user's job to guarantee the relationship between
partition name and data content
¡ Buckets: Data in each partition may in turn be divided
into Buckets based on the value of a hash function of
some column of the Table
§ Example: the page_views table may be bucketed by userid
78
Tables (dir)
Partitions (dir)
Buckets (file)
79
¡ Syntax: CREATE TABLE [IF NOT EXISTS]
[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT
col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...) [SORTED
BY (col_name [ASC|DESC], ...)] INTO num_buckets
BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
81
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION
ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
Default values: Ctrl+A, Ctrl+B, Ctrl+C, new line, respectively
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT
output_format_classname
82
¡ Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime)
INTO 32 BUCKETS
83
¡ To list existing tables in the warehouse
§ SHOW TABLES;
¡ To list tables with prefix 'page'
§ SHOW TABLES 'page.*';
¡ To list partitions of a table
§ SHOW PARTITIONS page_view;
¡ To list columns and column types of table.
§ DESCRIBE page_view;
84
¡ To rename existing table to a new name
§ ALTER TABLE old_table_name RENAME TO new_table_name;
¡ To rename the columns of an existing table
§ ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);
¡ To add columns to an existing table
§ ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int
column', c2 STRING DEFAULT 'def val');
¡ To rename a partition
§ ALTER TABLE table_name PARTITION old_partition_spec RENAME TO
PARTITION new_partition_spec;
¡ To rename a column
§ ALTER TABLE table_name CHANGE old_col_name new_col_name
column_type
¡ More details see:
https://cwiki.apache.org/confluence/display/Hive/LanguageManua
l+DDL#LanguageManualDDL-AlterTable
85
¡ To drop a table
§ DROP TABLE [IF EXISTS] table_name
§ Example:
§ DROP TABLE page_view
¡ To drop a paritition
§ ALTER TABLE table_name DROP [IF EXISTS]
PARTITION partition_spec[, PARTITION
partition_spec, ...]
§ Example:
§ ALTER TABLE pv_users DROP PARTITION (ds='2008-08-
08')
86
¡ Hive does not do any transformation while loading
data into tables. Load operations are currently pure
copy/move operations that move datafiles into
locations corresponding to Hive tables.
¡ Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO
TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
87
¡ Insert rows into a table:
§ Syntax
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1],
partcol2[=val2] ...)] VALUES values_row [, values_row ...]
§ Example:
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;
88
¡ Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
¡ Synopsis
§ The referenced column must be a column of the table
being updated.
§ The value assigned must be an expression that Hive
supports in the select clause. Thus arithmetic
operators, UDFs, casts, literals, etc. are supported.
Subqueries are not supported.
§ Only rows that match the WHERE clause will be
updated.
§ Partitioning columns cannot be updated.
§ Bucketing columns cannot be updated.
89
¡ Select Syntax:
90
¡ Difference between Order By and Sort By
§ The former guarantees total order in the output while the latter only
guarantees ordering of the rows within a reducer
¡ Cluster By
§ Cluster By is a short-cut for both Distribute By and Sort By.
§ Hive uses the columns in Distribute By to distribute the rows among reducers.
All rows with the same Distribute By columns will go to the same reducer.
However, Distribute By does not guarantee clustering or sorting properties on
the distributed keys.
x1 x1
x1 x2 x1 x1
x2 x1 x2 x2
x4 x4
x3 x3
x1 x4 x1 x3
x3 x4
Distribute By Cluster By
91
¡ Selects column 'foo' from all rows of partition ds=2008-08-
15 of the invites table. The results are not stored anywhere,
but are displayed on the console.
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
92
¡ Count the number of distinct users by gender
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
94
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;
95
¡ Built-in operators:
§ relational, arithmetic, logical, etc.
¡ Built-in functions:
§ mathematical, date function, string function, etc.
¡ Built-in aggregate functions:
§ max, min, count, etc.
¡ Built-in table-generating functions: transform a single
input row to multiple output rows
§ explode(ARRAY): Returns one row for each element from the
array.
§ explode(MAP): Returns one row for each key-value pair from the
input map with two columns in each row
¡ Create Custom UDFs
¡ More details see:
https://cwiki.apache.org/confluence/display/Hive/Languag
eManual+UDF#LanguageManualUDF-explode
96
¡ Create a table in Hive
98
¡ Pros
§ A easy way to process large scale data
§ Support SQL-based queries
§ Provide more user defined interfaces to extend
§ Programmability
§ Efficient execution plans for performance
§ Interoperability with other database
¡ Cons
§ No easy way to append data
§ Files in HDFS are immutable
99
¡ Log processing
§ Daily Report
§ User Activity Measurement
¡ Data/Text mining
§ Machine learning (Training Data)
¡ Business intelligence
§ Advertising Delivery
§ Spam Detection
100
¡ https://hbase.apache.org/book.html
¡ https://cwiki.apache.org/confluence/display/
Hive/Home#Home-HiveDocumentation
¡ http://www.tutorialspoint.com/hive/
¡ Hadoop The Definitive Guide. HBase Chapter.
¡ Hadoop the Definitive Guide. Hive Chapter
101
102