0% found this document useful (0 votes)
19 views

12_big_sql

Uploaded by

azza.abidi1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

12_big_sql

Uploaded by

azza.abidi1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Big SQL

August 22, 2014 © 2013 IBM Corporation


Agenda
 What is Big SQL?
– Why Big SQL?

 Big SQL Deep dive


– Architecture
– SQL Features
– Client Drivers
– HBase Support

 Using Big SQL

 Conclusion

2 © 2013 IBM Corporation


SQL Access for Hadoop: Why?
1 Pre-Processing Hub 2 Query-able Archive 3 Exploratory Analysi
 Data warehouse augmentation is
the leading Hadoop use case Streams BigInsights BigInsights Can combine
Real-time Landing zone Information with
processing for all data Integration unstructured
information

Data Warehouse Data Warehouse Data Warehouse

 MapReduce is difficult
– MapReduce Java API is tedious and
requires programming expertise
– Unfamiliar languages (i.e.. Pig) also require special skills

 SQL support opens the data to a much wider audience


– Familiar, widely known syntax
– Common catalog for identifying data and structure
– Declarative – clear separation of the what (the data you’re after) vs. the how (processing)

3 © 2013 IBM Corporation


BigSQL vs. Hive
 BigSQL is architecturally similar to Hive and reuses Hive
components
– HCatalog
– Datastorage ( storage handlers, serdes )

 Better SQL support


– Subqueries
– More functions ( window functions, etc. )
– Better JDBC/ODBC drivers and integration with tools like Cognos

 Better Performance
– Efficient local execution of point queries
– Better Optimizer and support for complex queries

4 © 2013 IBM Corporation


SQL for Hadoop: What’s the Problem?
 SQL Access to data in Hadoop is challenging
– Data is in many formats
• CSV, JSON, Hive RCFile, HBase, ...
• Some formats (HBase composite keys) don’t map cleanly
to relational models
– No schemas or statistics
– Hadoop was not designed to be a query engine

 Hive (with HiveQL): limited query access for Hadoop


– SQL-like, but NOT SQL
• Limited data types – no varchar(n), decimal(p,s), etc…
• Limited join support
• No subqueries (Hive 0.11 adds non-correlated subquery support)
• No windowed aggregates(added in Hive 0.11)
– Very limited JDBC/ODBC driver
– Everything executes in MapReduce

 Cloudera Impala is not the answer either


– Subject to Hive’s(0.9) query limitations and restrictions
– Support for limited set of file formats
– Currently limited to broadcast joins (All tables must fit in memory)

5 © 2013 IBM Corporation


Big SQL: Native SQL Query Access for Hadoop
 Native SQL access to data
stored in BigInsights
– ANSI SQL 92+ Application
– Standard syntax support (joins, data types, …)
SQL
 Real JDBC/ODBC drivers JDBC / ODBC Driver
– Prepared statements
– Cancel support JDBC / ODBC Server
– Database metadata API support
– Secure socket connections (SSL)

 Optimization Big SQL Engine


– Leveraging MapReduce parallelism
or…
– Direct access for low-latency queries
Data Sources
 Varied data sources
– HBase (including secondary indexes)
– CSV, Delimited files, Sequence files
Hive Tables HBase tables CSV Files
– JSON
– Hive tables
BigInsights

6 © 2013 IBM Corporation


Architecture
Application
 Big SQL shares catalogs
with Hive via the Hive SQL Language
metastore JDBC / ODBC Driver
– Each can query the others
tables
 SQL engine analyzes Big SQL Server
incoming queries Network Protocol Job Tracker Name Node •••
– Separates portion(s) to SQL Engine
execute at the server vs. Storage Handlers
Head Node Head Node

portion(s) to execute on Del SEQ


the cluster Files Files
HBase RDBMS ••• Hive Metastore

– Re-writes query if Head Node Head Node


necessary for improved
performance Region
Task Data Region Task Data Task Data Region
– Determines appropriate Tracker Node Server Tracker Node Server Tracker Node Server

storage handler for data •••


– Produces execution plan Compute Node Compute Node Compute Node

– Executes and coordinates BigInsights Cluster

query

7 © 2013 IBM Corporation


Data Sources SQL Engine
HCatalog API
 Big SQL utilizes HCatalog for data access BigSQL Hive
BigSQL Hive Storage
Storage Handlers
Handlers
 The HCatalog project provides API's to: Del Seq RC
HBase
HBase
Del
File
Seq
File
RC
File
HBase
HBase •••
– Access the Hive catalogs (metastore) File File File

– Utilize the Hive storage engines to read/write data


 Big SQL can access any data accessible from Hive + HCatalog, e.g.
– Delimited files - RC files - Custom
– Sequence files - Partitioned tables
 Existing Hive SerDe's provide data encoding, e.g.
– Text - Avro - JSON
– Binary - Thrift - Custom
 Big SQL also provides the ability to create virtual tables
– Data is synthesized via Jaql scripts
 Big SQL provides its own HBase storage handler.

8 © 2013 IBM Corporation


HBase Support
 Robust HBase support is a major Big SQL focus
 You should use Hive tables if
– You mainly run analytic ( OLAP ) queries that scan and aggregate all or much
of the data
 You should use HBase tables if
– You want to lookup single key values ( one customer, one order )
– You want to update data

 BigSQL supports
– Insert ( Upsert )
– Delete/Update ( not transactional in case of server failure )
– Indexes
– Dense Columns

9 © 2013 IBM Corporation


Using Big SQL

August 22, 2014 © 2013 IBM Corporation


Starting and Stopping
 Big SQL is located in
$BIGSQL_HOME/bin/bigsql

 The Big SQL server process can be started and stopped with
$BIGSQL_HOME/bin/bigsql start
$BIGSQL_HOME/bin/bigsql stop

$ $BIGSQL_HOME/bin/bigsql start
BigSQL running, pid 8479.

$ $BIGSQL_HOME/bin/bigsql stop
BigSQL pid 8479 stopped.

11 © 2013 IBM Corporation


BigInsights client tools

 Client Tools
– JSqsh
• Command line client installed with Big
SQL server
– BigInsights console access
• Execute queries via the console web UI
– Big SQL Eclipse plugin
• Graphical query builder with syntax
highlighting

 Most JDBC/ODBC capable


tools work

© Copyright IBM Corporation 2014


Creating and Loading a Table
 Step-1: Create table - A sub-directory is created in schema directory

 Step-2: Load some data – files are created in table directory

 Step-3: Start querying your data!


Using Existing Data
Let's repeat the previous example using "existing" data

1. Create the table directory


$ hadoop fs –mkdir /tables/employee

2. Copy data into that directory


$ hadoop fs -copyFromLocal employee.data /tables/employee

3. Create an EXTERNAL table


USE biadmin;
CREATE EXTERNAL TABLE employee2 (
empno INT,
fname VARCHAR(100),
lname VARCHAR(100),
dept VARCHAR(100)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '/tables/employee'
Basic table definition
 All forms of CREATE TABLE begin like you would expect:
Big SQL supports
create table users
nullability indicators, which
(
Hive does not
id int not null,
office_id int not null,
name varchar(30) not null,
phone varchar(10) null
)
ROW FORMAT TEXT DELIMITED BY ‘\t’ In Hive/BigSQL you
always need to specify
how the data is stored
 This defines the structure of the table (default is CSV )
– The remainder of the definition describes the storage

 Names are stored lower case, regardless of case provided

© Copyright IBM Corporation 2014


Data types
 Big SQL supports the following data types

smallint int[eger] bigint boolean decimal


float double real timestamp string
varchar(len) char(len) binary binary(len) varbinary(len)
array* struct* jsonTuple

 String is treated like varchar(32768), binary like binary(32768)

 JsonTuple allows to store complex objects as json and parse them on demand

 Big SQL does not directly dictate the storage format of a given data type!
– The SerDe that is used determines the storage representation
– Big SQL uses the Hive SerDe’s by default (LazySimpleSerDe and LazyBinarySerDe)
• Thus, Big SQL shares the Hive data representations
Loading Data
 Two possibilities LOAD and LOAD USING
– LOAD simply moves files into the Hive Warehouse folder
– LOAD USING runs a MapReduce job

 LOAD is very fast but files need to fit to table storage definition
– Load doesn’t check files, first queries on bad data will fail

Files can be in HDFS or in the local file system of the


BigSQL server. LOCAL means the file is in the local
Filesystem
Hive or HBase data

LOAD HIVE DATA LOCAL INPATH '/tmp/product_dim.del'


OVERWRITE INTO TABLE PRODUCT_DIM;

At the moment only overwrite is supported File needs to fit to table definition. If table
is tab delimited, file needs to be tab
delimited.
Loading from Databases
 Data Sources supported
– Data can be loaded from any databases that supports JDBC.

 SQOOP and JDBC


– LOAD uses SQOOP internally
– Uses JDBC to connect to source database
– Data nodes must be able to connect to the database server

 Input to the Load can be SQL Query or a table

LOAD USING JDBC CONNECTION URL 'jdbc:db2://bigdata:50000/SALES'


WITH PARAMETERS (user='db2inst1', password='passw0rd')
FROM TABLE SALES_FACT SPLIT COLUMN PRODUCT_KEY
INTO HIVE TABLE SALES_FACT OVERWRITE;

If table does not have a primary key, a split key


can be defined. This is used by SQOOP for
parallel loads
CTAS (Create Table As)
 CTAS statements allow you to create tables from SELECT
statements
– Transform data
– Change encoding
– ….
 No insert/select yet
Column names of new table need to be
Can be used to change encoding but not specified, data types are read from statement
partitioning

CREATE TABLE PRODUCT ( KEY, NAME, DATE )


ROW FORMAT DELIMITED STORED AS RCFILE
AS
SELECT PRODUCT_KEY, PRODUCT_NAME,
CAST( INTRODUCTION_DATE AS TIMESTAMP )
FROM PRODUCT_DIM_TMP;

It is possible to use any expression, joins, etc.

19 © 2013 IBM Corporation


Catalog Tables

 Big SQL provides a set of catalogs tables


– These are views on the Hive metastore:
[localhost][foo] 1> select * from syscat.tables where tablename='users';
+------------+-----------+
| schemaname | tablename |
+------------+-----------+
| default | users |
+------------+-----------+
1 row in results(first row: 0.14s; total: 0.15s)

[localhost][foo] 1> select * from syscat.columns where tablename='users';


+------------+-----------+-----------+--------+-----------+-------+
| schemaname | tablename | name | type | precision | scale |
+------------+-----------+-----------+--------+-----------+-------+
| default | users | id | INT | 10 | 0 |
| default | users | office_id | INT | 10 | 0 |
| default | users | name | STRING | 0 | 0 |
| default | users | children | ARRAY | 0 | 0 |
+------------+-----------+-----------+--------+-----------+-------+
4 rows in results(first row: 0.19s; total: 0.21s)

20 © 2013 IBM Corporation


SQL Support – Example Query

Sub-query support over Hive OLAP functions support

select *
from(select row_number() over (order by age asc) as rn
empno,
name,
age
from employee2) as t
where rn <= 4;

21 © 2013 IBM Corporation


"Point Queries"
 MapReduce incurs measurable overhead for the sake of resiliency
– Each mapper/reducer may involve JVM startup/shutdown
– Intermediate data is written to disk so partial failures can restart just the failed
portion of the query
– Overhead can be as high as 20-30 seconds per job
 For small data sets or certain data sources (e.g. HBase) MapReduce may be
unnecessary
 Big SQL can run query in server for sub second response time
– Automatically chosen for very simple queries:
SELECT C1, C2 FROM T1

– Can be provided as a query hint:

SELECT c1 FROM t1 /*+ accessmode='local' +*/ WHERE c2 > 10

– Or session setting:

set force local on;


SELECT c1 FROM t1 WHERE c2 > 10;

22 © 2013 IBM Corporation


BigInsights Big SQL: Summary
 Big SQL provides robust, standards-based SQL support for data
stored in BigInsights
– ANSI SQL-92+
– ODBC/JDBC drivers

 Big SQL fully integrates with SQL applications and BI tooling


– Existing queries run with no or few modifications*
– Existing JDBC and ODBC compliant tools can be leveraged

 Big SQL provides faster and more reliable performance


– Big SQL uses more efficient access paths to the data (ie. queries are
decomposed into fewer statements)
– Inexpensive queries processed by Big SQL can be executed outside of
MapReduce for much faster execution
– Big SQL is optimized to move less data over the network to the application for
most cases

*on data with the same schema


23 © 2013 IBM Corporation
Questions?

August 22, 2014 © 2013 IBM Corporation

You might also like