Big Data: Sqoop
Big Data: Sqoop
Big Data
Sqoop
Sqoop Overview
Open source Apache project originally developed by Cloudera
The name is a contraction to ‘SQL-to-Hadoop’
Sqoop exchanges data between a database and HDFS
Can import all tables, a single table, or partial table into HDFS
Data can be imported in a variety of formats
Sqoop can also export data from HDFS to a database
4
Sqoop- Formats
Sqoop typically generates a Java class for your import
Sqoop can not load Avro files directly into Hive
Sqoop can load Parquet files directly into Hive
It is possible to just do the code generation, without an actual
import (codegen, instead of import)
5
Creating Databases in mySQL
% mysql –u root –p
Enter password:
mysql> quit;
6
Populating a Database
% mysql hadoopguide;
mysql> CREATE TABLE widgets (id NOT NULL PRIMARY KEY AUTO INCREMENT;
->widget_nameVARCHAR(64) NOT NULL,
->price DECIMAL(10,2),
->design_date DATE,
->version INT,
->design_commentVARCHAR(100);
mysql> quit;
7
Import Table into HDFS
sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets –m 1
Sqoop will use its import tool to run a MapReduce job that connects to the
database and reads the specified table
The –m 1 defines one map task for the job
8
Review the Table Contents
-cat reads the contents to the screen
widgets is the table name
part-m-00000 is the part number
10
Sqoop Import Connectors
Import and export functionality is enabled by connectors
Common Connectors include …
MySQL
Oracle
SQL Server
DB2
Netezza
PostgreSQL
Generic JDBC
Third party connectors are also available to handle
Teradata and NoSQL stores
11
Import Process
Import process involves three steps
1. Examine table details
2. Create and submit a job to the
cluster
3. Fetch records from table and
write this data to HDFS
14
Generating a JAVA Source File
The JDBC API retrieves the metadata for the columns in the
defined table
The RDBMS data types are mapped to data types in Java
VARCHAR = String
INTEGER = Integer
Etc …
15
Creating a JAVA Class
The code generator will use this metadata information to
create a table specific class of objects required to hold a row of
data
17
Filling the Fields
Cursor retrieves records using a query to populate the fields in
the Java class for the defined table
19
Filling in Parallel
A splitting column will be identified by sqoop
Primary keys like id are good candidates for splitting columns
Use the --split-by argument
SELECT MIN(id), MAX(id) FROM widgets, yields 0 to 100K
Specify the number of mappers –m 5
Use WHERE to define the splitting
SELECT id, field2, etc… WHERE id >=0 and id < 20K
SELECT id, field2, etc… WHERE id >=20K and id < 40K
…etc… in 5 buckets
20
Incrementing Imports
Adding the --check-column function allows specification
for greater than a specified value --last-value
What if we only want rows since we last did our import?
last-value (id)
--incremental lastmodified (needs a column to check)
21
Importing Large Files
Sqoop will store imported CLOB and BLOB files in a LobFile
LobFiles store up to a 64 bit address space
LobFile format allows clients to hold a reference to record
without accessing the contents
25
Import by Column/Row
Update a subset of columns
26
Query Based Importing
--query defines the update type
WHERE $CONDITIONS must be included
Target directory must be specified
--split-by provides the mechanism to organize the mapper
tasks (for example 3 lists of account ids)
27
Sqoop with Hive
Sqoop combines data from HDFS and an RDBMS
Sqoop is complementary to performing analysis in Hive
Using Hive to analyze a sales data file combined with the
widget product file we imported using Sqoop
Sales
Log File
RDBMS Widgets
Product File
28
Loading Log File in Hive
Review contents of the sales.log file
33
Direct Mode Exports
Some databases offer functionality to export data directly using
the function mysqlimport
CombineFileInputFormat is used to group the number of input
files into a smaller number of map tasks
This can be much faster than JDBC
Unfortunately direct mode can not handle BLOBs
With direct mode, JDBC is still used for the metadata
34
JDBC Exports
Uses a JDBC Java class based on the target destination
MapReduce job reads the HDFS data files, parses the data
based on the chosen strategy
Using the JDBC strategy creates batch INSERT statements
inserting many records per statement
Separate threads are used to read HDFS and communicate
with the database
35
JDBC Parallel Threads
39
Export Transactions
Sqoop will spawn multiple tasks that export slices of the data
in parallel
Results from one task may be visible before another
Sqoop commits results for every few thousand rows
Follow on applications should not be used until all results are
available
A staging table can be defined as --staging-table and should be
cleared using --clear-staging-table
40
RDBMS Update Modes
Several options exist for --update-mode
allowinsert inserts new records
upsert updates records if they exist and inserts if they do not
updateonly will only update records if they exist, no inserts
41
Summary
Sqoop exchanges data between a database and the Hadoop
cluster
Tables are imported and exported using MapReduce jobs
Sqoop provides many options to control imports
Hive is often a recipient of sqoop data
Sqoop has the capability to export to RDBMS databases
42
Sqoop Documentation
http://sqoop.apache.org/
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
http://sqoop.apache.org/docs/1.4.6/SqoopDevGuide.html
White, T., & Safari Books Online. (2015). Hadoop :The definitive
guide. 4th ed. Sebastopol, CA: O'Reilly Media, 2015.
43