0% found this document useful (0 votes)
89 views

Big Data: Sqoop

Sqoop is an open source tool used to transfer data between Hadoop and relational databases. It can import data from a database into HDFS or export data from HDFS to a database. When importing data, Sqoop examines the database table, generates Java code to read the table, then uses MapReduce to parallelly extract the data and write it to HDFS files in a variety of formats like text, sequence files or Parquet. Sqoop allows incrementally importing new or updated data and importing a full database or selected tables. The imported data can then be analyzed using tools like Hive.

Uploaded by

Sheetal Vartak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Big Data: Sqoop

Sqoop is an open source tool used to transfer data between Hadoop and relational databases. It can import data from a database into HDFS or export data from HDFS to a database. When importing data, Sqoop examines the database table, generates Java code to read the table, then uses MapReduce to parallelly extract the data and write it to HDFS files in a variety of formats like text, sequence files or Parquet. Sqoop allows incrementally importing new or updated data and importing a full database or selected tables. The imported data can then be analyzed using tools like Hive.

Uploaded by

Sheetal Vartak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

This image cannot currently be displayed.

Big Data

Sqoop
Sqoop Overview
— Open source Apache project originally developed by Cloudera
— The name is a contraction to ‘SQL-to-Hadoop’
— Sqoop exchanges data between a database and HDFS
— Can import all tables, a single table, or partial table into HDFS
— Data can be imported in a variety of formats
— Sqoop can also export data from HDFS to a database

Image Source: Cloudera


2
Sqoop Tools
— sqoop COMMAND [ARGUMENTS]
— sqoop help

Image Source: Hadoop The Definitive Guide


3
Sqoop File Formats
— Can import text or binary
— Difference
— Human readability - Text
— Compactness - Binary
— Best data storage (precise, complete) - Binary
— We’ll talk a lot more about binary formats, in our Hive
section
— SequenceFiles, Avro, and Parquet are all binary formats
— Avro and Parquet are flexible and widely supported

4
Sqoop- Formats
— Sqoop typically generates a Java class for your import
— Sqoop can not load Avro files directly into Hive
— Sqoop can load Parquet files directly into Hive
— It is possible to just do the code generation, without an actual
import (codegen, instead of import)

5
Creating Databases in mySQL
% mysql –u root –p
Enter password:

mysql> CREATE DATABASE hadoopguide;

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ’ ‘ *


‘localhost’;

mysql> quit;

6
Populating a Database
% mysql hadoopguide;

mysql> CREATE TABLE widgets (id NOT NULL PRIMARY KEY AUTO INCREMENT;
->widget_nameVARCHAR(64) NOT NULL,
->price DECIMAL(10,2),
->design_date DATE,
->version INT,
->design_commentVARCHAR(100);

mysql> INSERT INTO widgets VALUES (NULL, ‘gear’, 0.25, ‘2050-02-10’, 1,


-> ‘pulls chain’);

mysql> quit;

7
Import Table into HDFS
sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets –m 1

— Sqoop will use its import tool to run a MapReduce job that connects to the
database and reads the specified table
— The –m 1 defines one map task for the job

8
Review the Table Contents
— -cat reads the contents to the screen
— widgets is the table name
— part-m-00000 is the part number

Image Source: Hadoop The Definitive Guide


9
Importing Data
— Client side application that imports data from a database and
writes that data into HDFS
— Uses a MapReduce job that extracts rows from a table
— Uses a Java JDBC API to access data in the RDBMS

10
Sqoop Import Connectors
— Import and export functionality is enabled by connectors
— Common Connectors include …
— MySQL
— Oracle
— SQL Server
— DB2
— Netezza
— PostgreSQL
— Generic JDBC
— Third party connectors are also available to handle
Teradata and NoSQL stores

11
Import Process
Import process involves three steps
1. Examine table details
2. Create and submit a job to the
cluster
3. Fetch records from table and
write this data to HDFS

Image Source: Cloudera


12
JDBC Import Process

Image Source: Hadoop The Definitive Guide


13
Examine Table for Import
— Determine a primary key
— Runs a boundary query to determine how many records need
import
— Divides the boundary query by the number of mappers
— Equalizes the loads across the mapper tasks

14
Generating a JAVA Source File
— The JDBC API retrieves the metadata for the columns in the
defined table
— The RDBMS data types are mapped to data types in Java
VARCHAR = String
INTEGER = Integer
Etc …

15
Creating a JAVA Class
— The code generator will use this metadata information to
create a table specific class of objects required to hold a row of
data

Image Source: Hadoop The Definitive Guide


16
Java Class Functions
— Serialization methods
— DBWritable interface is how the class interacts with JDBC
— ResultSet provides a cursor to to retrieve records
— readFields () will populate the fields (importing)
— write() inserts new rows into the table (exporting)
— ResultSet is then deserialized

17
Filling the Fields
— Cursor retrieves records using a query to populate the fields in
the Java class for the defined table

— Using a simple query

Image Source: Hadoop The Definitive Guide


18
Mapping in Sqoop
— The default number of mappers for a sqoop job is 4.
— This can be changed with the -m argument.
— When you view the results in hdfs, there will be one partition
file per mapper (so 4, by default)
— Use hdfs dfs -cat to view the contents of a file
— The files will be csv (comma-separate values), by default
— Make sure the database is not in use when importing

19
Filling in Parallel
— A splitting column will be identified by sqoop
— Primary keys like id are good candidates for splitting columns
— Use the --split-by argument
— SELECT MIN(id), MAX(id) FROM widgets, yields 0 to 100K
— Specify the number of mappers –m 5
— Use WHERE to define the splitting
— SELECT id, field2, etc… WHERE id >=0 and id < 20K
— SELECT id, field2, etc… WHERE id >=20K and id < 40K
…etc… in 5 buckets

20
Incrementing Imports
— Adding the --check-column function allows specification
for greater than a specified value --last-value
— What if we only want rows since we last did our import?
— last-value (id)
— --incremental lastmodified (needs a column to check)

— Using the id and --incremental append is an append mode


— Using date and --incremental lastmodified updates using the
time and date stamp (records the last modified time of the update)
— What if we only want rows that have a value in a specific column?
— check-column

21
Importing Large Files
— Sqoop will store imported CLOB and BLOB files in a LobFile
— LobFiles store up to a 64 bit address space
— LobFile format allows clients to hold a reference to record
without accessing the contents

Image Source: Hadoop The Definitive Guide


22
Imported Record with LobFile
— Primary record contains a reference to the LobFile

Notes Externally File Format Filename Byte Offset Length


Stored Large Object

Image Source: Hadoop The Definitive Guide


23
Importing a Database
— import-all-tables imports an entire database
— Tables are stored as comma delimited files
— Location is your home HDFS directory
— Each table will be found in a subdirectory of the table name
— Adding --warehouse-dir will redefine the parent directory
of the import

Image Source: Cloudera


24
Import Table Alternative
— The import table function can be used before the connect
function
— File can be changed from comma to tab delimited using --
fields-terminated-by ”\t”

25
Import by Column/Row
— Update a subset of columns

— Update matching rows

26
Query Based Importing
— --query defines the update type
— WHERE $CONDITIONS must be included
— Target directory must be specified
— --split-by provides the mechanism to organize the mapper
tasks (for example 3 lists of account ids)

27
Sqoop with Hive
— Sqoop combines data from HDFS and an RDBMS
— Sqoop is complementary to performing analysis in Hive
— Using Hive to analyze a sales data file combined with the
widget product file we imported using Sqoop

Sales
Log File

RDBMS Widgets
Product File

28
Loading Log File in Hive
— Review contents of the sales.log file

Image Source: Hadoop The Definitive Guide


29
Load Sales File into Hive
— Create a table in Hive including each of the fields in the sales
log file
— Select the sales.log file from the local directory

Image Source: Hadoop The Definitive Guide


30
Import directly to Hive
— Sqoop can import data from a RDBMS
— Table name is widgets (product table)
— --hive-import command directly loads the widgets data to
Hive
— A schema is inferred from the source table in the RDBMS

Image Source: Hadoop The Definitive Guide


31
Calculating Integrated Data
— Using data from both files we calculated the most important
zip code

Image Source: Hadoop The Definitive Guide


32
Exporting Data to a Database
— Sometimes it is useful to push data from HDFS data to an
RDBMS
— Good solution while batch processing on large data sets
— Export results to a relational database for access by other
systems
— The target table must be created or exist in the database

33
Direct Mode Exports
— Some databases offer functionality to export data directly using
the function mysqlimport
— CombineFileInputFormat is used to group the number of input
files into a smaller number of map tasks
— This can be much faster than JDBC
— Unfortunately direct mode can not handle BLOBs
— With direct mode, JDBC is still used for the metadata

34
JDBC Exports
— Uses a JDBC Java class based on the target destination
— MapReduce job reads the HDFS data files, parses the data
based on the chosen strategy
— Using the JDBC strategy creates batch INSERT statements
inserting many records per statement
— Separate threads are used to read HDFS and communicate
with the database

35
JDBC Parallel Threads

Image Source: Hadoop The Definitive Guide


36
Defining an Export Table
— We need a target table for our loading process
— Table must defines columns in the same sequence as the file

Image Source: Hadoop The Definitive Guide


37
Exporting data to the Table
— Connect to the RDBMS table
— Identify the sales_by_zip table for loading
— Export from the directory containing zip_profits
— --input-fields-terminated-by identifies the \ indicator

Image Source: Hadoop The Definitive Guide


38
Verify Export Results
— Simple select statement validates the export

39
Export Transactions
— Sqoop will spawn multiple tasks that export slices of the data
in parallel
— Results from one task may be visible before another
— Sqoop commits results for every few thousand rows
— Follow on applications should not be used until all results are
available
— A staging table can be defined as --staging-table and should be
cleared using --clear-staging-table

40
RDBMS Update Modes
— Several options exist for --update-mode
— allowinsert inserts new records
— upsert updates records if they exist and inserts if they do not
— updateonly will only update records if they exist, no inserts

41
Summary
— Sqoop exchanges data between a database and the Hadoop
cluster
— Tables are imported and exported using MapReduce jobs
— Sqoop provides many options to control imports
— Hive is often a recipient of sqoop data
— Sqoop has the capability to export to RDBMS databases

42
Sqoop Documentation
— http://sqoop.apache.org/
— http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
— http://sqoop.apache.org/docs/1.4.6/SqoopDevGuide.html
— White, T., & Safari Books Online. (2015). Hadoop :The definitive
guide. 4th ed. Sebastopol, CA: O'Reilly Media, 2015.

43

You might also like