0% found this document useful (0 votes)

89 views

Big Data: Sqoop

Sqoop is an open source tool used to transfer data between Hadoop and relational databases. It can import data from a database into HDFS or export data from HDFS to a database. When importing data, Sqoop examines the database table, generates Java code to read the table, then uses MapReduce to parallelly extract the data and write it to HDFS files in a variety of formats like text, sequence files or Parquet. Sqoop allows incrementally importing new or updated data and importing a full database or selected tables. The imported data can then be analyzed using tools like Hive.

Uploaded by

Sheetal Vartak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Big Data: Sqoop

Uploaded by

Sheetal Vartak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

This image cannot currently be displayed.

Big Data

Sqoop
Sqoop Overview
Open source Apache project originally developed by Cloudera
The name is a contraction to ‘SQL-to-Hadoop’
Sqoop exchanges data between a database and HDFS
Can import all tables, a single table, or partial table into HDFS
Data can be imported in a variety of formats
Sqoop can also export data from HDFS to a database

Image Source: Cloudera

2
Sqoop Tools
sqoop COMMAND [ARGUMENTS]
sqoop help

Image Source: Hadoop The Definitive Guide

3
Sqoop File Formats
Can import text or binary
Difference
Human readability - Text
Compactness - Binary
Best data storage (precise, complete) - Binary
We’ll talk a lot more about binary formats, in our Hive
section
SequenceFiles, Avro, and Parquet are all binary formats
Avro and Parquet are flexible and widely supported

4
Sqoop- Formats
Sqoop typically generates a Java class for your import
Sqoop can not load Avro files directly into Hive
Sqoop can load Parquet files directly into Hive
It is possible to just do the code generation, without an actual
import (codegen, instead of import)

5
Creating Databases in mySQL
% mysql –u root –p
Enter password:

mysql> CREATE DATABASE hadoopguide;

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ’ ‘ *

‘localhost’;

mysql> quit;

6
Populating a Database
% mysql hadoopguide;

mysql> CREATE TABLE widgets (id NOT NULL PRIMARY KEY AUTO INCREMENT;
->widget_nameVARCHAR(64) NOT NULL,
->price DECIMAL(10,2),
->design_date DATE,
->version INT,
->design_commentVARCHAR(100);

mysql> INSERT INTO widgets VALUES (NULL, ‘gear’, 0.25, ‘2050-02-10’, 1,

-> ‘pulls chain’);

mysql> quit;

7
Import Table into HDFS
sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets –m 1

Sqoop will use its import tool to run a MapReduce job that connects to the
database and reads the specified table
The –m 1 defines one map task for the job

8
Review the Table Contents
-cat reads the contents to the screen
widgets is the table name
part-m-00000 is the part number

Image Source: Hadoop The Definitive Guide

9
Importing Data
Client side application that imports data from a database and
writes that data into HDFS
Uses a MapReduce job that extracts rows from a table
Uses a Java JDBC API to access data in the RDBMS

10
Sqoop Import Connectors
Import and export functionality is enabled by connectors
Common Connectors include …
MySQL
Oracle
SQL Server
DB2
Netezza
PostgreSQL
Generic JDBC
Third party connectors are also available to handle
Teradata and NoSQL stores

11
Import Process
Import process involves three steps
1. Examine table details
2. Create and submit a job to the
cluster
3. Fetch records from table and
write this data to HDFS

Image Source: Cloudera

12
JDBC Import Process

Image Source: Hadoop The Definitive Guide

13
Examine Table for Import
Determine a primary key
Runs a boundary query to determine how many records need
import
Divides the boundary query by the number of mappers
Equalizes the loads across the mapper tasks

14
Generating a JAVA Source File
The JDBC API retrieves the metadata for the columns in the
defined table
The RDBMS data types are mapped to data types in Java
VARCHAR = String
INTEGER = Integer
Etc …

15
Creating a JAVA Class
The code generator will use this metadata information to
create a table specific class of objects required to hold a row of
data

Image Source: Hadoop The Definitive Guide

16
Java Class Functions
Serialization methods
DBWritable interface is how the class interacts with JDBC
ResultSet provides a cursor to to retrieve records
readFields () will populate the fields (importing)
write() inserts new rows into the table (exporting)
ResultSet is then deserialized

17
Filling the Fields
Cursor retrieves records using a query to populate the fields in
the Java class for the defined table

Using a simple query

Image Source: Hadoop The Definitive Guide

18
Mapping in Sqoop
The default number of mappers for a sqoop job is 4.
This can be changed with the -m argument.
When you view the results in hdfs, there will be one partition
file per mapper (so 4, by default)
Use hdfs dfs -cat to view the contents of a file
The files will be csv (comma-separate values), by default
Make sure the database is not in use when importing

19
Filling in Parallel
A splitting column will be identified by sqoop
Primary keys like id are good candidates for splitting columns
Use the --split-by argument
SELECT MIN(id), MAX(id) FROM widgets, yields 0 to 100K
Specify the number of mappers –m 5
Use WHERE to define the splitting
SELECT id, field2, etc… WHERE id >=0 and id < 20K
SELECT id, field2, etc… WHERE id >=20K and id < 40K
…etc… in 5 buckets

20
Incrementing Imports
Adding the --check-column function allows specification
for greater than a specified value --last-value
What if we only want rows since we last did our import?
last-value (id)
--incremental lastmodified (needs a column to check)

Using the id and --incremental append is an append mode

Using date and --incremental lastmodified updates using the
time and date stamp (records the last modified time of the update)
What if we only want rows that have a value in a specific column?
check-column

21
Importing Large Files
Sqoop will store imported CLOB and BLOB files in a LobFile
LobFiles store up to a 64 bit address space
LobFile format allows clients to hold a reference to record
without accessing the contents

Image Source: Hadoop The Definitive Guide

22
Imported Record with LobFile
Primary record contains a reference to the LobFile

Notes Externally File Format Filename Byte Offset Length

Stored Large Object

Image Source: Hadoop The Definitive Guide

23
Importing a Database
import-all-tables imports an entire database
Tables are stored as comma delimited files
Location is your home HDFS directory
Each table will be found in a subdirectory of the table name
Adding --warehouse-dir will redefine the parent directory
of the import

Image Source: Cloudera

24
Import Table Alternative
The import table function can be used before the connect
function
File can be changed from comma to tab delimited using --
fields-terminated-by ”\t”

25
Import by Column/Row
Update a subset of columns

Update matching rows

26
Query Based Importing
--query defines the update type
WHERE $CONDITIONS must be included
Target directory must be specified
--split-by provides the mechanism to organize the mapper
tasks (for example 3 lists of account ids)

27
Sqoop with Hive
Sqoop combines data from HDFS and an RDBMS
Sqoop is complementary to performing analysis in Hive
Using Hive to analyze a sales data file combined with the
widget product file we imported using Sqoop

Sales
Log File

RDBMS Widgets
Product File

28
Loading Log File in Hive
Review contents of the sales.log file

Image Source: Hadoop The Definitive Guide

29
Load Sales File into Hive
Create a table in Hive including each of the fields in the sales
log file
Select the sales.log file from the local directory

Image Source: Hadoop The Definitive Guide

30
Import directly to Hive
Sqoop can import data from a RDBMS
Table name is widgets (product table)
--hive-import command directly loads the widgets data to
Hive
A schema is inferred from the source table in the RDBMS

Image Source: Hadoop The Definitive Guide

31
Calculating Integrated Data
Using data from both files we calculated the most important
zip code

Image Source: Hadoop The Definitive Guide

32
Exporting Data to a Database
Sometimes it is useful to push data from HDFS data to an
RDBMS
Good solution while batch processing on large data sets
Export results to a relational database for access by other
systems
The target table must be created or exist in the database

33
Direct Mode Exports
Some databases offer functionality to export data directly using
the function mysqlimport
CombineFileInputFormat is used to group the number of input
files into a smaller number of map tasks
This can be much faster than JDBC
Unfortunately direct mode can not handle BLOBs
With direct mode, JDBC is still used for the metadata

34
JDBC Exports
Uses a JDBC Java class based on the target destination
MapReduce job reads the HDFS data files, parses the data
based on the chosen strategy
Using the JDBC strategy creates batch INSERT statements
inserting many records per statement
Separate threads are used to read HDFS and communicate
with the database

35
JDBC Parallel Threads

Image Source: Hadoop The Definitive Guide

36
Defining an Export Table
We need a target table for our loading process
Table must defines columns in the same sequence as the file

Image Source: Hadoop The Definitive Guide

37
Exporting data to the Table
Connect to the RDBMS table
Identify the sales_by_zip table for loading
Export from the directory containing zip_profits
--input-fields-terminated-by identifies the \ indicator

Image Source: Hadoop The Definitive Guide

38
Verify Export Results
Simple select statement validates the export

39
Export Transactions
Sqoop will spawn multiple tasks that export slices of the data
in parallel
Results from one task may be visible before another
Sqoop commits results for every few thousand rows
Follow on applications should not be used until all results are
available
A staging table can be defined as --staging-table and should be
cleared using --clear-staging-table

40
RDBMS Update Modes
Several options exist for --update-mode
allowinsert inserts new records
upsert updates records if they exist and inserts if they do not
updateonly will only update records if they exist, no inserts

41
Summary
Sqoop exchanges data between a database and the Hadoop
cluster
Tables are imported and exported using MapReduce jobs
Sqoop provides many options to control imports
Hive is often a recipient of sqoop data
Sqoop has the capability to export to RDBMS databases

42
Sqoop Documentation
http://sqoop.apache.org/
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
http://sqoop.apache.org/docs/1.4.6/SqoopDevGuide.html
White, T., & Safari Books Online. (2015). Hadoop :The definitive
guide. 4th ed. Sebastopol, CA: O'Reilly Media, 2015.

Maintenance Checklist Screw Conveyor
67% (3)
Maintenance Checklist Screw Conveyor
1 page
bda u3 copy
No ratings yet
bda u3 copy
59 pages
DSCI 5350 - Lecture 3 PDF
No ratings yet
DSCI 5350 - Lecture 3 PDF
39 pages
SQOOP
No ratings yet
SQOOP
8 pages
Module 5_Sqoop
No ratings yet
Module 5_Sqoop
25 pages
Unit 4 3 Lumify,Data Rapper and Sqooop
No ratings yet
Unit 4 3 Lumify,Data Rapper and Sqooop
27 pages
04-Sqoop(1)(1)
No ratings yet
04-Sqoop(1)(1)
30 pages
BD Sqltohadoop3 PDF
No ratings yet
BD Sqltohadoop3 PDF
13 pages
B22 BDA Experiment 03
No ratings yet
B22 BDA Experiment 03
11 pages
Cloudera Academic Partnership 8 PDF
No ratings yet
Cloudera Academic Partnership 8 PDF
69 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
90 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
Sqoop
No ratings yet
Sqoop
4 pages
Chapter n3 Sqoop
No ratings yet
Chapter n3 Sqoop
24 pages
SQOOP
No ratings yet
SQOOP
6 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
Bda 11
No ratings yet
Bda 11
10 pages
Module 2
No ratings yet
Module 2
27 pages
BigData Module 2
No ratings yet
BigData Module 2
18 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Zep Sqoop Big Data Interview Questions
No ratings yet
Zep Sqoop Big Data Interview Questions
25 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Sqoop Additional Reading Pp-200913-222451-Unlocked
No ratings yet
Sqoop Additional Reading Pp-200913-222451-Unlocked
18 pages
Sqoop
No ratings yet
Sqoop
9 pages
Sqoop - A Haddop Technology: Srikalahasti
No ratings yet
Sqoop - A Haddop Technology: Srikalahasti
13 pages
Apache Sqoop Data Transfer Between Hadoop and RDBMS
No ratings yet
Apache Sqoop Data Transfer Between Hadoop and RDBMS
9 pages
Unit 6
No ratings yet
Unit 6
26 pages
BigData - Sem 4 - Elective 1 - Module 2 - PPT
No ratings yet
BigData - Sem 4 - Elective 1 - Module 2 - PPT
29 pages
SqoopTutorial Ver 2.0
No ratings yet
SqoopTutorial Ver 2.0
51 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
sqoopintro
No ratings yet
sqoopintro
2 pages
Gold Video Task Complted
No ratings yet
Gold Video Task Complted
31 pages
Sqoop Students Datadotz
No ratings yet
Sqoop Students Datadotz
19 pages
15CS82 Module 2
No ratings yet
15CS82 Module 2
12 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
6 pages
BDA Lab2
No ratings yet
BDA Lab2
8 pages
Sqoop Performance Tuning Guidelines
No ratings yet
Sqoop Performance Tuning Guidelines
8 pages
DMBD MBAA21041 Sqoop
No ratings yet
DMBD MBAA21041 Sqoop
11 pages
HADOOP notes unit 3 and 4
No ratings yet
HADOOP notes unit 3 and 4
14 pages
160 P16cse5a-P16ite3a 2020052411232116
No ratings yet
160 P16cse5a-P16ite3a 2020052411232116
13 pages
Mod 2
No ratings yet
Mod 2
70 pages
M - M - Num-Mappers
No ratings yet
M - M - Num-Mappers
4 pages
Essential Hadoop Tools: Module - 2 Session - 2
No ratings yet
Essential Hadoop Tools: Module - 2 Session - 2
6 pages
0930 SqoopPerformanceTuningGuidelines en H2L
No ratings yet
0930 SqoopPerformanceTuningGuidelines en H2L
10 pages
Sqoop
No ratings yet
Sqoop
15 pages
Sqoop v1.1
No ratings yet
Sqoop v1.1
18 pages
bigdata+ppt (2)
No ratings yet
bigdata+ppt (2)
140 pages
scoop_ppt
No ratings yet
scoop_ppt
3 pages
Sqoop
No ratings yet
Sqoop
28 pages
UNIT-4
No ratings yet
UNIT-4
119 pages
BDT_Unit04
No ratings yet
BDT_Unit04
136 pages
Experiment-5(Case Study on Sqoop)
No ratings yet
Experiment-5(Case Study on Sqoop)
5 pages
sqooprequestfiles
No ratings yet
sqooprequestfiles
7 pages
BDA Module 2 PDF
No ratings yet
BDA Module 2 PDF
123 pages
Apache - SQOOP and Flume
No ratings yet
Apache - SQOOP and Flume
16 pages
Experiment No 2
No ratings yet
Experiment No 2
9 pages
What Are The Components of Web Service?: Java Questions
No ratings yet
What Are The Components of Web Service?: Java Questions
9 pages
SIC Big Data Chapter 3 Workbook
No ratings yet
SIC Big Data Chapter 3 Workbook
86 pages
Electrical Working Drawings
No ratings yet
Electrical Working Drawings
10 pages
Cainta vs. Cainta 489 Scra 468 2006
No ratings yet
Cainta vs. Cainta 489 Scra 468 2006
12 pages
Unit-IV (Sociological Thinkers) Emile Durkheim: Social Facts
No ratings yet
Unit-IV (Sociological Thinkers) Emile Durkheim: Social Facts
6 pages
DeKUT 10th Anniversary Feature Magazine March 2023
No ratings yet
DeKUT 10th Anniversary Feature Magazine March 2023
60 pages
Auto Body Receipt Template
No ratings yet
Auto Body Receipt Template
2 pages
Tutorial 2 Solutions
No ratings yet
Tutorial 2 Solutions
5 pages
Types and Elements of Drama
0% (1)
Types and Elements of Drama
3 pages
Lab 03 - Os
No ratings yet
Lab 03 - Os
11 pages
How To Calculate Quantities of Cement, Sand, Granite For Concrete Foundation From Building Plan
No ratings yet
How To Calculate Quantities of Cement, Sand, Granite For Concrete Foundation From Building Plan
5 pages
Aktivitas Antibakteri Ekstrak Tempe TERHADAP BAKTERI Bacillus Subtilis DAN Staphylococcus
No ratings yet
Aktivitas Antibakteri Ekstrak Tempe TERHADAP BAKTERI Bacillus Subtilis DAN Staphylococcus
4 pages
Part I سنه القمله 2016
No ratings yet
Part I سنه القمله 2016
20 pages
Susskind Classical Mechanics Notes
No ratings yet
Susskind Classical Mechanics Notes
13 pages
Mozzozzin Sur Clearance Indigency Certification
No ratings yet
Mozzozzin Sur Clearance Indigency Certification
3 pages
Chapter III Legislative Drafting General Overview
No ratings yet
Chapter III Legislative Drafting General Overview
11 pages
Vladimir Issurin II PDF
100% (1)
Vladimir Issurin II PDF
58 pages
MNHS Annex G10 Q4 LC1
No ratings yet
MNHS Annex G10 Q4 LC1
10 pages
Bismillah Salsa Sempro Fixxx
No ratings yet
Bismillah Salsa Sempro Fixxx
42 pages
Fsie Midterm Examination
No ratings yet
Fsie Midterm Examination
2 pages
Mapeh 6 - Q3 - W9 DLL
No ratings yet
Mapeh 6 - Q3 - W9 DLL
6 pages
7 Last Words at GNWOM
No ratings yet
7 Last Words at GNWOM
50 pages
SJH Hotel Fund Im Final 7 April 2021
No ratings yet
SJH Hotel Fund Im Final 7 April 2021
50 pages
Constant Head Permeability Test
No ratings yet
Constant Head Permeability Test
12 pages
Week 3 Assembly Language
No ratings yet
Week 3 Assembly Language
37 pages
Activity 2 Formation
No ratings yet
Activity 2 Formation
4 pages
Adhitya MYP3 Unit2 Formative Assessment (Sec D & F)
No ratings yet
Adhitya MYP3 Unit2 Formative Assessment (Sec D & F)
9 pages
The Next Level in Robotwelding: WWW - Welding-And-Cutting - Info Technical Journal For Welding and Allied Processes
No ratings yet
The Next Level in Robotwelding: WWW - Welding-And-Cutting - Info Technical Journal For Welding and Allied Processes
64 pages
Meeting 7 English Maritime
No ratings yet
Meeting 7 English Maritime
3 pages
6A a North African Story
No ratings yet
6A a North African Story
8 pages
129 Covid-19 Simple-Past Us
100% (1)
129 Covid-19 Simple-Past Us
7 pages

Big Data: Sqoop

Uploaded by

Big Data: Sqoop

Uploaded by

This image cannot currently be displayed.

Image Source: Cloudera

Image Source: Hadoop The Definitive Guide

mysql> CREATE DATABASE hadoopguide;

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ’ ‘ *

mysql> INSERT INTO widgets VALUES (NULL, ‘gear’, 0.25, ‘2050-02-10’, 1,

Image Source: Hadoop The Definitive Guide

Image Source: Cloudera

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

 Using a simple query

Image Source: Hadoop The Definitive Guide

 Using the id and --incremental append is an append mode

Image Source: Hadoop The Definitive Guide

Notes Externally File Format Filename Byte Offset Length

Image Source: Hadoop The Definitive Guide

Image Source: Cloudera

 Update matching rows

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

Image Source: Hadoop The Definitive Guide

You might also like

Using a simple query

Using the id and --incremental append is an append mode

Update matching rows