0% found this document useful (0 votes)
174 views16 pages

Sqoop Big Data Tech

This document provides an overview of Apache Sqoop, a tool for transferring bulk data between relational databases and Hadoop. It discusses the key features and challenges of Sqoop 1, and how Sqoop 2 was designed to address these through a more modular architecture, improved security, and uniform functionality across connectors. The current status of Sqoop 2 is that it is the primary focus of the Sqoop community, with an initial release available for testing and feedback.

Uploaded by

linkranjit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views16 pages

Sqoop Big Data Tech

This document provides an overview of Apache Sqoop, a tool for transferring bulk data between relational databases and Hadoop. It discusses the key features and challenges of Sqoop 1, and how Sqoop 2 was designed to address these through a more modular architecture, improved security, and uniform functionality across connectors. The current status of Sqoop 2 is that it is the primary focus of the Sqoop community, with an initial release available for testing and feedback.

Uploaded by

linkranjit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Mastering

 Sqoop  for  Data  Transfer  for  Big  Data  


Jarek  Jarcec  Cecho  |  Kathleen  Ting  

1
Who  Are  We?  
•  Jarek  Jarcec  Cecho  
•  Apache  Sqoop  Commi?er,  PMC  Member  
•  SoCware  Engineer,  Cloudera  
•  [email protected]  
 
•  Kathleen  Ting  
•  Apache  Sqoop  Commi?er,  PMC  Member  
•  Customer  OperaLons  Engineering  Manager,  Cloudera  
•  [email protected],  @kate_Lng  

2
What  is  Sqoop?  

•  Apache  Top-­‐Level  Project  


•  SQl  to  hadOOP  
•  Tool  to  transfer  data  from  relaLonal  databases  
•  Teradata,  MySQL,  PostgreSQL,  Oracle,  Netezza  
•  To  Hadoop  ecosystem  
•  HDFS  (text,  sequence  file),  Hive,  HBase,  Avro  
•  And  vice  versa  

3
Why  Sqoop?  
•  Efficient/Controlled  resource  uLlizaLon  
•  Concurrent  connecLons,  Time  of  operaLon  
•  Datatype  mapping  and  conversion  
•  AutomaLc,  and  User  override  
•  Metadata  propagaLon  
•  Sqoop  Record  
•  Hive  Metastore  
•  Avro  

4
Sqoop  1  

5  
Sqoop  1  
•  Based  on  Connectors  
•  Responsible  for  Metadata  lookups,  and  Data  Transfer  
•  Majority  of  connectors  are  JDBC  based  
•  Non-­‐JDBC  (direct)  connectors  for  opLmized  data  transfer  
•  Connectors  responsible  for  all  supported  funcLonality  
•  HBase  Import,  Avro  Support,  ...  

6
Sqoop  1  Challenges  

•  CrypLc,  contextual  command  line  arguments  


•  Security  concerns  
•  Type  mapping  is  not  clearly  defined  
•  Client  needs  access  to  Hadoop  binaries/configuraLon  
and  database  
•  JDBC  model  is  enforced  

7
Sqoop  1  Challenges  
•  Non-­‐uniform  funcLonality  
•  Different  connectors  support  different  capabiliLes  
•  Overlapped/Duplicated  funcLonality  
•  Different  connectors  may  implement  same  capabiliLes  
differently  
•  High  coupling  with  Hadoop  
•  Database  vendors  required  to  understand  Hadoop  
idiosyncrasies  in  order  to  build  connectors.  

8
Sqoop  2  

9  
Sqoop  2  –  Design  Goals  
•  Security  and  SeparaLon  of  Concerns  
•  Role  based  access  and  use  
 
•  Ease  of  extension  
•  No  low-­‐level  Hadoop  knowledge  needed    
•  No  funcLonal  overlap  between  Connectors  
 
•  Ease  of  Use  
•  Uniform  funcLonality  
•  Domain  specific  interacLons  

10
Sqoop  2:  ConnecLon  vs  Job  metadata  

There  are  two  disLnct  sets  of  opLons  to  pass  into  Sqoop:  
 Connection (distinct per database) Job (distinct per table)

11
Sqoop  2:  Workings  
•  Connectors  register  metadata  
•  Metadata  enables  creaLon  of  ConnecLons  and  Jobs  
•  ConnecLons  and  Jobs  stored  in  Metadata  Repository  
•  Operator  runs  Jobs  that  use  appropriate  connecLons  
•  Admins  set  policy  for  connecLon  use  

12
Sqoop  2:  Security  

•  Support  for  secure  access  to  external  systems  via  


role-­‐based  access  to  connecLon  objects  
•  Administrators  create/edit/delete  connecLons  
•  Operators  use  connecLons  

13
Current  Status:  Sqoop  2  

•  Primary  focus  of  the  Sqoop  Community    


•  First  cut:  1.99.1    

•  bits  and  docs:  h?p://sqoop.apache.org/  

14
Demo  

15
16

You might also like