0% found this document useful (0 votes)

17 views

Mod 2

The document provides an overview of the Hadoop ecosystem tools, focusing on Sqoop, Apache Flume, Apache Hive, and HBase. It details the functionalities, architectures, and commands associated with each tool, emphasizing data transfer, streaming, data warehousing, and real-time access. Additionally, it compares Sqoop and Flume, and discusses Hive's partitioning and bucketing features, as well as HBase's characteristics as a distributed database.

Uploaded by

Chinmayi D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Mod 2

Uploaded by

Chinmayi D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 70

Hadoop Ecosystem Tool

(Mod 2)
Prepared By: Dr. Kimmi
Kumari
Contents
• Introduction to SQOOP
• Hive
• Hbase
Overview of SQOOP in Hadoop
• The tool which is used to perform data transfer operations
• Transfer is from relational database management system
to Hadoop server
Features of the Sqoop
• Parallel Import/Export
• Import Results of an SQL Query
• Connectors For All Major RDBMS Databases
• Kerberos Security Integration
• Provides Full and Incremental Load
Sqoop Architecture
• Submission of the import/ export command
• Sqoop fetches data from different databases
• Performance of map tasks to load the data on to HDFS
Sqoop - Import All Tables
• Imports a set of tables from an RDBMS to HDFS
• The following syntax is used to import all tables.
 $ sqoop import-all-tables (generic-args) (import-
args)
 $ sqoop-import-all-tables (generic-args) (import-
args)
Sqoop Workflow
Sqoop - Import All Tables
• Every table in that database must have a primary key
field
• In order to verify all the table data to the userdb database
in HDFS, refer the below command:
$HADOOP_HOME/bin/hadoop fs -ls
Sqoop Export
• Exports a set of files from HDFS back to RDBMS
• Table must already exist
Modes of Sqoop Export
• Insert mode
• Update mode
Syntax for Sqoop Export

$ sqoop export (generic-args)

(export-args)

$ sqoop-export (generic-args)
(export-args)
Consider the
below table
 The below command is used to export the table data
(which is in emp_data file on HDFS) to the employee table
in db database of Mysql database server.

$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
Importing data from MySQL to
HDFS
Step 1: Login into MySQL
Step 2: Create a database and table and insert data.
Step 3: Create a database and table in the hive where data
should be imported
Step 4: Run the import command on Hadoop.
Step 5: Check-in hive if data is imported successfully or not.
Apache Flume
• Collecting , Aggregating and Transporting Streaming
Data.
• Copies Streaming Data from various web server to HDFS.
• Ex : Data generated via Social Media , Messages of
Emails , Log files etc.
Challenges of PUT Command

• Transfer takes place with one file at a time

• Time Consuming
• Difficult & Complicated
Overview of Data Transfer over Apache
Flume
Apache Flume Agent
Setting Multi-Agent Flow
Consolidation
SQOOP VS FLUME
• Sqoop works well for data transfer and Flume for
Streaming Data.
• In Sqoop , Data load is not driven by Events whereas in
Flume , it is not the case.
• In Sqoop, HDFS is the destination in case of Import
operation but in Flume , it is the centralized store for
data.
Introduction to Apache Hive
• Open source data warehousing framework.
• Resides on the top of Hadoop.
• Data Analysis.
• Similar to SQL and called HiveQL.
• Data Infrastructure Team at Facebook.
Hive Architecture
Where to Use Hive
Limitations of Hive
SQL
• Structured Query Language.
• SQL itself is a declarative language.
• SQL support schema for data storage
HiveQL
• Hive’s SQL language is known as HiveQL.
• Combination of SQL-92, Oracle’s SQL language, and
MySQL.
Hive Data Types

• INT , TINYINT, SMALLINT, BIGINT.

• DOUBLE
• FLOAT
Date/Time Types

• TIMESTAMP
• DATES
String Types
• STRING
• Varchar
• CHAR
Complex Type
• Struct
• Map
• Array
Hive DDL Commands
• Create
• Show
• Describe
• Use
• Drop
• Alter
• Truncate
Hive DML Commands
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
Hive Joining tables
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
Inner Join in HiveQL

• select e1.empname, e2.department_name from employee e1 join

employee_department e2 on e1.empid= e2.depid;
Hive Partitions
Hive Partitions
• Apache Hive organizes tables into partitions.
• Partitioning is a way of dividing a table into related parts.
Why is Partitioning Important?

• Difficult for Hadoop users to query this huge amount of

data.
• When we submit a SQL query, Hive read the entire data-
set. Therefore Inefficient to run MapReduce jobs over a
large table.
• Increases the performance speed.
How to Create Partitions in Hive?

CREATE TABLE table_name (column1 data_type, column2

data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);
Types of Hive Partitioning

• Static
Partitioning
• Dynamic
Partitioning
Hive Static Partitioning
• Insert input data files individually into a partition table.
• Alteration of the partition in the static partition is allowed.
• Static partition is in Strict Mode.
Hive Dynamic Partitioning

• Single insert to partition table

• Alteration on the Dynamic partition are not allowed .
• The mode is in non-strict mode.
Hive Partitioning – Advantages
• Distributes execution load horizontally.
• Faster execution of queries.
• Low volume of data.
Hive Partitioning –
Disadvantages
• Too many small partition /directories.
• Some queries take a long time to execute dealing with
huge volume of data.
Bucketing in Hive
Decomposing table data sets into more manageable
parts
Why Bucketing?

• Gives effective results in few scenarios such as:

– When there is the limited number of partitions.

– Or, while partitions are of comparatively equal size.
Features of Bucketing in Hive

• Records with the same bucketed column will always be stored

in the same bucket.
• Divide the table into buckets , CLUSTERED BY clause is used.
• Partitioning & Bucketing can be done on the same hive table.
• Bucketed tables will create almost equally distributed data file
parts.
Advantages of Bucketing in Hive

• Efficient sampling
• Faster query responses
• Flexibility to keep the records in each bucket to be sorted
by one or more columns.
Limitations of Bucketing in Hive

• It doesn’t ensure that the table is properly populated.

• One need to handle Data Loading into buckets by
themself.
A bucketed sorted
user table
CREATE TABLE
bucketed_user( PARTITIONED BY (country
firstname VARCHAR(64), VARCHAR(64))
lastname VARCHAR(64), CLUSTERED BY (state)
address STRING, SORTED BY (city) INTO 32
city VARCHAR(64), BUCKETS
state VARCHAR(64), STORED AS
post STRING, SEQUENCEFILE;
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
INTRODUCTION TO HBASE
• Distributed column-oriented database.
• Built on top of the Hadoop file system.
• Open-source project and is horizontally scalable.
• Random real-time read/write access to data in the Hadoop
File System.
HDFS v/s HBASE
• HDFS is a distributed file system whereas HBase is a
database .
• HDFS does not support fast individual record lookups
whereas HBase does.
• HDFS offers high latency batch processing whereas
HBase offers low latency access to single rows from
billions of records
Column Oriented
• Store data tables as sections of columns of data, rather
than as rows of data.
• It is suitable for Online Analytical Processing (OLAP).
• Column-oriented databases are designed for huge tables.
Row Oriented
• It is suitable for Online Transaction Process (OLTP).
• Such databases are designed for small number of rows
and columns.
HBase RDBMS
Schema-less Governed by its schema,
Horizontally scalable. Hard to scale.
No transactions are there. Transactional.

De-normalized data. Normalized data.

Good for semi-structured as Good for structured data.
well as structured data.
Where to Use HBase

• Random, real-time read/write access to Big Data.

• Commodity hardware.
• Google File System.
Applications of HBase

• Writing heavy applications.

• Need for fast random access to available data.
• Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Working with Hbase commands
1) HBase general commands – Opening of the
terminal
2) create table
3) list
4) disable
Working with Hbase commands
5) is_disabled 11) Get
6) enable 12) Delete
13) deleteall
7) is_enabled 14) scan
8) Describe 15) Count
16) truncate
9) Drop
10) put

Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
First Term Data Processing Examination (Answer Any 7 Questions) Class: Ss3
100% (2)
First Term Data Processing Examination (Answer Any 7 Questions) Class: Ss3
3 pages
Hive Documet
No ratings yet
Hive Documet
33 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
Hive_Main
No ratings yet
Hive_Main
33 pages
Hive
No ratings yet
Hive
29 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Introduction to Hive
No ratings yet
Introduction to Hive
14 pages
Big Data
No ratings yet
Big Data
120 pages
DOC-20250429-WA0006. (1)
No ratings yet
DOC-20250429-WA0006. (1)
53 pages
HIVE
No ratings yet
HIVE
80 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Hive
No ratings yet
Hive
12 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
Hive PPTs
No ratings yet
Hive PPTs
34 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
5- HIVE
No ratings yet
5- HIVE
51 pages
Hive_Basics
No ratings yet
Hive_Basics
35 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
HIVE AND PIG
No ratings yet
HIVE AND PIG
57 pages
Hive
No ratings yet
Hive
65 pages
(r17a0528) Big Data Analytics-57-100
No ratings yet
(r17a0528) Big Data Analytics-57-100
44 pages
HBase and Hive at StumbleUpon Presentation
No ratings yet
HBase and Hive at StumbleUpon Presentation
22 pages
Hive Tutorial
No ratings yet
Hive Tutorial
25 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
BDA Assignment I and II
No ratings yet
BDA Assignment I and II
8 pages
Big Data Training1
No ratings yet
Big Data Training1
4 pages
Unit 5 Lecture No-1(Hive)
No ratings yet
Unit 5 Lecture No-1(Hive)
30 pages
unit-IV.docx
No ratings yet
unit-IV.docx
64 pages
Bda Unit 5
No ratings yet
Bda Unit 5
16 pages
Big Data Analytics and Developers Training Session 10
No ratings yet
Big Data Analytics and Developers Training Session 10
27 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
Interview
No ratings yet
Interview
86 pages
hive
No ratings yet
hive
49 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
HIVE
No ratings yet
HIVE
28 pages
Hive
No ratings yet
Hive
23 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
13 pages
Interview Questions - Hive and Querying
No ratings yet
Interview Questions - Hive and Querying
3 pages
Hive
No ratings yet
Hive
50 pages
Hadoop Prac Commands
No ratings yet
Hadoop Prac Commands
16 pages
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
No ratings yet
How Sqoop Works?: Relationaldatabase Servers in The Relational Database Structure
7 pages
BDA Unit-5-PPT
No ratings yet
BDA Unit-5-PPT
39 pages
Experiment No 2
No ratings yet
Experiment No 2
9 pages
hive
No ratings yet
hive
47 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
big-data-unit 5
No ratings yet
big-data-unit 5
54 pages
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
No ratings yet
Knowledge About Apache Sqoop and Its All Basic Commands To Import and Export The Data
7 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
DSCI 5350 - Lecture 5 PDF
No ratings yet
DSCI 5350 - Lecture 5 PDF
64 pages
Hive PPT
No ratings yet
Hive PPT
25 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
100 Question of PL/SQL
No ratings yet
100 Question of PL/SQL
35 pages
RAC TOP 51 Interview Question and Answer (Replica) 2
No ratings yet
RAC TOP 51 Interview Question and Answer (Replica) 2
9 pages
Client Manual Brio
No ratings yet
Client Manual Brio
186 pages
Advanced Java Programming
No ratings yet
Advanced Java Programming
3 pages
8-Distributed Database
No ratings yet
8-Distributed Database
22 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
Harness Business Data
No ratings yet
Harness Business Data
46 pages
DBMS end sem
No ratings yet
DBMS end sem
52 pages
Database Administration
No ratings yet
Database Administration
14 pages
Unit 1 MCQ
No ratings yet
Unit 1 MCQ
55 pages
Salesforce Developer Skills
No ratings yet
Salesforce Developer Skills
2 pages
Symantec DLP 15.0 Admin Guide
100% (1)
Symantec DLP 15.0 Admin Guide
1,975 pages
57.14 - Aggregate Functions COUNT, MIN, MAX, AVG, SUM - mp4
No ratings yet
57.14 - Aggregate Functions COUNT, MIN, MAX, AVG, SUM - mp4
2 pages
Advance Java Programming-question bank-2 Answer Key (1)
No ratings yet
Advance Java Programming-question bank-2 Answer Key (1)
9 pages
Lesson 3 Unstructured Data
No ratings yet
Lesson 3 Unstructured Data
28 pages
DATA ANALYTICS
No ratings yet
DATA ANALYTICS
42 pages
Cs8481 Data Base Management System
No ratings yet
Cs8481 Data Base Management System
62 pages
Database-notes
No ratings yet
Database-notes
109 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Lab Installing Oracle and Loading Course Databases
No ratings yet
Lab Installing Oracle and Loading Course Databases
3 pages
Lecture 5 System Design
No ratings yet
Lecture 5 System Design
13 pages
Unit I Advanced SQL PDF
No ratings yet
Unit I Advanced SQL PDF
41 pages
SQL Server 2005 Tutorial
No ratings yet
SQL Server 2005 Tutorial
48 pages
Registry Structure
No ratings yet
Registry Structure
18 pages
INFOMAN Lesson 7
No ratings yet
INFOMAN Lesson 7
113 pages
Module-5: CSIT-102
No ratings yet
Module-5: CSIT-102
20 pages
database Ch1,2,3,4 ملخص
No ratings yet
database Ch1,2,3,4 ملخص
9 pages
Using SQL in SELECT Queries (Rules) Use It After The Equal, Greater Than, Less Than Signs (, , )
No ratings yet
Using SQL in SELECT Queries (Rules) Use It After The Equal, Greater Than, Less Than Signs (, , )
2 pages
Database Concepts
No ratings yet
Database Concepts
11 pages

Mod 2

Uploaded by

Mod 2

Uploaded by

Hadoop Ecosystem Tool

$ sqoop export (generic-args)

• Transfer takes place with one file at a time

• INT , TINYINT, SMALLINT, BIGINT.

• select e1.empname, e2.department_name from employee e1 join

• Difficult for Hadoop users to query this huge amount of

CREATE TABLE table_name (column1 data_type, column2

• Single insert to partition table

• Gives effective results in few scenarios such as:

– When there is the limited number of partitions.

• Records with the same bucketed column will always be stored

• It doesn’t ensure that the table is properly populated.

De-normalized data. Normalized data.

• Random, real-time read/write access to Big Data.

• Writing heavy applications.

You might also like