0% found this document useful (0 votes)
17 views

Mod 2

The document provides an overview of the Hadoop ecosystem tools, focusing on Sqoop, Apache Flume, Apache Hive, and HBase. It details the functionalities, architectures, and commands associated with each tool, emphasizing data transfer, streaming, data warehousing, and real-time access. Additionally, it compares Sqoop and Flume, and discusses Hive's partitioning and bucketing features, as well as HBase's characteristics as a distributed database.

Uploaded by

Chinmayi D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Mod 2

The document provides an overview of the Hadoop ecosystem tools, focusing on Sqoop, Apache Flume, Apache Hive, and HBase. It details the functionalities, architectures, and commands associated with each tool, emphasizing data transfer, streaming, data warehousing, and real-time access. Additionally, it compares Sqoop and Flume, and discusses Hive's partitioning and bucketing features, as well as HBase's characteristics as a distributed database.

Uploaded by

Chinmayi D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Hadoop Ecosystem Tool

(Mod 2)
Prepared By: Dr. Kimmi
Kumari
Contents
• Introduction to SQOOP
• Hive
• Hbase
Overview of SQOOP in Hadoop
• The tool which is used to perform data transfer operations
• Transfer is from relational database management system
to Hadoop server
Features of the Sqoop
• Parallel Import/Export
• Import Results of an SQL Query
• Connectors For All Major RDBMS Databases
• Kerberos Security Integration
• Provides Full and Incremental Load
Sqoop Architecture
• Submission of the import/ export command
• Sqoop fetches data from different databases
• Performance of map tasks to load the data on to HDFS
Sqoop - Import All Tables
• Imports a set of tables from an RDBMS to HDFS
• The following syntax is used to import all tables.
 $ sqoop import-all-tables (generic-args) (import-
args)
 $ sqoop-import-all-tables (generic-args) (import-
args)
Sqoop Workflow
Sqoop - Import All Tables
• Every table in that database must have a primary key
field
• In order to verify all the table data to the userdb database
in HDFS, refer the below command:
$HADOOP_HOME/bin/hadoop fs -ls
Sqoop Export
• Exports a set of files from HDFS back to RDBMS
• Table must already exist
Modes of Sqoop Export
• Insert mode
• Update mode
Syntax for Sqoop Export

$ sqoop export (generic-args)


(export-args)

$ sqoop-export (generic-args)
(export-args)
Consider the
below table
 The below command is used to export the table data
(which is in emp_data file on HDFS) to the employee table
in db database of Mysql database server.

$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
Importing data from MySQL to
HDFS
Step 1: Login into MySQL
Step 2: Create a database and table and insert data.
Step 3: Create a database and table in the hive where data
should be imported
Step 4: Run the import command on Hadoop.
Step 5: Check-in hive if data is imported successfully or not.
Apache Flume
• Collecting , Aggregating and Transporting Streaming
Data.
• Copies Streaming Data from various web server to HDFS.
• Ex : Data generated via Social Media , Messages of
Emails , Log files etc.
Challenges of PUT Command

• Transfer takes place with one file at a time


• Time Consuming
• Difficult & Complicated
Overview of Data Transfer over Apache
Flume
Apache Flume Agent
Setting Multi-Agent Flow
Consolidation
SQOOP VS FLUME
• Sqoop works well for data transfer and Flume for
Streaming Data.
• In Sqoop , Data load is not driven by Events whereas in
Flume , it is not the case.
• In Sqoop, HDFS is the destination in case of Import
operation but in Flume , it is the centralized store for
data.
Introduction to Apache Hive
• Open source data warehousing framework.
• Resides on the top of Hadoop.
• Data Analysis.
• Similar to SQL and called HiveQL.
• Data Infrastructure Team at Facebook.
Hive Architecture
Where to Use Hive
Limitations of Hive
SQL
• Structured Query Language.
• SQL itself is a declarative language.
• SQL support schema for data storage
HiveQL
• Hive’s SQL language is known as HiveQL.
• Combination of SQL-92, Oracle’s SQL language, and
MySQL.
Hive Data Types

• INT , TINYINT, SMALLINT, BIGINT.


• DOUBLE
• FLOAT
Date/Time Types

• TIMESTAMP
• DATES
String Types
• STRING
• Varchar
• CHAR
Complex Type
• Struct
• Map
• Array
Hive DDL Commands
• Create
• Show
• Describe
• Use
• Drop
• Alter
• Truncate
Hive DML Commands
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
Hive Joining tables
• Inner Join
• Left Outer Join
• Right Outer Join
• Full Outer Join
Inner Join in HiveQL

• select e1.empname, e2.department_name from employee e1 join


employee_department e2 on e1.empid= e2.depid;
Hive Partitions
Hive Partitions
• Apache Hive organizes tables into partitions.
• Partitioning is a way of dividing a table into related parts.
Why is Partitioning Important?

• Difficult for Hadoop users to query this huge amount of


data.
• When we submit a SQL query, Hive read the entire data-
set. Therefore Inefficient to run MapReduce jobs over a
large table.
• Increases the performance speed.
How to Create Partitions in Hive?

CREATE TABLE table_name (column1 data_type, column2


data_type) PARTITIONED BY (partition1 data_type, partition2
data_type,….);
Types of Hive Partitioning

• Static
Partitioning
• Dynamic
Partitioning
Hive Static Partitioning
• Insert input data files individually into a partition table.
• Alteration of the partition in the static partition is allowed.
• Static partition is in Strict Mode.
Hive Dynamic Partitioning

• Single insert to partition table


• Alteration on the Dynamic partition are not allowed .
• The mode is in non-strict mode.
Hive Partitioning – Advantages
• Distributes execution load horizontally.
• Faster execution of queries.
• Low volume of data.
Hive Partitioning –
Disadvantages
• Too many small partition /directories.
• Some queries take a long time to execute dealing with
huge volume of data.
Bucketing in Hive
Decomposing table data sets into more manageable
parts
Why Bucketing?

• Gives effective results in few scenarios such as:

– When there is the limited number of partitions.


– Or, while partitions are of comparatively equal size.
Features of Bucketing in Hive

• Records with the same bucketed column will always be stored


in the same bucket.
• Divide the table into buckets , CLUSTERED BY clause is used.
• Partitioning & Bucketing can be done on the same hive table.
• Bucketed tables will create almost equally distributed data file
parts.
Advantages of Bucketing in Hive

• Efficient sampling
• Faster query responses
• Flexibility to keep the records in each bucket to be sorted
by one or more columns.
Limitations of Bucketing in Hive

• It doesn’t ensure that the table is properly populated.


• One need to handle Data Loading into buckets by
themself.
A bucketed sorted
user table
CREATE TABLE
bucketed_user( PARTITIONED BY (country
firstname VARCHAR(64), VARCHAR(64))
lastname VARCHAR(64), CLUSTERED BY (state)
address STRING, SORTED BY (city) INTO 32
city VARCHAR(64), BUCKETS
state VARCHAR(64), STORED AS
post STRING, SEQUENCEFILE;
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
INTRODUCTION TO HBASE
• Distributed column-oriented database.
• Built on top of the Hadoop file system.
• Open-source project and is horizontally scalable.
• Random real-time read/write access to data in the Hadoop
File System.
HDFS v/s HBASE
• HDFS is a distributed file system whereas HBase is a
database .
• HDFS does not support fast individual record lookups
whereas HBase does.
• HDFS offers high latency batch processing whereas
HBase offers low latency access to single rows from
billions of records
Column Oriented
• Store data tables as sections of columns of data, rather
than as rows of data.
• It is suitable for Online Analytical Processing (OLAP).
• Column-oriented databases are designed for huge tables.
Row Oriented
• It is suitable for Online Transaction Process (OLTP).
• Such databases are designed for small number of rows
and columns.
HBase RDBMS
Schema-less Governed by its schema,
Horizontally scalable. Hard to scale.
No transactions are there. Transactional.

De-normalized data. Normalized data.


Good for semi-structured as Good for structured data.
well as structured data.
Where to Use HBase

• Random, real-time read/write access to Big Data.


• Commodity hardware.
• Google File System.
Applications of HBase

• Writing heavy applications.


• Need for fast random access to available data.
• Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Working with Hbase commands
1) HBase general commands – Opening of the
terminal
2) create table
3) list
4) disable
Working with Hbase commands
5) is_disabled 11) Get
6) enable 12) Delete
13) deleteall
7) is_enabled 14) scan
8) Describe 15) Count
16) truncate
9) Drop
10) put

You might also like