0% found this document useful (0 votes)
12 views

Unit_IV_Hadoop

Hadoop is an open-source software framework designed for storing, processing, and analyzing large datasets in a distributed, scalable, and fault-tolerant manner. It consists of core components like HDFS for data storage and MapReduce for data processing, along with various ecosystem tools such as HBase, Flume, Sqoop, Hive, and Pig. Hadoop enables organizations to handle vast amounts of unstructured data efficiently, significantly reducing time and costs associated with data analysis.

Uploaded by

nithinxavier620
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit_IV_Hadoop

Hadoop is an open-source software framework designed for storing, processing, and analyzing large datasets in a distributed, scalable, and fault-tolerant manner. It consists of core components like HDFS for data storage and MapReduce for data processing, along with various ecosystem tools such as HBase, Flume, Sqoop, Hive, and Pig. Hadoop enables organizations to handle vast amounts of unstructured data efficiently, significantly reducing time and costs associated with data analysis.

Uploaded by

nithinxavier620
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 90

Hadoop &

Hadoop Eco System


 Hadoop - Basic Concepts
 HDFS

 Map Reduce

 Hadoop Eco System


 Hbase

 Flume

 Sqoop

 Hive and Pig


Hadoop – What it is?

A software framework for storing,


processing and analysing “Big Data”

 Distributed

 Scalable

 Fault-tolerant

 Open source
Hadoop’s History
 Based on work done by Google in the early 2000s
 The Google File System (2003)
 MapReduce (2004)

 Core concept: distribute the data as it is initially stored in


the System
 Individual nodes can work on data local to those nodes
 No data transfer over the network is required for initial
processing
Who uses Hadoop?
Traditional Large Scale Computation
 Traditionally, computation has been
processor-bound
 Relatively small amounts of data
 Lots of complex processing performed on
that data
 For decades, the primary push was to
increase the computing power of a single
machine
 Faster processor, more RAM
Distributed Systems
 Traditionally, computation has been processor-
bound
 Relatively small amounts of data
 Lots of complex processing performed on that
data
 For decades, the primary push was to increase
the computing power of a single machine
 Faster processor, more RAM

• The Better Solution : more computers


- Distributed Systems : Use multiple machines
for a single job
Distributed System - Challenges
 Programming for traditional distributed systems is complex

 Data exchange requires synchronization

 Finite bandwidth is available

 Temporal dependencies are complicated

 It is difficult to deal with partial failures of the system


Traditional System : Data
Bottleneck
 Traditionally data is stored in a central location
 Data is copied to processors at runtime
 Fine for limited amounts of data
Data Bottleneck (Contd..)
 Modern systems have much more data
 Terabytes+ a day
 Petabytes+ total
 New Approach is required
Data Bottleneck (Contd..)
 Modern systems have much more data
 Terabytes+ a day
 Petabytes+ total
 New Approach is required
Hadoop
 A radical new approach to distributed computing
 Distribute data when the data is stored
 Run computation where the data is
Hadoop : Very-High Level
Overview
 Data is split into “blocks” when loaded

 Map tasks typically work on a single block


 A master program manages tasks
Core Hadoop Concepts
 Applications are written in high-level code
– Developers do not worry about network programming
Nodes talk to each other as little as possible
– Developers should not write code for node communication

 Data is spread among machines in advance


– Computation happens where the data is stored
– Data is replicated multiple times on the system

 Hadoop is scalable and fault-tolerant


Scalability

 Adding nodes adds capacity proportionally


 Increasing loads results in decline in performance – not
failure of the system.
Fault-tolerant
 Node failure is inevitable
 Reasons :
System continues to function
Master reassigns task to a different node
Data replication = no loss of data
Nodes which recover rejoin the cluster
automatically.
Hadoop-able
Problems
What Hadoop can do?
Where does data come
from?
 Science
 Medical imaging, sensor data, genome sequencing, weather
data, satellite feeds, etc.
 Industry
 Financial, pharmaceutical, manufacturing, insurance,
online, energy, retail data
 Legacy
 Sales data, customer behaviour, product databases,
accounting data, etc.
 System Data
 Log files, health & status feeds, activity streams, network
messages, Web Analytics, intrusion detection, spam filters
What is common across Hadoop-
able Problem
 Nature of the data ( 3Vs)
 Volume
 Velocity
 Variety
 Nature of the analysis
 Batch Processing
 Parallel Processing
 Distributed Data - Spread data over a cluster of
servers and take the computation to the data
Benefits of Hadoop
 Previously impossible/impracticable analysis
 Lower cost
 Less time
 Greater flexibility
 Near-linear scalability
 Ask Bigger questions
Hadoop
Components
Hadoop Components
Hadoop Core Components
 HDFS (Hadoop Distributed File System)
 Stores data on the cluster
 MapReduce
 Process data stored on the cluster
Hadoop Distributed
File System (HDFS)
HDFS – Basic Concepts
 HDFS – File system written in Java
 Sits on top of native file system
 Storage for massive amount of data
Scalable

Fault tolerant
Supportsefficient processing with
MapReduce
How files are stored (1)
How files are stored (2)
How files are stored (3)
How files are stored (3)
How files are stored (4)
Getting Data in and out of HDFS
Getting Data in and out of HDFS
Getting Data in and out of
HDFS
Getting Data in and out of
HDFS
Getting Data in and out of
HDFS
Getting Data in and out of HDFS
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Example : Storing and Retrieving
of Files
Important points about
HDFS
 HDFS performs best with a modest number of large files
 Millions, rather than billion, of files
 Each file typically 100 MB or more

 Files in HDFS are “write once”


 Files can be replaced but not changed

 HDFS is optimized for large, streaming reads of files, rather


than random reads
MapReduce
Map Reduce : Basic
Concepts
 MapReduce is a method for distributing a task across
multiple nodes

 Each node processes data stored on that node


 Where possible

 Consists of two phases:


 Map

 Reduce
Basic Terminology
 Job – A “full program” - an execution of a Mapper and
Reducer across a data set

 Task – An execution of a Mapper or a Reducer on a slice of


data
 a.k.a. Task-In-Progress (TIP)

 Task Instance – an attempt to execute a task on a machine


MapReduce: High Level
Master node

MapReduce job
submitted by JobTracker
client computer

Slave node Slave node Slave node

TaskTracker TaskTracker TaskTracker

Task instance Task instance Task instance


MapReduce : Components
 Mapper
 Each mapper operates on a single HDFS
block
 Map tasks run on the node where the
block is stored
 Shuffle & Sort
 Sorts and consolidates intermediate data
from all mappers
 Happens after all map tasks are complete
and before Reduce task starts
 Reducer
 Operates on sorted/shuffled
intermediate data (Map Taks output)
 Produces final output
Example : Word Count
Word Count Mapper
Shuffle & Sort
Reducer
Mappers run in Parallel
Features of MapReduce
 Automatic parallelization and distribution
 Fault-tolerance
 Status and monitoring tools
 A clean abstraction for programmers
 MapReduce programs are usually written in Java

 MapReduce abstracts all the ‘housekeeping’ away from


the developer
 Developer can concentrate simply on writing the Map
and Reduce functions
Analyzing Log Data
Basic Cluster Configuration :
HDFS
Cluster Configuration : MapReduce
Hadoop Eco
System
Hadoop Components
Hadoop Eco
System
HBASE
Hadoop Database : HBASE
RDBMS Vs HBase
When to use Hbase?
 Use HDFS if
 Only appending of dataset ( no random write)
 Usually reading the whole data set ( no random read)
 Use Hbase if
 Random read and/or write are needed
 Thousands of operations per second on TB+ of data to be
perfomed
 Use RDBMS if
 Data fits in one single node
 Full transaction support is needed
 Real-time query capabilities is needed
Hadoop – Eco
System
Flume
Flume : Real Time Data
Import
 What is Flume?
 A service to move large amounts of data in
real time
 Example : Storing log files in HDFS
 Flume is
 Distributed (agents)
 Reliable and available
 Horizontally scalable
 Extensible
Flume : High Level
Overview
Flume : High Level
Overview
Flume : High Level
Overview
Hadoop – Eco
System
Sqoop
Sqoop : Data Xchange
with RDBMS
 Sqoop : SQL to Hadoop
 Transfers data between
RDBMS and HDFS
 Uses a command-line tool or
application connector
 Allows incremental updates
Sqoop –Custom
Connectors
 Custom connectors for
 MySQL

 Postgres

 Netezza

 Teradata

 Oracle(partnered with Quest Software)

 Not open source, but free to use


Data Centre Integration
Hadoop Eco
System
Hive and Pig
High Level Data Languages : Hive
& Pig
 Motivation : MapReduce is powerful
but hard to master
 Solution : Hive and Pig
 Built on MapReduce
 Leverage existing skillsets
 Data Analysts who use SQL
 Programmers who use scripting
languages
 Open Source Apache Project
 Hive – initially developed at Facebook
 Pig – initially developed at Yahoo!
HIVE
 What is Hive?
 HiveQL : An SQL like interface to Hadoop

 Example :
 SELECT * FROM purchases WHERE price
>10000 ORDER BY storeid
PIG
 What is Pig?
 Pig Latin : A dataflow language for
transforming large data sets

 Example :
purchases = LOAD “/user/dave/purchases”
AS (itemID, price, storeID, purchaserID);
bigticket = FILTER purchases BY price >
10000;
…..
Hive Vs Pig
Hive Pig

Language HiveQL(SQL-like) Pig Latin (Dataflow


Language)

Schema Table definitions stored Schema optionally


in a metastore defined at runtime

Programmatic access JDBC, ODBC PigServer (Java API)


Hadoop Solution –
Examples
Data Processing – Traditional
Approach
Data Processing – Traditional
Approach
Data Processing : with Hadoop
Conclusion
Conclusion
 Data is big and getting bigger
 Data is often unstructured and complex
 Hadoop is used to retrieve value out of data

 Benefits of Hadoop
Handles less structured or unstructured data
Significantly lower time and cost
Can retain data indefinitely for much lower cost than
traditional solutions
Conclusion
 Hadoop has become a valuable business
intelligence tool and will become an increasingly
important part of a BI infrastructure

 Hadoop won’t replace EDW – but complement BI


infrastructure of organizations with large EDW

 Usage of Hadoop offloads the time- and resource-


intensive processing of large data sets- thus freeing
the DWH to serve user needs.
ANY

You might also like