Chapter 2 Introduction To Hadoop
Chapter 2 Introduction To Hadoop
Introduction to
Introduction
1. What is Hadoop?
3. Hadoop Ecosystem
4.Physical Architecture
5.Hadoop limitations
Hadoop
• Scalability
• No Pre-processing
• Handles un-structure data
• Dividing data into blocks and chunks, storing across multiple servers
• Processing is done in parallel across multiple connected machines
• Scalable
– Single server to thousand servers
• Fault Tolerance
• Economical
• Handle hardware failure
– Ability to detect and handle failures at the
application layer
Hadoop Assumptins
6. Portability is important.
Core Components of Hadoop
3. Hadoop MapReduce
• Key Algorithm used to distribute work around a cluster
1. Name Node
2. Data Node
Name Node
• Functions of NameNode:
1. Manages the namespace of the file system in memory
2. Maintains inode information
3. Maps inode to the list of blocks and locations
4. Ensures Authorization and Authentication
5. Creates checkpoints and logs the namespace changes
Data Node
• Functions of DataNode:
1. Handles the block storage on multiple volumes.
2. Maintains block integrity
3. Periodically sends signal and send block reports to
NameNode.
Hadoop Map-Reduce
• Algorithm.
• Helps in parallel processing.
• Two phases:
1. Map Phase:
– Set of key-value pair forms
– Over each key-value pair, desire function is executed so as to
generate a set of intermediate key-value pair.
2. Reduce Phase:
-- The intermediate key-value pairs are grouped by key and values
are combined together according to the reduce algorithm provided by
the user.
• HDFS is the storage system for both i/p and o/p of the
MapReduce jobs.
Components of MapReduce
1. Job Tracker:
– Master which manages the jobs and recourses in the cluster.
– It schedule each map on Task Tracker.
– One Job Tracker in one cluster.
2. Task Tracker:
– Slaves which runs on every machine in a cluster.
– Responsible for running Map and Reduce Task as instructed by Job Tracker
3. JobHistoryServer:
• Demon that saves historical information about tasks.
Yet Another Resourse Negotiator
• HDFS
1. It is foundation for many more BD framework.
2. It provides scalable and reliable storage.
3. Size of data increases, we can add commodity hardware to increase storage
capacity.
• YARN -
1. Provides flexible scheduling and resource management over the HDFS storage.
2. Used at Yahoo to schedule jobs across 40000 servers.
• MapReduce -
1. programming Model
2. Simplifies parallel computing.
3. Instead of dealing with the complexities of synchronization and scheduling,
MapReduce deals with only 2 function:
Map()
Reduce()
4. Used by Google for Indexing websites.
Hadoop Ecosystem
• HIVE-
1. Programming model
2. Created at Facebook to issue SQL like queries using MapReduce on their data in
HDFS.
3. It is a basically Data Ware that provides Ad-hoc queries, data summarization
and analysis of huge data sets.
• PIG-
1. High level Programming model
2. Process and analyses BD using User Defined Functions and programming efforts.
3. Provides a Bridge to query data on Hadoop but unlike HIVE
4. Use Script implementation to make Hadoop data accessible by developers.
5. Created at Yahoo to model data flow based programs using MapReduce.
• Giraph –
1. Specialized model for graph processing
2. Used by facebook to analyze social graph.
Hadoop Ecosystem
• Spark
1. Real time in-memory Data Processing
2. In-memory ->100X faster for some tasks.
3. Spark provides an easier to use alternative to MapReduce and offers
performance up to 10 times faster for certain applications.
4. To make programming faster, Spark provides clean, concise APIs in Scala,
Java and Python.
• Storm -
1. Storm is a complex event processor(CEP) .
2. It also work as a distributed computation framework for processing fast, large
stream of data.
3. Real time in-memory Data Processing
• Flink-
• Flint is a data processing system and alternative to MapReduce .
• It comes with its own runtime, rather than building on top of Mapreduce.
• Real time in-memory Data Processing.
Hadoop Ecosystem
• HBase -
1. It is the Hadoop Database.
2. NoSQL / No-relational distributed Database.
3. It is a backing system for MR jobs outputs .
4. Hbase is based on Column than rows for fast processing.
5. Facebook also use Hbase for messaging.
• Cassendra-
1. It is a free and open-source, distributed, wide column database management
system designed to handle large amounts of data across many commodity
servers.
2. It providing high availability with no single point of failure.
3. NoSQL / No-relational distributed Database.
4. MR can retrieved data from Cassendra.
• MongoDB-
1. NoSQL Database
2. Document-oriented database system
3. It stores structure data as JSON-like documents.
Hadoop Ecosystem
• Zookeeper -
1. It is a coordination service that gives you the tools you need to write correct
distribution applications.
2. Managing Cluster.
3. Running all this tools requires a centralized management system for
synchronization, configuration and to ensure high availability.
• Combination of cloud environment with big data processing tools such as Hadoop
, provides the high performance computing power needed to analyze vast
amount of data efficiently and cost effectively.
Cloud
integration
environment
Storage Storage
Node Node
Switch HBase VM
HBase VM Database
Zookeeper VM
Zookeeper VM
Web Console
LDAP VM
Server
VM
Hadoop
Cluster
• Every hadoop compatible file system should provide location awareness for
effective scheduling of work. As well as DATA AWARENESS.
• Hadoop application uses this information to find the data node and run the task.
• HDFS replicates data to keep different copies of data on different racks to reduce
the impact of rack power or switch failure.
Hadoop compatible file system provides location awareness
FilBlock A
Block B Client File.txt
Block A DN3, DN4
Block C Block B DN5
Block c DN6
File.txt NameNode
Switch
Rack 5
DataNode 3,5
Rack 7
DataNode 4
Switch Switch Switch Rack 9
DataNode 6
DataNode 3 DataNode 4 DataNode 6
A A C
DataNode 5
B
Rack 5 Rack 7 Rack 9
Hadoop limitation
1. Security Concerns
• Hadoop security model is disabled by default due to sheer complexity.
• Doesn’t provides encryption at storage and n/w level.
2. Vulnerable by nature
• Written entirely in java.
• Most widely used language by cyber criminal.
4. General Limitations
• Google mentions in its article that Hadoop may not be the only answer
for big data.
• Google has its own Cloud Dataflow as a possible solution.
• Companies could be missing out on many other benefits by using Hadoop
alone.
Hadoop
Thank You