Unit -5 Updated Mhm

Apache ZooKeeper is a centralized service for maintaining configuration data and providing synchronization in distributed systems, particularly used in Hadoop to manage distributed applications. It addresses common issues like race conditions and deadlocks, ensuring that distributed applications function cohesively. Apache Flume, on the other hand, is a distributed system designed for collecting and transferring data from various sources to centralized storage like HDFS, offering features like reliability, scalability, and fault tolerance.

Uploaded by

Utkarsha Mahajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views25 pages

Unit -5 Updated Mhm

Uploaded by

Utkarsha Mahajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Zookeeper

Apache Zookeeper:What is Apache Zookeeper?

Introduction to Apache Zookeeper,
Why do we need Zookeeper in Hadoop?
How ZooKeeper in Hadoop Works?
Data Ingestion Tools:
Apache Zookeeper:What is Apache Zookeeper?
• Data ingestion is the transportation of data from assorted
sources to a storage medium where it can be accessed, used,
and analyzed by an organization.
• The destination is typically a data warehouse, data mart,
database, or a document store.
What is Apache ZooKeeper
• What is Zookeeper?
• Zookeeper is a top-level software developed by Apache that acts as a
centralized service and is used to maintain naming and configuration
data and to provide flexible and robust synchronization within
distributed systems.
• Zookeeper keeps track of status of the Kafka cluster nodes and it also
keeps track of Kafka topics, partitions etc.
• Zookeeper it self is allowing multiple clients to perform simultaneous
reads and writes and acts as a shared configuration service within the
system.
• The Zookeeper atomic broadcast (ZAB) protocol is the brains of the
whole system, making it possible for Zookeeper to act as an atomic
broadcast system and issue orderly updates.
What is Apache Zookeeper
• Apache Zookeeper is a coordination service for distributed application that
enables synchronization across a cluster.
• Zookeeper in Hadoop can be viewed as centralized repository where
distributed applications can put data and get data out of it.
• It is used to keep the distributed system functioning together as a single unit,
using its synchronization, serialization and coordination goals.
• For simplicity's sake Zookeeper can be thought of as a file system where we
have znodes that store data instead of files or directories storing data.
• Zookeeper is a Hadoop Admin tool used for managing the jobs in the cluster.
Introduction to Apache Zookeeper

• The formal definition of Apache Zookeeper says that it is a distributed,

open-source configuration, synchronization service along with naming
registry for distributed applications.
• Apache Zookeeper is used to manage and coordinate large cluster of
machines.
• For example Apache Storm which is used by Twitter for storing
machine state data has Apache Zookeeper as the coordinator between
machines.
Why do we need Zookeeper in the Hadoop?
• Distributed applications are difficult to coordinate and work with as they are much more error
prone due to huge number of machines attached to network.
• As many machines are involved, race condition and deadlocks are common problems when
implementing distributed applications.
• Race condition occurs when a machine tries to perform two or more operations at a time and this
can be taken care by serialization property of ZooKeeper.
• Deadlocks are when two or more machines try to access same shared resource at the same time.
• More precisely they try to access each other’s resources which leads to lock of system as none of the
system is releasing the resource but waiting for other system to release it.
• Synchronization in Zookeeper helps to solve the deadlock.
• Another major issue with distributed application can be partial failure of process, which can lead to
inconsistency of data.
• Zookeeper handles this through atomicity, which means either whole of the process will finish or
nothing will persist after failure.
• Thus Zookeeper is an important part of Hadoop that take care of these small but important issues so
that developer can focus more on functionality of the application.
How ZooKeeper in Hadoop Works?
• Hadoop ZooKeeper, is a distributed application that follows a simple client-server model
where clients are nodes that make use of the service, and servers are nodes that provide the
service.
• Multiple server nodes are collectively called ZooKeeper ensemble.
• At any given time, one ZooKeeper client is connected to at least one ZooKeeper server.
• A master node is dynamically chosen in consensus within the ensemble; thus usually, an
ensemble of Zookeeper is an odd number so that there is a majority of vote.
• If the master node fails, another master is chosen in no time and it takes over the previous
master.
• Other than master and slaves there are also observers in Zookeeper.
• Observers were brought in to address the issue of scaling.
• With the addition of slaves the write performance is going to be affected as voting process
is expensive.
• So observers are slaves that do not take part into voting process but have similar duties
as other slaves.
Writes in Zookeeper
• All the writes in Zookeeper go through the Master node, thus it is
guaranteed that all writes will be sequential.
• On performing write operation to the Zookeeper, each server attached to
that client persists the data along with master.
• Thus, this makes all the servers updated about the data.
• However this also means that concurrent writes cannot be made. Linear
writes guarantee can be problematic if Zookeeper is used for write
dominant workload.
• Zookeeper in Hadoop, is ideally used for coordinating message exchanges
between clients, which involves less writes and more reads.
• Zookeeper is helpful till the time the data is shared but if application has
concurrent data writing then Zookeeper can come in way of the application
and impose strict ordering of operations.
Reads in Zookeeper

Zookeeper is best at reads as reads can be concurrent.

Concurrent reads are done as each client is attached to different server and all
clients can read from the servers simultaneously, although having concurrent
reads leads to eventual consistency as master is not involved.

There can be cases where client may have an outdated view, which gets
updated with a little delay.
UNIT-5
Chapter-2
Apache Flume

• Introduction,
• Architecture,
• DataFlow,
• Features and Limitations,
• Applications.
Introduction to Apache Flume

• Apache Flume is a distributed system for collecting, aggregating, and

transferring data from external sources like Twitter, Facebook, web
servers to the central repository like HDFS.
• It is mainly for loading log data from different sources to Hadoop
HDFS.
• Apache Flume is a highly robust and available service.
• It is extensible, fault-tolerant, and scalable.
Features of Apache Flume
Features of Apache Flume( contd,,.)
1. Open-source
Apache Flume is an open-source distributed system. So it is available free of cost.
2. Data flow
Apache Flume allows its users to build multi-hop, fan-in, and fan-out flows. It also allows
for contextual routing as well as backup routes (fail-over) for the failed hops.
3. Reliability
In apache flume, the sources transfer events through the channel. The flume source puts
events in the channel which are then consumed by the sink. The sink transfers the event to
the next agent or to the terminal repository (like HDFS).
The events in the flume channel are removed only when they are stored in the next agent
channel or in the terminal repository.
In this way, the single-hop message delivery semantics in Apache Flume caters to
end-to-end reliability of the flow. Flume uses a transactional approach for guaranteeing
reliable delivery of the flume events.
4. Recoverability
The flume events are staged in a flume channel on each flume agent. This
. recovery from failure. Also, Apache Flume supports a durable File
manages
channel. File channels can be backed by the local file system.
5. Steady flow
Apache Flume offers steady data flow between reading and writes operations.
When the rate at which data is coming exceeds the rate of writing data to the
destination, then Apache Flume acts as a mediator between the data producers
and the centralized stores. Thus offers a steady flow of data between them.
6. Latency
Apache Flume caters to high throughput with lower latency.
7. Ease of use
With Flume, we can ingest the stream data from multiple web servers and store
them to any of the centralized stores such as HBase, Hadoop HDFS, etc.
8. Reliable message delivery
. transactions in Apache Flume are channel-based. For each message,
All the
two transactions are there – one for the sender and one for the receiver. This
ensures reliable message delivery.
9. Import of Huge volumes of data
Along with the log files, Apache Flume can also be used for importing huge
volumes of data produced by e-commerce sites like Flipkart, Amazon, and
networking sites like Twitter, Facebook.
10. Support for varieties of Sources and Sinks
Apache Flume supports a wide range of sources and sinks.
11. Streaming
Apache Flume gives us a reliable solution that helps us ingesting online
streaming data from different sources (such as email messages, network
traffic, log files, social media, etc) in HDFS.
12. Fault-tolerant and scalable
Flume is an extensible, reliable, highly available, and horizontally
scalable system. It is customizable for different types of sources and
sinks.
13. Inexpensive
It is an inexpensive system. It is less costly to install and operate. Its
maintenance is very economical.
14. Configuration
Apache Flume contains a very declarative configuration.
15. Documentation
Flume provides complete documentation with many good examples and
patterns which helps its user to learn how Flume can be used and
configured.
Limitations:
Limitations of Apache Flume are:
• 1. Weak ordering guarantee
• Apache Flume offers weaker guarantees than the other systems such as message queues in the event of moving
data more quickly and for enabling cheaper fault tolerance. In Apache Flume’s end-to-end reliability mode, the
flume events are delivered at least once, but with zero ordering guarantees.
• 2. Duplicacy
• Apache Flume does not guarantee that the messages reaching are 100% unique. In many scenarios, the duplicate
messages might pop in.
• 3. Low scalability
• Flume scalability is often low because for any businesses, sizing the hardware of a typical Apache Flume may be
tricky, and in most of the cases, it is trial and error. Due to this, Flume scalability aspect is often under the lens.
• 4. Reliability issue
• The throughput that Apache Flume can handle depends highly upon the backing store of the channel. So, if the
backing store is not chosen wisely, then there may be scalability and reliability issues.
• 5. Complex topology
• It has complex topology and reconfiguration is challenging.
• Despite its disadvantages, Flume’s advantages outweigh its disadvantages.
Flume Event

• Flume event is the basic unit of the data that is to be transported inside
Flume. A Flume event has a payload of the byte array.
• This is to be transferred from the source to the destination followed by
optional headers.
• The below figure depicts the structure of the Flume event.
Apache flume Architecture:
Apache flume Architecture:

1. Data Generators
Data generators generate real-time streaming data.
The data generated by data generators are collected by individual Flume agents that are
running on them.
The common data generators are Facebook, Twitter, etc.
2. Flume Agent
The agent is a JVM process in Flume.
It receives events from the clients or other agents and transfers it to the destination or other
agents.
It is a JVM process that consists of three components that are a source, channel, and sink
through which data flow occurs in Flume.
a. Source
• A Flume source is the component of Flume Agent which consumes data (events) from data generators like a
web server and delivers it to one or more channels.
• The data generator sends data (events) to Flume in a format recognized by the target Flume source.
• Flume supports different types of sources. Each source receives events (data) from a specific data generator.
• Example of Flume sources: Avro source, Exec source, Thrift source, NetCat source, HTTP source, Scribe
source, twitter 1% source, etc.
b. Channel
• When a Flume source receives an event from a data generator, it stores it on one or more channels. A Flume
channel is a passive store that receives events from the Flume source and stores them till Flume sinks
consume them.
• Channel acts as a bridge between Flume sources and Flume sinks.
• Flume channels are fully transactional and can work with any number of Flume sources and sinks.
• Example of Flume Channel− Custom Channel, File system channel, JDBC channel, Memory channel, etc.
• c. Sink
• The Flume sink retrieves the events from the Flume channel and pushes them on the centralized store like
HDFS, HDFS, or passes them to the next agent.
•
Example of Flume Sink− HDFS sink, AvHBase sink, Elasticsearch sink, etc.
3. Data collector
• The data collector collects the data from individual agents and aggregates
them.
• It pushes the collected data to a centralized store.
4. Centralized store
• Centralized stores are Hadoop HDFS, HBase, etc.
Applications of Apache flume(USE cases)

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Reviewer Tutorial
No ratings yet
Reviewer Tutorial
43 pages
Hadoop questions
No ratings yet
Hadoop questions
61 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
11 pages
Unit-5 BDA
No ratings yet
Unit-5 BDA
96 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Zookeeper HBase SPARK
No ratings yet
Zookeeper HBase SPARK
25 pages
Flume_Agent
No ratings yet
Flume_Agent
23 pages
Unit V-HBase
No ratings yet
Unit V-HBase
10 pages
Expose BDD
No ratings yet
Expose BDD
16 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
No ratings yet
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
13 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
FLUME[1]
No ratings yet
FLUME[1]
31 pages
Unit5_BDA
No ratings yet
Unit5_BDA
75 pages
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
No ratings yet
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
2 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Apache Flume
No ratings yet
Apache Flume
8 pages
unit 2
No ratings yet
unit 2
9 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
5a. Introduction to Data Ingestion and Processing
No ratings yet
5a. Introduction to Data Ingestion and Processing
26 pages
Big Data Technology Stack
100% (1)
Big Data Technology Stack
12 pages
Apache Zookeeper
No ratings yet
Apache Zookeeper
31 pages
Zookeeper: Coordinating Your Cluster
No ratings yet
Zookeeper: Coordinating Your Cluster
13 pages
Week+3+%288W%29+ +Exploring+Hadoop+Ecosystem+%28W6%29
No ratings yet
Week+3+%288W%29+ +Exploring+Hadoop+Ecosystem+%28W6%29
72 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
2020300053_BDA_EXP7_CHINMAY
No ratings yet
2020300053_BDA_EXP7_CHINMAY
5 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Framework
No ratings yet
Hadoop Framework
22 pages
Week 3 (8W) - Exploring Hadoop Ecosystem (W6)_revised
No ratings yet
Week 3 (8W) - Exploring Hadoop Ecosystem (W6)_revised
66 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
No ratings yet
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
19 pages
unit 2
No ratings yet
unit 2
28 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
BDA Mid-2 Important Questions
No ratings yet
BDA Mid-2 Important Questions
19 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Apache Zookeeper
No ratings yet
Apache Zookeeper
28 pages
The Origin of The Name "Zookeeper"
No ratings yet
The Origin of The Name "Zookeeper"
4 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Big Data Introduction & Ecosystems
No ratings yet
Big Data Introduction & Ecosystems
4 pages
Zookeeper Tutorial: What Is, Architecture of Apache Zookeeper
No ratings yet
Zookeeper Tutorial: What Is, Architecture of Apache Zookeeper
10 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
CC
No ratings yet
CC
5 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Apache Mastered
From Everand
Apache Mastered
Pasquale De Marco
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Cheat Sheet - Modern ABAP
No ratings yet
Cheat Sheet - Modern ABAP
11 pages
Mkcvi Osint 2024
No ratings yet
Mkcvi Osint 2024
19 pages
DBMS LAB MANUAL Modified-1
No ratings yet
DBMS LAB MANUAL Modified-1
43 pages
Performance Task 1 Prog 114 No. 2 B
100% (1)
Performance Task 1 Prog 114 No. 2 B
4 pages
DBMS Lecture Notes
No ratings yet
DBMS Lecture Notes
120 pages
Unit 2 Part 2
No ratings yet
Unit 2 Part 2
25 pages
Textile Shop Management System
100% (3)
Textile Shop Management System
51 pages
Dbms 111
No ratings yet
Dbms 111
5 pages
Mysql
No ratings yet
Mysql
24 pages
slides(lec-6)
No ratings yet
slides(lec-6)
9 pages
Proficy HMI/SCADA - iFIX: U SQL
No ratings yet
Proficy HMI/SCADA - iFIX: U SQL
66 pages
Data Mining in Search Engine Analytics
No ratings yet
Data Mining in Search Engine Analytics
7 pages
M L U G: X Oader SER Uide
No ratings yet
M L U G: X Oader SER Uide
37 pages
Tumaini University of Makumira
No ratings yet
Tumaini University of Makumira
6 pages
SQL Exam 6
No ratings yet
SQL Exam 6
20 pages
Assignment - Big Data Management
No ratings yet
Assignment - Big Data Management
2 pages
Ogg 19 1 0 0 0 Cert Matrix 5491855
No ratings yet
Ogg 19 1 0 0 0 Cert Matrix 5491855
51 pages
Intelligent Cubes
No ratings yet
Intelligent Cubes
12 pages
Gonese Data Base
No ratings yet
Gonese Data Base
6 pages
Nikhil Java Resume
No ratings yet
Nikhil Java Resume
5 pages
DB2 LUW DBA Course content
No ratings yet
DB2 LUW DBA Course content
3 pages
DB2 بحث
No ratings yet
DB2 بحث
11 pages
Working With Audit Policies
No ratings yet
Working With Audit Policies
26 pages
Apex Notes
No ratings yet
Apex Notes
10 pages
OLAP Architectures
No ratings yet
OLAP Architectures
19 pages
Module 4 SQL
100% (1)
Module 4 SQL
151 pages
DBMS Lab File - Udit - Dumka
No ratings yet
DBMS Lab File - Udit - Dumka
44 pages
Introducing Data Science Techniques by Connecting Database Concepts and Dplyr
No ratings yet
Introducing Data Science Techniques by Connecting Database Concepts and Dplyr
8 pages
A Crash Course in Redis - ByteByteGo Newsletter
No ratings yet
A Crash Course in Redis - ByteByteGo Newsletter
13 pages

Unit -5 Updated Mhm

Uploaded by

Unit -5 Updated Mhm

Uploaded by

Zookeeper

Apache Zookeeper:What is Apache Zookeeper?

• The formal definition of Apache Zookeeper says that it is a distributed,

Zookeeper is best at reads as reads can be concurrent.

• Apache Flume is a distributed system for collecting, aggregating, and

You might also like