0% found this document useful (0 votes)
7 views25 pages

Unit -5 Updated Mhm

Apache ZooKeeper is a centralized service for maintaining configuration data and providing synchronization in distributed systems, particularly used in Hadoop to manage distributed applications. It addresses common issues like race conditions and deadlocks, ensuring that distributed applications function cohesively. Apache Flume, on the other hand, is a distributed system designed for collecting and transferring data from various sources to centralized storage like HDFS, offering features like reliability, scalability, and fault tolerance.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views25 pages

Unit -5 Updated Mhm

Apache ZooKeeper is a centralized service for maintaining configuration data and providing synchronization in distributed systems, particularly used in Hadoop to manage distributed applications. It addresses common issues like race conditions and deadlocks, ensuring that distributed applications function cohesively. Apache Flume, on the other hand, is a distributed system designed for collecting and transferring data from various sources to centralized storage like HDFS, offering features like reliability, scalability, and fault tolerance.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Zookeeper

Apache Zookeeper:What is Apache Zookeeper?


Introduction to Apache Zookeeper,
Why do we need Zookeeper in Hadoop?
How ZooKeeper in Hadoop Works?
Data Ingestion Tools:
Apache Zookeeper:What is Apache Zookeeper?
• Data ingestion is the transportation of data from assorted
sources to a storage medium where it can be accessed, used,
and analyzed by an organization.
• The destination is typically a data warehouse, data mart,
database, or a document store.
What is Apache ZooKeeper
• What is Zookeeper?
• Zookeeper is a top-level software developed by Apache that acts as a
centralized service and is used to maintain naming and configuration
data and to provide flexible and robust synchronization within
distributed systems.
• Zookeeper keeps track of status of the Kafka cluster nodes and it also
keeps track of Kafka topics, partitions etc.
• Zookeeper it self is allowing multiple clients to perform simultaneous
reads and writes and acts as a shared configuration service within the
system.
• The Zookeeper atomic broadcast (ZAB) protocol is the brains of the
whole system, making it possible for Zookeeper to act as an atomic
broadcast system and issue orderly updates.
What is Apache Zookeeper
• Apache Zookeeper is a coordination service for distributed application that
enables synchronization across a cluster.
• Zookeeper in Hadoop can be viewed as centralized repository where
distributed applications can put data and get data out of it.
• It is used to keep the distributed system functioning together as a single unit,
using its synchronization, serialization and coordination goals.
• For simplicity's sake Zookeeper can be thought of as a file system where we
have znodes that store data instead of files or directories storing data.
• Zookeeper is a Hadoop Admin tool used for managing the jobs in the cluster.
Introduction to Apache Zookeeper

• The formal definition of Apache Zookeeper says that it is a distributed,


open-source configuration, synchronization service along with naming
registry for distributed applications.
• Apache Zookeeper is used to manage and coordinate large cluster of
machines.
• For example Apache Storm which is used by Twitter for storing
machine state data has Apache Zookeeper as the coordinator between
machines.
Why do we need Zookeeper in the Hadoop?
• Distributed applications are difficult to coordinate and work with as they are much more error
prone due to huge number of machines attached to network.
• As many machines are involved, race condition and deadlocks are common problems when
implementing distributed applications.
• Race condition occurs when a machine tries to perform two or more operations at a time and this
can be taken care by serialization property of ZooKeeper.
• Deadlocks are when two or more machines try to access same shared resource at the same time.
• More precisely they try to access each other’s resources which leads to lock of system as none of the
system is releasing the resource but waiting for other system to release it.
• Synchronization in Zookeeper helps to solve the deadlock.
• Another major issue with distributed application can be partial failure of process, which can lead to
inconsistency of data.
• Zookeeper handles this through atomicity, which means either whole of the process will finish or
nothing will persist after failure.
• Thus Zookeeper is an important part of Hadoop that take care of these small but important issues so
that developer can focus more on functionality of the application.
How ZooKeeper in Hadoop Works?
• Hadoop ZooKeeper, is a distributed application that follows a simple client-server model
where clients are nodes that make use of the service, and servers are nodes that provide the
service.
• Multiple server nodes are collectively called ZooKeeper ensemble.
• At any given time, one ZooKeeper client is connected to at least one ZooKeeper server.
• A master node is dynamically chosen in consensus within the ensemble; thus usually, an
ensemble of Zookeeper is an odd number so that there is a majority of vote.
• If the master node fails, another master is chosen in no time and it takes over the previous
master.
• Other than master and slaves there are also observers in Zookeeper.
• Observers were brought in to address the issue of scaling.
• With the addition of slaves the write performance is going to be affected as voting process
is expensive.
• So observers are slaves that do not take part into voting process but have similar duties
as other slaves.
Writes in Zookeeper
• All the writes in Zookeeper go through the Master node, thus it is
guaranteed that all writes will be sequential.
• On performing write operation to the Zookeeper, each server attached to
that client persists the data along with master.
• Thus, this makes all the servers updated about the data.
• However this also means that concurrent writes cannot be made. Linear
writes guarantee can be problematic if Zookeeper is used for write
dominant workload.
• Zookeeper in Hadoop, is ideally used for coordinating message exchanges
between clients, which involves less writes and more reads.
• Zookeeper is helpful till the time the data is shared but if application has
concurrent data writing then Zookeeper can come in way of the application
and impose strict ordering of operations.
Reads in Zookeeper

Zookeeper is best at reads as reads can be concurrent.

Concurrent reads are done as each client is attached to different server and all
clients can read from the servers simultaneously, although having concurrent
reads leads to eventual consistency as master is not involved.

There can be cases where client may have an outdated view, which gets
updated with a little delay.
UNIT-5
Chapter-2
Apache Flume

• Introduction,
• Architecture,
• DataFlow,
• Features and Limitations,
• Applications.
Introduction to Apache Flume

• Apache Flume is a distributed system for collecting, aggregating, and


transferring data from external sources like Twitter, Facebook, web
servers to the central repository like HDFS.
• It is mainly for loading log data from different sources to Hadoop
HDFS.
• Apache Flume is a highly robust and available service.
• It is extensible, fault-tolerant, and scalable.
Features of Apache Flume
Features of Apache Flume( contd,,.)
1. Open-source
Apache Flume is an open-source distributed system. So it is available free of cost.
2. Data flow
Apache Flume allows its users to build multi-hop, fan-in, and fan-out flows. It also allows
for contextual routing as well as backup routes (fail-over) for the failed hops.
3. Reliability
In apache flume, the sources transfer events through the channel. The flume source puts
events in the channel which are then consumed by the sink. The sink transfers the event to
the next agent or to the terminal repository (like HDFS).
The events in the flume channel are removed only when they are stored in the next agent
channel or in the terminal repository.
In this way, the single-hop message delivery semantics in Apache Flume caters to
end-to-end reliability of the flow. Flume uses a transactional approach for guaranteeing
reliable delivery of the flume events.
4. Recoverability
The flume events are staged in a flume channel on each flume agent. This
. recovery from failure. Also, Apache Flume supports a durable File
manages
channel. File channels can be backed by the local file system.
5. Steady flow
Apache Flume offers steady data flow between reading and writes operations.
When the rate at which data is coming exceeds the rate of writing data to the
destination, then Apache Flume acts as a mediator between the data producers
and the centralized stores. Thus offers a steady flow of data between them.
6. Latency
Apache Flume caters to high throughput with lower latency.
7. Ease of use
With Flume, we can ingest the stream data from multiple web servers and store
them to any of the centralized stores such as HBase, Hadoop HDFS, etc.
8. Reliable message delivery
. transactions in Apache Flume are channel-based. For each message,
All the
two transactions are there – one for the sender and one for the receiver. This
ensures reliable message delivery.
9. Import of Huge volumes of data
Along with the log files, Apache Flume can also be used for importing huge
volumes of data produced by e-commerce sites like Flipkart, Amazon, and
networking sites like Twitter, Facebook.
10. Support for varieties of Sources and Sinks
Apache Flume supports a wide range of sources and sinks.
11. Streaming
Apache Flume gives us a reliable solution that helps us ingesting online
streaming data from different sources (such as email messages, network
traffic, log files, social media, etc) in HDFS.
12. Fault-tolerant and scalable
Flume is an extensible, reliable, highly available, and horizontally
scalable system. It is customizable for different types of sources and
sinks.
13. Inexpensive
It is an inexpensive system. It is less costly to install and operate. Its
maintenance is very economical.
14. Configuration
Apache Flume contains a very declarative configuration.
15. Documentation
Flume provides complete documentation with many good examples and
patterns which helps its user to learn how Flume can be used and
configured.
Limitations:
Limitations of Apache Flume are:
• 1. Weak ordering guarantee
• Apache Flume offers weaker guarantees than the other systems such as message queues in the event of moving
data more quickly and for enabling cheaper fault tolerance. In Apache Flume’s end-to-end reliability mode, the
flume events are delivered at least once, but with zero ordering guarantees.
• 2. Duplicacy
• Apache Flume does not guarantee that the messages reaching are 100% unique. In many scenarios, the duplicate
messages might pop in.
• 3. Low scalability
• Flume scalability is often low because for any businesses, sizing the hardware of a typical Apache Flume may be
tricky, and in most of the cases, it is trial and error. Due to this, Flume scalability aspect is often under the lens.
• 4. Reliability issue
• The throughput that Apache Flume can handle depends highly upon the backing store of the channel. So, if the
backing store is not chosen wisely, then there may be scalability and reliability issues.
• 5. Complex topology
• It has complex topology and reconfiguration is challenging.
• Despite its disadvantages, Flume’s advantages outweigh its disadvantages.
Flume Event

• Flume event is the basic unit of the data that is to be transported inside
Flume. A Flume event has a payload of the byte array.
• This is to be transferred from the source to the destination followed by
optional headers.
• The below figure depicts the structure of the Flume event.
Apache flume Architecture:
Apache flume Architecture:

1. Data Generators
Data generators generate real-time streaming data.
The data generated by data generators are collected by individual Flume agents that are
running on them.
The common data generators are Facebook, Twitter, etc.
2. Flume Agent
The agent is a JVM process in Flume.
It receives events from the clients or other agents and transfers it to the destination or other
agents.
It is a JVM process that consists of three components that are a source, channel, and sink
through which data flow occurs in Flume.
a. Source
• A Flume source is the component of Flume Agent which consumes data (events) from data generators like a
web server and delivers it to one or more channels.
• The data generator sends data (events) to Flume in a format recognized by the target Flume source.
• Flume supports different types of sources. Each source receives events (data) from a specific data generator.
• Example of Flume sources: Avro source, Exec source, Thrift source, NetCat source, HTTP source, Scribe
source, twitter 1% source, etc.
b. Channel
• When a Flume source receives an event from a data generator, it stores it on one or more channels. A Flume
channel is a passive store that receives events from the Flume source and stores them till Flume sinks
consume them.
• Channel acts as a bridge between Flume sources and Flume sinks.
• Flume channels are fully transactional and can work with any number of Flume sources and sinks.
• Example of Flume Channel− Custom Channel, File system channel, JDBC channel, Memory channel, etc.
• c. Sink
• The Flume sink retrieves the events from the Flume channel and pushes them on the centralized store like
HDFS, HDFS, or passes them to the next agent.

Example of Flume Sink− HDFS sink, AvHBase sink, Elasticsearch sink, etc.
3. Data collector
• The data collector collects the data from individual agents and aggregates
them.
• It pushes the collected data to a centralized store.
4. Centralized store
• Centralized stores are Hadoop HDFS, HBase, etc.
Applications of Apache flume(USE cases)

You might also like