Kafka Clustering v1.0.0
Kafka Clustering v1.0.0
CONTENT
1. INTRODUCTION
2. LIFECYCLE
3. CLUSTERING
5. REFERENCES
INTRODUCTION
Producers: These are the entities that generate and send messages (events) to Kafka
topics. Topics act as channels where events are categorized.
Consumers: They subscribe to topics and receive the messages. Kafka allows multiple
consumers to subscribe to a topic, providing a scalable and distributed system for
handling streams of data.
2. Storage:
Kafka is designed for durability and fault tolerance. Events are stored on disk, and Kafka
brokers (nodes in the Kafka cluster) are responsible for managing and storing these
events.
Data is distributed across multiple brokers, ensuring reliability and availability even if
some nodes fail. This distributed storage model contributes to the scalability and fault
tolerance of Kafka.
3. Stream Processing:
Developers can implement various operations on the event stream, such as filtering,
transformation, and aggregation. This enables the creation of reactive applications,
event-driven architectures, and real-time analytics.
1. Low Latency:
Apache Kafka offers remarkably low end-to-end latency, making it well-suited for
real-time data streaming.
3. High Scalability:
Kafka's distributed design enables high scalability, allowing the system to handle
varying volumes and speeds of incoming messages.
The ability to scale Kafka up or down without causing downtime ensures flexibility
in adapting to changing application and processing demands.
4. High Fault Tolerance:
Data duplication and distribution across servers or brokers ensure that if one
server goes down, data remains available on others, ensuring continuous access to
information.
5. Multiple Integrations:
This flexibility allows organizations to easily incorporate Kafka into their real-time
data pipelines, connecting it with different applications and leveraging its benefits
in diverse ecosystems.
Comparison between Apache Kafka and a few other notable Event Streaming
technologies:
Kafka:
RabbitMQ:
It is known for simplicity, ease of use, and support for various messaging
patterns.
Kafka:
Kafka is widely adopted, has a mature ecosystem, and is known for its
scalability and fault tolerance.
Pulsar:
LIFECYCLE
1. Kafka Producer:
Kafka brokers are individual Kafka server instances within a Kafka cluster.
Brokers handle the storage, retrieval, and replication of data. They are responsible
for receiving and serving messages from producers and consumers.
When a Kafka client (producer or consumer) wants to interact with the Kafka
cluster, it needs to connect to one of the brokers. The broker it initially connects to
is called the bootstrap broker.
The bootstrap broker provides the client with metadata about the Kafka cluster,
such as the list of brokers, partitions, and their leaders. Once a client has this
information, it can communicate directly with the appropriate broker for producing
or consuming messages.
3. Kafka Topic:
Topics are divided into partitions, which are the unit of parallelism and distribution.
The Partitions that make up a Topic are dispersed among the servers of the Kafka
Clusters. Each server in the cluster is in charge of its own data and Partition
requests. When a Broker receives the messages, it also receives a key.
The key can be used to indicate which Partition a message should be sent to.
Messages with the same key are sent to the same Partition. This allows several
users to read from the same Topic at the same time.
5. Kafka Broker:
Kafka cluster typically consists of multiple brokers to maintain load balance. Kafka
brokers are stateless, so they use ZooKeeper for maintaining their cluster state.
One Kafka broker instance can handle hundreds of thousands of reads and writes
per second and each bro-ker can handle TB of messages without performance
impact. Kafka broker leader election can be done by ZooKeeper.
6. Kafka Connector:
Kafka Connect provides a common framework for you to define connectors that
integrate Kafka with external systems , which do the work of moving data in and
out of Kafka.
7. Kafka Consumer:
Kafka brokers are stateless, which means that the consumer has to maintain how
many messages have been consumed by using partition offset. If the consumer
acknowledges a particular message offset, it implies that the consumer has
consumed all prior messages. The consumer issues an asynchronous pull request
to the broker to have a buffer of bytes ready to consume. The consumers can
rewind or skip to any point in a partition simply by supplying an offset value.
Consumer offset value is notified by ZooKeeper.
8. Zookeeper:
Producers are applications or systems that generate and send messages to Kafka
topics.
Topics are divided into partitions, allowing parallel processing and scalability.
3. Partitioning and Replication:
Partitions are assigned to brokers. Each partition has a leader and zero or more
followers (replicas).
Replication ensures data durability and availability. Replicas are distributed across
brokers.
4. Leader Election:
The leader for each partition is dynamically elected from the in-sync replicas (ISRs)
by the Kafka controller.
The leader handles all reads and writes for its partition.
5. Message Ingestion:
Producers send messages to Kafka brokers. The producer may choose a specific
partition or let Kafka handle partitioning.
6. Storage in Logs:
7. Consumer Groups:
8. Consumer Polling:
The offset, which represents the position in the log, is managed by Kafka and helps
track the progress of consumption.
Consumers commit their offsets to Kafka, indicating the point up to which they
have consumed messages.
Replication and leader election ensure fault tolerance and high availability.
Kafka uses ZooKeeper for cluster coordination, managing metadata, and leader
election.
Kafka can adapt to dynamic changes, such as broker additions or removals, and
reassigning leaders as needed.
CLUSTERING
Kafka cluster typically consists of multiple brokers to maintain load balance.
1. Single-Node Cluster:
This is the simplest form of Kafka setup, where all components (ZooKeeper and Kafka
broker) run on a single machine. While suitable for development and testing, it lacks the
benefits of fault tolerance and high availability.
2. Multi-Node Cluster:
1. Architecture:
Single-Node Cluster:
Suitable for development, testing, and scenarios with relatively low data volume.
Multi-Node Cluster:
Enables horizontal scaling, distributing data across multiple nodes for better
performance and fault tolerance.
2. Scalability:
Single-Node Cluster:
Multi-Node Cluster:
Horizontal scalability allows for the addition of more machines (nodes) to the
cluster.
3. Fault Tolerance:
Single-Node Cluster:
Vulnerable to a single point of failure. If the machine or Kafka broker fails, the
entire system becomes unavailable.
Multi-Node Cluster:
Enhanced fault tolerance through data replication. Each partition has multiple
replicas distributed across different nodes.
Even if one or more nodes fail, the system remains operational, as replicas on other
nodes can take over.
4. High Availability:
Single-Node Cluster:
Multi-Node Cluster:
5. Parallel Processing:
Single-Node Cluster:
Multi-Node Cluster:
6. Load Balancing:
Single-Node Cluster:
Multi-Node Cluster:
Distributes the load of producing and consuming messages across multiple nodes.
Multi-Node Cluster:
Step 1: Prerequisites
Before you start, make sure you have the following installed on your Windows machine:
Java: Kafka requires Java to run. Download and install the latest version of Java from the
official website: https://www.oracle.com/java/technologies/javase-downloads.html
1. Extract the downloaded Kafka archive to a directory of your choice. For example, you can
extract it to C:\kafka .
2. Open the config directory and edit the server.properties file using a text editor.
Change the log.dirs property to a directory where Kafka will store its data. For
example, log.dirs=C:/kafka/data .
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
Step 6: Start Kafka
.\bin\windows\kafka-server-start.bat .\config\server.properties
If you want to create desired topic with desired number of partition then :
Open the server.properties file in a text editor. You can use Notepad or any other text editor of
your choice.
Add or modify the following lines in the server.properties file to configure the topics:
These settings will apply to newly created topics. If you want these settings to apply to existing
topics as well, you need to update the topic configuration separately.
Save the changes to the server.properties file and close the text editor.
If Kafka is already running, restart it to apply the new configuration. Stop the Kafka server, and
then start it again:
Now that the configuration is in place, you can create the three topics with the specified number
of partitions:
Now, you have Apache Kafka installed and running on your Windows machine. You can start
producing and consuming messages on the myTopic topic.
CONFIGURATION PROPERTIES :
# zookeeper.properties
dataDir=C:\\zookeeper-3.6.3\\data
clientPort=2181
Explanation:
dataDir : Specifies the directory where ZooKeeper will store its data.
clientPort : Defines the port on which ZooKeeper clients (including Kafka) will connect.
Create a server.properties file for each Kafka broker with the following content:
Broker 1 ( server-1.properties ):
# server-1.properties
broker.id=0
listeners=PLAINTEXT://localhost:9092
log.dirs=C:\\kafka_2.13-2.8.1\\kafka-logs-1
zookeeper.connect=localhost:2181
Broker 2 ( server-2.properties ):
# server-2.properties
broker.id=1
listeners=PLAINTEXT://localhost:9093
log.dirs=C:\\kafka_2.13-2.8.1\\kafka-logs-2
zookeeper.connect=localhost:2181
Explanation:
listeners : The address and port where the broker listens for connections.
zookeeper.connect : The connection string for ZooKeeper, specifying the address and port.
Replication Configuration:
num.replica.fetchers=2
default.replication.factor=2
ZooKeeper Connection:
zookeeper.connect=localhost:2181
1. Bootstrap Servers:
bootstrap.servers : A list of host and port pairs to use for establishing the initial
connection to the Kafka cluster.
bootstrap.servers=<broker1>:<port1>,<broker2>:<port2>
Acknowledgment Settings:
acks : The number of acknowledgments the producer requires the broker to receive before
considering a message as sent.
acks=all
retries : The number of times the producer should retry sending a message in case of
failures.
delivery.timeout.ms : The maximum time in milliseconds the producer will wait for an
acknowledgment.
retries=3
delivery.timeout.ms=30000
1. Bootstrap Servers:
bootstrap.servers : A list of host and port pairs used for establishing the initial
connection to the Kafka cluster.
bootstrap.servers=<broker1>:<port1>,<broker2>:<port2>
enable.auto.commit=true
auto.commit.interval.ms=1000
4. Fetch Settings:
fetch.min.bytes : The minimum amount of data the server should return for a fetch
request.
fetch.max.wait.ms : The maximum amount of time the server should wait for more data to
arrive before sending a response.
fetch.min.bytes=1
fetch.max.wait.ms=500
Note:
1. You need to adjust some settings based on your specific requirements and environment.
2. .sh is used for linux and . .bat or cmd is used for linux
# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1
# Replication configuration
default.replication.factor=1
listeners : The address the socket server listens on. Adjust the hostname and port
accordingly.
Kafka relies on Zookeeper, so you need to have a Zookeeper ensemble running. Here is a
basic configuration:
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an
acknowledgement
syncLimit=5
initLimit : The number of ticks that the ZooKeeper servers in quorum can take to
connect to a leader.
syncLimit : The number of ticks that can pass between sending a request and getting
an acknowledgment.
dataDir : The directory where ZooKeeper will store its snapshots and transaction logs.
After configuring, start Zookeeper and then Kafka. You can use the following commands:
# Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
# Start Kafka
bin/kafka-server-start.sh config/server.properties
4. Create a Topic:
Open a new command prompt, navigate to the Kafka directory, and run the following command to
create a topic:
Open a new command prompt, navigate to the Kafka directory, and run the following command to
start a producer:
Open a new command prompt, navigate to the Kafka directory, and run the following command to
start a consumer:
This is a basic configuration for a single-node Kafka setup. Depending on your use case, you may
need to tweak additional settings or enable specific features. Always refer to the official Kafka
documentation for the most up-to-date and detailed information.
Setting up a multi-node Kafka cluster involves configuring both Kafka and Zookeeper on each
node, Rest of the configuration is same as single node . Below are the steps to set up a Kafka
multi-node cluster:
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an
acknowledgement
syncLimit=5
Update the server.x configurations for each node with the appropriate hostnames and
ports.
# Broker ID for each Kafka broker (unique integer for each broker)
broker.id=1 # For tdslave01
broker.id=2 # For tdslave02
broker.id=3 # For tdslave03
broker.id=4 # For tdmaster01
broker.id=5 # For tdmaster02
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
REFERENCES
https://kafka.apache.org/documentation/#configuration
https://sharebigdata.wordpress.com/2015/09/09/kafka-installation-single-node-setup/
https://www.bogotobogo.com/Hadoop/BigData_hadoop_Zookeeper_Kafka_single_node_Multiple_
broker_cluster.php