0% found this document useful (0 votes)
11 views

Fundamentals and Architecture of Apache Kafka

Fundamentals and Architecture of Apache Kafka. This presentation explains Apache Kafka's architecture and internal design giving an overview of Kafka internal functions, including: Brokers, Replication, Partitions, Producers, Consumers, Commit log, comparison over traditional message queues.

Uploaded by

Angelo Cesaro
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Fundamentals and Architecture of Apache Kafka

Fundamentals and Architecture of Apache Kafka. This presentation explains Apache Kafka's architecture and internal design giving an overview of Kafka internal functions, including: Brokers, Replication, Partitions, Producers, Consumers, Commit log, comparison over traditional message queues.

Uploaded by

Angelo Cesaro
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Fundamentals and Architecture of

Apache Kafka®

Angelo Cesaro
Who am I?

• I’m Angelo!
• Consultant and Data Engineer at Cesaro.io
• More than 10 years of experience
• Worked at ServiceNow, Sky

• Follow me on
https://www.linkedin.com/in/angelocesaro
https://twitter.com/angelocesaro
https://github.com/cesaroangelo
Apache Kafka – Overview

• A distributed streaming platform used for building real time data


pipelines and mission-critical streaming applications with the
following characteristics:

1. Horizontally scalable
2. Fault tolerant
3. Really fast
4. Used by thousands of companies in production
Kafka’s benefits over traditional
messages queues
There are few key differences between kafka and other
traditional messages queues
• Durability and availability
1. Cluster can handle broker failures
2. Messages are replicated for reliability
• Very high throughput
• Data retention
• Excellent scalability
1. A small kafka cluster can process a large number of messages
• Support real-time and batch consumption
1. Kafka was born for real time processing of data, but can also handle
batch oriented jobs, for example feeding data to Hadoop or a data
High level of a Kafka cluster

• Producers send data to the kafka cluster


• Consumers read data from the kafka cluster
• Brokers are the main storage and messaging components of the kafka
cluster
Note: the components above can be physical machines, VMs or docker containers, kafka works the same on
of those platforms.
Messages

• The basic unit of data in kafka is a message and the


messages are the atomic unit of data sent by producers
• A message is a key-value pair:
• All the data is stored in Kafka as byte arrays (very
important!)
• Producer provides serializers to convert the key and value
to byte arrays
• Key and value can be any data type
Topic

• Kafka keeps streams of messages called topic and they categorize


messages into groups
• Developers can decide which topics have to exist and by
default Kafka auto-create topics when they are first used
• Kafka has no limit to the number of topics that can be used
• Topics are logical representation that spans across brokers

Note: By analogy, we can think topics as tables in a dbms, just like


we separate data in a db in different tables, we do the same with
topics
Data partitioning

• Producers shard data over a group of partitions and this is needed


to allow for parallel access to the topic for increased throughput

• Each partition contains a subset of messages and they are ordered


and immutable

• Usually the message key is used to control which partition a


message is assigned to
Kafka components

• 4 key components are in a kafka system


• Brokers
• Producers
• Consumers
• Zookeeper
Kafka broker

• Brokers receive and store data sent by the producers


• Brokers are server class systems that provide messages to the
consumers when requested
• Messages are spread across multiple partitions in different brokers
• Kafka provides a configurable retention policy for messages and each
message is identified by its offset number
• The commit log is an append only data structure that lives in ram for
fast access and it’s flushed to disk periodically
• Producer sends requests to the brokers that append messages to the
end of the log
• Consumers consumes from a specific offset (usually the lowest
available) and consumes all messages sequentially
Kafka producers

• Each producer writes data as messages to the kafka cluster


• Producers can be written in any language
• Kafka provides a tool to send messages to the cluster
• Confluent develops a rest (representational state transfer) server
which can be used by clients written in any language
• Confluent Enterprise includes a MQTT (message queuing telemetry
transport) proxy that allows direct ingestion of IoT data
Kafka consumers

• Each consumer pull events from topics as they are written


• The latest message read are kept tracked in a special ‘consumer
offset’ topic
• If necessary the consumers can be reset to start reading from a
specific offset (parameter to set in the configuration for the
default behavior)

Note: other similar solutions use to push events


Distributed consumption

• The way kafka uses to scale the consumption is the combination


of multiple consumers into consumer groups
• Each consumer in that scenario will be assigned a subset of
partitions for consumption

It’s important to know that traditional systems tend to be point to


point, that means that a message is gone once it has been
consumed and can’t be read again. Kafka was designed to work
differently, to allow to use the data multiple times
Zookeeper

• Zookeeper is a centralized and distributed service that can be


used to enable highly reliable distributed coordination
• It maintains configuration information (in this context kafka cluster
configurations)
• It provides distributed synchronization
• It runs in cluster and provides resiliency against failures
Kafka & Zookeeper

Kafka uses Zookeeper for various important features


• Cluster management
• Storage of ACLs and passwords
• Failure detection and recovery
Note:
1. kafka can’t run without zookeeper
2. In the previous kafka releases (<0.11), the clients had to access to
zookeeper, from 0.11 only the brokers need that access and then
the cluster is isolated from the clients for better security and
performance
Advantages of a pull architecture

• Ability to add more consumers to the system without reconfiguring


the cluster

• Ability for a consumer to go offline and return back later, resuming


from where it left off

• Consumer won’t get overwhelmed by data, consumer decides


what speed to get data and slow consumers won’t affect fast
producers
Speeding up data transfer

Kafka is fast, but why?


• Kafka uses system page cache for producing and consuming
messages. (linux kernel feature)
• The use of page cache enables zero-copy, the feature that allows
to transfer data directly from local file channel to a remote socket.
that saves cpu cycles and memory bandwidth.
Kafka metrics

• Kafka metrics can be exposed via jmx and showed through jmx clients
• Type of metrics exposed are:
1. Gauge: instantaneous measurement of one value
2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes rate,
etc
3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile, etc
4. Timer: measurement of timings meter + histogram
Kafka uses yammer metrics on the broker and in the older <0.9 clients. New
clients uses new internal metric package. Confluent plans to consolidate the jmx
metrics packages in the future.
Why Replication?

• Each partition is stored in a broker


• If we wouldn’t have any replication and if a broker goes offline,
then the partitions stored in that broker won’t be available and a
permanent data loss could occur
• Without redundancy, partitions will be not available for reads and
writes if the server goes offline and if the server has a fatal crash,
the data is gone permanently
Kafka uses replication for durability and availability
Replica

• Each partition can have replicas


• Each replica is placed on different brokers
• Replicas are spread evenly across brokers for load balancing

We specify the replication factor at topic creation time


Rack awareness of replicas

• Rack awareness enables each replica to be placed on brokers in


different racks. That helps to improve performance and fault
tolerance.
• Each broker can be configured with a broker.rack property, e.g.
rack-1, us-east-1a
• It’s useful if we need to deploy kafka on AWS across availability
zones
• Rack awareness was introduced in Confluent 3.0
Replica configurations

• Increase the replication factor for better durability


• For auto created topics, by default Kafka use replication factor 1,
that needs to be configured accordingly in server.properties
Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 –
replication-factor 3 –topic mytopic
How brokers are involved in
replication

• Brokers ensure strongly consistent replicas


• One replica is on the leader broker
• All messages produced go to the leader
• The leader propagates those messages to the followers brokers
• All consumers read messages from the leader
Note: very important to understand this above in case of
troubleshooting ;)
Leaders and followers

• Leader:
1. Accepts all reads and writes
2. Manages replicas
3. Leader election rate (meter metric):
kafka.controller:type=controllerstatus,name=leaderelectionrateandtime
ms
• Follower:
• Provide fault tolerance
• keep up with the leader
• There is a special thread running in the cluster that manage the current list
of leaders and followers for every partition. It’s a complex and mission-
critical task, for this reason there is a replica of this information in zookeeper
and then cached on every broker for faster access
Partition leaders

• Leaders have to be evenly distributed across all brokers for 2 main


reasons:
• Leaders can change in case of failure
• Leaders do more work as discussed in the previous slides
Preferred replica

• When we create a topic the preferred replica is set automatically.


• It’s the first replica in the list of assigned replicas
• Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic
Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
In Sync Replica (ISR)

• In sync replica is a list of the replicas – leader + followers


• A message is committed if it’s received by every replica in
the list
Note for troubleshooting: Where is it kept the isr list? It’s in
the leader
What does committed mean?

• Committed means in this context that the message is


received and written to the disk by all replicas
• The data is not available for consuming if it’s not
committed
• Who decides when to commit a message? The leader has
this responsibility
Using Kafka command line tools

#create topic with replication factor 1 and partition 1


• kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic test
#delete topic with name test
• kafka-topics.sh --delete --zookeeper localhost:2181 --topic test
#list info regarding topic
• kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
#list topics
• kafka-topics.sh --list --zookeeper localhost:2181
Links!

• https://kafka.apache.org

• https://www.confluent.io

• https://www.cesaro.io

You might also like