0% found this document useful (0 votes)

11 views

Fundamentals and Architecture of Apache Kafka

Fundamentals and Architecture of Apache Kafka. This presentation explains Apache Kafka's architecture and internal design giving an overview of Kafka internal functions, including: Brokers, Replication, Partitions, Producers, Consumers, Commit log, comparison over traditional message queues.

Uploaded by

Angelo Cesaro

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Fundamentals and Architecture of Apache Kafka

Uploaded by

Angelo Cesaro

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Fundamentals and Architecture of

Apache Kafka®

Angelo Cesaro
Who am I?

• I’m Angelo!
• Consultant and Data Engineer at Cesaro.io
• More than 10 years of experience
• Worked at ServiceNow, Sky

• Follow me on
https://www.linkedin.com/in/angelocesaro
https://twitter.com/angelocesaro
https://github.com/cesaroangelo
Apache Kafka – Overview

• A distributed streaming platform used for building real time data

pipelines and mission-critical streaming applications with the
following characteristics:

1. Horizontally scalable
2. Fault tolerant
3. Really fast
4. Used by thousands of companies in production
Kafka’s benefits over traditional
messages queues
There are few key differences between kafka and other
traditional messages queues
• Durability and availability
1. Cluster can handle broker failures
2. Messages are replicated for reliability
• Very high throughput
• Data retention
• Excellent scalability
1. A small kafka cluster can process a large number of messages
• Support real-time and batch consumption
1. Kafka was born for real time processing of data, but can also handle
batch oriented jobs, for example feeding data to Hadoop or a data
High level of a Kafka cluster

• Producers send data to the kafka cluster

• Consumers read data from the kafka cluster
• Brokers are the main storage and messaging components of the kafka
cluster
Note: the components above can be physical machines, VMs or docker containers, kafka works the same on
of those platforms.
Messages

• The basic unit of data in kafka is a message and the

messages are the atomic unit of data sent by producers
• A message is a key-value pair:
• All the data is stored in Kafka as byte arrays (very
important!)
• Producer provides serializers to convert the key and value
to byte arrays
• Key and value can be any data type
Topic

• Kafka keeps streams of messages called topic and they categorize

messages into groups
• Developers can decide which topics have to exist and by
default Kafka auto-create topics when they are first used
• Kafka has no limit to the number of topics that can be used
• Topics are logical representation that spans across brokers

Note: By analogy, we can think topics as tables in a dbms, just like

we separate data in a db in different tables, we do the same with
topics
Data partitioning

• Producers shard data over a group of partitions and this is needed

to allow for parallel access to the topic for increased throughput

• Each partition contains a subset of messages and they are ordered

and immutable

• Usually the message key is used to control which partition a

message is assigned to
Kafka components

• 4 key components are in a kafka system

• Brokers
• Producers
• Consumers
• Zookeeper
Kafka broker

• Brokers receive and store data sent by the producers

• Brokers are server class systems that provide messages to the
consumers when requested
• Messages are spread across multiple partitions in different brokers
• Kafka provides a configurable retention policy for messages and each
message is identified by its offset number
• The commit log is an append only data structure that lives in ram for
fast access and it’s flushed to disk periodically
• Producer sends requests to the brokers that append messages to the
end of the log
• Consumers consumes from a specific offset (usually the lowest
available) and consumes all messages sequentially
Kafka producers

• Each producer writes data as messages to the kafka cluster

• Producers can be written in any language
• Kafka provides a tool to send messages to the cluster
• Confluent develops a rest (representational state transfer) server
which can be used by clients written in any language
• Confluent Enterprise includes a MQTT (message queuing telemetry
transport) proxy that allows direct ingestion of IoT data
Kafka consumers

• Each consumer pull events from topics as they are written

• The latest message read are kept tracked in a special ‘consumer
offset’ topic
• If necessary the consumers can be reset to start reading from a
specific offset (parameter to set in the configuration for the
default behavior)

Note: other similar solutions use to push events

Distributed consumption

• The way kafka uses to scale the consumption is the combination

of multiple consumers into consumer groups
• Each consumer in that scenario will be assigned a subset of
partitions for consumption

It’s important to know that traditional systems tend to be point to

point, that means that a message is gone once it has been
consumed and can’t be read again. Kafka was designed to work
differently, to allow to use the data multiple times
Zookeeper

• Zookeeper is a centralized and distributed service that can be

used to enable highly reliable distributed coordination
• It maintains configuration information (in this context kafka cluster
configurations)
• It provides distributed synchronization
• It runs in cluster and provides resiliency against failures
Kafka & Zookeeper

Kafka uses Zookeeper for various important features

• Cluster management
• Storage of ACLs and passwords
• Failure detection and recovery
Note:
1. kafka can’t run without zookeeper
2. In the previous kafka releases (<0.11), the clients had to access to
zookeeper, from 0.11 only the brokers need that access and then
the cluster is isolated from the clients for better security and
performance
Advantages of a pull architecture

• Ability to add more consumers to the system without reconfiguring

the cluster

• Ability for a consumer to go offline and return back later, resuming

from where it left off

• Consumer won’t get overwhelmed by data, consumer decides

what speed to get data and slow consumers won’t affect fast
producers
Speeding up data transfer

Kafka is fast, but why?

• Kafka uses system page cache for producing and consuming
messages. (linux kernel feature)
• The use of page cache enables zero-copy, the feature that allows
to transfer data directly from local file channel to a remote socket.
that saves cpu cycles and memory bandwidth.
Kafka metrics

• Kafka metrics can be exposed via jmx and showed through jmx clients
• Type of metrics exposed are:
1. Gauge: instantaneous measurement of one value
2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes rate,
etc
3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile, etc
4. Timer: measurement of timings meter + histogram
Kafka uses yammer metrics on the broker and in the older <0.9 clients. New
clients uses new internal metric package. Confluent plans to consolidate the jmx
metrics packages in the future.
Why Replication?

• Each partition is stored in a broker

• If we wouldn’t have any replication and if a broker goes offline,
then the partitions stored in that broker won’t be available and a
permanent data loss could occur
• Without redundancy, partitions will be not available for reads and
writes if the server goes offline and if the server has a fatal crash,
the data is gone permanently
Kafka uses replication for durability and availability
Replica

• Each partition can have replicas

• Each replica is placed on different brokers
• Replicas are spread evenly across brokers for load balancing

We specify the replication factor at topic creation time

Rack awareness of replicas

• Rack awareness enables each replica to be placed on brokers in

different racks. That helps to improve performance and fault
tolerance.
• Each broker can be configured with a broker.rack property, e.g.
rack-1, us-east-1a
• It’s useful if we need to deploy kafka on AWS across availability
zones
• Rack awareness was introduced in Confluent 3.0
Replica configurations

• Increase the replication factor for better durability

• For auto created topics, by default Kafka use replication factor 1,
that needs to be configured accordingly in server.properties
Kafka-topics –create –zookeeper zookeeper:2181 –partitions 1 –
replication-factor 3 –topic mytopic
How brokers are involved in
replication

• Brokers ensure strongly consistent replicas

• One replica is on the leader broker
• All messages produced go to the leader
• The leader propagates those messages to the followers brokers
• All consumers read messages from the leader
Note: very important to understand this above in case of
troubleshooting ;)
Leaders and followers

• Leader:
1. Accepts all reads and writes
2. Manages replicas
3. Leader election rate (meter metric):
kafka.controller:type=controllerstatus,name=leaderelectionrateandtime
ms
• Follower:
• Provide fault tolerance
• keep up with the leader
• There is a special thread running in the cluster that manage the current list
of leaders and followers for every partition. It’s a complex and mission-
critical task, for this reason there is a replica of this information in zookeeper
and then cached on every broker for faster access
Partition leaders

• Leaders have to be evenly distributed across all brokers for 2 main

reasons:
• Leaders can change in case of failure
• Leaders do more work as discussed in the previous slides
Preferred replica

• When we create a topic the preferred replica is set automatically.

• It’s the first replica in the list of assigned replicas
• Kafka-topics –zookeeper zookeeper:2181 –describe –topic my-topic
Topic:my-topic PartitionCount:1 ReplicationFactor:3 Configs:
Topic: my-topic Partition: 0 Leader: 1 Replicas: 1,2,0 Isr: 1,2,0
In Sync Replica (ISR)

• In sync replica is a list of the replicas – leader + followers

• A message is committed if it’s received by every replica in
the list
Note for troubleshooting: Where is it kept the isr list? It’s in
the leader
What does committed mean?

• Committed means in this context that the message is

received and written to the disk by all replicas
• The data is not available for consuming if it’s not
committed
• Who decides when to commit a message? The leader has
this responsibility
Using Kafka command line tools

#create topic with replication factor 1 and partition 1

• kafka-topics.sh --create --zookeeper localhost:2181 --replication-
factor 1 --partitions 1 --topic test
#delete topic with name test
• kafka-topics.sh --delete --zookeeper localhost:2181 --topic test
#list info regarding topic
• kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
#list topics
• kafka-topics.sh --list --zookeeper localhost:2181
Links!

• https://kafka.apache.org

• https://www.confluent.io

• https://www.cesaro.io

Snowflake Training
No ratings yet
Snowflake Training
685 pages
Kafka Using Spring Boot
No ratings yet
Kafka Using Spring Boot
136 pages
Understanding Apache Kafka White Paper
No ratings yet
Understanding Apache Kafka White Paper
7 pages
100 Most Asked Problems - LeetCode
No ratings yet
100 Most Asked Problems - LeetCode
8 pages
Murali Sir PLSQL PDF
100% (1)
Murali Sir PLSQL PDF
162 pages
Kafka Streaming Data
No ratings yet
Kafka Streaming Data
154 pages
Configuring Kafka For High Throughput
No ratings yet
Configuring Kafka For High Throughput
11 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
kafka
No ratings yet
kafka
43 pages
Kafka
No ratings yet
Kafka
23 pages
Unit 5 Apache Kafka Notes
No ratings yet
Unit 5 Apache Kafka Notes
54 pages
Apache Kafka Description
No ratings yet
Apache Kafka Description
36 pages
Big Data-Kafka
No ratings yet
Big Data-Kafka
14 pages
Kafka Clustering v1.0.0
No ratings yet
Kafka Clustering v1.0.0
20 pages
Cours - Kafka
No ratings yet
Cours - Kafka
72 pages
Apache Kafka
No ratings yet
Apache Kafka
17 pages
Apache Kafka | Thi Nguyen's Blog
No ratings yet
Apache Kafka | Thi Nguyen's Blog
39 pages
AK
No ratings yet
AK
22 pages
Apache Kafka(1)
No ratings yet
Apache Kafka(1)
10 pages
Kafka Notes
No ratings yet
Kafka Notes
7 pages
Kafka Using Spring Boot v2
No ratings yet
Kafka Using Spring Boot v2
150 pages
4. Introduction to Apache Kafka and its setup (3)
No ratings yet
4. Introduction to Apache Kafka and its setup (3)
29 pages
KAFKAExample2
No ratings yet
KAFKAExample2
12 pages
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
No ratings yet
Getting To Know Kafka: Ola Is The First Course in The Series of Courses Covering All The Aspects of Kafka
23 pages
Apache_Kafka_360_1631077800
No ratings yet
Apache_Kafka_360_1631077800
137 pages
Kafka Notes2
No ratings yet
Kafka Notes2
19 pages
Apache Kafka
No ratings yet
Apache Kafka
38 pages
Apache Kafka Long Polling
No ratings yet
Apache Kafka Long Polling
20 pages
Kafka With Spring Boot
No ratings yet
Kafka With Spring Boot
48 pages
Big Data - Group 14
No ratings yet
Big Data - Group 14
26 pages
Kafka SlidesShare
No ratings yet
Kafka SlidesShare
100 pages
Some Special Terms in Kafka
No ratings yet
Some Special Terms in Kafka
10 pages
Kafka
No ratings yet
Kafka
45 pages
Apache Kafka Key Concepts
100% (1)
Apache Kafka Key Concepts
8 pages
Kafka
No ratings yet
Kafka
3 pages
08_Apache_Kafka
No ratings yet
08_Apache_Kafka
45 pages
Apache Kafka
No ratings yet
Apache Kafka
43 pages
Kafka Topic Questions
No ratings yet
Kafka Topic Questions
9 pages
Kafka: Big Data Huawei Course
No ratings yet
Kafka: Big Data Huawei Course
14 pages
Kafka Patterns and Anti-Patterns
No ratings yet
Kafka Patterns and Anti-Patterns
7 pages
? Kafka
No ratings yet
? Kafka
2 pages
Apache Kafka Beginner Guide
No ratings yet
Apache Kafka Beginner Guide
40 pages
KAFKA PRESENTATION (1)
No ratings yet
KAFKA PRESENTATION (1)
16 pages
Kafka and Spark Streaming
No ratings yet
Kafka and Spark Streaming
45 pages
Apache Kafka 101
No ratings yet
Apache Kafka 101
25 pages
BDA Lab A7
No ratings yet
BDA Lab A7
10 pages
Kafka Notes1
No ratings yet
Kafka Notes1
19 pages
Why Is Kafka So Fast
No ratings yet
Why Is Kafka So Fast
10 pages
Documentation
No ratings yet
Documentation
105 pages
Kafka
No ratings yet
Kafka
5 pages
Kafka
No ratings yet
Kafka
19 pages
Introduction To Apache Kafka
No ratings yet
Introduction To Apache Kafka
18 pages
Chapter 1 - Introduction To KAFKA: Objectives
No ratings yet
Chapter 1 - Introduction To KAFKA: Objectives
17 pages
Kafka My Kafka Note v67
No ratings yet
Kafka My Kafka Note v67
55 pages
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
No ratings yet
Using Kafka For Real Time Data Ingestion With .NET KevinFeasel
33 pages
Instaclustr Understanding Apache Kafka White Paper
No ratings yet
Instaclustr Understanding Apache Kafka White Paper
8 pages
Kafka 1
No ratings yet
Kafka 1
10 pages
Kafka's Architecture: Find Answers On The Fly, or Master Something New. Subscribe Today
No ratings yet
Kafka's Architecture: Find Answers On The Fly, or Master Something New. Subscribe Today
1 page
Step 19 Kafka Optional
No ratings yet
Step 19 Kafka Optional
10 pages
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
From Everand
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
Peter Jones
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Python Range Function: Built-In Functions
No ratings yet
Python Range Function: Built-In Functions
36 pages
Reg Expressions
No ratings yet
Reg Expressions
10 pages
BIOS Tutorial and Codes
No ratings yet
BIOS Tutorial and Codes
111 pages
Ethernet
No ratings yet
Ethernet
20 pages
Mcs 011 B 3
No ratings yet
Mcs 011 B 3
77 pages
Team Database Assignment
No ratings yet
Team Database Assignment
9 pages
Question Bank For J2EE PDF
No ratings yet
Question Bank For J2EE PDF
3 pages
ICE305-LAB4 Suraiya Salam
No ratings yet
ICE305-LAB4 Suraiya Salam
10 pages
Finger Command
No ratings yet
Finger Command
3 pages
PARTITION
No ratings yet
PARTITION
8 pages
Adwise 4.5 REST API Documentation
No ratings yet
Adwise 4.5 REST API Documentation
127 pages
FS - TS - e - PD758 Mnemonic - PD
No ratings yet
FS - TS - e - PD758 Mnemonic - PD
14 pages
Python Question Bank
100% (1)
Python Question Bank
7 pages
Loading Script
100% (3)
Loading Script
42 pages
Netmon Software Installation and Configuration Guide: January 6, 2020
No ratings yet
Netmon Software Installation and Configuration Guide: January 6, 2020
21 pages
Computer Programming II
No ratings yet
Computer Programming II
4 pages
HP 3PAR F-Class, T-Class, and StoreServ 10000 Storage Troubleshooting Guide PDF
No ratings yet
HP 3PAR F-Class, T-Class, and StoreServ 10000 Storage Troubleshooting Guide PDF
130 pages
Travelmate 630: Service Guide
No ratings yet
Travelmate 630: Service Guide
134 pages
MX25L12873F, 3V, 128Mb, v1.2 PDF
No ratings yet
MX25L12873F, 3V, 128Mb, v1.2 PDF
98 pages
Dodo Assignment Greentfoot Java
No ratings yet
Dodo Assignment Greentfoot Java
21 pages
All PM Counters
No ratings yet
All PM Counters
815 pages
Creating A Connection String and Working With SQL Server LocalDB - Microsoft Docs
No ratings yet
Creating A Connection String and Working With SQL Server LocalDB - Microsoft Docs
4 pages
ROMref
No ratings yet
ROMref
11 pages
Sangoma Session Border Controllers Datasheet
No ratings yet
Sangoma Session Border Controllers Datasheet
2 pages
Project On Scholar Registration PDF
No ratings yet
Project On Scholar Registration PDF
9 pages
Data Carving Presentation by Aditya Upadhyay - 20231009 - 122725 - 0000
No ratings yet
Data Carving Presentation by Aditya Upadhyay - 20231009 - 122725 - 0000
11 pages
How To Perform HANA Archiving Process
No ratings yet
How To Perform HANA Archiving Process
17 pages