Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
Apache Kafka®
Angelo Cesaro
Who am I?
• I’m Angelo!
• Consultant and Data Engineer at Cesaro.io
• More than 10 years of experience
• Worked at ServiceNow, Sky
• Follow me on
https://www.linkedin.com/in/angelocesaro
https://twitter.com/angelocesaro
https://github.com/cesaroangelo
Apache Kafka – Overview
1. Horizontally scalable
2. Fault tolerant
3. Really fast
4. Used by thousands of companies in production
Kafka’s benefits over traditional
messages queues
There are few key differences between kafka and other
traditional messages queues
• Durability and availability
1. Cluster can handle broker failures
2. Messages are replicated for reliability
• Very high throughput
• Data retention
• Excellent scalability
1. A small kafka cluster can process a large number of messages
• Support real-time and batch consumption
1. Kafka was born for real time processing of data, but can also handle
batch oriented jobs, for example feeding data to Hadoop or a data
High level of a Kafka cluster
• Kafka metrics can be exposed via jmx and showed through jmx clients
• Type of metrics exposed are:
1. Gauge: instantaneous measurement of one value
2. Meter: measurement of ticks in a time range. E.g. one minute rate, 10 minutes rate,
etc
3. Histogram: measurement of a value variants. E.g. 50th percentile, 98th percentile, etc
4. Timer: measurement of timings meter + histogram
Kafka uses yammer metrics on the broker and in the older <0.9 clients. New
clients uses new internal metric package. Confluent plans to consolidate the jmx
metrics packages in the future.
Why Replication?
• Leader:
1. Accepts all reads and writes
2. Manages replicas
3. Leader election rate (meter metric):
kafka.controller:type=controllerstatus,name=leaderelectionrateandtime
ms
• Follower:
• Provide fault tolerance
• keep up with the leader
• There is a special thread running in the cluster that manage the current list
of leaders and followers for every partition. It’s a complex and mission-
critical task, for this reason there is a replica of this information in zookeeper
and then cached on every broker for faster access
Partition leaders
• https://kafka.apache.org
• https://www.confluent.io
• https://www.cesaro.io