Apache Kafka, a powerful publish-subscribe messaging system, has emerged as the preferred choice of high-volume data streams utilized by international corporations such as LinkedIn, Netflix, and Uber. Lying at the heart of Kafka's superior performance is its design of Kafka partitions.
Partitions are the basic blocks that empower Kafka to handle huge datasets effortlessly, guaranteeing unhampered scalability, parallel processing, and high fault tolerance.
In this article, you'll learn exactly what Kafka partitions are, how they work behind the scenes, their benefits, real-world use cases, and how to leverage them effectively in your systems.
What is a Messaging System?
When our application is made up of a large number of microservices, these microservices often need to interact with each other. An example of architecture is shown below.
Messaging System kaflaHere, the services 1,2 and 3 are producers as they are producing some data which is required by the services 4,5,6 and 7, hence they are called the consumers as they are consuming the data produced by the producers. So, we can see a large number of connections setup here. So, a large number of data pipelines need to be created and maintained in order to have interaction between these consumers and producers. Here, the messaging system solves this problem by reducing the number of connections (data pipelines) as shown below.
Messaging System reduces the number of data pipelinesSo, the producers simply produce the data to a messaging system and the consumers will consume the required data from the messaging system itself and not from the individual microservices.
Number of connections have been reduced from (producers * consumers) to (producers + consumers).
What is Apache Kafka?
Apache Kafka (Apache is the name of the company that owns Kafka now) is a publish-subscribe messaging system. So, now we know that a messaging system is responsible for sending data between applications so that the applications focus on the data and processing that data rather than data transmission and maintaining data pipelines.
A publish-subscribe messaging system is a messaging system in which some microservices (subscribers) subscribe to a particular topic and receive messages from that topic. The publisher publishes the messages to that topic and subscribers consume those messages. There is a time dependency which is laid on the consumers to consume the messages. When a subscriber (consumer) consumes a message, it does not send an acknowledgement back to the producer.
publish-subscribe messaging systemFor more details refer What is Apache Kafka
What is a Topic?
Like in databases, we might have a large database but we also maintain separation of concern using tables, similarly the separation of concern in Apache Kafka is maintained using topics. In simple terms, topics are similar to tables in databases (just for understanding). For instance, in an organization’s database, we might have a lot of data, data related to the clients, data related to employees, data related to finance, etc. These are maintained by creating separate tables for employee data, for client data and for financial data, respectively. Similarly, in microservice architecture, there can be a lot of data which needs to be sent to other microservices, some of this data might be related to each other can be categorized under one category. So, we create a topic for such data and put the data under the same category in one topic.
All the other microservices that are interested in that data will consume messages from that topic. For instance, in our e-commerce website, we might create a topic for payments related data. Now, there might be an order creation service which needs this data as it needs to calculate the freight charges (charges for sending a product from source to destination) and there might be an invoice creation service which creates PDF of the invoices for an order. This service will also need the payment data. However, there might be add-to-cart service and at this point, no payment has been made, or we can simply say that this service might not need any payments data. Rather, this service will need the product-description data and such data might also be required by a product-listing service.
So, in this case, we can have 2 separate topics, one can be payments-data and the other can be product-description data, and the data from these topics is consumed by different microservices, as per their requirement.
Example Kafka Topic: payments-data
Example Kafka Topic - product-detailsAlso, a topic is not simply a dump of data. It is a queue i.e. it maintains the FIFO (First-In-First-Out) principle. So, the data which came first into the topic, will be consumed first by the consumer.
What is Apache Kafka Partitions?
Till now, we have understood that a major reason for developing a microservice based architecture was that if one service fails, it should not impact the other services. In simple terms, we should have some fallback mechanism in case of a failure so that our systems never go down.
For understanding Apache Kafka partitions, we need to understand a few more Kafka terminologies. The diagram below shows how a topic is actually stored in Kafka.
How topics are stored in KafkaWe know that the topic in layman terms is just a dump of data (real-time data as the producer keeps producing new data). So, we understand that if we have a dump of data, we need to store it somewhere in a physical machine (called a server). So that server in the case of Kafka is called a Broker.
A broker in Apache Kafka is responsible for maintaining topics (and thus the published messages). A broker is more than simply a storage service for a topic. It maintains consumer-offsets and does other work too, but that is beyond the scope of this discussion. A collection or group of multiple brokers is called a Kafka cluster.
For now, if we take a broker as a machine that stores and maintains topics, what happens if the broker in which our topic is stored is down due to some reason? Obviously if we were to assume that the diagram shown above is actually what happens exactly then if our broker is down, we won’t be able to consume the messages and multiple consumer applications will get choked as the producer might keep producing the messages to the topic which is stuck and the lag in our systems will increase a lot.
To prevent this from happening we have 2 very important concepts in Kafka called partitioning and replication.
Each topic is split into multiple parts and each part is called a partition. All the messages within each partition are also ordered and immutable. Each message inside the partitions has a unique ID called offset.
Partitions in Apache Kafka
Here, we see that partitions have been divided into the existing brokers in a round-robin fashion. So, if the broker 1 is down, we don’t lose the entire topic, rather we have partition 2 still on from which the messages can be consumed. This approach is called decentralization and is a common approach which is also followed in databases too.
Note: The number of partitions to be made from a particular topic depends on a lot of factors such as the frequency of data coming into the topic, the number of consumers, etc. and is very specific to your use case and design of your system.
Why Use Partitions?
Partitions provide essential advantages for Kafka:
- Scalability: Increased partitions enable parallel processing, enabling Kafka to process huge data volumes.
- Parallel Processing: Various consumers consume from multiple partitions at once, tremendously accelerating data consumption.
- Fault Tolerance: Partition replication among brokers provides data security and high availability despite one server failure.
- Increased Throughput: Sliced messages into partitions make it possible to have more efficient data processing, minimizing bottlenecks.
Key Kafka Partition Concepts:
- Topic: An associated message is stored in a Kafka topic and may have multiple partitions.
- Brokers: Partitions are kept on different Kafka servers (brokers) for fault tolerance.
- Replicas: Partition copies guarantee data safety.
- Message Ordering: Messages are ordered within the same partition but not between partitions.
- Partition Key: Determines how messages are spread over partitions
How Kafka Partitions Work
Apache Kafka handles massive streams of real-time data with ease by dividing topics into smaller pieces known as partitions
1. Partition Storage
Every Kafka partition is kept as an independent log file on a Kafka broker (server) disk. Messages are always appended to the end of this log file sequentially—written messages cannot be modified or deleted. This is an append-only architecture that assists Kafka in sustaining extremely high write rates and assures message ordering in partitions.
2. Partition Leaders & Followers
Kafka provides high availability and fault tolerance through replication of partitions on multiple brokers. Here's how it works:
- Leader Partition: Each partition has a single leader that is responsible for all incoming messages (write requests) and consumer data reads. Leaders are responsible for the data distribution and performance of Kafka.
- Replica Partitions (Followers): These are the copies of the leader partition on other brokers as backup. They copy the leader's data to safeguard against failures of the broker, preventing data loss.
- In-Sync Replicas (ISR): ISRs are replicas in perfect sync with the leader partition. ISR replicas alone can become a new leader instantly in case the existing leader fails, hence offering continuous availability and guaranteed delivery of data.
How Kafka Producers Use Partitions
Kafka producers write messages to Kafka topics. By default, producers send messages to partitions round-robin style or based on the message key for placement. A partition key will send all messages with a specific key to the same partition, preserving order of messages—a must-have function for applications such as financial transactions processing.
Partitioning Strategies
- Round-Robin Partitioning: Distributes messages evenly but not in guarantee form.
- Key-Based Partitioning: Keeps messages with the same key in order.
- Sticky Partitioning: Sticky partitioning was added in Kafka 2.4, balancing efficiency with uniform message distribution..
Consumer Groups and Partition Assignment
Kafka consumers consume data from partitions. Consumers are grouped together to enable parallel consumption from different partitions. Kafka assigns partitions to consumers automatically:
- Equal consumers and partitions: One partition per consumer..
- Fewer consumers than partitions: Multiple partitions per consumer..
- More consumers than partitions: Some consumers remain idle.
Consumer Offsets and Rebalancing
Consumers track their reading progress using "offsets." Kafka manages these offsets, allowing consumers to pause and resume seamlessly. If consumers join or leave a group, Kafka triggers a "rebalance," reassigning partitions to maintain balance.
Advanced Kafka Partitioning Concepts
Kafka Streams and Kafka Connect
- Kafka Streams: Leverages partitions to construct real-time processing applications with increased scalability and stateful processing.
- Kafka Connect: Utilizes partitions to connect Kafka with external systems in a highly parallel data transfer-optimized fashion.
Monitoring and Tuning Kafka Partitions
Periodic monitoring can reveal performance bottlenecks such as "hot partitions" where traffic is abnormally large. Monitoring of partition performance may be facilitated with tools such as JMX or Prometheus/Grafana, contributing to fine-tuning.
What is Replication? How are partitions replicated?
Replicas are backups of partitions. These are never read data or write data. This means that we don’t directly read or write data to the replicas. We read and write the data only on the partitions and that is replicated to the replicas. Replication Factor is the property in Apache Kafka which decides how many replicas of each partition must be created. The default value of replication factor in Apache Kafka is 1 i.e. initially only the partition will exist and no duplicate copy of that partition exists. An example of a topic with replication factor 2 is shown below:
Replication of Apache Kafka PartitionsHere, the replicas have been assigned to the brokers in a round-robin fashion only. Please note that no replica and its corresponding partition will be in the same broker as it will ruin the entire idea of replication and partitioning.
Replication provides fault tolerance and prevents data loss.
So, we got a brief introduction to Apache Kafka partitions in this article. Hope you understood the article and enjoyed learning about Kafka and messaging systems in depth.
Real-World Examples of Kafka Partitions
Here are some real-world examples highlighting how Kafka partitioning strategy brings scalability and consistency:
E-commerce platforms
Large e-commerce companies employ Kafka partitioning based on customer user IDs. This allows the orders of each customer to be processed in their rightful order, offering consistent customer experience and continuous order tracking, particularly during festive sale events.
Financial systems
Banks and financial systems generally process millions of transactions per second. With transaction IDs as a partition key, Kafka can offer correct transaction ordering, which is essential for fraud detection, payment processing, and regulatory compliance.
IoT solutions
Internet of Things (IoT) networks generate huge volumes of sensor data per second. By spreading sensor data evenly across multiple Kafka partitions, IoT businesses can do concurrent analytics, allowing for quicker analysis of real-time data, faster decision-making, and scalable handling of data.
Conclusion
Apache Kafka partitions are of paramount significance to handle humongous streams of real-time data in a efficient and dependable manner. By means of topic partitioning into distributed partitions on brokers, Kafka can achieve unparalleled scalability, fault tolerance, and parallel processing. By means of partition replication, Kafka reaches high availability even if there is failure on some servers, thereby qualifying as an ideal candidate for mission-critical applications in areas such as e-commerce, finance, and IoT.
Similar Reads
Introduction to Apache Kafka Producer
Apache Kafka is among the strongest platforms for managing this type of data flow, utilized by companies such as LinkedIn, Netflix, and Uber.Think of the Kafka Producer as a data sender. Itâs a software component or client that pushes messages (like user clicks, signups, or sensor readings) into Kaf
14 min read
How To Install Apache Kafka on Mac OS?
Setting up Apache Kafka on your Mac OS operating system framework opens up many opportunities for taking care of continuous data and data streams effectively. Whether you're a designer, IT specialist, or tech geek, figuring out how to introduce Kafka locally permits you to examine and construct appl
4 min read
Introduction to Apache Cassandra
Cassandra is a distributed database management system which is open source with wide column store, NoSQL database to handle large amount of data across many commodity servers which provides high availability with no single point of failure. It is written in Java and developed by Apache Software Foun
3 min read
How to get all Topics in Apache Kafka?
Apache Kafka is an open-source event streaming platform that is used to build real-time data pipelines and also to build streaming applications. Kafka is specially designed to handle a large amount of data in a scalable way. In this article, we will learn how to get all topics in Apache Kafka. Steps
2 min read
Spring Boot - Integration with Kafka
Apache Kafka is a distributed messaging system designed for high-throughput and low-latency message delivery. It is widely used in real-time data pipelines, streaming analytics, and other applications requiring reliable and scalable data processing. Kafkaâs publish-subscribe model allows producers t
6 min read
Spring Boot â Integrate with Apache Kafka for Streaming
Apache Kafka is a widely used distributed streaming platform that enables the development of scalable, fault-tolerant, and high-throughput applications. In this article, we'll walk you through the process of integrating Kafka with a Spring Boot application, providing detailed code examples and expla
7 min read
Apache Kafka - Idempotent Producer
Apache Kafka Producers are going to write data to topics and topics are made of partitions. Now the producers in Kafka will automatically know to which broker and partition to write based on your message and in case there is a Kafka broker failure in your cluster the producers will automatically rec
4 min read
What is Apache Kafka?
Apache Kafka is a publish-subscribe messaging system. A messaging system lets you send messages between processes, applications, and servers. Broadly Speaking, Apache Kafka is software where topics (a topic might be a category) can be defined and further processed. Applications may connect to this s
13 min read
Introduction to Databricks
Databricks is a cloud-based platform for managing and analyzing large datasets using the Apache Spark open-source big data processing engine. It offers a unified workspace for data scientists, engineers, and business analysts to collaborate, develop, and deploy data-driven applications. Databricks i
5 min read
Microservices Communication with Apache Kafka in Spring Boot
Apache Kafka is a distributed streaming platform and can be widely used to create real-time data pipelines and streaming applications. It can publish and subscribe to records in progress, save these records in an error-free manner, and handle floating records as they arrive. Combined with Spring Boo
6 min read