Introduction to Apache Kafka Partitions

Apache Kafka, a powerful publish-subscribe messaging system, has emerged as the preferred choice of high-volume data streams utilized by international corporations such as LinkedIn, Netflix, and Uber. Lying at the heart of Kafka's superior performance is its design of Kafka partitions.

Partitions are the basic blocks that empower Kafka to handle huge datasets effortlessly, guaranteeing unhampered scalability, parallel processing, and high fault tolerance.

In this article, you'll learn exactly what Kafka partitions are, how they work behind the scenes, their benefits, real-world use cases, and how to leverage them effectively in your systems.

What is a Messaging System?

When our application is made up of a large number of microservices, these microservices often need to interact with each other. An example of architecture is shown below.

Here, the services 1,2 and 3 are producers as they are producing some data which is required by the services 4,5,6 and 7, hence they are called the consumers as they are consuming the data produced by the producers. So, we can see a large number of connections setup here. So, a large number of data pipelines need to be created and maintained in order to have interaction between these consumers and producers. Here, the messaging system solves this problem by reducing the number of connections (data pipelines) as shown below.

2-(2) — Messaging System reduces the number of data pipelines

So, the producers simply produce the data to a messaging system and the consumers will consume the required data from the messaging system itself and not from the individual microservices.

Number of connections have been reduced from (producers * consumers) to (producers + consumers).

What is Apache Kafka?

Apache Kafka (Apache is the name of the company that owns Kafka now) is a publish-subscribe messaging system. So, now we know that a messaging system is responsible for sending data between applications so that the applications focus on the data and processing that data rather than data transmission and maintaining data pipelines.

A publish-subscribe messaging system is a messaging system in which some microservices (subscribers) subscribe to a particular topic and receive messages from that topic. The publisher publishes the messages to that topic and subscribers consume those messages. There is a time dependency which is laid on the consumers to consume the messages. When a subscriber (consumer) consumes a message, it does not send an acknowledgement back to the producer.

3-(2) — publish-subscribe messaging system

For more details refer What is Apache Kafka

What is a Topic?

Like in databases, we might have a large database but we also maintain separation of concern using tables, similarly the separation of concern in Apache Kafka is maintained using topics. In simple terms, topics are similar to tables in databases (just for understanding). For instance, in an organization’s database, we might have a lot of data, data related to the clients, data related to employees, data related to finance, etc. These are maintained by creating separate tables for employee data, for client data and for financial data, respectively. Similarly, in microservice architecture, there can be a lot of data which needs to be sent to other microservices, some of this data might be related to each other can be categorized under one category. So, we create a topic for such data and put the data under the same category in one topic.

All the other microservices that are interested in that data will consume messages from that topic. For instance, in our e-commerce website, we might create a topic for payments related data. Now, there might be an order creation service which needs this data as it needs to calculate the freight charges (charges for sending a product from source to destination) and there might be an invoice creation service which creates PDF of the invoices for an order. This service will also need the payment data. However, there might be add-to-cart service and at this point, no payment has been made, or we can simply say that this service might not need any payments data. Rather, this service will need the product-description data and such data might also be required by a product-listing service.

So, in this case, we can have 2 separate topics, one can be payments-data and the other can be product-description data, and the data from these topics is consumed by different microservices, as per their requirement.

4-(1) — Example Kafka Topic: payments-data

Also, a topic is not simply a dump of data. It is a queue i.e. it maintains the FIFO (First-In-First-Out) principle. So, the data which came first into the topic, will be consumed first by the consumer.

What is Apache Kafka Partitions?

Till now, we have understood that a major reason for developing a microservice based architecture was that if one service fails, it should not impact the other services. In simple terms, we should have some fallback mechanism in case of a failure so that our systems never go down.

For understanding Apache Kafka partitions, we need to understand a few more Kafka terminologies. The diagram below shows how a topic is actually stored in Kafka.

We know that the topic in layman terms is just a dump of data (real-time data as the producer keeps producing new data). So, we understand that if we have a dump of data, we need to store it somewhere in a physical machine (called a server). So that server in the case of Kafka is called a Broker.

A broker in Apache Kafka is responsible for maintaining topics (and thus the published messages). A broker is more than simply a storage service for a topic. It maintains consumer-offsets and does other work too, but that is beyond the scope of this discussion. A collection or group of multiple brokers is called a Kafka cluster.

For now, if we take a broker as a machine that stores and maintains topics, what happens if the broker in which our topic is stored is down due to some reason? Obviously if we were to assume that the diagram shown above is actually what happens exactly then if our broker is down, we won’t be able to consume the messages and multiple consumer applications will get choked as the producer might keep producing the messages to the topic which is stuck and the lag in our systems will increase a lot.

To prevent this from happening we have 2 very important concepts in Kafka called partitioning and replication.

Each topic is split into multiple parts and each part is called a partition. All the messages within each partition are also ordered and immutable. Each message inside the partitions has a unique ID called offset.

Here, we see that partitions have been divided into the existing brokers in a round-robin fashion. So, if the broker 1 is down, we don’t lose the entire topic, rather we have partition 2 still on from which the messages can be consumed. This approach is called decentralization and is a common approach which is also followed in databases too.

Note: The number of partitions to be made from a particular topic depends on a lot of factors such as the frequency of data coming into the topic, the number of consumers, etc. and is very specific to your use case and design of your system.

Why Use Partitions?

Partitions provide essential advantages for Kafka:

Scalability: Increased partitions enable parallel processing, enabling Kafka to process huge data volumes.
Parallel Processing: Various consumers consume from multiple partitions at once, tremendously accelerating data consumption.
Fault Tolerance: Partition replication among brokers provides data security and high availability despite one server failure.
Increased Throughput: Sliced messages into partitions make it possible to have more efficient data processing, minimizing bottlenecks.

Key Kafka Partition Concepts:

Topic: An associated message is stored in a Kafka topic and may have multiple partitions.
Brokers: Partitions are kept on different Kafka servers (brokers) for fault tolerance.
Replicas: Partition copies guarantee data safety.
Message Ordering: Messages are ordered within the same partition but not between partitions.
Partition Key: Determines how messages are spread over partitions

How Kafka Partitions Work

Apache Kafka handles massive streams of real-time data with ease by dividing topics into smaller pieces known as partitions

1. Partition Storage

Every Kafka partition is kept as an independent log file on a Kafka broker (server) disk. Messages are always appended to the end of this log file sequentially—written messages cannot be modified or deleted. This is an append-only architecture that assists Kafka in sustaining extremely high write rates and assures message ordering in partitions.

2. Partition Leaders & Followers

Kafka provides high availability and fault tolerance through replication of partitions on multiple brokers. Here's how it works:

Leader Partition: Each partition has a single leader that is responsible for all incoming messages (write requests) and consumer data reads. Leaders are responsible for the data distribution and performance of Kafka.
Replica Partitions (Followers): These are the copies of the leader partition on other brokers as backup. They copy the leader's data to safeguard against failures of the broker, preventing data loss.
In-Sync Replicas (ISR): ISRs are replicas in perfect sync with the leader partition. ISR replicas alone can become a new leader instantly in case the existing leader fails, hence offering continuous availability and guaranteed delivery of data.

How Kafka Producers Use Partitions

Kafka producers write messages to Kafka topics. By default, producers send messages to partitions round-robin style or based on the message key for placement. A partition key will send all messages with a specific key to the same partition, preserving order of messages—a must-have function for applications such as financial transactions processing.

Partitioning Strategies

Round-Robin Partitioning: Distributes messages evenly but not in guarantee form.
Key-Based Partitioning: Keeps messages with the same key in order.
Sticky Partitioning: Sticky partitioning was added in Kafka 2.4, balancing efficiency with uniform message distribution..

Consumer Groups and Partition Assignment

Kafka consumers consume data from partitions. Consumers are grouped together to enable parallel consumption from different partitions. Kafka assigns partitions to consumers automatically:

Equal consumers and partitions: One partition per consumer..
Fewer consumers than partitions: Multiple partitions per consumer..
More consumers than partitions: Some consumers remain idle.

Consumer Offsets and Rebalancing

Consumers track their reading progress using "offsets." Kafka manages these offsets, allowing consumers to pause and resume seamlessly. If consumers join or leave a group, Kafka triggers a "rebalance," reassigning partitions to maintain balance.

Advanced Kafka Partitioning Concepts

Kafka Streams and Kafka Connect

Kafka Streams: Leverages partitions to construct real-time processing applications with increased scalability and stateful processing.
Kafka Connect: Utilizes partitions to connect Kafka with external systems in a highly parallel data transfer-optimized fashion.

Monitoring and Tuning Kafka Partitions

Periodic monitoring can reveal performance bottlenecks such as "hot partitions" where traffic is abnormally large. Monitoring of partition performance may be facilitated with tools such as JMX or Prometheus/Grafana, contributing to fine-tuning.

What is Replication? How are partitions replicated?

Replicas are backups of partitions. These are never read data or write data. This means that we don’t directly read or write data to the replicas. We read and write the data only on the partitions and that is replicated to the replicas. Replication Factor is the property in Apache Kafka which decides how many replicas of each partition must be created. The default value of replication factor in Apache Kafka is 1 i.e. initially only the partition will exist and no duplicate copy of that partition exists. An example of a topic with replication factor 2 is shown below:

Here, the replicas have been assigned to the brokers in a round-robin fashion only. Please note that no replica and its corresponding partition will be in the same broker as it will ruin the entire idea of replication and partitioning.

Replication provides fault tolerance and prevents data loss.

So, we got a brief introduction to Apache Kafka partitions in this article. Hope you understood the article and enjoyed learning about Kafka and messaging systems in depth.

Real-World Examples of Kafka Partitions

Here are some real-world examples highlighting how Kafka partitioning strategy brings scalability and consistency:

E-commerce platforms

Large e-commerce companies employ Kafka partitioning based on customer user IDs. This allows the orders of each customer to be processed in their rightful order, offering consistent customer experience and continuous order tracking, particularly during festive sale events.

Financial systems

Banks and financial systems generally process millions of transactions per second. With transaction IDs as a partition key, Kafka can offer correct transaction ordering, which is essential for fraud detection, payment processing, and regulatory compliance.

IoT solutions

Internet of Things (IoT) networks generate huge volumes of sensor data per second. By spreading sensor data evenly across multiple Kafka partitions, IoT businesses can do concurrent analytics, allowing for quicker analysis of real-time data, faster decision-making, and scalable handling of data.

Conclusion

Apache Kafka partitions are of paramount significance to handle humongous streams of real-time data in a efficient and dependable manner. By means of topic partitioning into distributed partitions on brokers, Kafka can achieve unparalleled scalability, fault tolerance, and parallel processing. By means of partition replication, Kafka reaches high availability even if there is failure on some servers, thereby qualifying as an ideal candidate for mission-critical applications in areas such as e-commerce, finance, and IoT.