Open In App

Kafka Architecture

Last Updated : 31 Mar, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. It is known for its high throughput, low latency, fault tolerance, and scalability. This article delves into the architecture of Kafka, exploring its core components, functionalities, and the interactions between them.

Understanding Apache Kafka Architecture

Kafka is open-source distributed streaming platform, designed to handle large amounts of real-time data by providing scalable, fault-tolerant, low-latency platform for processing in real-time. Kafka is designed by a team of engineers at LinkedIn and later open-sourced in 2011. Thousands of organizations use Kafka for building event-driven architectures, real-time analytics and streaming pipelines. Kafka architecture is based on producer-subscriber model and follows distributed architecture, runs as cluster.

Core Components of Kafka Architecture

  1. Kafka Cluster: A Kafka cluster is a distributed system composed of multiple Kafka brokers working together to handle the storage and processing of real-time streaming data. It provides fault tolerance, scalability, and high availability for efficient data streaming and messaging in large-scale applications.
  2. Brokers: Brokers are the servers that form the Kafka cluster. Each broker is responsible for receiving, storing, and serving data. They handle the read and write operations from producers and consumers. Brokers also manage the replication of data to ensure fault tolerance.
  3. Topics and Partitions: Data in Kafka is organized into topics, which are logical channels to which producers send data and from which consumers read data. Each topic is divided into partitions, which are the basic unit of parallelism in Kafka. Partitions allow Kafka to scale horizontally by distributing data across multiple brokers.
  4. Producers: Producers are client applications that publish (write) data to Kafka topics. They send records to the appropriate topic and partition based on the partitioning strategy, which can be key-based or round-robin.
  5. Consumers: Consumers are client applications that subscribe to Kafka topics and process the data. They read records from the topics and can be part of a consumer group, which allows for load balancing and fault tolerance. Each consumer in a group reads data from a unique set of partitions.
  6. ZooKeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In Kafka, ZooKeeper is used to manage and coordinate the Kafka brokers. ZooKeeper is shown as a separate component interacting with the Kafka cluster.
  7. Offsets : Offsets are unique identifiers assigned to each message in a partition. Consumers will use these offsets to track their progress in consuming messages from a topic.
Kafka-Architecture-01-01
Core Components of Kafka Architecture

Kafka APIs

Kafka provides several APIs to interact with the system:

  1. Producer API: Allows applications to send streams of data to topics in the Kafka cluster. It handles the serialization of data and the partitioning logic.
  2. Consumer API: Allows applications to read streams of data from topics. It manages the offset of the data read, ensuring that each record is processed exactly once.
  3. Streams API: A Java library for building applications that process data in real-time. It allows for powerful transformations and aggregations of event data.
  4. Connector API: Provides a framework for connecting Kafka with external systems. Source connectors import data from external systems into Kafka topics, while sink connectors export data from Kafka topics to external systems.

Interactions in the Kafka Architecture

  • Producers to Kafka Cluster: Producers send data to the Kafka cluster. The data is published to specific topics, which are then divided into partitions and distributed across the brokers.
  • Kafka Cluster to Consumers: Consumers read data from the Kafka cluster. They subscribe to topics and consume data from the partitions assigned to them. The consumer group ensures that the load is balanced and that each partition is processed by only one consumer in the group.
  • ZooKeeper to Kafka Cluster: ZooKeeper coordinates and manages the Kafka cluster. It keeps track of the cluster's metadata, manages broker configurations, and handles leader elections for partitions.

The relationship between partitions, offsets, and consumer groups in a Kafka-based system:

Kafka-architecture-02
Interactions in the Kafka Architecture
  1. Partitions: The diagram shows three partitions labeled "Partition 0", "Partition 1", and "Partition 2". Each partition contains a sequence of records, each with an offset value ranging from 0 to 6. The offset is a unique identifier for each record within a partition, indicating its position.
  2. Consumer Group: The right side of the diagram shows a consumer group with three consumers. Each consumer is associated with a specific partition and an offset value:
    • Consumer 1: Associated with Partition 0, reading from offset 4.
    • Consumer 2: Associated with Partition 1, reading from offset 2.
    • Consumer 3: Associated with Partition 2, reading from offset 3.
  3. Data Flow: The arrows in the diagram indicate the flow of data from the partitions to the consumers. Each consumer reads data from its assigned partition starting from the specified offset. This ensures that each record is processed exactly once by the consumer group.

Key Features of Kafka Architecture

  1. High Throughput and Low Latency: Kafka is designed to handle high volumes of data with low latency. It can process millions of messages per second with latencies as low as 10 milliseconds.
  2. Fault Tolerance: Kafka achieves fault tolerance through data replication. Each partition can have multiple replicas, and Kafka ensures that data is replicated across multiple brokers. This allows the system to continue operating even if some brokers fail.
  3. Durability: Kafka ensures data durability by persisting data to disk. Data is stored in a log-structured format, which allows for efficient sequential reads and writes.
  4. Scalability: Kafka's distributed architecture allows it to scale horizontally by adding more brokers to the cluster. This enables Kafka to handle increasing amounts of data without downtime.
  5. Real-Time Processing: Kafka supports real-time data processing through its Streams API and ksqlDB, a streaming database that allows for SQL-like queries on streaming data.

Apache Kafka Frameworks

Kafka is a distributed streaming platform that can be extended and integrated with various frameworks to extend its capabilities and integrate with other systems. Some of the key frameworks in Kafka ecosystem include:

  1. Kafka Connect: Kafka Connect is a tool, plugin for reliable and scalable streaming data integration between Apache Kafka and other systems. It is a part of Apache Kafka ecosystem and provides a framework to connect Kafka with external systems like databases, file systems etc. Kafka Connect provides built-in connectors for common data sources and sinks making a simplified integration process.
  2. Kafka Streams: Kafka Streams is a client library for building applications and microservices that process and analyze the data stored in Kafka topics. It provides a high-level API for performing streaming processing tasks such as filtering, joining data streams, aggregating.
kafka-connect-01
Apache Kafka Frameworks

3. Schema Registry: The Schema Registry is a component of Confluent Platform (a distribution of Kakfa) that provides a centralized repository for storing and managing Avro schemas used in Kafka messages. It ensures that the serialization and deserialization in Producers and Consumers for using compatible schemas.

kafka-schema-02
Schema Registry

Kafka Topic Management

One of the fundamental aspects of working with Kafka is managing topics. Topics are the categories to which records are sent by producers and from which records are received by consumers.

1. Creating Topics

To create a topic in Kafka, you can use the kafka-topics.sh script, which is included in the Kafka distribution. Here is an example command to create a Kafka topic:

./bin/kafka-topics.sh --create --topic topic_name --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1

Command Explanation:

  • --create: This flag is used to create a new topic.
  • --topic topic_name: Specifies the name of the topic that needs to be created.
  • --bootstrap-server localhost:9092: Specifies the Kafka broker to connect to. You can replace localhost:9092 with your actual broker address.
  • --replication-factor 1: Specifies the replication factor for the topic, which indicates how many copies of each partition should be maintained. In this example, it is set to 1.
  • --partitions 1: Specifies the number of partitions for the topic. In this example, it is set to 1.

2. Topic Configurations

The Kafka topics have several configurations that determine their behavior. If no topic configuration is provided then the default server properties were used. You can create a topic using "kafka-topic" tool and "--config" option. We can modify the configuration using "kafka-configs" and "--alter" option.

Example: Creating a Topic with Configurations

./bin/kafka-topics.sh --create --topic topic_name --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --config retention.ms=604800000

Example: Modifying Topic Configurations

./bin/kafka-configs.sh --alter --entity-type topics --entity-name topic_name --add-config retention.ms=259200000

3. Topic Partitions and Replication

  • Partitions: Topic partitions are a fundamental concept in Kafka that allow data to be parallelized and distributed among multiple brokers. Each partition is an ordered collection of messages that are immutable. When creating a topic, you specify the number of partitions using the --partitions flag.
  • Replication: Kafka allows replication of data across multiple brokers to ensure data durability and fault tolerance. Each partition can have one or more replicas. Among these replicas, one serves as the leader, and the others serve as followers. The leader handles all read and write requests for the partition, while the followers replicate the data. If the leader replica fails, one of the follower replicas is elected as the new leader. The replication factor is specified using the --replication-factor flag when creating a topic.

Example: Creating a Topic with Partitions and Replication

./bin/kafka-topics.sh --create --topic topic_name --bootstrap-server localhost:9092 --replication-factor 3 --partitions 4

Real-World Kafka Architectures

Apache Kafka is a versatile platform used in various real-world applications due to its high throughput, fault tolerance, and scalability. Here, we will explore three common Kafka architectures: Pub-Sub Systems, Stream Processing Pipelines, and Log Aggregation Architectures.

1. Pub-Sub Systems

In a publish-subscribe (pub-sub) system, producers publish messages to topics, and consumers subscribe to those topics to receive the messages. Kafka's architecture is well-suited for pub-sub systems due to its ability to handle high volumes of data and provide reliable message delivery.

Key Components

  • Producers: Applications that send data to Kafka topics.
  • Topics: Logical channels to which producers send data and from which consumers read data.
  • Consumers: Applications that subscribe to topics and process the data.
  • Consumer Groups: Groups of consumers that share the load of reading from topics.

A real-world example of a pub-sub system using Kafka could be a news feed application where multiple news sources (producers) publish articles to a topic, and various user applications (consumers) subscribe to receive updates in real-time.

2. Stream Processing Pipelines

Stream processing pipelines involve continuously ingesting, processing, and transforming data in real-time. Kafka's ability to handle high-throughput data streams and its integration with stream processing frameworks like Apache Flink and Apache Spark make it ideal for building such pipelines.

Key Components

  • Producers: Applications that send raw data streams to Kafka topics.
  • Topics: Channels where raw data is stored before processing.
  • Stream Processors: Applications or frameworks that consume raw data, process it, and produce transformed data.
  • Sink Topics: Topics where processed data is stored for further use.

A real-world example of a stream processing pipeline using Kafka could be a financial trading platform where market data (producers) is ingested in real-time, processed to detect trading signals (stream processors), and the results are stored in sink topics for further analysis.

3. Log Aggregation Architectures

Log aggregation involves collecting log data from various sources, centralizing it, and making it available for analysis. Kafka's durability and scalability make it an excellent choice for log aggregation systems.

Key Components

  • Log Producers: Applications or services that generate log data.
  • Log Topics: Kafka topics where log data is stored.
  • Log Consumers: Applications that read log data for analysis or storage in a centralized system.

A real-world example of a log aggregation architecture using Kafka could be a microservices-based application where each microservice produces logs. These logs are sent to Kafka topics, and a centralized logging system (like ELK Stack) consumes the logs for analysis and monitoring.

Kafka's architecture supports various real-world applications, including pub-sub systems, stream processing pipelines, and log aggregation architectures. Its ability to handle high-throughput data streams, provide fault tolerance, and scale horizontally makes it a powerful tool for building robust and scalable data-driven applications.

Advantages of Kafka Architecture

  • Decoupling of Producers and Consumers: Kafka decouples producers and consumers, allowing them to operate independently. This makes it easier to scale and manage the system.
  • Ordered and Immutable Logs: Kafka maintains the order of records within a partition and ensures that records are immutable. This guarantees the integrity and consistency of the data.
  • High Availability: Kafka's replication and fault tolerance mechanisms ensure high availability and reliability of the data.

Conclusion

Apache Kafka's architecture is designed to handle real-time data streaming and processing with high throughput, low latency, fault tolerance, and scalability. Its core components, including brokers, topics, partitions, producers, consumers, and ZooKeeper, work together to provide a robust and efficient data streaming platform. Kafka's APIs, such as the Producer, Consumer, Streams, and Connector APIs, offer powerful tools for building real-time data pipelines and applications. With its ability to handle large volumes of data and provide real-time processing, Kafka is a valuable tool for modern data-driven applications.


Next Article

Similar Reads