INTERNET OF THINGS
TOPIC: APACHE KAFKA
Group members
Taniya Souza
[1DA21CS150]
Srinivasan r Guide:
[1DA21CS143] Prof. Lavanya Santosh
Yashwanth B K
CSE Dept
[1DA21CS171]
Yashwanth Gowda B
[1DA21CS172]
What is Apache Kafka?
•Apache Kafka is an open-source distributed event-streaming platform.
•Originally developed by LinkedIn and donated to the Apache Software
Foundation in 2011.
•Designed to handle high-throughput, low-latency, real-time data streams.
Key Features of Kafka
Distributed System: Runs as a cluster of brokers for scalability and fault tolerance.
Durable Storage: Data is stored on disk and replicated across brokers.
High Throughput: Can handle millions of messages per second.
Low Latency: Ensures quick delivery of messages.
Decoupling Systems: Allows independent development and scaling of producers and consumers.
Why Use Kafka?
•Ideal for modern data-driven applications.
•Helps in building real-time analytics systems.
•Serves as a backbone for microservices communication.
•Ensures scalability to handle large datasets.
•Integrates with popular big data frameworks like Spark, Flink, and Hadoop.
Core Functions:
•Publish and Subscribe: Enables real-time messaging between producers and consumers through
topics.
•Durable Storage: Persistently stores data streams on disk, allowing replay and recovery.
•Scalable Partitioning: Divides topics into partitions for parallel and distributed data processing.
•Fault Tolerance: Ensures data availability and reliability through replication across brokers.
•Real-Time Stream Processing: Processes and analyzes data streams in real time using Kafka
Streams or external tools.
Kafka Architecture
Overview
• Kafka is a publish-subscribe messaging
system with the following components:
• Producers: Publish messages to topics.
• Consumers: Subscribe to topics to consume
messages.
• Brokers: Manage the storage and retrieval
of messages.
• Topics: Categories to which messages are
published.
• Partitions: Break down topics for scalability.
Kafka Topics
A topic is a logical channel for data streams.
Each topic is divided into partitions for parallel processing.
Data in topics is retained for a configurable period, even after
consumption.
Topics can have configurations for replication and data retention.
Example: A “Sales Data” topic could have partitions based on
regions.
Producers and Consumers
•Producers: Send data to Kafka topics.
•Push messages to specific partitions.
•Can define custom partitioning logic (e.g., based on keys).
•Consumers: Read data from topics.
•Join consumer groups for parallel processing.
•Kafka ensures that each partition is read by one consumer in a group.
Brokers and Clusters
• A Kafka cluster consists of multiple brokers.
• Brokers: Handle storage and management of data
streams.
• Each broker handles a subset of partitions.
• Collaborate to provide fault tolerance and scalability.
• Clusters use ZooKeeper (or KRaft, in newer versions)
for managing configurations and leader election.
Kafka Partitions
Topics are divided into partitions to distribute data
and allow parallelism.
Data Placement: Messages in partitions are stored
in the order they arrive.
Key-Based Partitioning: Ensures that messages
with the same key go to the same partition.
Example: A “User Activity” topic could have partitions
for different user IDs.
Offset and Message
Ordering
• Offset: A unique identifier for each message in a partition.
• Used to keep track of consumed messages.
• Kafka guarantees message order within a partition but not
across partitions.
• Consumers can reset offsets for reprocessing messages.
Durability and Replication
•Kafka ensures durability by replicating data across brokers.
•Leader Replica: Handles all read and write requests for a partition.
•Follower Replicas: Maintain copies and take over if the leader fails.
•Acknowledgments: Producers can configure how many replicas must confirm
a message before it's considered successful.
Use Cases of Kafka
•Real-Time Analytics:
Monitor and analyze social media feeds or website activities.
•Log Aggregation:
Centralized logging from distributed systems.
•Event Sourcing:
Capture application changes as a sequence of events.
•Data Integration:
Sync databases and applications.
•Stream Processing:
Process and analyze data in real-time with Kafka Streams or other tools.
Advantages of Kafka
•Scalability: Can scale horizontally by adding brokers.
•Flexibility: Works with multiple programming languages.
•Resilience: Fault-tolerant with replication and partitioning.
•Performance: Handles millions of events per second with low latency.
•Integration: Seamlessly integrates with popular tools like Spark and Flink.
Challenges with Kafka
Complex Setup: Requires expertise to configure and maintain.
Resource-Intensive: High memory usage for durability and performance.
Message Duplication: Can occur without proper configuration.
Operational Overhead: ZooKeeper dependency in older versions.
SUMMAR
Y
Apache Kafka is a distributed platform for real-time data streaming and
processing, designed for high-throughput, low-latency, and fault-tolerant
communication.
Kafka uses topics for organizing data, partitions for scalability, and
replication for reliability, enabling efficient handling of massive data
streams.
Common applications include real-time analytics, event-driven
architectures, log aggregation, and data integration between diverse
systems.
THANK YOU