AK
AK
2. Key Concepts
1. Event (Message):
o A unit of data written to Kafka (like a log entry or JSON record).
o Example: "User A purchased item X."
2. Producer:
o Sends events (messages) into Kafka topics.
3. Consumer:
o Reads events from Kafka topics.
4. Topics:
o A named category in Kafka where messages are stored.
o Example: A topic called purchases stores all purchase events.
5. Partitions:
o Each topic is split into partitions for scalability.
o Messages in partitions are ordered.
6. Offset:
o A unique number identifying the position of a message in a partition.
7. Broker:
o A Kafka server that stores data and serves client requests.
o Kafka is a cluster of multiple brokers.
8. Zookeeper:
o Coordinates the Kafka cluster by managing metadata, leader elections, etc.
o Not required in newer Kafka versions (2.8+ uses KRaft mode).
9. Consumer Group:
o A group of consumers that work together to consume messages from a topic.
10. Kafka Connect:
o A tool for moving data between Kafka and external systems.
3. Kafka Architecture
1. Producers:
o Write messages to a topic.
2. Topics and Partitions:
o Topics are split into partitions to handle large-scale data.
o Messages in a partition are immutable and ordered.
3. Consumers:
o Read messages from topics in a pull-based model.
4. Brokers:
o Kafka servers that store topic data and respond to client requests.
5. Replication:
o Kafka replicates topic partitions across brokers to ensure fault tolerance.
6. ZooKeeper:
o Maintains cluster state and handles leader election.
o (Optional in Kafka 2.8+)
4. Key Functionalities
1. Publish-Subscribe:
o Kafka uses a topic-based publish-subscribe model.
2. Fault Tolerance:
o Data is replicated across brokers to ensure reliability.
3. Durability:
o Kafka stores data on disk, making it highly durable.
4. Scalability:
o Kafka handles large-scale data by scaling brokers and partitions.
5. Real-Time Processing:
o Kafka processes messages in near real-time.
[Service]
User=kafka
Group=kafka
ExecStart=/path/to/kafka/bin/kafka-server-start.sh /path/to/kafka/config/server.properties
ExecStop=/path/to/kafka/bin/kafka-server-stop.sh
Restart=on-failure
[Install]
WantedBy=multi-user.target
b) Enable Kafka Service:
sudo systemctl enable kafka
c) Start Kafka Service:
sudo systemctl start kafka
d) Check Kafka Service Status:
sudo systemctl status kafka
[Service]
User=zookeeper
Group=zookeeper
ExecStart=/path/to/zookeeper/bin/zkServer.sh start
ExecStop=/path/to/zookeeper/bin/zkServer.sh stop
Restart=on-failure
[Install]
WantedBy=multi-user.target
b) Enable Zookeeper Service:
sudo systemctl enable zookeeper
c) Start Zookeeper Service:
sudo systemctl start zookeeper
d) Check Zookeeper Service Status:
sudo systemctl status zookeeper
2. Topics
Definition: A topic is like a folder where Kafka stores messages. Producers write to topics, and consumers
read from topics.
• Example:
o Topic name: orders
o Messages:
▪ { "order_id": 1, "user": "John", "amount": 250 }
▪ { "order_id": 2, "user": "Alice", "amount": 500 }
3. Partitions
Definition: Each topic is divided into partitions for scalability and parallelism.
• Key Points:
o Messages in a partition are ordered.
o Partitions are distributed across Kafka brokers.
o A topic can have multiple partitions.
Example:
• Topic: orders with 3 partitions.
• Partition 0: { "order_id": 1 }
• Partition 1: { "order_id": 2 }
• Partition 2: { "order_id": 3 }
Messages are assigned to partitions based on:
1. Key Hashing: e.g., "order_id" % number_of_partitions
2. Round-robin (if no key is provided).
4. Producers
Definition: Producers send events (messages) to Kafka topics.
• Example: A payment gateway produces messages:
o Topic: transactions
o Messages:
▪ { "transaction_id": 101, "status": "success" }
▪ { "transaction_id": 102, "status": "failed" }
How Producers Work:
• Use Kafka’s Producer API.
• Specify:
o Topic: Where the message should go.
o Key (optional): Determines the partition.
o Value: The actual message.
Example Code (Java):
java
5. Consumers
Definition: Consumers read messages from Kafka topics.
• Example:
o Consumer reads from the orders topic:
▪ { "order_id": 1 }
▪ { "order_id": 2 }
How Consumers Work:
• Use Kafka’s Consumer API.
• Specify:
o Topic: Which topic to read from.
o Group ID: Groups consumers for parallel processing.
o Offset: Controls where to start reading:
▪ earliest (start from the beginning).
▪ latest (start from new messages).
Example Code (Java):
java
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
System.out.println("Consumed record: " + record.value());
}
}
6. Offset
Definition: The position of a message in a partition.
• Example:
o Partition 0 contains:
▪ Offset 0: { "order_id": 1 }
▪ Offset 1: { "order_id": 2 }
Consumers use offsets to track what they’ve read:
• Offset 0 → Read.
• Offset 1 → Next to be read.
7. Consumer Groups
Definition: A set of consumers sharing the workload.
• Example:
o Topic: orders with 3 partitions.
o Consumer Group: order-group with 3 consumers.
▪ Consumer 1 → Reads Partition 0.
▪ Consumer 2 → Reads Partition 1.
▪ Consumer 3 → Reads Partition 2.
If a consumer fails, its partition is reassigned to another consumer in the group.
8. Kafka Brokers
Definition: Kafka servers that store and manage messages.
• Cluster Example:
o Broker 1: Stores Partition 0 of orders.
o Broker 2: Stores Partition 1 of orders.
o Broker 3: Stores Partition 2 of orders.
If one broker goes down, replicas (copies) on other brokers ensure data availability.
9. Replication
Definition: Each partition is copied (replicated) to other brokers for fault tolerance.
• Example:
o Topic: orders with 2 partitions.
o Replication Factor: 2.
o Partition 0:
▪ Leader: Broker 1.
▪ Replica: Broker 2.
o Partition 1:
▪ Leader: Broker 2.
▪ Replica: Broker 3.
Leader handles all read/write operations. Replicas are backups.
10. ZooKeeper
Definition: Manages Kafka cluster metadata, leader election, and configurations.
• Tasks:
o Keeps track of brokers.
o Handles partition leader election.
{
"name": "file-source-connector",
"config": {
"connector.class": "FileStreamSource",
"tasks.max": "1",
"file": "/tmp/input.txt",
"topic": "file-topic"
}
}
Sink Connector Example (Database):
json
{
"name": "db-sink-connector",
"config": {
"connector.class": "JDBC",
"connection.url": "jdbc:mysql://localhost:3306/mydb",
"connection.user": "root",
"connection.password": "password",
"topics": "db-topic"
}
}
7. Security in Kafka
1. Encryption:
o TLS (SSL): Secures communication between Kafka clients and brokers.
o Key configurations:
▪ ssl.keystore.location
▪ ssl.truststore.location
2. Authentication:
o SASL (Simple Authentication and Security Layer):
▪ Mechanisms: PLAIN, SCRAM, GSSAPI (Kerberos).
▪ Configurations for SASL/PLAIN:
properties
sasl.mechanism=PLAIN
security.protocol=SASL_SSL
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="user"
password="pass";
3. Authorization:
o ACLs (Access Control Lists):
▪ Define permissions for topics, consumer groups, etc.
▪ Example:
bash
errors.deadletterqueue.topic.name=my-dlq
errors.deadletterqueue.context.headers.enable=true
4. Idempotence and Exactly-Once Semantics (EOS):
o Ensure messages are not duplicated during retries.
o Idempotent producer:
properties
enable.idempotence=true
o Transactions:
java
producer.initTransactions();
producer.beginTransaction();
producer.send(record);
producer.commitTransaction();
5. Cluster Balancing:
o Tools: Cruise Control for automatic rebalancing and monitoring.
9. Multi-Tenancy
• Namespace Management:
o Use prefixes or separate clusters for tenants.
o Example: Tenant-specific topics (tenantA.orders, tenantB.orders).
• Quota Management:
o Limit producer and consumer throughput:
properties
quota.producer.default=1000000
quota.consumer.default=2000000
log.segment.bytes=1073741824
o Retention settings:
properties
log.retention.hours=168
Practice Questions with Explanations
1. Kafka Topics and Partitions
Question 1:
You have a topic user-logins with 5 partitions and a replication factor of 3. What happens if one broker goes
offline?
• A) All partitions become unavailable.
• B) All partitions will have reduced replication.
• C) Only some partitions will become unavailable.
• D) The cluster will stop accepting writes.
Answer:
• B) All partitions will have reduced replication.
Explanation: If one broker goes offline, the partitions hosted on that broker will still be available if
their replicas are on other brokers. However, the replication factor will temporarily decrease until the
broker is restored or replicas are reassigned.
2. Schema Registry
Question 2:
You register a schema for a topic using Schema Registry. What happens if a producer sends a message that
does not match the schema?
• A) The message is rejected.
• B) The message is accepted but logged as a warning.
• C) The consumer fails to deserialize the message.
• D) The message is dropped silently.
Answer:
• A) The message is rejected.
Explanation: Schema Registry enforces schema validation during message production. If the data
does not conform to the schema, it is rejected before being written to the topic.
Question 3:
Which compatibility modes does Confluent Schema Registry support?
• A) Backward, Forward, Full.
• B) Strict, Relaxed, None.
• C) Additive, Non-Additive, Full.
• D) None of the above.
Answer:
• A) Backward, Forward, Full.
Explanation: Schema Registry supports compatibility modes to ensure smooth schema evolution.
Examples:
o Backward Compatibility: Consumers using older schemas can read new data.
o Forward Compatibility: Consumers using newer schemas can read older data.
o Full Compatibility: Both backward and forward compatibility are maintained.
3. Kafka Connect
Question 4:
You configure a Kafka Connect source connector for a database. However, you notice duplicate messages in
the Kafka topic. What is the likely cause?
• A) The connector tasks are set too high.
• B) The source database has duplicate records.
• C) Offset management is not properly configured.
• D) The topic has too many partitions.
Answer:
• C) Offset management is not properly configured.
Explanation: Kafka Connect tracks offsets to ensure exactly-once delivery. If offset management fails
(e.g., connector restarts or misconfiguration), duplicate messages can occur.
Question 5:
What property determines the maximum number of parallel tasks in Kafka Connect?
• A) tasks.parallel
• B) connect.tasks
• C) tasks.max
• D) max.tasks
Answer:
• C) tasks.max.
Explanation: tasks.max defines the number of parallel tasks that a connector can spawn, enabling
parallelism and scalability.
4. Kafka Streams
Question 6:
You are using Kafka Streams to aggregate order totals per user. Which Kafka Streams concept should you
use to store the state of the aggregation?
• A) KStream
• B) GlobalKTable
• C) KTable
• D) Processor API
Answer:
• C) KTable.
Explanation: A KTable is used for aggregations and maintains the latest state of a keyed record. In
this case, it stores the total order amounts per user.
Question 7:
Write a Kafka Streams application to filter out transactions below $100 and write valid transactions to a new
topic.
Answer (Java):
java
6. Security
Question 10:
You are setting up TLS encryption for Kafka. Which configuration is required on the broker?
• A) ssl.enabled=true
• B) security.protocol=SSL
• C) ssl.keystore.location
• D) Both B and C.
D) Both B and C.
Explanation: For TLS encryption, you must specify the security protocol as SSL and provide the keystore
and truststore locations.
Question 12:
You need to replicate data between two Kafka clusters in different regions. Which tool should you use?
• A) Kafka Streams
• B) Confluent Replicator
• C) MirrorMaker 2
• D) Both B and C.
Answer:
• D) Both B and C.
Explanation:
• Confluent Replicator: Ideal for Confluent Platform users with schema registry support.
• MirrorMaker 2: Open-source solution for geo-replication.
Mock Scenario
Scenario:
You are managing a Kafka cluster with the following requirements:
1. Messages must be replicated across 3 brokers with fault tolerance.
2. Consumer lag should be minimized.
3. Integrate with a database for CDC (Change Data Capture).
Questions:
1. What replication factor should you configure for the topic?
o Answer: 3.
2. How do you monitor and reduce consumer lag?
o Answer: Use the Consumer Lag metric, and ensure consumers are properly distributed across
partitions.
3. Which Kafka Connect plugin is ideal for CDC?
o Answer: Debezium.