Apache Kafka
Apache Kafka
• kafka-replica-verification.sh --broker-list
br1,br2 --topic-white-list 'my-.*'
Topics, Partitions and offsets
● Topics: a particular stream of Data.
− Similar to a table in database(without all the
constraints)
− You can have as many topics you want
− A topic is identified by its name
● Topics are split in partitions
− Each partition is ordered
− Each message within a partition gets an incremental
ID, called offset.
Topic example
● Say you have a fleet of trucks, each truck reports
its GPS position to Kafka.
● You can have a topic "trucks_gps” that contains
the position of all trucks.
● Each truck will send a message to kafka every 20
seconds, each message will contain the truck ID
and the truck position (latitude and longitude)
● We choose to create that topic with 10 partitions
(arbitrary number)
Topics, Partitions and offsets
● Offset only have a meaning for a specific partition.
− Eg.offset 3 in partition 0 doesn’t represent the same data as
offset 3 in partition 1
● Order is guarenteed only within a partition (not across
partitions)
● Order is kept only for a limited time (default is one week)
● Once the data is written to a partition, it cant be changed
(immutability)
● Data is assigned randomly to a partition unless a key is
provided (more on this later)
Brokers
● A kafka cluster is composed of multiple brokers (servers)
● Each broker ois identified with its ID (integer)
● Each broker contains certain topic partitons
● After connecting to any broker (called a bootstrap
broker), you will be connected to the entire cluster
● A good number to get started is 3 brokers, but big
clusters have over 100 brokers
● In these examples we choose to number brokers starting
at 100 (arbitrary)
Topic replication factor
● Topics sholud have a replication factor > 1
(usually between 2 & 3)
● This way if a broker is down, another broker can
serve the data
● Example: Topic-A with 2 partitions and replication
factor of 2
Topic replication factor
● Example: we lost Broker 102
● Result: Broker 101 & 103 can still serve the data
Concept of Leader for a Partition
● At any time only one broker can be a leader for a
given partition
● Only that leader can recieve and serve data for a
partition
● The other brokers will synchronize the data
● Therefore each partition has one ;eader and
multiple ISR (in-sync-replica)
Producers
● Producers write data to topics (which is made of
partitions)
● Producers autumatically know to which broker
and partition to write to
● In case of Broker failures, Producers will
automatically recover
Producers
● Producers can choose to recieve
acknowledgment of data writes:
− acks=0: Producer wont wait for acknowledgment
(possible data loss)
− acks=1: Producer will wait for leader
acknowledgment (limited data loss)
− acks=all: Leader+replicas acknowledgment (no data
loss)
Producers: Message Keys
● Producers can choose to send a key with the
message(string,number, etc..)
● If key=null, data is sent round robin (broker 101
then 102 then 103..)
● If key is sent then all messages for that key will
always go to the same partition
● A key is basically sent if you need message
oedering for a specific field(ex: truck_id)
Consumers
● Consumers read data from a topic (identified by
name)
● Consumers know which broker to read from
● In case broker failures, Consumers know how to
recover
● Data is read in order within each partitions
Consumer Groups
● Consumers can read data in consumer groups
● Each consumer within a group reads from
exclusive partitions
● If you have more consumers than partitions,
consumers will be inactive
Consumer Groups
What if too many consumers?
● If you have more consumers than partitions,
some consumers will be inactive
Consumer Offsets
● Kafka stores the offsets at wwhich a consumer group
has been reading
● The offset comitted live in a kafka topic named
__consumer_offsets
● When a consumer in a group has processed data
recieved from kafka, it should be committing the offset
● If a consumer dies, it will be able to read back from
where it left off thanks to the committed consumer
offsets!
Delivery Semantics for consumers
● Consumers choose when to commit offsets
● There are 3 delivery semantics:
● At most once:
− Offsets are committed as soon as the message is recieved
− If the processing goes wrong, the message will be lost (it wont be read again)
● At Least once (usually preferred):
− Offsets are committed after the message is processed
− If the processing goes wrong, the message will read again.
− This can result in duplicate processing of messages. Make sure your processing is
idempotent (i.e. processing again the messages wont impact your systems)
● Exactly once:
− Can be achieved for kafka => kafka workflows using kafka streams API
− For kafka => External system workflows, use an idempotent consumer.
Kafka Broker Discovery
● Every Kafka broker is also called a "bootstrap
server"
● That means that you only need to connect to one
broker, and you will be connected to the entire
cluster.
● Each broker knows about all brokers, topics and
partitions (metadata)
Zookeeper
● Zookeeper manages brokers (keeps a list of them)
● Zookeeper helps in performing leader election for partitions
● Zookeeper sends notifications to Kafka in case of changes (e.g.
new topic, broker dies, broker comes up, delete topics, etc....
● Kafka can't work without Zookeeper
● Zookeeper by design operates with an odd number of servers (3,
5, 7)
● Zookeeper has a leader (handle writes) the rest of the servers are
followers (handle reads)
● (Zookeeper does NOT store consumer offsets with Kafka > v0.10)
Kafka Guarantees
● Messages are appended to a topic-partition in the order they are
sent
● Consumers read messages in the order stored in a topic-partition
● With a replication factor of N, producers and consumers can
tolerate up to N-l brokers being down
● This is why a replication factor of 3 is a good idea:
− Allows for one broker to be taken down for maintenance
− Allows for another broker to be taken down unexpectedly
● As long as the number of partitions remains constant for a topic
(no new partitions), the same key will always go to the same
partition
Security
• Man-in-the-middle Attacks
• Encrypt
• Authorization
• Identity
• Authentication
• Role
• JAAS
• Principal and Role
• SSL
• Certificates and Keys
• SASL
• Simple Authentication Service Layer
• PLAINTEXT: username and password on Kafka Brokers
• SASL SCRAM: hashes on zk (no dependency on broker)
• SASL GSSAPI : AD based Kerberos
ACL
• ACL to allow
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --add
• --allow-principal User:Bob
• --allow-principal User:Alice
• --allow-host Host1,Host2
• —operation Read
• --topic Test-topic
ACL
• ACL to deny
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --add
• --allow-principal User:*
• --allow-host *
• --deny-principal User:BadBob
• --deny-host bad-host
• --operation Read
• --topic Test-topic
ACL
• Removing ACL
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --remove
• --allow-principal User:Bob
• --allow-principal User:Alice --allow-host Host1,Host2 --
operation Read --topic Test-topic
Security
• Listing the ACL
• bin/kafka-acls.sh
• --authorizer kafka.security.auth.SimpleAclAuthorizer
• --authorizer-properties zookeeper.connect=localhost:2181
• --list
• --topic Test-topic
Security
• Convenient Methods
• bin/kafka-acls.sh --authorizer
kafka.security.auth.SimpleAclAuthorizer --authorizer-
properties zookeeper.connect=localhost:2181 --add --
allow-principal User:Bob --producer --topic Test-topic
• bin/kafka-acls.sh --authorizer
kafka.security.auth.SimpleAclAuthorizer --authorizer-
properties zookeeper.connect=localhost:2181 --add --
allow-principal User:Bob --consumer --topic test-topic --
consumer-group Group-1
JMX Metrics
• Download JMXTERM
• https://docs.cyclopsgroup.org/jmxterm
• Start the interactive shell
• java -jar jmxterm-1.0.0-uber.jar