Kafka PDF
Kafka PDF
Introduction to Kafka
and Kafka running in EC2
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka growing
What is Kafka?
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka’s growth is exploding, more than 1⁄3 of all Fortune 500 companies use Kafka. These companies includes the top ten travel companies, 7 of top ten banks, 8 of top
ten insurance companies, 9 of top ten telecom companies, and much more. LinkedIn, Microsoft and Netflix process four comma messages a day with Kafka
(1,000,000,000,000). Kafka is used for real-time streams of data, used to collect big data or to do real time analysis or both). Kafka is used with in-memory microservices
to provide durability it can be used to feed events to CEP (complex event streaming systems), and IOT/IFTTT style automation systems.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka often gets used in the real-time streaming data architectures to provide real-time analytics. Since Kafka is a fast, scalable, durable, and fault-tolerant publish-
subscribe messaging system, Kafka is used in use cases where JMS, RabbitMQ, and AMQP may not even be considered due to volume and responsiveness. Kafka has
higher throughput, reliability and replication characteristics which make it applicable for things like tracking service calls (tracks every call) or track IOT sensors data
where a traditional MOM might not be considered.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka can works with Flume/Flafka, Spark Streaming, Storm, HBase, Flink and Spark for real-time ingesting, analysis and processing of streaming data. Kafka is a data
stream used to feed Hadoop BigData lakes. Kafka brokers support massive message streams for low-latency analysis in Hadoop or Spark. Also, Kafka Streaming (a
subproject) can be used for real-time analytics.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
In short, Kafka gets used for stream processing, website activity tracking, metrics collection and monitoring, log aggregation, real-time analytics, CEP systems, ingesting
data into Spark, ingesting data into Hadoop, CQRS, replay messages, error recovery, and guaranteed distributed commit log for in-memory computing (microservices).
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
A lot of large companies who handle a lot of data use Kafka. LinkedIn, where it originated, uses it to track activity data and operational metrics. Twitter uses it as part of
Storm to provide a stream processing infrastructure. Square uses Kafka as a bus to move all system events to various Square data centers (logs, custom events, metrics,
and so on), outputs to Splunk, Graphite (dashboards), and to implement an Esper-like/CEP alerting systems. It gets used by other companies too like Spotify, Uber,
Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, NetFlix, and much more.
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Most important reason: Kafka’s great performance: throughput, latency, obtained through great engineering
Kafka has operational simplicity. Kafka is to set up and use, and it is easy to reason how Kafka works. However, the main reason Kafka is very popular is its excellent
performance. It has other characteristics as well, but so do other messaging systems. Kafka has great performance, and it is stable, provides reliable durability, has a
flexible publish-subscribe/queue that scales well with N-number of consumer groups, has robust replication, provides Producers with tunable consistency guarantees,
and it provides preserved ordering at shard level (Kafka Topic Partition). In addition, Kafka works well with systems that have data streams to process and enables those
systems to aggregate, transform & load into other stores. But none of those characteristics would matter if Kafka was slow. The most important reason Kafka is popular
is Kafka’s exceptional performance.
Kafka relies heavily on the OS kernel to move data around quickly which keeps the JVM heap size small. It relies on the principals of Zero Copy. Kafka enables you to
batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more
efficient data compression and reduces I/O latency between consumers and producers. Kafka writes to the immutable commit log to the disk sequential; thus, avoids
random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions. This
sharding allows Kafka to handle massive load.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Streaming Architecture
Kafka gets used most often for real-time streaming of data into other systems. Kafka is a middle layer to decouple your real-time data pipelines. Kafka core is not good
for direct computations such as data aggregations, or CEP. Kafka Streaming which is part of the Kafka ecosystem does provide the ability to do real-time analytics. Kafka
can be used to feed fast lane systems (real-time, and operational data systems) like Storm, Flink, Spark Streaming and your services and CEP systems. Kafka is also
used to stream data for batch data analysis. Kafka feeds Hadoop. It streams data into your BigData platform or into RDBMS, Cassandra, Spark, or even S3 for some
future data analysis. These data stores often support data analysis, reporting, data science crunching, compliance auditing, and backups.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka is fast because it avoids copying buffers in-memory (Zero Copy), and streams data to immutable logs instead of using random access.
Kafka is used to feed data lakes like Hadoop, and to feed real-time analytics systems like Flink, Storm and Spark Streaming.
Kafka feeds data to real-time analytics systems like Storm, Spark Streaming, Flink, and Kafka Streaming.
The top ten travel companies, 7 of top ten banks, 8 of top ten insurance companies, 9 of top ten telecom companies, LinkedIn, Microsoft, Netflix and many more
companies.
What is Kafka?
❖ Distributed Streaming Platform
❖ Publish and Subscribe to streams of records
❖ Fault tolerant storage
❖ Replicates Topic Log Partitions to multiple servers
❖ Process records as they occur
❖ Fast, efficient IO, batching, compression, and more
❖ Used to decouple data streams
Kafka is a distributed streaming platform that is used publish and subscribe to streams of records. Kafka gets used for fault tolerant storage. Kafka replicates topic log
partitions to multiple servers. Kafka is designed to allow your apps to process records as they occur. Kafka is fast, uses IO efficiently by batching, compressing records.
Kafka gets used for decoupling data streams. Kafka is used to stream data into data lakes, applications and real-time stream analytics systems.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka decouple streams of data by allowing multiple consumer groups that can each control where in the topic partition they are. The producers don’t know about the
consumers. Since the Kafka broker delegates the log partition offset (where the consumer is in the record stream) to the clients (Consumers), the message consumption
is flexible. This allows you to feed your high-latency daily or hourly data analysis in Spark and Hadoop and the same time you are feeding microservices real-time
messages, sending events to your CEP system and feeding data to your real-time analytic systems.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka communication from clients and servers uses a wire protocol over TCP that is versioned and documented. Kafka promises to maintain backward compatibility with
older clients, and many languages are supported. The Kafka ecosystem also provides REST proxy allows easy integration via HTTP and JSON. Kafka also supports Avro
via the Confluent Schema Registry for Kafka. Avro and the Schema Registry allows complex records to be produced and read by clients in many programming languages
and allows for the evolution of the records. Kafka is truly polyglot.
Kafka Usage
❖ Build real-time streaming applications that react to streams
❖ Real-time data analytics
❖ Transform, react, aggregate, join real-time data flows
❖ Feed events to CEP for complex event processing
❖ Feed data lakes
❖ Build real-time streaming data pipe-lines
❖ Enable in-memory microservices (actors, Akka, Vert.x, Qbit,
RxJava)
Kafka allows you to build real-time streaming data pipe-lines. Kafka enable in-memory microservices (actors, Akka, Baratine.io, QBit, reactors, reactive, Vert.x, RxJava,
Spring Reactor) Kafka allows you to build real-time streaming applications that react to streams to do real-time data analytics, transform, react, aggregate, join real-time
data flows and perform CEP (complex event processing).
You can use Kafka to aid in gathering Metrics/KPIs, aggregate statistics from many sources implement event sourcing, use it with microservices (in-memory) and actor
systems to implement in-memory services (external commit log for distributed systems).
You can use Kafka to replicate data between nodes, to re-sync for nodes, to restore state. While it is true, Kafka used for real-time data analytics and stream processing,
you can also use it for log aggregation, messaging, click-stream tracking, audit trails, and much more.
In a world where data science and analytics is a big deal, then capturing data to feed into your data lakes and real-time analytics systems is a big deal, and since Kafka
can hold up to these kinds of strenuous use cases, Kafka is a big deal.
Kafka allows you to build real-time streaming data pipe-lines. Kafka enable in-memory microservices (actors, Akka, Baratine.io, QBit, reactors, reactive, Vert.x, RxJava,
Spring Reactor) Kafka allows you to build real-time streaming applications that react to streams to do real-time data analytics, transform, react, aggregate, join real-time
data flows and perform CEP (complex event processing).
You can use Kafka to aid in gathering Metrics/KPIs, aggregate statistics from many sources implement event sourcing, use it with microservices (in-memory) and actor
systems to implement in-memory services (external commit log for distributed systems).
You can use Kafka to replicate data between nodes, to re-sync for nodes, to restore state. While it is true, Kafka used for real-time data analytics and stream processing,
you can also use it for log aggregation, messaging, click-stream tracking, audit trails, and much more.
In a world where data science and analytics is a big deal, then capturing data to feed into your data lakes and real-time analytics systems is a big deal, and since Kafka
can hold up to these kinds of strenuous use cases, Kafka is a big deal.
Kafka cluster retains all published records and if you don't set a limit, it will keep records until it runs out of disk space. You can set time-based limits (configurable
retention period), size-based limits (configurable based on size), or use compaction (keeps the latest version of record using key). You can, for example, set a retention
policy of three days or two weeks or a month. The records in the topic log are available for consumption until discarded by time, size or compaction. The consumption
speed not impacted by size as Kafka always writes to the end of the topic log.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka is a good storage system for records/messages. Kafka acts like high-speed file system for commit log storage and replication. These characteristics make Kafka
useful for all manners of applications. Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance. Since modern drives are
fast and quite large, this fits well and is very useful. Kafka Producers can wait on acknowledgment, so messages are durable as the producer write not complete until the
message replicates. The Kafka disk structure scales well. Modern disk drives have very high throughput when writing in large streaming batches. Also, Kafka Clients/
Consumers can control read position (offset) which allows for use cases like replaying the log if there was a critical bug (fix the bug and the replay). And since offsets are
tracked per consumer group, which we talk about in Kafka Architecture, the consumers can be quite flexible (like replaying the log).
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka Review
❖ How does Kafka decouple streams of data?
❖ What are some use cases for Kafka where you work?
❖ What are some common use cases for Kafka?
❖ How is Kafka like a distributed message storage
system?
❖ How does Kafka know when to delete old messages?
❖ Which programming languages does Kafka support?
It decouple streams of data by allowing multiple consumer groups that can each control where in the topic partition they are. The producers don’t know about the
consumers. Since the Kafka broker delegates the log partition offset (where the consumer is in the record stream) to the clients (Consumers), the message consumption
is flexible. This allows you to feed your high-latency daily or hourly data analysis in Spark and Hadoop and the same time you are feeding microservices real-time
messages, sending events to your CEP system and feeding data to your real-time analytic systems.
Kafka feeds data to real-time analytics systems like Storm, Spark Streaming, Flink, and Kafka Streaming.
What are some use cases for Kafka where you work?
Kafka Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka Fundamentals
❖ Records have a key (optional), value and timestamp; Immutable
❖ Topic a stream of records (“/orders”, “/user-signups”), feed name
❖ Log topic storage on disk
❖ Partition / Segments (parts of Topic Log)
❖ Producer API to produce a streams or records
❖ Consumer API to consume a stream of records
❖ Broker: Kafka server that runs in a Kafka Cluster. Brokers form a cluster.
Cluster consists on many Kafka Brokers on many servers.
❖ ZooKeeper: Does coordination of brokers/cluster topology. Consistent file
system for configuration information and leadership election for Broker Topic
Partition Leaders
Kafka consists of Records, Topics, Consumers, Producers, Brokers, and Clusters. Records can have key (optional), value and timestamp. Kafka Records are immutable.
A Kafka Topic is a stream of records (“/orders”, “/user-signups”). You can think of a Topic as a feed name. A topic has a Log which is the topic's storage on disk. A Topic
Log is broken up into partitions and segments. The Kafka Producer API is used to produce streams of data records. The Kafka Consumer API is used to consume a
stream of records from Kafka. A Broker is a Kafka server that runs in a Kafka Cluster. Kafka Brokers form a cluster. The Kafka Cluster consists of many Kafka Brokers on
many servers. Broker sometimes refer to more of a logical system or as Kafka as a whole. Kafka uses ZooKeeper to manage the cluster. ZooKeeper is used to
coordinate the brokers/cluster topology. ZooKeeper is a consistent file system for configuration information. ZooKeeper gets used for leadership election for Broker Topic
Partition Leaders.
Kafka Consumer
record Cluster
Producer
Consumer
Producer Topic
record
Producer
Consumer
The above shows producers sending records to a Kafka Topic and Consumers consuming records from the Kafka cluster.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Core Kafka
The above is the core Kafka system. Kafka has Producers, Kafka (Topics, Nodes, Cluster, Broker), and Kafka uses ZooKeeper.
Kafka uses Zookeeper to do leadership election of Kafka Broker and Topic Partition pairs. Kafka uses Zookeeper to manage service discovery for Kafka Brokers that
form the cluster.
Zookeeper sends changes of the topology to Kafka, so each node in the cluster knows when a new broker joined, a Broker died, a topic was removed or a topic was
added, etc. Zookeeper provides an in-sync view of Kafka Cluster configuration.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka producers write to Topics. Kafka consumers read from Topics. A topic is associated with a log which is data structure on disk. Kafka appends records from a
producer(s) to the end of a topic log. A topic log consists of many partitions that are spread over multiple files which can be spread on multiple Kafka cluster nodes.
Consumers read from Kafka topics at their cadence and can pick where they are (offset) in the topic log.
Each consumer group tracks offset from where they left off reading. Kafka distributes topic log partitions on different nodes in a cluster for high performance with
horizontal scalability. Spreading partitions aids in writing data quickly. Topic log partitions are Kafka way to shard reads and writes to the topic log. Also, partitions are
needed to have multiple consumers in a consumer group work at the same time. Kafka replicates partitions to many nodes to provide failover.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Producer
Partition
0 0 1 2 3 4 5 6 7 8 9 10 11 12
Consumer Group B
Consumer Group A
How can Kafka scale if multiple producers and consumers read/write to the same Kafka Topic log? First Kafka is fast, Kafka writes to filesystem sequentially which is fast.
On a modern fast drive, Kafka can easily write up to 700 MB or more bytes of data a second. Kafka scales writes and reads by sharding topic logs into partitions (parts of
a Topic log). Recall Topics logs can be split into multiple partitions which can be stored on multiple different servers, and those servers can use multiple disks.
Multiple producers can write to different partitions of the same Topic. Multiple consumers from multiple consumer groups can read from different partitions efficiently.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka Brokers
Each Broker has an ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. For failover, you want
to start with at least three to five brokers. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster if needed.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Recall that Kafka uses ZooKeeper to form Kafka Brokers into a cluster and each node in Kafka cluster is called a Kafka Broker.
Topic partitions can be replicated across multiple nodes for failover. The topic should have a replication factor greater than 1 (2, or 3). For example, if you are running in
AWS, you would want to be able to survive a single availability zone outage.
If one Kafka Broker goes down, then the Kafka Broker which is an ISR (in-sync replica) can serve data.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka uses replication for failover. Replication of Kafka topic log partitions allows for failure of a rack or AWS availability zone (AZ). You need a replication factor of at least
3 to survive a single AZ failure. You need to use Mirror Maker, a Kafka utility that ships with Kafka core, for disaster recovery. Mirror Maker replicates a Kafka cluster to
another data-center or AWS region.
They call what Mirror Maker does mirroring as not to be confused with replication. We cover mirroring and disaster recovery quite a bit.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka Review
❖ How does Kafka decouple streams of data?
❖ What are some use cases for Kafka where you work?
❖ What are some common use cases for Kafka?
❖ What is a Topic?
❖ What is a Broker?
❖ What is a Partition? Offset?
❖ Can Kafka run without Zookeeper?
❖ How do implement failover in Kafka?
❖ How do you implement disaster recovery in Kafka?
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka versus
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Is Kafka a queue or a publish and subscribe system? Yes. It can be both. Kafka is like a queue for consumer groups, which we cover later. Basically, Kafka is a queue
system per consumer in consumer group so it can do load balancing like JMS, RabbitMQ, etc. Kafka is like topics in JMS, RabbitMQ, and other MOM systems for
multiple consumer groups.
Kafka has topics and producers publish to the topics and the subscribers (Consumer Groups) read from the topics. Kafka offers consumer groups, which is a named
group of consumers. A consumer group acts as a subscription. Kafka broadcast to multiple consumer groups using a single queue as each consumer group tracks its
own offset into the log. MOM is message oriented middleware, and includes JMS implementors, ActiveMQ, RabbitMQ, IBM MQ Series, and Tibco.
Kafka vs MOM
❖ By design, Kafka is better suited for scale than traditional MOM systems
due to partition topic log
❖ Load divided among Consumers for read by partition
❖ Handles parallel consumers better than traditional MOM
❖ Also by moving location (partition offset) in log to client/consumer side of
equation instead of the broker, less tracking required by Broker and more
flexible consumers
❖ Kafka written with mechanical sympathy, modern hardware, cloud in mind
❖ Disks are faster
❖ Servers have tons of system memory
❖ Easier to spin up servers for scale out
By design, Kafka is better suited for scale than traditional MOM systems due to partition topic log. Kafka can divide among Consumers by partition and send those
message/records in batches. Kafka handles parallel consumers better than traditional MOM, and can even handle failover for consumers in a consumer group. Also, by
moving location (partition offset) in log (message queue) to client/consumer side of the equation instead of the broker, less tracking required by Broker and you have more
flexible consumers. Kafka was written with mechanical sympathy, modern hardware, cloud in mind: disk drives are faster and bigger, servers have tons of system
memory, and it is easier to spin up servers for scale-out then when traditional MOM systems were first designed and built. When IBM MQ Series was first released, your
mobile phone would be considered a super computer. Expect messaging systems to catch up and for new messaging systems to come out.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
• Stock prices
• IOT sensors
Kafka works with streaming data too. Kinesis Streams is like Kafka Core. Kinesis Analytics is like Kafka Streams. A Kinesis Shard is like Kafka Partition. They are similar
and get used in similar use cases. In Kinesis, data is stored in shards. In Kafka, data is stored in partitions.
Data is stored in Kinesis for default 24 hours, and you can increase that up to 7 days.
Kafka records are by default stored for 7 days and you can increase that until you run out of disk space. In fact, you can decide by the size of the data or by date. You
can even use compaction with Kafka so it only stores the latest timestamp per key per record in the log.
With Kinesis data can be analyzed by lambda before it gets sent to S3 or RedShift.
With Kinesis you pay for use, by buying read and write units.
Kinesis Analytics allows you to perform SQL like queries on data. Kafka Streaming allows you to perform functional aggregations and mutations. Kafka integrates with
Flink and Spark which allow querying streams with SQL like queries.
Kafka is more flexible than Kinesis but you have to manage your own clusters, and requires some dedicated DevOps resources to keep it going. Kinesis is sold as a
service and does not require a DevOps team to keep it going. You pay for this. Depending on the use case, Kinesis could be faster to get started and much easier than
hiring staff to manage a Kafka cluster. However, with less flexibility and potentially the expense. It really depends on your use case and volume of data on which is the
better fit.
Kafka Topics
called consumer groups. Topics are broken up into partitions for speed,
Topic Partitions
❖ Topics are broken up into partitions
❖ Partitions decided usually by key of record
❖ Key of record determines which partition
❖ Partitions are used to scale Kafka across many servers
❖ Record sent to correct partition by key
❖ Partitions are used to facilitate parallel consumers
❖ Records are consumed in parallel up to the number of partitions
❖ Order guaranteed per partition
❖ Partitions can be replicated to multiple brokers
Kafka breaks topics up into partitions. A record is stored on a partition usually by record key if the key is present and round-robin if the key is missing (default behavior).
The record key, by default, determines which partition a producer sends the record.
Kafka uses partitions to scale a topic across many servers for producer writes. Also, Kafka also uses partitions to facilitate parallel consumers. Consumers consume
records in parallel up to the number of partitions.
The order guaranteed per partition. If partitioning by key then all records for the key will be on the same partition which is useful if you ever have to replay the log. Kafka
can replicate partitions to multiple brokers for failover.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka maintains record order only in a single partition. A partition is an ordered, immutable record sequence. Kafka continually appended to partitions using the partition
as a structured commit log. Records in partitions are assigned sequential id number called the offset. The offset identifies each record location within the partition. Topic
partitions allow Kafka log to scale beyond a size that will fit on a single server. Topic partitions must fit on servers that host it, but topics can span many partitions hosted
on many servers. Also, topic partitions are a unit of parallelism - a partition can only be worked on by one consumer in a consumer group at a time. Consumers can run
in their own process or their own thread. If a consumer stops, Kafka spreads partitions across the remaining consumer in the same consumer group.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Partition
0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition
1 0 1 2 3 4 5 6 7 8 9
Writes
Partition
2 0 1 2 3 4 5 6 7 8 9 10 11
Partition 0 1 2 3 4 5 6 7 8
3
Older Newer
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Each partition can be replicated across a configurable number of Kafka servers - used for fault tolerance. Each partition has a leader server and zero or more follower
servers. Leaders handle all read and write requests for a partition. Followers replicate leaders and take over if the leader dies. Kafka uses also uses partitions for parallel
consumer handling within a group. Kafka distributes topic log partitions over servers in the Kafka cluster. Each server handles its share of data and requests by sharing
partition leadership.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka chooses one broker's partition’s replicas as leader using ZooKeeper. The broker that has the partition leader handles all reads and writes of records for the
partition. Kafka replicates writes to the leader partition to followers (node/partition pair). A follower that is in-sync is called an ISR (in-sync replica). If a partition leader
fails, Kafka chooses a new ISR as the new leader.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
The record is considered "committed" when all ISRs for partition wrote to their log.
Kafka Broker 1
Kafka Broker 0 Kafka Broker 2
Topic Review
❖ What is an ISR?
❖ How does Kafka scale Consumers?
❖ What are leaders? followers?
❖ How does Kafka perform failover for Consumers?
❖ How does Kafka perform failover for Brokers?
What is an ISR?
Kafka scales consumers by partition such that each consumer gets its share of partitions. A consumer can have more than one partition, but a partition can only be used
by one consumer in a consumer group at a time. If you only have one partition, then you can only have one consumer.
Leaders perform all reads and writes to a particular topic partition. Followers replicate leaders.
If a consumer in a consumer group dies, the partitions assigned to that consumer is divided up amongst the remaining consumers in that group.
If a broker dies, then Kafka divides up leadership of its topic partitions to the remaining brokers in the cluster.
Kafka Producers
Partition selection
Durability
Kafka Producers Kafka Producer Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka Producers
❖ Producers send records to topics
Kafka producers send records to topics. The records are sometimes referred to as messages. The producer picks which partition to send a record to per topic.
The producer can send records round-robin. The producer could implement priority systems based on sending records to certain partitions based on the priority of the
record.
Generally speaking, producers send records to a partition based on the record's key. The default partitioner for Java uses a hash of the record's key to choose the
partition or uses a round-robin strategy if the record has no key. The important concept here is that the producer picks partition.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Partition
0 0 1 2 3 4 5 6 7 8 9 10 11 12
Consumer Group A
Kafka Producers
❖ Producers write at their own cadence so order of Records
cannot be guaranteed across partitions
❖ Producer configures consistency level (ack=0, ack=all, ack=1)
❖ Producers pick the partition such that Record/messages goes
to a given same partition based on the data
❖ Example have all the events of a certain 'employeeId' go to
same partition
❖ If order within a partition is not needed, a 'Round Robin'
partition strategy can be used so Records are evenly
distributed across partitions.
Producers write at their cadence so the order of Records cannot be guaranteed across partitions. The producers get to configure their consistency/durability level (ack=0,
ack=all, ack=1), which we will cover later. Producers pick the partition such that Record/messages go to a given partition based on the data. For example, you could
have all the events of a certain 'employeeId' go to the same partition. If order within a partition is not needed, a 'Round Robin' partition strategy can be used, so Records
get evenly distributed across partitions.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Producer Review
Yes. A producer could have a burst of records, and a consumer does not have to be on the same page as the consumer.
What is the default partition strategy for producers without using a key?
Round-Robin
Records with the same key get sent to the same partition.
You group consumers into a consumer group by use case or function of the group. One consumer group might be responsible for delivering records to high-speed, in-
memory microservices while another consumer group is streaming those same records to Hadoop. Consumer groups have names to identify them from other consumer
groups.
A consumer group has a unique id. Each consumer group is a subscriber to one or more Kafka topics. Each consumer group maintains its offset per topic partition.
If you need multiple subscribers, then you have multiple consumer groups. A record gets delivered to only one consumer in a consumer group.
Each consumer in a consumer group processes records and only one consumer in that group will get the same record.
Recall that order is only guaranteed within a single partition. Since records are typically stored by key into a partition then order per partition is sufficient for most use
cases
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
❖ If Consumer dies, its partitions are split among remaining live Consumers
in Consumer Group
Kafka consumer consumption divides partitions over consumer instances within a consumer group.
Each consumer in the consumer group is an exclusive consumer of a "fair share" of partitions.
Consumer membership within a consumer group is handled by the Kafka protocol dynamically.
consumer dies, its partitions are split among the remaining live consumers in the consumer group.
Partition
0 0 1 2 3 4 5 6 7 8 9 10 11 12
Consumer Group B
Consumer Group A
How does Kafka divide up a topic so multiple consumers in a consumer group can process a topic in parallel? Kafka makes you group consumers into consumer groups
with a group id. Consumers with the same id are in the same consumer group.
A single Kafka broker becomes the group coordinator for the consumer group. This broker assigns partitions when new members arrive note that older Kafka clients
would talk directly to ZooKeeper now broker does coordination and uses ZooKeeper in the background. This broker can reassign partitions when group members leave
or if the topic configuration changes. When the consumer group is created, the offset is set according to reset policy of topic.
Consumers notify the Kafka broker when they have successfully processed a record,
If a consumer fails before sending commit offset to Kafka broker, then a different
If a consumer fails after processing the record but before sending the commit
to the broker, then some Kafka records could be reprocessed. In this scenario,
Kafka implements the at least once behavior, and you should make sure the
These topics use log compaction, which means they only save the most recent value per key.
"Log end offset" is offset of the last record written to log partition and where Producers writes to next.
"High Watermark" is the offset of the last record that was successfully replicated to all partitions followers.
Consumer only reads up to the “high watermark.” Consumers can’t read un-replicated data.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Only a single consumer from the same consumer group can access a single partition. If consumer group count exceeds the partition count, then the extra consumers
remain idle. Kafka can use the idle consumers for failover.
Kafka Cluster
Server 1 Server 2
P2 P3 P4 P0 P1 P5
C0 C1 C3 C0 C1 C3
Notice server 1 has topic partition P2, P3, and P4 while server 2 has partition P0, P1, and P5.
Notice that Consumer C0 from Consumer Group A is processing records from P0 and P2.
Notice that no single partition is shared by any consumer from any consumer group.
Notice that each partition gets its fair share of partitions for the topics.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Multi-threaded Consumers
❖ You can run more than one Consumer in a JVM process
❖ If processing records takes a while, a single Consumer can run multiple threads to process
records
❖ Harder to manage offset for each Thread/Task
❖ One Consumer runs multiple threads
❖ 2 messages on same partitions being processed by two different threads
❖ Hard to guarantee order without threads coordination
❖ PREFER: Multiple Consumers can run each processing record batches in their own thread
❖ Easier to manage offset
❖ Each Consumer runs in its thread
❖ Easier to mange failover (each process runs X num of Consumer threads)
You can run more than one Consumer in a JVM process by using threads.
If processing a record takes a while, a single Consumer can run multiple threads to process records, but it is harder to manage offset for each Thread/Task. If one
consumer runs multiple threads, then two messages on the same partitions could be processed by two different threads which make it hard to guarantee record delivery
order without complex thread coordination. This setup might be appropriate if processing
If you need to run multiple consumers, then run each consumer in their own thread. This way Kafka can deliver record batches to the consumer and the consumer does
not have to worry about the offset ordering. A thread per consumer makes it easier to manage offsets. It is also simpler to manage failover (each process runs X num of
consumer threads) as you can allow Kafka to do the brunt of the work.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Consumer Review
A consumer group is a group of related consumers that perform a task, like putting data into Hadoop or sending messages to a service. Consumer groups each have
unique offsets per partition. Different consumer groups can read from different locations in a partition.
Yes. The consumer groups have their own offset for every partition in the topic which is unique to what other consumer groups have.
A consumer can see a record after the record gets fully replicated to all followers.
What happens if you run multiple consumers in many threads in the same JVM?
Let’s show a simple example using producers and consumers from the Kafka command line.
Download Kafka 0.10.2.x from the Kafka download page. Later versions will likely work, but this was example was done with 0.10.2.x.
We unzipped the Kafka download and put it in ~/kafka-training/, and then renamed the Kafka install folder to kafka. Please do the same.
Run Kafka
Next, we are going to run ZooKeeper and then run Kafka Server/Broker. We will use some Kafka command line utilities, to create Kafka topics, send messages via a
producer and consume messages from the command line.
Run ZooKeeper
Kafka relies on ZooKeeper. To keep things simple, we will use a single ZooKeeper node. Kafka provides a startup script for ZooKeeper called zookeeper-server-start.sh which is located at ~/kafka-training/
kafka/bin/zookeeper-server-start.sh.
The Kafka distribution also provide a ZooKeeper config file which is setup to run single node.
~/kafka-training/run-zookeeper.sh
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/zookeeper-server-start.sh \
kafka/config/zookeeper.properties
Run run-zookeeper.sh
~/kafka-training
$ ./run-zookeeper.sh
Kafka also provides a startup script for the Kafka server called kafka-server-start.sh which is located at ~/kafka-training/kafka/bin/kafka-server-start.sh.
The Kafka distribution also provide a Kafka config file which is setup to run Kafka single node, and points to ZooKeeper running on localhost:2181.
To run Kafka, we create this script in kafka-training and run it in another terminal window.
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/zookeeper-server-start.sh \
kafka/config/zookeeper.properties
Run run-kafka.sh
$ ./run-kafka.sh
Now let’s create the topic that we will send records on.
Kafka also provides a utility to work with topics called kafka-topics.sh which is located at ~/kafka-training/kafka/bin/kafka-topics.sh. We will use this tool to create a topic called my-topic with a replication
factor of 1 since we only have one server. We will use thirteen partitions for my-topic, which means we could have up to 13 Kafka consumers.
To run Kafka, create this script in kafka-training\lab1, and run it in another terminal window.
#!/usr/bin/env bash
cd ~/kafka-training
# Create a topic
kafka/bin/kafka-topics.sh --create \
--zookeeper localhost:2181 \
--replication-factor 1 --partitions 13 \
--topic my-topic
Run create-topic.sh
~/kafka-training/lab1
$ ./create-topic.sh
List Topics
You can see which topics that Kafka is managing using kafka-topics.sh as follows.
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-topics.sh --list \
--zookeeper localhost:2181
Notice that we have to specify the location of the ZooKeeper cluster node which is running on localhost port 2181.
~/kafka-training/lab1
$ ./list-topics.sh
__consumer_offsets
_schemas
my-topic
The Kafka distribution provides a command utility to send messages from the command line. It start up a terminal window where everything you type is sent to the Kafka topic. Kafka provides the utility kafka-
console-producer.sh which is located at ~/kafka-training/kafka/bin/kafka-console-producer.sh to send messages to a topic on the command line.
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-console-producer.sh \
--broker-list localhost:9092 \
--topic my-topic
Notice that we specify the Kafka node which is running at localhost:9092.. Next run start-producer-console.sh and send at least four messages.
~/kafka-training/lab1
$ ./start-producer-console.sh
This is message 1
This is message 2
In order to see these messages, we will need to run the consumer console.
The Kafka distribution provides a command utility to see messages from the command line. It displays the messages in various modes. Kafka provides the utility kafka-
console-consumer.sh which is located at ~/kafka-training/kafka/bin/kafka-console-producer.sh to receive messages from a topic on the command line. Create the file in
~/kafka-training/lab1/start-consumer-console.sh and run it.
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic my-topic \
--from-beginning
Notice that we specify the Kafka node which is running at localhost:9092 like we did before, but we also specify to read all of the messages from my-topic from the
beginning —from-beginning. Next, run start-consumer-console.sh in another terminal
$ ./start-consumer-console.sh
This is message 2
This is message 1
Notice that the messages are not coming in order. This is because we only have one consumer so it is reading the messages from all 13 partitions. Order is only
guaranteed within a partition.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
kafka-topics.sh
kafka-topics.sh
kafka-console-producer.sh
kafka-console-consumer.sh
How could we get the messages to come in order from the consumer?
Objectives
❖ Run many Kafka Brokers
❖ Create a replicated topic
❖ Demonstrate Pub / Sub
❖ Demonstrate load balancing
consumers
❖ Demonstrate consumer failover
❖ Demonstrate broker failover
In this lab, we are going to run many Kafka Nodes on our development laptop so that you will need at least 16 GB of RAM for local dev machine. You can run just two
servers if you have less memory than 16 GB. We are going to create a replicated topic. We then demonstrate consumer failover and broker failover. We also demonstrate
load balancing Kafka consumers. We show how, with many groups, Kafka acts like a Publish/Subscribe. But, when we put all of our consumers in the same group, Kafka
will load share the messages to the consumers in the same group (more like a queue than a topic in a traditional MOM sense).
If not already running, then start up ZooKeeper (./run-zookeeper.sh from the first tutorial). Also, shut down Kafka from the first lab.
Next, you need to copy server properties for three brokers (detailed instructions to follow). Then we will modify these Kafka server properties to add unique Kafka ports,
Kafka log locations, and unique Broker ids. Then we will create three scripts to start these servers up using these properties, and then start the servers. Lastly, we create
replicated topic and use it to demonstrate Kafka consumer failover, and Kafka broker failover.
In this section, we will copy the existing Kafka server.properties to server-0.properties, server-1.properties, and server-2.properties. Then we change server-0.properties
to set log.dirs to “./logs/kafka-0. Then we modify server-1.properties to set port to 9093, broker id to 1, and log.dirs to “./logs/kafka-1”. Lastly modify
server-2.propertiesto use port 9094, broker id 2, and log.dirs “./logs/kafka-2”.
$ ~/kafka-training
$ mkdir -p lab2/config
$ cp kafka/config/server.properties kafka/lab2/config/server-0.properties
$ cp kafka/config/server.properties kafka/lab2/config/server-1.properties
$ cp kafka/config/server.properties kafka/lab2/config/server-2.properties
Modify server-x.properties
With your favorite text editor change server-0.properties so that log.dirs is set to ./logs/kafka-0. Leave the rest of the file the same. Make sure log.dirs is only defined once. Modify ~/kafka-training/lab2/config/
server-0.properties as follows:
broker.id=0
port=9092
log.dirs=./logs/kafka-0
...
With your favorite text editor change log.dirs, broker.id and and log.dirs of server-1.properties as follows. Modify ~/kafka-training/lab2/config/server-1.properties as follows:
broker.id=1
port=9093
log.dirs=./logs/kafka-1
...
With your favorite text editor change log.dirs, broker.id and and log.dirs of server-2.properties as follows. Modify ~/kafka-training/lab2/config/server-2.properties as follows.
broker.id=2
port=9094
log.dirs=./logs/kafka-2
...
Create Startup scripts for three Kafka servers. The startup scripts will just run kafka-server-start.sh with the corresponding properties file.
#!/usr/bin/env bash
CONFIG=`pwd`/config
cd ~/kafka-training
## Run Kafka
kafka/bin/kafka-server-start.sh \
"$CONFIG/server-0.properties"
#!/usr/bin/env bash
CONFIG=`pwd`/config
cd ~/kafka-training
## Run Kafka
kafka/bin/kafka-server-start.sh \
"$CONFIG/server-1.properties"
Create ~/kafka-training/lab2/start-3rd-server.sh like the last two but use server-2.properties for the config.
Notice we are passing the Kafka server properties files that we created in the last step.
Run Servers
$ ./start-1st-server.sh
...
$ ./start-2nd-server.sh
...
$ ./start-3rd-server.sh
Create Kafka replicated topic my-failsafe-topic. Now create a replicated topic that the console producers and console consumers can use.
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-topics.sh --create \
--zookeeper localhost:2181 \
--replication-factor 3 \
--partitions 13 \
--topic my-failsafe-topic
Notice that the replication factor gets set to 3, and the topic name is my-failsafe-topic, and like before it has 13 partitions. Then we just have to run the script to create the topic.
~/kafka-training/lab2
$ ./create-replicated-topic.sh
Start Kafka Consumer that uses the replicated topic. Next, create a script that starts the consumer and then start the consumer with the script (~/kafka-training/lab2/start-consumer-console-replicated.sh) as
follows:
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-console-consumer.sh \
--bootstrap-server localhost:9094,localhost:9092 \
--topic my-failsafe-topic \
--from-beginning
Notice that a list of Kafka servers is passed to --bootstrap-server parameter. Only, two of the three servers get passed that we ran earlier. Even though only one broker is needed, the consumer client will learn
about the other broker from just one server. Usually, you list multiple brokers in case there is an outage so that the client can connect.
$ ./start-consumer-console-replicated.sh
❖ Start producer
❖ Pass list of Kafka Brokers
Start Kafka Producer that uses the replicated topic. Next, we create a script that starts the producer. Then launch the producer with the script you create. Create ~/kafka-training/lab2/start-consumer-
producer-replicated.sh as follows:
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-console-producer.sh \
--broker-list localhost:9092,localhost:9093 \
--topic my-failsafe-topic
Notice we start Kafka producer and pass it a list of Kafka Brokers to use via the parameter —broker-list. Now use the start-producer-console-replicated.sh to launch the producer as follows:
$ ./start-consumer-producer-replicated.sh
Now send some message from the producer to Kafka and see those messages consumed by the consumer.
~/kafka-training/lab2
$ ./start-consumer-producer-replicated.sh
Hi Mom
Good!
~/kafka-training/lab2
$ ./start-consumer-console-replicated.sh
Hi Mom
Good!
Now start two more consumers in their own terminal window and then send more messages from the producer. Start Producer as follows:
$ ./start-consumer-producer-replicated.sh
message 1
message 2
message 3
$ ./start-consumer-console-replicated.sh
Hi Mom
message 1
message 2
message 3
..
$ ./start-consumer-console-replicated.sh
Hi Mom
message 2
message 3
Notice that the messages are sent to all of the consumers because each consumer is in a different consumer group.
Now let’s change the consumer to each be in their own consumer group. Stop the producers and the consumers from before, but leave Kafka and ZooKeeper running. Now let’s modify the start-consumer-
console-replicated.sh script to add a Kafka consumer group. We want to put all of the consumers in same consumer group. This way the consumers will share the messages as each consumer in the
consumer group will get its share of partitions. Notice that the script is the same as before except we added --consumer-property group.id=mygroup which will put every consumer that runs with this script
into the mygroup consumer group. Now we just run the producer and three consumers.
$ ./start-consumer-console-replicated.sh
$ ./start-consumer-producer-replicated.sh
Now send seven messages from the Kafka producer console as follows:
$ ./start-consumer-producer-replicated.sh
m1
m2
m3
m4
m5
m6
m7
Notice that the messages are spread evenly among the consumers.
$ ./start-consumer-console-replicated.sh
m3
m5
$ ./start-consumer-console-replicated.sh
m2
m6
Consumer Failover
❖ Kill one consumer
❖ Send seven more
messages
❖ Load is spread to
remaining consumers
❖ Failover WORK!
Next, let’s demonstrate consumer failover by killing one of the consumers and sending seven more messages. Kafka should divide up the work to the consumers that are running.
First, kill the third consumer (CTRL-C in the consumer terminal does the trick).
Now send seven more messages with the Kafka console-producer. Notice that the messages are spread evenly among the remaining consumers.
We killed one consumer, sent seven more messages, and saw Kafka spread the load to remaining consumers. Kafka consumer failover works!
You can use kafka-topics.sh to see how the Kafka topic is laid out among the Kafka brokers. The ---describe will show partitions, ISRs, and broker partition leadership.
#!/usr/bin/env bash
cd ~/kafka-training
kafka/bin/kafka-topics.sh --describe \
--topic my-failsafe-topic \
--zookeeper localhost:2181
Let’s run kafka-topics.sh --describe and see the topology of our my-failsafe-topic.
We are going to lists which broker owns (leader of) which partition, and list replicas and ISRs of each partition. ISRs are replicas that are up to date. Remember there are
13 topics.
Notice how each broker gets a share of the partitions as leaders and followers. Also, see how Kafka replicates the partitions on each broker.
Use Kafka topic describe to see that a new leader was elected!
You can stop the first broker by hitting CTRL-C in the broker terminal or by running the above command. Now that the first Kafka broker has stopped, let’s use Kafka
topics describe to see that new leaders were elected!
Notice how Kafka spreads the leadership over the 2nd and 3rd Kafka brokers.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Let’s prove that failover worked by sending two more messages from the producer console.
Then notice if the consumers still get the messages. Send the message m15 and m16.
Notice that the messages are spread evenly among the remaining live consumers.
The first Kafka broker gets m16. The 1st Kafka broker gets m15.
Why did the three consumers not load share the messages at first?
They did not load share at first because they were each in a different consumer group. Consumer groups each subscribe to a topic and maintain their own offsets per
partition in that topic.
We shut a consumer down. Then we sent more messages. We observed Kafka spreading messages to the remaining cluster.
We didn’t. We showed failover for Kafka brokers by shutting one down, then using the producer console to send two more messages. Then we saw that the producer
used the remaining Kafka brokers. Those Kafka brokers then delivered the messages to the live consumers.
What tool and option did we use to show ownership of partitions and the ISRs?
Kafka Connect
Kafka Ecosystem
❖ Kafka Streams
❖ Streams API to transform, aggregate, process records from a stream and
produce derivative streams
❖ Kafka Connect
❖ Connector API reusable producers and consumers
❖ (e.g., stream of changes from DynamoDB)
❖ Kafka REST Proxy
❖ Producers and Consumers over REST (HTTP)
❖ Schema Registry - Manages schemas using Avro for Kafka Records
❖ Kafka MirrorMaker - Replicate cluster data to another cluster
The core of Kafka is the brokers, topics, logs, partitions, and cluster. The core also consists of related tools like MirrorMaker. The aforementioned is Kafka as it exists in
Apache.
The Kafka ecosystem consists of Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy, and the Schema Registry. Most of the additional pieces of the Kafka
ecosystem comes from Confluent and is not part of Apache.
Kafka Stream is the Streams API to transform, aggregate, and process records from a stream and produces derivative streams. Kafka Connect is the connector API to
create reusable producers and consumers (e.g., stream of changes from DynamoDB). The Kafka REST Proxy is used to producers and consumer over REST (HTTP). The
Schema Registry manages schemas using Avro for Kafka records. The Kafka MirrorMaker is used to replicate cluster data to another cluster.
Kafka Ecosystem
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
❖ Example: video player app sends events videos watched, videos paused
❖ can aggregate activity of many users to see what new videos are hot
❖ Solves hard problems: out of order records, aggregating/joining across streams, stateful
computations, and more
The Kafka Stream API builds on core Kafka primitives and has a life of its own.
A stream processor takes continual streams of records from input topics, performs some processing, transformation, aggregation on input, and produces one or more
output streams
For example, a video player application might take an input stream of events of videos watched, and videos paused, and output a stream of user preferences and then
gear new video recommendations based on recent user activity or aggregate activity of many users to see what new videos are hot.
Kafka Stream API solves hard problems with out of order records, aggregating across multiple streams, joining data from multiple streams, allowing for stateful
computations, and more.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
DB DB
Consumers
Producers
App
Kafka
App
Cluster App
App
App
App
App App
Streams
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
Kafka Streams enable real-time processing of streams. It can aggregate across multiple streams, joining data from multiple streams, allowing for stateful computations,
and more.
Kafka Connect is the connector API to create reusable producers and consumers (e.g., stream of changes from DynamoDB).
Kafka Connect Sources are sources of records. Kafka Connect Sinks are a destination for records.
The Schema Registry manages schemas using Avro for Kafka records.
The Kafka REST Proxy is used to producers and consumer over REST (HTTP). You could use it for easy integration of existing code bases.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
™
References
❖ Learning Apache Kafka, Second Edition 2nd Edition by Nishant Garg (Author),
2015, ISBN 978-1784393090, Packet Press
❖ Apache Kafka Cookbook, 1st Edition, Kindle Edition by Saurabh Minni (Author),
2015, ISBN 978-1785882449, Packet Press
❖ Kafka Streams for Stream processing: A few words about how Kafka works,
Serban Balamaci, 2017, Blog: Plain Ol' Java
❖ Kafka official documentation, 2017
❖ Why we need Kafka? Quora
❖ Why is Kafka Popular? Quora
❖ Why is Kafka so Fast? Stackoverflow
❖ Kafka growth exploding (Tech Republic)