Apache Kafka
Apache Kafka
Documentation
Kafka 1.0 Documentation
Prior releases: 0.7.x, 0.8.0, 0.8.1.X, 0.8.2.X, 0.9.0.X, 0.10.0.X, 0.10.1.X, 0.10.2.X, 0.11.0.X.
1. GETTING STARTED
1.1 Introduction
1.4 Ecosystem
1.5 Upgrading
2. APIS
3. CONFIGURATION
4. DESIGN
4.1 Motivation
4.2 Persistence
4.3 E ciency
4.7 Replication
4.9 Quotas
http://kafka.apache.org/documentation.html#introduction 1/130
11/8/2017 Apache Kafka
5. IMPLEMENTATION
5.2 Messages
5.4 Log
5.5 Distribution
6. OPERATIONS
Modifying topics
Graceful shutdown
Balancing leadership
Decommissioning brokers
6.2 Datacenters
Ext4 Notes
6.6 Monitoring
6.7 ZooKeeper
Download
Stable Version
Operationalization
@apachekafka
7. SECURITY
New Clusters
Migrating Clusters
8. KAFKA CONNECT
8.1 Overview
http://kafka.apache.org/documentation.html#introduction 2/130
11/8/2017 Apache Kafka
Transformations
REST API
9. KAFKA STREAMS
9.5 Architecture
1. GETTING STARTED
1.1 Introduction
Apache Kafka is a distributed streaming platform. What exactly does that mean?
1. It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messagin
2. It lets you store streams of records in a fault-tolerant way.
3. It lets you process streams of records as they occur.
1. Building real-time streaming data pipelines that reliably get data between systems or applications
2. Building real-time streaming applications that transform or react to the streams of data
To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
@apachekafka The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.
APIs
existing applications or data systems. For example, a
http://kafka.apache.org/documentation.html#introduction 3/130
11/8/2017 Apache Kafka
Con guration
Design
Implementation
Operations
Security
PERFORMANCE
POWERED BY
PROJECT INFO
ECOSYSTEM
CLIENTS In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP pr
protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are a
EVENTS languages.
CONTACT US
Topics and Logs
APACHE
Let's rst dive into the core abstraction Kafka provides for a stream of recordsthe topic.
A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can h
many consumers that subscribe to the data written to it.
For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Download
@apachekafka
Each partition is an ordered, immutable sequence of records that is continually appended toa structured commit log. The records
are each assigned a sequential id number called the offset that uniquely identi es each record within the partition.
The Kafka cluster retains all published recordswhether or not they have been consumedusing a con gurable retention period. Fo
retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will
free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
http://kafka.apache.org/documentation.html#introduction 4/130
11/8/2017 Apache Kafka
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is contro
consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the
can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or s
most recent record and start consuming from "now".
This combination of features means that Kafka consumers are very cheapthey can come and go without much impact on the clus
consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed
consumers.
The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will t on a single server. Each in
partition must t on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Seco
the unit of parallelismmore on that in a bit.
Distribution
The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share
partitions. Each partition is replicated across a con gurable number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all rea
requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically
leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partit
topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition funct
on some key in the record). More on the use of partitioning in a second!
Consumers
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer insta
subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer i
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Download
@apachekafka
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instanc
has four.
More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Ea
composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics w
subscriber is a cluster of consumers instead of a single process.
The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each in
exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handl
protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an inst
partitions will be distributed to the remaining instances.
http://kafka.apache.org/documentation.html#introduction 5/130
11/8/2017 Apache Kafka
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering com
ability to partition data by key is su cient for most applications. However, if you require a total order over records this can be achiev
that has only one partition, though this will mean only one consumer process per consumer group.
Guarantees
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is s
producer as a record M2, and M1 is sent rst, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
More details on these guarantees are given in the design section of the documentation.
How does Kafka's notion of streams compare to a traditional enterprise messaging system?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server an
goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a w
strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale y
Unfortunately, queues aren't multi-subscriberonce one process reads the data it's gone. Publish-subscribe allows you broadcast da
processes, but has no way of scaling processing since every message goes to every subscriber.
The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up
over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast mes
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these propertiesit can scale processing and is also multi-subscriber
to choose one or the other.
Kafka has stronger ordering guarantees than a traditional messaging system, too.
A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands
the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to cons
may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel c
Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume fro
Download of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelismthe partitionwithin the topics, Kafka is able to provide both ordering guaran
@apachekafka
balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the cons
that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader
and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note ho
cannot be more consumer instances in a consumer group than partitions.
Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for
messages. What is different about Kafka is that it is a very good storage system.
Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so th
considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.
The disk structures Kafka uses scale wellKafka will perform the same whether you have 50 KB or 50 TB of persistent data on the s
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of spec
distributed lesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
http://kafka.apache.org/documentation.html#introduction 6/130
11/8/2017 Apache Kafka
For details about the Kafka's commit log storage and replication design, please read this page.
It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.
In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this in
produces continual streams of data to output topics.
For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adju
computed off this data.
It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations K
fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of stream
streams together.
This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code cha
performing stateful computations, etc.
The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stat
and uses the same group mechanism for fault tolerance among the stream processor instances.
This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming
A distributed le system like HDFS allows storing static les for batch processing. Effectively a system like this allows storing and p
historical data from the past.
A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built
process future data as it arrives.
Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applicat
for streaming data pipelines.
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That
application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as futur
This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.
Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low
Download
pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guarantee
integration with o ine systems that load data only periodically or may go down for extended periods of time for maintenance. The s
@apachekafka
processing facilities make it possible to transform data as it arrives.
For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.
Here is a description of a few of the popular use cases for Apache Kafka. For an overview of a number of these areas in action, se
Messaging
Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to dec
processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has bette
built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend
durability guarantees Kafka provides.
In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.
http://kafka.apache.org/documentation.html#introduction 7/130
11/8/2017 Apache Kafka
The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feed
site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. Th
available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or o
warehousing systems for o ine processing and reporting.
Activity tracking is often very high volume as many activity messages are generated for each user page view.
Metrics
Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce ce
of operational data.
Log Aggregation
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log les off serv
them in a central place (a le server or HDFS perhaps) for processing. Kafka abstracts away the details of les and gives a cleaner a
log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources an
data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger dura
guarantees due to replication, and much lower end-to-end latency.
Stream Processing
Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Ka
then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing. For example,
pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an "articles" topic; further proc
normalize or deduplicate this content and published the cleansed article content to a new topic; a nal processing stage might attem
recommend this content to users. Such processing pipelines create graphs of real-time data ows based on the individual topics. St
0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka to perform such da
as described above. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache
Event Sourcing
Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's supp
large stored log data makes it an excellent backend for an application built in this style.
Download
Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a
mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka
Apache BookKeeper project.
This tutorial assumes you are starting fresh and have no existing Kafka or ZooKeeper data. Since Kafka console scripts are differen
and Windows platforms, on Windows platforms use bin\windows\ instead of bin/ , and change the script extension to .bat
Kafka uses ZooKeeper so you need to rst start a ZooKeeper server if you don't already have one. You can use the convenience scri
with kafka to get a quick-and-dirty single-node ZooKeeper instance.
Let's create a topic named "test" with a single partition and only one replica:
We can now see that topic if we run the list topic command:
Alternatively, instead of manually creating topics you can also con gure your brokers to auto-create topics when a non-existent topic
to.
Kafka comes with a command line client that will take input from a le or from standard input and send it out as messages to the Ka
default, each line will be sent as a separate message.
Run the producer and then type a few messages into the console to send to the server.
Download Kafka also has a command line consumer that will dump out messages to standard output.
If you have each of the above commands running in a different terminal then you should now be able to type messages into the prod
and see them appear in the consumer terminal.
All of the command line tools have additional options; running the command with no arguments will display usage information docu
in more detail.
So far we have been running against a single broker, but that's no fun. For Kafka, a single broker is just a cluster of size one, so nothi
changes other than starting a few more broker instances. But just to get feel for it, let's expand our cluster to three nodes (still all on
machine).
First we make a con g le for each of the brokers (on Windows use the copy command instead):
http://kafka.apache.org/documentation.html#introduction 9/130
11/8/2017 Apache Kafka
Now edit these new les and set the following properties:
1 config/server-1.properties:
2 broker.id=1
3 listeners=PLAINTEXT://:9093
4 log.dir=/tmp/kafka-logs-1
5
6 config/server-2.properties:
7 broker.id=2
8 listeners=PLAINTEXT://:9094
9 log.dir=/tmp/kafka-logs-2
The broker.id property is the unique and permanent name of each node in the cluster. We have to override the port and log direc
because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port
each other's data.
We already have Zookeeper and our single node started, so we just need to start the two new nodes:
Okay but now that we have a cluster how can we know which broker is doing what? To see that run the "describe topics" command:
Here is an explanation of output. The rst line gives a summary of all the partitions, each additional line gives information about one
we have only one partition for this topic there is only one line.
"leader" is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly selecte
partitions.
"replicas" is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they are cu
"isr" is the set of "in-sync" replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.
Note that in my example node 1 is the leader for the only partition of the topic.
We can run the same command on the original topic we created to see where it is:
Download
1 > bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
2 Topic:test PartitionCount:1 ReplicationFactor:1 Configs:
@apachekafka
3 Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr: 0
So there is no surprise therethe original topic has no replicas and is on server 0, the only server in our cluster when we created it.
Now let's test out fault-tolerance. Broker 1 was acting as the leader so let's kill it:
http://kafka.apache.org/documentation.html#introduction 10/130
11/8/2017 Apache Kafka
2 7564 ttys002 0:15.91 /System/Library/Frameworks/JavaVM.framework/Versions/1.8/Home/bin/java...
3 > kill -9 7564
On Windows use:
1 > wmic process where "caption = 'java.exe' and commandline like '%server-1.properties%'" get processid
2 ProcessId
3 6016
4 > taskkill /pid 6016 /f
Leadership has switched to one of the slaves and node 1 is no longer in the in-sync replica set:
But the messages are still available for consumption even though the leader that took the writes originally is down:
Writing data from the console and writing it back to the console is a convenient place to start, but you'll probably want to use data fr
sources or export data from Kafka to other systems. For many systems, instead of writing custom integration code you can use Kaf
import or export data.
Kafka Connect is a tool included with Kafka that imports and exports data to Kafka. It is an extensible tool that runs connectors, wh
the custom logic for interacting with an external system. In this quickstart we'll see how to run Kafka Connect with simple connecto
data from a le to a Kafka topic and export data from a Kafka topic to a le.
Or on Windows:
Next, we'll start two connectors running in standalone mode, which means they run in a single, local, dedicated process. We provide
con guration les as parameters. The rst is always the con guration for the Kafka Connect process, containing common con gura
Download the Kafka brokers to connect to and the serialization format for data. The remaining con guration les each specify a connector to c
les include a unique connector name, the connector class to instantiate, and any other con guration required by the connector.
@apachekafka
1 > bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config
These sample con guration les, included with Kafka, use the default local cluster con guration you started earlier and create two c
rst is a source connector that reads lines from an input le and produces each to a Kafka topic and the second is a sink connector
messages from a Kafka topic and produces each as a line in an output le.
During startup you'll see a number of log messages, including some indicating that the connectors are being instantiated. Once the
process has started, the source connector should start reading lines from test.txt and producing them to the topic connect-t
sink connector should start reading messages from the topic connect-test and write them to the le test.sink.txt . We can
has been delivered through the entire pipeline by examining the contents of the output le:
Note that the data is being stored in the Kafka topic connect-test , so we can also run a console consumer to see the data in the
custom consumer code to process it):
http://kafka.apache.org/documentation.html#introduction 11/130
11/8/2017 Apache Kafka
2 {"schema":{"type":"string","optional":false},"payload":"foo"}
3 {"schema":{"type":"string","optional":false},"payload":"bar"}
4 ...
The connectors continue to process data, so we can add data to the le and see it move through the pipeline:
You should see the line appear in the console consumer output and in the sink le.
Kafka Streams is a client library for building mission-critical real-time applications and microservices, where the input and/or output
in Kafka clusters. Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the clien
bene ts of Kafka's server-side cluster technology to make these applications highly scalable, elastic, fault-tolerant, distributed, and m
quickstart example will demonstrate how to run a streaming application coded in this library.
1.4 Ecosystem
There are a plethora of tools that integrate with Kafka outside the main distribution. The ecosystem page lists many of these, includ
processing systems, Hadoop integration, monitoring, and deployment tools.
Kafka 1.0.0 introduces wire protocol changes. By following the recommended rolling upgrade plan below, you guarantee no downtim
upgrade. However, please review the notable changes in 1.0.0 before upgrading.
1. Update server.properties on all brokers and add the following properties. CURRENT_KAFKA_VERSION refers to the version you
from. CURRENT_MESSAGE_FORMAT_VERSION refers to the current message format version currently in use. If you have not o
message format previously, then CURRENT_MESSAGE_FORMAT_VERSION should be set to match CURRENT_KAFKA_VERSIO
2. Upgrade the brokers one at a time: shut down the broker, update the code, and restart it.
3. Once the entire cluster is upgraded, bump the protocol version by editing inter.broker.protocol.version and setting it
4. Restart the brokers one by one for the new protocol version to take effect.
Download
Additional Upgrade Notes:
@apachekafka 1. If you are willing to accept downtime, you can simply take all the brokers down, update the code and start them back up. They
the new protocol by default.
2. Bumping the protocol version and restarting can be done any time after the brokers are upgraded. It does not have to be imme
Similarly for the message format version.
Topic deletion is now enabled by default, since the functionality is now stable. Users who wish to to retain the previous behavior s
broker con g delete.topic.enable to false . Keep in mind that topic deletion removes data and the operation is not rever
is no "undelete" operation)
For topics that support timestamp search if no offset can be found for a partition, that partition is now included in the search res
offset value. Previously, the partition was not included in the map. This change was made to make the search behavior consisten
of topics not supporting timestamp search.
If the inter.broker.protocol.version is 1.0 or later, a broker will now stay online to serve replicas on live log directories ev
o ine log directories. A log directory may become o ine due to IOException caused by hardware failure. Users need to monitor t
metric offlineLogDirectoryCount to check whether there is o ine log directory.
http://kafka.apache.org/documentation.html#introduction 12/130
11/8/2017 Apache Kafka
Upgrading your Streams application from 0.11.0 to 1.0.0 does not require a broker upgrade. A Kafka Streams 1.0.0 application ca
0.11.0, 0.10.2 and 0.10.1 brokers (it is not possible to connect to 0.10.0 brokers though).
If you are monitoring on streams metrics, you will need make some changes to the metrics names in your reporting and monitori
because the metrics sensor hierarchy was changed.
There are a few public APIs including ProcessorContext#schedule() , Processor#punctuate() and KStreamBuilder ,
TopologyBuilder are being deprecated by new APIs. We recommend making corresponding code changes, which should be v
the new APIs look quite similar, when you upgrade.
See Streams API changes in 1.0.0 for more details.
http://kafka.apache.org/documentation.html#introduction 13/130
11/8/2017 Apache Kafka
Kafka 0.11.0.0 introduces a new message format version as well as wire protocol changes. By following the recommended rolling u
below, you guarantee no downtime during the upgrade. However, please review the notable changes in 0.11.0.0 before upgrading.
Starting with version 0.10.2, Java clients (producer and consumer) have acquired the ability to communicate with older brokers. Ver
clients can talk to version 0.10.0 or newer brokers. However, if your brokers are older than 0.10.0, you must upgrade all the brokers i
cluster before upgrading your clients. Version 0.11.0 brokers support 0.8.x and newer clients.
1. Update server.properties on all brokers and add the following properties. CURRENT_KAFKA_VERSION refers to the version you
from. CURRENT_MESSAGE_FORMAT_VERSION refers to the current message format version currently in use. If you have not o
message format previously, then CURRENT_MESSAGE_FORMAT_VERSION should be set to match CURRENT_KAFKA_VERSIO
2. Upgrade the brokers one at a time: shut down the broker, update the code, and restart it.
3. Once the entire cluster is upgraded, bump the protocol version by editing inter.broker.protocol.version and setting it
do not change log.message.format.version yet.
4. Restart the brokers one by one for the new protocol version to take effect.
5. Once all (or most) consumers have been upgraded to 0.11.0 or later, then change log.message.format.version to 0.11.0 on ea
restart them one by one. Note that the older Scala consumer does not support the new message format, so to avoid the perfo
down-conversion (or to take advantage of exactly once semantics), the new Java consumer must be used.
1. If you are willing to accept downtime, you can simply take all the brokers down, update the code and start them back up. They
the new protocol by default.
2. Bumping the protocol version and restarting can be done any time after the brokers are upgraded. It does not have to be imme
Similarly for the message format version.
3. It is also possible to enable the 0.11.0 message format on individual topics using the topic admin tool ( bin/kafka-topics.
updating the global setting log.message.format.version .
4. If you are upgrading from a version prior to 0.10.0, it is NOT necessary to rst update the message format to 0.10.0 before you
0.11.0.
Upgrading your Streams application from 0.10.2 to 0.11.0 does not require a broker upgrade. A Kafka Streams 0.11.0 application
@apachekafka
0.11.0, 0.10.2 and 0.10.1 brokers (it is not possible to connect to 0.10.0 brokers though).
If you specify customized key.serde , value.serde and timestamp.extractor in con gs, it is recommended to use the
con gure parameter as these con gs are deprecated.
See Streams API changes in 0.11.0 for more details.
Unclean leader election is now disabled by default. The new default favors durability over availability. Users who wish to to retain
behavior should set the broker con g unclean.leader.election.enable to true .
Producer con gs block.on.buffer.full , metadata.fetch.timeout.ms and timeout.ms have been removed. They we
deprecated in Kafka 0.9.0.0.
The offsets.topic.replication.factor broker con g is now enforced upon auto topic creation. Internal auto topic creatio
GROUP_COORDINATOR_NOT_AVAILABLE error until the cluster size meets this replication factor requirement.
When compressing data with snappy, the producer and broker will use the compression scheme's default block size (2 x 32 KB) i
in order to improve the compression ratio. There have been reports of data compressed with the smaller block size being 50% lar
compressed with the larger block size. For the snappy case, a producer with 5000 partitions will require an additional 315 MB of J
http://kafka.apache.org/documentation.html#introduction 14/130
11/8/2017 Apache Kafka
Similarly, when compressing data with gzip, the producer and broker will use 8 KB instead of 1 KB as the buffer size. The default
excessively low (512 bytes).
The broker con guration max.message.bytes now applies to the total size of a batch of messages. Previously the setting app
of compressed messages, or to non-compressed messages individually. A message batch may consist of only a single message
cases, the limitation on the size of individual messages is only reduced by the overhead of the batch format. However, there are s
implications for message format conversion (see below for more detail). Note also that while previously the broker would ensure
one message is returned in each fetch request (regardless of the total and partition-level fetch sizes), the same behavior now app
message batch.
GC log rotation is enabled by default, see KAFKA-3754 for details.
Deprecated constructors of RecordMetadata, MetricName and Cluster classes have been removed.
Added user headers support through a new Headers interface providing user headers read and write access.
ProducerRecord and ConsumerRecord expose the new Headers API via Headers headers() method call.
ExtendedSerializer and ExtendedDeserializer interfaces are introduced to support serialization and deserialization for headers. H
ignored if the con gured serializer and deserializer are not the above classes.
A new con g, group.initial.rebalance.delay.ms , was introduced. This con g speci es the time, in milliseconds, that the
GroupCoordinator will delay the initial consumer rebalance. The rebalance will be further delayed by the value of
group.initial.rebalance.delay.ms as new members join the group, up to a maximum of max.poll.interval.ms . The
for this is 3 seconds. During development and testing it might be desirable to set this to 0 in order to not delay test execution tim
org.apache.kafka.common.Cluster#partitionsForTopic , partitionsForNode and availablePartitionsForTopic
return an empty list instead of null (which is considered a bad practice) in case the metadata for the required topic does not e
Streams API con guration parameters timestamp.extractor , key.serde , and value.serde were deprecated and repla
default.timestamp.extractor , default.key.serde , and default.value.serde , respectively.
For offset commit failures in the Java consumer's commitAsync APIs, we no longer expose the underlying cause when instanc
RetriableCommitFailedException are passed to the commit callback. See KAFKA-5052 for more detail.
Kafka 0.11.0 includes support for idempotent and transactional capabilities in the producer. Idempotent delivery ensures that messa
Download delivered exactly once to a particular topic partition during the lifetime of a single producer. Transactional delivery allows producers
multiple partitions such that either all messages are successfully delivered, or none of them are. Together, these capabilities enable
@apachekafka semantics" in Kafka. More details on these features are available in the user guide, but below we add a few speci c notes on enablin
upgraded cluster. Note that enabling EoS is not required and there is no impact on the broker's behavior if unused.
1. Only the new Java producer and consumer support exactly once semantics.
2. These features depend crucially on the 0.11.0 message format. Attempting to use them on an older format will result in unsup
errors.
3. Transaction state is stored in a new internal topic __transaction_state . This topic is not created until the the rst attemp
transactional request API. Similar to the consumer offsets topic, there are several settings to control the topic's con guration.
transaction.state.log.min.isr controls the minimum ISR for this topic. See the con guration section in the user guide
options.
4. For secure clusters, the transactional APIs require new ACLs which can be turned on with the bin/kafka-acls.sh . tool.
5. EoS in Kafka introduces new request APIs and modi es several existing ones. See KIP-98 for the full details
The 0.11.0 message format includes several major enhancements in order to support better delivery semantics for the producer (se
improved replication fault tolerance (see KIP-101). Although the new format contains more information to make these improvement
http://kafka.apache.org/documentation.html#introduction 15/130
11/8/2017 Apache Kafka
have made the batch format much more e cient. As long as the number of messages per batch is more than 2, you can expect low
overhead. For smaller batches, however, there may be a small performance impact. See here for the results of our initial performanc
the new message format. You can also nd more detail on the message format in the KIP-98 proposal.
One of the notable differences in the new message format is that even uncompressed messages are stored together as a single bat
few implications for the broker con guration max.message.bytes , which limits the size of a single batch. First, if an older client p
messages to a topic partition using the old format, and the messages are individually smaller than max.message.bytes , the brok
reject them after they are merged into a single batch during the up-conversion process. Generally this can happen when the aggrega
individual messages is larger than max.message.bytes . There is a similar effect for older consumers reading messages down-co
the new format: if the fetch size is not set at least as large as max.message.bytes , the consumer may not be able to make progre
individual uncompressed messages are smaller than the con gured fetch size. This behavior does not impact the Java client for 0.1
since it uses an updated fetch protocol which ensures that at least one message can be returned even if it exceeds the fetch size. To
these problems, you should ensure 1) that the producer's batch size is not set larger than max.message.bytes , and 2) that the co
size is set at least as large as max.message.bytes .
Most of the discussion on the performance impact of upgrading to the 0.10.0 message format remains pertinent to the 0.11.0 upgra
affects clusters that are not secured with TLS since "zero-copy" transfer is already not possible in that case. In order to avoid the cos
conversion, you should ensure that consumer applications are upgraded to the latest 0.11.0 client. Signi cantly, since the old consu
deprecated in 0.11.0.0, it does not support the new message format. You must upgrade to use the new consumer to use the new me
without the cost of down-conversion. Note that 0.11.0 consumers support backwards compatibility with brokers 0.10.0 brokers and
possible to upgrade the clients rst before the brokers.
0.10.2.0 has wire protocol changes. By following the recommended rolling upgrade plan below, you guarantee no downtime during t
However, please review the notable changes in 0.10.2.0 before upgrading.
Starting with version 0.10.2, Java clients (producer and consumer) have acquired the ability to communicate with older brokers. Ver
clients can talk to version 0.10.0 or newer brokers. However, if your brokers are older than 0.10.0, you must upgrade all the brokers i
cluster before upgrading your clients. Version 0.10.2 brokers support 0.8.x and newer clients.
Note: If you are willing to accept downtime, you can simply take all the brokers down, update the code and start all of them. They wi
new protocol by default.
Note: Bumping the protocol version and restarting can be done any time after the brokers were upgraded. It does not have to be imm
Upgrading your Streams application from 0.10.1 to 0.10.2 does not require a broker upgrade. A Kafka Streams 0.10.2 application
0.10.2 and 0.10.1 brokers (it is not possible to connect to 0.10.0 brokers though).
You need to recompile your code. Just swapping the Kafka Streams library jar le will not work and will break your application.
http://kafka.apache.org/documentation.html#introduction 16/130
11/8/2017 Apache Kafka
If you use a custom (i.e., user implemented) timestamp extractor, you will need to update this code, because the TimestampExt
interface was changed.
If you register custom metrics, you will need to update this code, because the StreamsMetric interface was changed.
See Streams API changes in 0.10.2 for more details.
The default values for two con gurations of the StreamsCon g class were changed to improve the resiliency of Kafka Streams a
internal Kafka Streams producer retries default value was changed from 0 to 10. The internal Kafka Streams consumer
max.poll.interval.ms default value was changed from 300000 to Integer.MAX_VALUE .
The Java clients (producer and consumer) have acquired the ability to communicate with older brokers. Version 0.10.2 clients ca
0.10.0 or newer brokers. Note that some features are not available or are limited when older brokers are used.
Several methods on the Java consumer may now throw InterruptException if the calling thread is interrupted. Please refer t
KafkaConsumer Javadoc for a more in-depth explanation of this change.
Java consumer now shuts down gracefully. By default, the consumer waits up to 30 seconds to complete pending requests. A ne
with timeout has been added to KafkaConsumer to control the maximum wait time.
Multiple regular expressions separated by commas can be passed to MirrorMaker with the new Java consumer via the --whitelist
makes the behaviour consistent with MirrorMaker when used the old Scala consumer.
Upgrading your Streams application from 0.10.1 to 0.10.2 does not require a broker upgrade. A Kafka Streams 0.10.2 application
0.10.2 and 0.10.1 brokers (it is not possible to connect to 0.10.0 brokers though).
The Zookeeper dependency was removed from the Streams API. The Streams API now uses the Kafka protocol to manage intern
instead of modifying Zookeeper directly. This eliminates the need for privileges to access Zookeeper directly and
"StreamsCon g.ZOOKEEPER_CONFIG" should not be set in the Streams app any more. If the Kafka cluster is secured, Streams ap
the required security privileges to create new topics.
Several new elds including "security.protocol", "connections.max.idle.ms", "retry.backoff.ms", "reconnect.backoff.ms" and "reque
were added to StreamsCon g class. User should pay attention to the default values and set these if needed. For more details ple
Kafka Streams Con gs.
KIP-88: OffsetFetchRequest v2 supports retrieval of offsets for all topics if the topics array is set to null .
KIP-88: OffsetFetchResponse v2 introduces a top-level error_code eld.
Download KIP-103: UpdateMetadataRequest v3 introduces a listener_name eld to the elements of the end_points array.
KIP-108: CreateTopicsRequest v1 introduces a validate_only eld.
@apachekafka KIP-108: CreateTopicsResponse v1 introduces an error_message eld to the elements of the topic_errors array.
0.10.1.0 has wire protocol changes. By following the recommended rolling upgrade plan below, you guarantee no downtime during t
However, please notice the Potential breaking changes in 0.10.1.0 before upgrade.
Note: Because new protocols are introduced, it is important to upgrade your Kafka clusters before upgrading your clients (i.e. 0.10.1
support 0.10.1.x or later brokers while 0.10.1.x brokers also support older clients).
2. Upgrade the brokers one at a time: shut down the broker, update the code, and restart it.
3. Once the entire cluster is upgraded, bump the protocol version by editing inter.broker.protocol.version and setting it to 0.10.1.0
http://kafka.apache.org/documentation.html#introduction 17/130
11/8/2017 Apache Kafka
4. If your previous message format is 0.10.0, change log.message.format.version to 0.10.1 (this is a no-op as the message form
for both 0.10.0 and 0.10.1). If your previous message format version is lower than 0.10.0, do not change log.message.format.
parameter should only change once all consumers have been upgraded to 0.10.0.0 or later.
5. Restart the brokers one by one for the new protocol version to take effect.
6. If log.message.format.version is still lower than 0.10.0 at this point, wait until all consumers have been upgraded to 0.10.0 or
change log.message.format.version to 0.10.1 on each broker and restart them one by one.
Note: If you are willing to accept downtime, you can simply take all the brokers down, update the code and start all of them. They wi
new protocol by default.
Note: Bumping the protocol version and restarting can be done any time after the brokers were upgraded. It does not have to be imm
The log retention time is no longer based on last modi ed time of the log segments. Instead it will be based on the largest timest
messages in a log segment.
The log rolling time is no longer depending on log segment create time. Instead it is now based on the timestamp in the message
speci cally. if the timestamp of the rst message in the segment is T, the log will be rolled out when a new message has a timest
than or equal to T + log.roll.ms
The open le handlers of 0.10.0 will increase by ~33% because of the addition of time index les for each segment.
The time index and offset index share the same index size con guration. Since each time index entry is 1.5x the size of offset ind
may need to increase log.index.size.max.bytes to avoid potential frequent log rolling.
Due to the increased number of index les, on some brokers with large amount the log segments (e.g. >15K), the log loading proc
broker startup could be longer. Based on our experiment, setting the num.recovery.threads.per.data.dir to one may reduce the log
Upgrading your Streams application from 0.10.0 to 0.10.1 does require a broker upgrade because a Kafka Streams 0.10.1 applica
connect to 0.10.1 brokers.
There are couple of API changes, that are not backward compatible (cf. Streams API changes in 0.10.1 for more details). Thus, yo
update and recompile your code. Just swapping the Kafka Streams library jar le will not work and will break your application.
The new Java consumer is no longer in beta and we recommend it for all new development. The old Scala consumers are still su
they will be deprecated in the next release and will be removed in a future major release.
Download The --new-consumer / --new.consumer switch is no longer required to use tools like MirrorMaker and the Console Consum
consumer; one simply needs to pass a Kafka broker to connect to instead of the ZooKeeper ensemble. In addition, usage of the C
@apachekafka
Consumer with the old consumer has been deprecated and it will be removed in a future major release.
Kafka clusters can now be uniquely identi ed by a cluster id. It will be automatically generated when a broker is upgraded to 0.10
id is available via the kafka.server:type=KafkaServer,name=ClusterId metric and it is part of the Metadata response. Serializers, c
interceptors and metric reporters can receive the cluster id by implementing the ClusterResourceListener interface.
The BrokerState "RunningAsController" (value 4) has been removed. Due to a bug, a broker would only be in this state brie y befo
out of it and hence the impact of the removal should be minimal. The recommended way to detect if a given broker is the control
kafka.controller:type=KafkaController,name=ActiveControllerCount metric.
The new Java Consumer now allows users to search offsets by timestamp on partitions.
The new Java Consumer now supports heartbeating from a background thread. There is a new con guration max.poll.interv
controls the maximum time between poll invocations before the consumer will proactively leave the group (5 minutes by default)
the con guration request.timeout.ms must always be larger than max.poll.interval.ms because this is the maximum
JoinGroup request can block on the server while the consumer is rebalancing, so we have changed its default value to just above
Finally, the default value of session.timeout.ms has been adjusted down to 10 seconds, and the default value of max.poll
been changed to 500.
When using an Authorizer and a user doesn't have Describe authorization on a topic, the broker will no longer return
TOPIC_AUTHORIZATION_FAILED errors to requests since this leaks topic names. Instead, the UNKNOWN_TOPIC_OR_PARTITION
http://kafka.apache.org/documentation.html#introduction 18/130
11/8/2017 Apache Kafka
be returned. This may cause unexpected timeouts or delays when using the producer and consumer since Kafka clients will typic
automatically on unknown topic errors. You should consult the client logs if you suspect this could be happening.
Fetch responses have a size limit by default (50 MB for consumers and 10 MB for replication). The existing per partition limits als
for consumers and replication). Note that neither of these limits is an absolute maximum as explained in the next point.
Consumers and replicas can make progress if a message larger than the response/partition size limit is found. More concretely,
message in the rst non-empty partition of the fetch is larger than either or both limits, the message will still be returned.
Overloaded constructors were added to kafka.api.FetchRequest and kafka.javaapi.FetchRequest to allow the caller
order of the partitions (since order is signi cant in v3). The previously existing constructors were deprecated and the partitions a
before the request is sent to avoid starvation issues.
0.10.0.0 has potential breaking changes (please review before upgrading) and possible performance impact following the upgrade.
recommended rolling upgrade plan below, you guarantee no downtime and no performance impact during and following the upgrade
Note: Because new protocols are introduced, it is important to upgrade your Kafka clusters before upgrading your clients.
Notes to clients with version 0.9.0.0: Due to a bug introduced in 0.9.0.0, clients that depend on ZooKeeper (old Scala high-level Con
MirrorMaker if used with the old consumer) will not work with 0.10.0.x brokers. Therefore, 0.9.0.0 clients should be upgraded to 0.9.
brokers are upgraded to 0.10.0.x. This step is not necessary for 0.8.X or 0.9.0.1 clients.
2. Upgrade the brokers. This can be done a broker at a time by simply bringing it down, updating the code, and restarting it.
3. Once the entire cluster is upgraded, bump the protocol version by editing inter.broker.protocol.version and setting it to 0.10.0.0
Download
shouldn't touch log.message.format.version yet - this parameter should only change once all consumers have been upgraded
4. Restart the brokers one by one for the new protocol version to take effect.
@apachekafka
5. Once all consumers have been upgraded to 0.10.0, change log.message.format.version to 0.10.0 on each broker and restart th
Note: If you are willing to accept downtime, you can simply take all the brokers down, update the code and start all of them. They wi
new protocol by default.
Note: Bumping the protocol version and restarting can be done any time after the brokers were upgraded. It does not have to be imm
The message format in 0.10.0 includes a new timestamp eld and uses relative offsets for compressed messages. The on disk mes
can be con gured through log.message.format.version in the server.properties le. The default on-disk message format is 0.10.0. If
client is on a version before 0.10.0.0, it only understands message formats before 0.10.0. In this case, the broker is able to convert m
the 0.10.0 format to an earlier format before sending the response to the consumer on an older version. However, the broker can't us
transfer in this case. Reports from the Kafka community on the performance impact have shown CPU utilization going from 20% bef
after an upgrade, which forced an immediate upgrade of all clients to bring performance back to normal. To avoid such message co
consumers are upgraded to 0.10.0.0, one can set log.message.format.version to 0.8.2 or 0.9.0 when upgrading the broker to 0.10.0.
broker can still use zero-copy transfer to send the data to the old consumers. Once consumers are upgraded, one can change the m
http://kafka.apache.org/documentation.html#introduction 19/130
11/8/2017 Apache Kafka
to 0.10.0 on the broker and enjoy the new message format that includes new timestamp and improved compression. The conversio
to ensure compatibility and can be useful to support a few apps that have not updated to newer clients yet, but is impractical to sup
consumer tra c on even an overprovisioned cluster. Therefore, it is critical to avoid the message conversion as much as possible w
have been upgraded but the majority of clients have not.
Note: By setting the message format version, one certi es that all existing messages are on or below that message format version.
consumers before 0.10.0.0 might break. In particular, after the message format is set to 0.10.0, one should not change it back to an
as it may break consumers on versions before 0.10.0.0.
Note: Due to the additional timestamp introduced in each message, producers sending small messages may see a message throug
degradation because of the increased overhead. Likewise, replication now transmits an additional 8 bytes per message. If you're run
the network capacity of your cluster, it's possible that you'll overwhelm the network cards and see failures and performance issues d
overload.
Note: If you have enabled compression on producers, you may notice reduced producer throughput and/or lower compression rate o
some cases. When receiving compressed messages, 0.10.0 brokers avoid recompressing the messages, which in general reduces t
improves the throughput. In certain cases, however, this may reduce the batching size on the producer, which could lead to worse th
happens, users can tune linger.ms and batch.size of the producer for better throughput. In addition, the producer buffer used for com
messages with snappy is smaller than the one used by the broker, which may have a negative impact on the compression ratio for th
disk. We intend to make this con gurable in a future Kafka release.
Starting from Kafka 0.10.0.0, the message format version in Kafka is represented as the Kafka version. For example, message fo
refers to the highest message version supported by Kafka 0.9.0.
Message format 0.10.0 has been introduced and it is used by default. It includes a timestamp eld in the messages and relative o
for compressed messages.
ProduceRequest/Response v2 has been introduced and it is used by default to support message format 0.10.0
FetchRequest/Response v2 has been introduced and it is used by default to support message format 0.10.0
MessageFormatter interface was changed from def writeTo(key: Array[Byte], value: Array[Byte], output: PrintS
def writeTo(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]], output: PrintStream)
MessageReader interface was changed from def readMessage(): KeyedMessage[Array[Byte], Array[Byte]] to def
readMessage(): ProducerRecord[Array[Byte], Array[Byte]]
MessageFormatter's package was changed from kafka.tools to kafka.common
MessageReader's package was changed from kafka.tools to kafka.common
Download MirrorMakerMessageHandler no longer exposes the handle(record: MessageAndMetadata[Array[Byte], Array[Byte]])
was never called.
@apachekafka
The 0.7 KafkaMigrationTool is no longer packaged with Kafka. If you need to migrate from 0.7 to 0.10.0, please migrate to 0.8 rs
follow the documented upgrade process to upgrade from 0.8 to 0.10.0.
The new consumer has standardized its APIs to accept java.util.Collection as the sequence type for method parameters
may have to be updated to work with the 0.10.0 client library.
LZ4-compressed message handling was changed to use an interoperable framing speci cation (LZ4f v1.5.1). To maintain compa
clients, this change only applies to Message format 0.10.0 and later. Clients that Produce/Fetch LZ4-compressed messages usin
(Message format 0.9.0) should continue to use the 0.9.0 framing implementation. Clients that use Produce/Fetch protocols v2 o
use interoperable LZ4f framing. A list of interoperable LZ4 libraries is available at http://www.lz4.org/
Starting from Kafka 0.10.0.0, a new client library named Kafka Streams is available for stream processing on data stored in Kafka
new client library only works with 0.10.x and upward versioned brokers due to message format changes mentioned above. For m
please read Streams documentation.
The default value of the con guration parameter receive.buffer.bytes is now 64K for the new consumer.
The new consumer now exposes the con guration parameter exclude.internal.topics to restrict internal topics (such as t
offsets topic) from accidentally being included in regular expression subscriptions. By default, it is enabled.
http://kafka.apache.org/documentation.html#introduction 20/130
11/8/2017 Apache Kafka
The old Scala producer has been deprecated. Users should migrate their code to the Java producer included in the kafka-clients J
possible.
The new consumer API has been marked stable.
0.9.0.0 has potential breaking changes (please review before upgrading) and an inter-broker protocol change from previous versions
that upgraded brokers and clients may not be compatible with older versions. It is important that you upgrade your Kafka cluster bef
your clients. If you are using MirrorMaker downstream clusters should be upgraded rst as well.
1. Update server.properties le on all brokers and add the following property: inter.broker.protocol.version=0.8.2.X
2. Upgrade the brokers. This can be done a broker at a time by simply bringing it down, updating the code, and restarting it.
3. Once the entire cluster is upgraded, bump the protocol version by editing inter.broker.protocol.version and setting it to 0.9.0.0.
4. Restart the brokers one by one for the new protocol version to take effect
Note: If you are willing to accept downtime, you can simply take all the brokers down, update the code and start all of them. They wi
new protocol by default.
Note: Bumping the protocol version and restarting can be done any time after the brokers were upgraded. It does not have to be imm
The new broker id generation feature can be disabled by setting broker.id.generation.enable to false.
Con guration parameter log.cleaner.enable is now true by default. This means topics with a cleanup.policy=compact will now be
default, and 128 MB of heap will be allocated to the cleaner process via log.cleaner.dedupe.buffer.size. You may want to review
log.cleaner.dedupe.buffer.size and the other log.cleaner con guration values based on your usage of compacted topics.
Default value of con guration parameter fetch.min.bytes for the new consumer is now 1 by default.
Deprecations in 0.9.0.0
http://kafka.apache.org/documentation.html#introduction 21/130
11/8/2017 Apache Kafka
Altering topic con guration from the kafka-topics.sh script (kafka.admin.TopicCommand) has been deprecated. Going forward, p
kafka-con gs.sh script (kafka.admin.Con gCommand) for this functionality.
The kafka-consumer-offset-checker.sh (kafka.tools.ConsumerOffsetChecker) has been deprecated. Going forward, please use ka
groups.sh (kafka.admin.ConsumerGroupCommand) for this functionality.
The kafka.tools.ProducerPerformance class has been deprecated. Going forward, please use org.apache.kafka.tools.ProducerPe
this functionality (kafka-producer-perf-test.sh will also be changed to use the new class).
The producer con g block.on.buffer.full has been deprecated and will be removed in future release. Currently its default value has
to false. The KafkaProducer will no longer throw BufferExhaustedException but instead will use max.block.ms value to block, afte
throw a TimeoutException. If block.on.buffer.full property is set to true explicitly, it will set the max.block.ms to Long.MAX_VALUE
metadata.fetch.timeout.ms will not be honoured
0.8.2 is fully compatible with 0.8.1. The upgrade can be done one broker at a time by simply bringing it down, updating the code, and
0.8.1 is fully compatible with 0.8. The upgrade can be done one broker at a time by simply bringing it down, updating the code, and r
Release 0.7 is incompatible with newer releases. Major changes were made to the API, ZooKeeper data structures, and protocol, and
in order to add replication (Which was missing in 0.7). The upgrade from 0.7 to later versions requires a special tool for migration. T
can be done without downtime.
2. APIS
Kafka includes ve core apis:
1. The Producer API allows applications to send streams of data to topics in the Kafka cluster.
2. The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
3. The Streams API allows transforming streams of data from input topics to output topics.
4. The Connect API allows implementing connectors that continually pull from some source system or application into Kafka or
Kafka into some sink system or application.
5. The AdminClient API allows managing and inspecting topics, brokers, and other Kafka objects.
Kafka exposes all its functionality over a language independent protocol which has clients available in many programming language
the Java clients are maintained as part of the main Kafka project, the others are available as independent open source projects. A lis
Download clients is available here.
The Producer API allows applications to send streams of data to topics in the Kafka cluster.
Examples showing how to use the producer are given in the javadocs.
To use the producer, you can use the following maven dependency:
1 <dependency>
2 <groupId>org.apache.kafka</groupId>
3 <artifactId>kafka-clients</artifactId>
4 <version>1.0.0</version>
5 </dependency>
6
The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
Examples showing how to use the consumer are given in the javadocs.
http://kafka.apache.org/documentation.html#introduction 22/130
11/8/2017 Apache Kafka
To use the consumer, you can use the following maven dependency:
1 <dependency>
2 <groupId>org.apache.kafka</groupId>
3 <artifactId>kafka-clients</artifactId>
4 <version>1.0.0</version>
5 </dependency>
6
The Streams API allows transforming streams of data from input topics to output topics.
Examples showing how to use this library are given in the javadocs
To use Kafka Streams you can use the following maven dependency:
1 <dependency>
2 <groupId>org.apache.kafka</groupId>
3 <artifactId>kafka-streams</artifactId>
4 <version>1.0.0</version>
5 </dependency>
6
The Connect API allows implementing connectors that continually pull from some source data system into Kafka or push from Kafk
sink data system.
Many users of Connect won't need to use this API directly, though, they can use pre-built connectors without needing to write any co
information on using Connect is available here.
Those who want to implement custom connectors can see the javadoc.
The AdminClient API supports managing and inspecting topics, brokers, acls, and other Kafka objects.
1 <dependency>
2 <groupId>org.apache.kafka</groupId>
Download 3 <artifactId>kafka-clients</artifactId>
4 <version>1.0.0</version>
@apachekafka 5 </dependency>
6
For more information about the AdminClient APIs, see the javadoc.
A more limited legacy producer and consumer api is also included in Kafka. These old Scala APIs are deprecated and only still availa
compatibility purposes. Information on them can be found here here.
3. CONFIGURATION
Kafka uses key-value pairs in the property le format for con guration. These values can be supplied either from a le or programma
broker.id
log.dirs
zookeeper.connect
http://kafka.apache.org/documentation.html#introduction 23/130
11/8/2017 Apache Kafka
Topic-level con gurations and defaults are discussed in more detail below.
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
zookeeper.co
Zookeeper host string string h
nnect
DEPRECATED: only used when
`advertised.listeners` or `listeners` are not set.
Use `advertised.listeners` instead. Hostname to
publish to ZooKeeper for clients to use. In IaaS
advertised.ho environments, this may need to be different from
string null h
st.name the interface to which the broker binds. If this is
not set, it will use the value for `host.name` if
con gured. Otherwise it will use the value
returned from
java.net.InetAddress.getCanonicalHostName().
Listeners to publish to ZooKeeper for clients to
use, if different than the `listeners` con g
property. In IaaS environments, this may need to
advertised.list
be different from the interface to which the string null h
eners
broker binds. If this is not set, the value for
`listeners` will be used. Unlike `listeners` it is not
valid to advertise the 0.0.0.0 meta-address.
DEPRECATED: only used when
`advertised.listeners` or `listeners` are not set.
Use `advertised.listeners` instead. The port to
advertised.po publish to ZooKeeper for clients to use. In IaaS
int null h
rt environments, this may need to be different from
the port to which the broker binds. If this is not
set, it will publish the same port that the broker
binds to.
auto.create.to
Enable auto creation of topic on the server boolean true h
pics.enable
auto.leader.re Enables auto leader balancing. A background
balance.enabl thread checks and triggers leader balance if boolean true h
e required at regular intervals
background.t The number of threads to use for various
int 10 [1,...] h
hreads background processing tasks
The broker id for this server. If unset, a unique
broker id will be generated.To avoid con icts
broker.id between zookeeper generated broker id's and int -1 h
user con gured broker id's, generated broker ids
start from reserved.broker.max.id + 1.
http://kafka.apache.org/documentation.html#introduction 24/130
11/8/2017 Apache Kafka
listener.security.protocol.map must also be set.
Specify hostname as 0.0.0.0 to bind to all
interfaces. Leave hostname empty to bind to
default interface. Examples of legal listener lists:
PLAINTEXT://myhost:9092,SSL://:9091
CLIENT://0.0.0.0:9092,REPLICATION://localhost:
9093
The directory in which the log data is kept /tmp/kafka-
log.dir string h
(supplemental for log.dirs property) logs
The directories in which the log data is kept. If
log.dirs string null h
not set, the value in log.dir is used
log. ush.inter The number of messages accumulated on a log 92233720368
long [1,...] h
val.messages partition before messages are ushed to disk 54775807
The maximum time in ms that a message in any
log. ush.inter topic is kept in memory before ushed to disk. If
long null h
val.ms not set, the value in
log. ush.scheduler.interval.ms is used
log. ush.offs The frequency with which we update the
et.checkpoint persistent record of the last ush which acts as int 60000 [0,...] h
.interval.ms the log recovery point
log. ush.sche
The frequency in ms that the log usher checks 92233720368
duler.interval. long h
whether any log needs to be ushed to disk 54775807
ms
log. ush.start
.offset.check The frequency with which we update the
int 60000 [0,...] h
point.interval. persistent record of log start offset
ms
log.retention.
The maximum size of the log before deleting it long -1 h
bytes
The number of hours to keep a log le before
log.retention.
deleting it (in hours), tertiary to log.retention.ms int 168 h
hours
property
The number of minutes to keep a log le before
log.retention. deleting it (in minutes), secondary to
int null h
minutes log.retention.ms property. If not set, the value in
log.retention.hours is used
The number of milliseconds to keep a log le
log.retention.
before deleting it (in milliseconds), If not set, the long null h
ms
value in log.retention.minutes is used
The maximum time before a new log segment is
log.roll.hours rolled out (in hours), secondary to log.roll.ms int 168 [1,...] h
property
http://kafka.apache.org/documentation.html#introduction 25/130
11/8/2017 Apache Kafka
batches and this limit only applies to a single
record in that case.
http://kafka.apache.org/documentation.html#introduction 26/130
11/8/2017 Apache Kafka
equests blocking the network threads
DEPRECATED: Used only when dynamic default
quotas are not con gured for or in Zookeeper.
quota.consu 92233720368
Any consumer distinguished by long [1,...] h
mer.default 54775807
clientId/consumer group will get throttled if it
fetches more bytes than this value per-second
DEPRECATED: Used only when dynamic default
quotas are not con gured for , or in Zookeeper.
quota.produc 92233720368
Any producer distinguished by clientId will get long [1,...] h
er.default 54775807
throttled if it produces more bytes than this
value per-second
Minimum bytes expected for each fetch
replica.fetch.
response. If not enough bytes, wait up to int 1 h
min.bytes
replicaMaxWaitTimeMs
max wait time for each fetcher request issued
by follower replicas. This value should always be
replica.fetch.
less than the replica.lag.time.max.ms at all int 500 h
wait.max.ms
times to prevent frequent shrinking of ISR for
low throughput topics
replica.high.w
atermark.che The frequency with which the high watermark is
long 5000 h
ckpoint.interv saved out to disk
al.ms
If a follower hasn't sent any fetch requests or
replica.lag.ti hasn't consumed up to the leaders log end
long 10000 h
me.max.ms offset for at least this time, the leader will
remove the follower from isr
replica.socket
.receive.buffe The socket receive buffer for network requests int 65536 h
r.bytes
The socket timeout for network requests. Its
replica.socket
value should be at least int 30000 h
.timeout.ms
replica.fetch.wait.max.ms
The con guration controls the maximum
amount of time the client will wait for the
request.timeo response of a request. If the response is not
int 30000 h
ut.ms received before the timeout elapses the client
will resend the request if necessary or fail the
request if retries are exhausted.
The SO_RCVBUF buffer of the socket sever
socket.receiv
sockets. If the value is -1, the OS default will be int 102400 h
e.buffer.bytes
used.
socket.reques The maximum number of bytes in a socket
int 104857600 [1,...] h
t.max.bytes request
Download
The SO_SNDBUF buffer of the socket sever
socket.send.b
sockets. If the value is -1, the OS default will be int 102400 h
@apachekafka uffer.bytes
used.
The maximum allowed timeout for transactions.
If a clients requested transaction time exceed
transaction.m this, then the broker will return an error in
ax.timeout.m InitProducerIdRequest. This prevents a client int 900000 [1,...] h
s from too large of a timeout, which can stall
consumers reading from topics included in the
transaction.
transaction.st Batch size for reading from the transaction log
ate.log.load.b segments when loading producer ids and int 5242880 [1,...] h
uffer.size transactions into the cache.
transaction.st
Overridden min.insync.replicas con g for the
ate.log.min.is int 2 [1,...] h
transaction topic.
r
transaction.st
The number of partitions for the transaction
ate.log.num.p int 50 [1,...] h
topic (should not change after deployment).
artitions
The replication factor for the transaction topic
transaction.st
(set higher to ensure availability). Internal topic
ate.log.replic short 3 [1,...] h
creation will fail until the cluster size meets this
ation.factor
replication factor requirement.
http://kafka.apache.org/documentation.html#introduction 27/130
11/8/2017 Apache Kafka
transaction.st The transaction topic segment bytes should be int 104857600 [1,...] h
ate.log.segm kept relatively small in order to facilitate faster
ent.bytes log compaction and cache loads
The maximum amount of time in ms that the
transactional. transaction coordinator will wait before
id.expiration. proactively expire a producer's transactional id int 604800000 [1,...] h
ms without receiving any transaction status updates
from it.
unclean.leade Indicates whether to enable replicas not in the
r.election.ena ISR set to be elected as leader as a last resort, boolean false h
ble even though doing so may result in data loss
zookeeper.co The max time that the client waits to establish a
nnection.time connection to zookeeper. If not set, the value in int null h
out.ms zookeeper.session.timeout.ms is used
zookeeper.se
ssion.timeout Zookeeper session timeout int 6000 h
.ms
zookeeper.set
Set client to use secure ACLs boolean false h
.acl
broker.id.gen Enable automatic broker id generation on the
eration.enabl server. When enabled the value con gured for boolean true m
e reserved.broker.max.id should be reviewed.
Rack of the broker. This will be used in rack
broker.rack aware replication assignment for fault tolerance. string null m
Examples: `RACK1`, `us-east-1d`
Idle connections timeout: the server socket
connections.
processor threads close the connections that long 600000 m
max.idle.ms
idle more than this
controlled.sh
utdown.enabl Enable controlled shutdown of the server boolean true m
e
controlled.sh Controlled shutdown can fail for multiple
utdown.max.r reasons. This determines the number of retries int 3 m
etries when such failure happens
Before each retry, the system needs time to
controlled.sh recover from the state that caused the previous
utdown.retry. failure (Controller fail over, replica lag etc). This long 5000 m
backoff.ms con g determines the amount of time to wait
before retrying.
controller.soc
The socket timeout for controller-to-broker
ket.timeout.m int 30000 m
channels
s
http://kafka.apache.org/documentation.html#introduction 28/130
11/8/2017 Apache Kafka
inter.broker.lis Name of listener used for communication string null m
tener.name between brokers. If this is unset, the listener
name is de ned by security.inter.broker.protocol.
It is an error to set this and
security.inter.broker.protocol properties at the
same time.
Specify which version of the inter-broker
protocol will be used. This is typically bumped
inter.broker.pr
after all brokers were upgraded to a new version.
otocol.versio string 1.0-IV0 m
Example of some valid values are: 0.8.0, 0.8.1,
n
0.8.1.1, 0.8.2, 0.8.2.0, 0.8.2.1, 0.9.0.0, 0.9.0.1
Check ApiVersion for the full list.
log.cleaner.ba The amount of time to sleep when there are no
long 15000 [0,...] m
ckoff.ms logs to clean
log.cleaner.de
The total memory used for log deduplication
dupe.buffer.si long 134217728 m
across all cleaner threads
ze
log.cleaner.de
lete.retention. How long are delete records retained? long 86400000 m
ms
Enable the log cleaner process to run on the
server. Should be enabled if using any topics
log.cleaner.en with a cleanup.policy=compact including the
boolean true m
able internal offsets topic. If disabled those topics
will not be compacted and continually grow in
size.
Log cleaner dedupe buffer load factor. The
log.cleaner.io.
percentage full the dedupe buffer can become.
buffer.load.fa double 0.9 m
A higher value will allow more log to be cleaned
ctor
at once but will lead to more hash collisions
log.cleaner.io. The total memory used for log cleaner I/O
int 524288 [0,...] m
buffer.size buffers across all cleaner threads
log.cleaner.io. The log cleaner will be throttled so that the sum
1.797693134
max.bytes.per of its read and write i/o will be less than this double m
8623157E308
.second value on average
log.cleaner.mi
The minimum ratio of dirty log to total log for a
n.cleanable.ra double 0.5 m
log to eligible for cleaning
tio
log.cleaner.mi The minimum time a message will remain
n.compaction uncompacted in the log. Only applicable for logs long 0 m
.lag.ms that are being compacted.
log.cleaner.th The number of background threads to use for
int 1 [0,...] m
reads log cleaning
Download The default cleanup policy for segments beyond
log.cleanup.p the retention window. A comma separated list of [compact,
list delete m
olicy valid policies. Valid policies are: "delete" and delete]
@apachekafka
"compact"
log.index.inte The interval with which we add an entry to the
int 4096 [0,...] m
rval.bytes offset index
log.index.size
The maximum size in bytes of the offset index int 10485760 [4,...] m
.max.bytes
Specify the message format version the broker
will use to append messages to the logs. The
value should be a valid ApiVersion. Some
examples are: 0.8.2, 0.9.0.0, 0.10.0, check
ApiVersion for more details. By setting a
log.message.
particular message format version, the user is
format.versio string 1.0-IV0 m
certifying that all the existing messages on disk
n
are smaller or equal than the speci ed version.
Setting this value incorrectly will cause
consumers with older versions to break as they
will receive messages with a format that they
don't understand.
log.message. The maximum difference allowed between the long 92233720368 m
timestamp.dif timestamp when a broker receives a message 54775807
ference.max. and the timestamp speci ed in the message. If
ms log.message.timestamp.type=CreateTime, a
message will be rejected if the difference in
http://kafka.apache.org/documentation.html#introduction 29/130
11/8/2017 Apache Kafka
timestamp exceeds this threshold. This
con guration is ignored if
log.message.timestamp.type=LogAppendTime.
The maximum timestamp difference allowed
should be no greater than log.retention.ms to
avoid unnecessarily frequent log rolling.
De ne whether the timestamp in the message is
log.message. [CreateTime,
message create time or log append time. The
timestamp.ty string CreateTime LogAppendTi m
value should be either `CreateTime` or
pe me]
`LogAppendTime`
Should pre allocate le when create new
log.preallocat
segment? If you are using Kafka on Windows, boolean false m
e
you probably need to set it to true.
log.retention. The frequency in milliseconds that the log
check.interval cleaner checks whether any log is eligible for long 300000 [1,...] m
.ms deletion
max.connecti The maximum number of connections we allow
int 2147483647 [1,...] m
ons.per.ip from each ip address
max.connecti
Per-ip or hostname overrides to the default
ons.per.ip.ove string "" m
maximum number of connections
rrides
num.partition
The default number of log partitions per topic int 1 [1,...] m
s
The fully quali ed name of a class that
implements the KafkaPrincipalBuilder interface,
which is used to build the KafkaPrincipal object
used during authorization. This con g also
supports the deprecated PrincipalBuilder
interface which was previously used for client
authentication over SSL. If no principal builder is
de ned, the default behavior depends on the
security protocol in use. For SSL authentication,
principal.build the principal name will be the distinguished
class null m
er.class name from the client certi cate if one is
provided; otherwise, if client authentication is
not required, the principal name will be
ANONYMOUS. For SASL authentication, the
principal will be derived using the rules de ned
by
sasl.kerberos.principal.to.local.rule
s if GSSAPI is in use, and the SASL
authentication ID for other mechanisms. For
PLAINTEXT, the principal will be ANONYMOUS.
producer.purg
atory.purge.in The purge interval (in number of requests) of the
int 1000 m
Download terval.request producer request purgatory
s
queued.max.r The number of queued bytes allowed before no
@apachekafka long -1 m
equest.bytes more requests are read
replica.fetch. The amount of time to sleep when fetch
int 1000 [0,...] m
backoff.ms partition error occurs.
The number of bytes of messages to attempt to
fetch for each partition. This is not an absolute
maximum, if the rst record batch in the rst
non-empty partition of the fetch is larger than
replica.fetch. this value, the record batch will still be returned
int 1048576 [0,...] m
max.bytes to ensure that progress can be made. The
maximum record batch size accepted by the
broker is de ned via message.max.bytes
(broker con g) or max.message.bytes (topic
con g).
replica.fetch.r Maximum bytes expected for the entire fetch int 10485760 [0,...] m
esponse.max. response. Records are fetched in batches, and if
bytes the rst record batch in the rst non-empty
partition of the fetch is larger than this value, the
record batch will still be returned to ensure that
progress can be made. As such, this is not an
absolute maximum. The maximum record batch
size accepted by the broker is de ned via
http://kafka.apache.org/documentation.html#introduction 30/130
11/8/2017 Apache Kafka
message.max.bytes (broker con g) or
max.message.bytes (topic con g).
reserved.brok
Max number that can be used for a broker.id int 1000 [0,...] m
er.max.id
The list of SASL mechanisms enabled in the
sasl.enabled. Kafka server. The list may contain any
list GSSAPI m
mechanisms mechanism for which a security provider is
available. Only GSSAPI is enabled by default.
sasl.kerberos.
Kerberos kinit command path. string /usr/bin/kinit m
kinit.cmd
sasl.kerberos.
Login thread sleep time between refresh
min.time.befo long 60000 m
attempts.
re.relogin
A list of rules for mapping from principal names
to short names (typically operating system
usernames). The rules are evaluated in order
and the rst rule that matches a principal name
is used to map it to a short name. Any later rules
sasl.kerberos. in the list are ignored. By default, principal
principal.to.lo names of the form list DEFAULT m
cal.rules {username}/{hostname}@{REALM} are mapped
to {username}. For more details on the format
please see security authorization and acls. Note
that this con guration is ignored if an extension
of KafkaPrincipalBuilder is provided by the
principal.builder.class con guration.
The Kerberos principal name that Kafka runs as.
sasl.kerberos.
This can be de ned either in Kafka's JAAS string null m
service.name
con g or in Kafka's con g.
sasl.kerberos.
Percentage of random jitter added to the
ticket.renew.ji double 0.05 m
renewal time.
tter
sasl.kerberos. Login thread will sleep until the speci ed
ticket.renew. window factor of time from last refresh to
double 0.8 m
window.facto ticket's expiry has been reached, at which time it
r will try to renew the ticket.
sasl.mechani
SASL mechanism used for inter-broker
sm.inter.brok string GSSAPI m
communication. Default is GSSAPI.
er.protocol
Security protocol used to communicate between
security.inter. brokers. Valid values are: PLAINTEXT, SSL,
broker.protoc SASL_PLAINTEXT, SASL_SSL. It is an error to set string PLAINTEXT m
ol this and inter.broker.listener.name properties at
the same time.
Download
A list of cipher suites. This is a named
combination of authentication, encryption, MAC
@apachekafka ssl.cipher.suit and key exchange algorithm used to negotiate
list null m
es the security settings for a network connection
using TLS or SSL network protocol. By default all
the available cipher suites are supported.
Con gures kafka broker to request client
authentication. The following settings are
common:
ssl.client.auth=required If set to
required client authentication is required.
[required,
ssl.client.auth=requested This means
ssl.client.auth string none requested, m
client authentication is optional. unlike
none]
requested , if this option is set client can
choose not to provide authentication
information about itself
ssl.client.auth=none This means client
authentication is not needed.
http://kafka.apache.org/documentation.html#introduction 31/130
11/8/2017 Apache Kafka
er.algorithm SSL connections. Default value is the key
manager factory algorithm con gured for the
Java Virtual Machine.
The location of the key store le. This is optional
ssl.keystore.l
for client and can be used for two-way string null m
ocation
authentication for client.
The store password for the key store le. This is
ssl.keystore.p
optional for client and only needed if password null m
assword
ssl.keystore.location is con gured.
ssl.keystore.t The le format of the key store le. This is
string JKS m
ype optional for client.
The SSL protocol used to generate the
SSLContext. Default setting is TLS, which is ne
for most cases. Allowed values in recent JVMs
ssl.protocol are TLS, TLSv1.1 and TLSv1.2. SSL, SSLv2 and string TLS m
SSLv3 may be supported in older JVMs, but their
usage is discouraged due to known security
vulnerabilities.
The name of the security provider used for SSL
ssl.provider connections. Default value is the default security string null m
provider of the JVM.
The algorithm used by trust manager factory for
ssl.trustmana SSL connections. Default value is the trust
string PKIX m
ger.algorithm manager factory algorithm con gured for the
Java Virtual Machine.
ssl.truststore.
The location of the trust store le. string null m
location
The password for the trust store le. If a
ssl.truststore.
password is not set access to the truststore is password null m
password
still available, but integrity checking is disabled.
ssl.truststore.
The le format of the trust store le. string JKS m
type
The alter con gs policy class that should be
alter.con g.p used for validation. The class should implement
olicy.class.na the class null lo
me org.apache.kafka.server.policy.AlterCo
nfigPolicy interface.
authorizer.cla The authorizer class that should be used for
string "" lo
ss.name authorization
The create topic policy class that should be
create.topic.p used for validation. The class should implement
olicy.class.na the class null lo
me org.apache.kafka.server.policy.CreateT
Download opicPolicy interface.
Map between listener names and security
@apachekafka protocols. This must be de ned for the same
security protocol to be usable in more than one
port or IP. For example, internal and external
tra c can be separated even if SSL is required
for both. Concretely, the user could de ne
listeners with names INTERNAL and EXTERNAL
and this property as:
PLAINTEXT:P
`INTERNAL:SSL,EXTERNAL:SSL`. As shown, key
LAINTEXT,SS
and value are separated by a colon and map
listener.securi L:SSL,SASL_P
entries are separated by commas. Each listener
ty.protocol.m string LAINTEXT:SA lo
name should only appear once in the map.
ap SL_PLAINTEX
Different security (SSL and SASL) settings can
T,SASL_SSL:S
be con gured for each listener by adding a
ASL_SSL
normalised pre x (the listener name is
lowercased) to the con g name. For example, to
set a different keystore for the INTERNAL
listener, a con g with name
`listener.name.internal.ssl.keystore.location`
would be set. If the con g for the listener name
is not set, the con g will fallback to the generic
con g (i.e. `ssl.keystore.location`).
metric.report A list of classes to use as metrics reporters. list "" lo
ers Implementing the
org.apache.kafka.common.metrics.Metric
http://kafka.apache.org/documentation.html#introduction 32/130
11/8/2017 Apache Kafka
sReporter interface allows plugging in classes
that will be noti ed of new metric creation. The
JmxReporter is always included to register JMX
statistics.
metrics.num. The number of samples maintained to compute
int 2 [1,...] lo
samples metrics.
metrics.recor
The highest recording level for metrics. string INFO lo
ding.level
metrics.samp The window of time a metrics sample is
long 30000 [1,...] lo
le.window.ms computed over.
quota.window The number of samples to retain in memory for
int 11 [1,...] lo
.num client quotas
quota.window
The time span of each sample for client quotas int 1 [1,...] lo
.size.seconds
replication.qu
The number of samples to retain in memory for
ota.window.n int 11 [1,...] lo
replication quotas
um
replication.qu
The time span of each sample for replication
ota.window.si int 1 [1,...] lo
quotas
ze.seconds
ssl.endpoint.i
The endpoint identi cation algorithm to validate
denti cation. string null lo
server hostname using server certi cate.
algorithm
ssl.secure.ran
The SecureRandom PRNG implementation to
dom.impleme string null lo
use for SSL cryptography operations.
ntation
transaction.a
bort.timed.ou
The interval at which to rollback transactions
t.transaction. int 60000 [1,...] lo
that have timed out
cleanup.interv
al.ms
transaction.re
The interval at which to remove transactions
move.expired.
that have expired due to
transaction.cl int 3600000 [1,...] lo
transactional.id.expiration.ms
eanup.interva
passing
l.ms
zookeeper.sy
How far a ZK follower can be behind a ZK leader int 2000 lo
nc.time.ms
More details about broker con guration can be found in the scala class kafka.server.KafkaConfig .
@apachekafka Con gurations pertinent to topics have both a server default as well an optional per-topic override. If no per-topic con guration is giv
default is used. The override can be set at topic creation time by giving one or more --config options. This example creates a to
topic with a custom max message size and ush rate:
Overrides can also be changed or set later using the alter con gs command. This example updates the max message size for my-to
1 > bin/kafka-configs.sh --zookeeper localhost:2181 --entity-type topics --entity-name my-topic --alter --dele
The following are the topic-level con gurations. The server's default con guration for this property is given under the Server Default
heading. A given server default con g value only applies to a topic if it does not have an explicit topic con g override.
http://kafka.apache.org/documentation.html#introduction 33/130
11/8/2017 Apache Kafka
NAME DESCRIPTION TYPE DEFAULT VALID SERVER I
VALUES DEFAULT
PROPERTY
http://kafka.apache.org/documentation.html#introduction 35/130
11/8/2017 Apache Kafka
ignored if
message.timestamp.type=Log
AppendTime.
De ne whether the timestamp
in the message is message
log.message.
message.tim create time or log append time.
string CreateTime timestamp.ty m
estamp.type The value should be either
pe
`CreateTime` or
`LogAppendTime`
This con guration controls
how frequently the log
compactor will attempt to
clean the log (assuming log
compaction is enabled). By
default we will avoid cleaning a
log where more than 50% of
log.cleaner.mi
min.cleanable the log has been compacted.
double 0.5 [0,...,1] n.cleanable.ra m
.dirty.ratio This ratio bounds the
tio
maximum space wasted in the
log by duplicates (at 50% at
most 50% of the log could be
duplicates). A higher ratio will
mean fewer, more e cient
cleanings but will mean more
wasted space in the log.
The minimum time a message
log.cleaner.mi
min.compacti will remain uncompacted in the
long 0 [0,...] n.compaction m
on.lag.ms log. Only applicable for logs
.lag.ms
that are being compacted.
When a producer sets acks to
"all" (or "-1"), this con guration
speci es the minimum number
of replicas that must
acknowledge a write for the
write to be considered
successful. If this minimum
cannot be met, then the
producer will raise an
exception (either
NotEnoughReplicas or
NotEnoughReplicasAfterAppen
min.insync.re min.insync.re
d). int 1 [1,...] m
plicas plicas
When used together,
min.insync.replicas and acks
allow you to enforce greater
durability guarantees. A typical
scenario would be to create a
topic with a replication factor
Download
of 3, set min.insync.replicas to
2, and produce with acks of
@apachekafka "all". This will ensure that the
producer raises an exception if
a majority of replicas do not
receive a write.
True if we should preallocate
log.preallocat
preallocate the le on disk when creating a boolean false m
e
new log segment.
This con guration controls the
maximum size a partition
(which consists of log
segments) can grow to before
we will discard old log
segments to free up space if
retention.byte we are using the "delete" log.retention.
long -1 m
s retention policy. By default bytes
there is no size limit only a time
limit. Since this limit is
enforced at the partition level,
multiply it by the number of
partitions to compute the topic
retention in bytes.
retention.ms This con guration controls the long 604800000 log.retention. m
maximum time we will retain a ms
http://kafka.apache.org/documentation.html#introduction 36/130
11/8/2017 Apache Kafka
log before we will discard old
log segments to free up space
if we are using the "delete"
retention policy. This
represents an SLA on how
soon consumers must read
their data.
This con guration controls the
segment le size for the log.
Retention and cleaning is
segment.byte log.segment.
always done a le at a time so int 1073741824 [14,...] m
s bytes
a larger segment size means
fewer les but less granular
control over retention.
This con guration controls the
size of the index that maps
offsets to le positions. We
segment.inde log.index.size
preallocate this index le and int 10485760 [0,...] m
x.bytes .max.bytes
shrink it only after log rolls. You
generally should not need to
change this setting.
The maximum random jitter
subtracted from the scheduled
segment.jitter log.roll.jitter.
segment roll time to avoid long 0 [0,...] m
.ms ms
thundering herds of segment
rolling
This con guration controls the
period of time after which
Kafka will force the log to roll
segment.ms long 604800000 [0,...] log.roll.ms m
even if the segment le isn't full
to ensure that retention can
delete or compact old data.
Indicates whether to enable
unclean.leade replicas not in the ISR set to be unclean.leade
r.election.ena elected as leader as a last boolean false r.election.ena m
ble resort, even though doing so ble
may result in data loss.
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
Download
A list of host/port pairs to use for establishing
the initial connection to the Kafka cluster. The
@apachekafka client will make use of all servers irrespective of
which servers are speci ed here for
bootstrappingthis list only impacts the initial
hosts used to discover the full set of servers.
bootstrap.ser
This list should be in the form list h
vers
host1:port1,host2:port2,... . Since these
servers are just used for the initial connection to
discover the full cluster membership (which may
change dynamically), this list need not contain
the full set of servers (you may want more than
one, though, in case a server is down).
Serializer class for key that implements the
key.serializer org.apache.kafka.common.serialization. class h
Serializer interface.
Serializer class for value that implements the
value.serializ
org.apache.kafka.common.serialization. class h
er
Serializer interface.
acks The number of acknowledgments the producer string 1 [all, -1, 0, 1] h
requires the leader to have received before
considering a request complete. This controls
the durability of records that are sent. The
following settings are allowed:
http://kafka.apache.org/documentation.html#introduction 37/130
11/8/2017 Apache Kafka
acks=0 If set to zero then the producer will
not wait for any acknowledgment from the
server at all. The record will be immediately
added to the socket buffer and considered
sent. No guarantee can be made that the
server has received the record in this case,
and the retries con guration will not take
effect (as the client won't generally know of
any failures). The offset given back for each
record will always be set to -1.
acks=1 This will mean the leader will write
the record to its local log but will respond
without awaiting full acknowledgement from
all followers. In this case should the leader
fail immediately after acknowledging the
record but before the followers have
replicated it then the record will be lost.
acks=all This means the leader will wait
for the full set of in-sync replicas to
acknowledge the record. This guarantees that
the record will not be lost as long as at least
one in-sync replica remains alive. This is the
strongest available guarantee. This is
equivalent to the acks=-1 setting.
http://kafka.apache.org/documentation.html#introduction 38/130
11/8/2017 Apache Kafka
together into fewer requests whenever multiple
records are being sent to the same partition.
This helps performance on both the client and
the server. This con guration controls the
default batch size in bytes.
http://kafka.apache.org/documentation.html#introduction 39/130
11/8/2017 Apache Kafka
receive.buffer The size of the TCP receive buffer (SO_RCVBUF) int 32768 [-1,...] m
.bytes to use when reading data. If the value is -1, the
OS default will be used.
The con guration controls the maximum
amount of time the client will wait for the
response of a request. If the response is not
received before the timeout elapses the client
request.timeo will resend the request if necessary or fail the
int 30000 [0,...] m
ut.ms request if retries are exhausted. This should be
larger than replica.lag.time.max.ms (a broker
con guration) to reduce the possibility of
message duplication due to unnecessary
producer retries.
JAAS login context parameters for SASL
connections in the format used by JAAS
sasl.jaas.con
con guration les. JAAS con guration le password null m
g
format is described here. The format for the
value is: ' (=)*;'
The Kerberos principal name that Kafka runs as.
sasl.kerberos.
This can be de ned either in Kafka's JAAS string null m
service.name
con g or in Kafka's con g.
SASL mechanism used for client connections.
sasl.mechani This may be any mechanism for which a security
string GSSAPI m
sm provider is available. GSSAPI is the default
mechanism.
Protocol used to communicate with brokers.
security.proto
Valid values are: PLAINTEXT, SSL, string PLAINTEXT m
col
SASL_PLAINTEXT, SASL_SSL.
The size of the TCP send buffer (SO_SNDBUF) to
send.buffer.b
use when sending data. If the value is -1, the OS int 131072 [-1,...] m
ytes
default will be used.
ssl.enabled.pr The list of protocols enabled for SSL TLSv1.2,TLSv
list m
otocols connections. 1.1,TLSv1
ssl.keystore.t The le format of the key store le. This is
string JKS m
ype optional for client.
The SSL protocol used to generate the
SSLContext. Default setting is TLS, which is ne
for most cases. Allowed values in recent JVMs
ssl.protocol are TLS, TLSv1.1 and TLSv1.2. SSL, SSLv2 and string TLS m
SSLv3 may be supported in older JVMs, but their
usage is discouraged due to known security
vulnerabilities.
The name of the security provider used for SSL
ssl.provider connections. Default value is the default security string null m
Download provider of the JVM.
ssl.truststore.
The le format of the trust store le. string JKS m
@apachekafka type
When set to 'true', the producer will ensure that
exactly one copy of each message is written in
the stream. If 'false', producer retries due to
broker failures, etc., may write duplicates of the
retried message in the stream. Note that
enable.idemp enabling idempotence requires
boolean false lo
otence max.in.flight.requests.per.connection
to be less than or equal to 5, retries to be
greater than 0 and acks must be 'all'. If these
values are not explicitly set by the user, suitable
values will be chosen. If incompatible values are
set, a Con gException will be thrown.
A list of classes to use as interceptors.
Implementing the
org.apache.kafka.clients.producer.Prod
interceptor.cl ucerInterceptor interface allows you to
list null lo
asses intercept (and possibly mutate) the records
received by the producer before they are
published to the Kafka cluster. By default, there
are no interceptors.
max.in. ight.r The maximum number of unacknowledged int 5 [1,...] lo
equests.per.c requests the client will send on a single
http://kafka.apache.org/documentation.html#introduction 40/130
11/8/2017 Apache Kafka
onnection connection before blocking. Note that if this
setting is set to be greater than 1 and there are
failed sends, there is a risk of message re-
ordering due to retries (i.e., if retries are
enabled).
The period of time in milliseconds after which
we force a refresh of metadata even if we
metadata.ma
haven't seen any partition leadership changes to long 300000 [0,...] lo
x.age.ms
proactively discover any new brokers or
partitions.
A list of classes to use as metrics reporters.
Implementing the
org.apache.kafka.common.metrics.Metric
metric.report
sReporter interface allows plugging in classes list "" lo
ers
that will be noti ed of new metric creation. The
JmxReporter is always included to register JMX
statistics.
metrics.num. The number of samples maintained to compute
int 2 [1,...] lo
samples metrics.
metrics.recor [INFO,
The highest recording level for metrics. string INFO lo
ding.level DEBUG]
metrics.samp The window of time a metrics sample is
long 30000 [0,...] lo
le.window.ms computed over.
The maximum amount of time in milliseconds to
wait when reconnecting to a broker that has
repeatedly failed to connect. If provided, the
reconnect.ba backoff per host will increase exponentially for
long 1000 [0,...] lo
ckoff.max.ms each consecutive connection failure, up to this
maximum. After calculating the backoff
increase, 20% random jitter is added to avoid
connection storms.
The base amount of time to wait before
attempting to reconnect to a given host. This
reconnect.ba
avoids repeatedly connecting to a host in a tight long 50 [0,...] lo
ckoff.ms
loop. This backoff applies to all connection
attempts by the client to a broker.
The amount of time to wait before attempting to
retry.backoff. retry a failed request to a given topic partition.
long 100 [0,...] lo
ms This avoids repeatedly sending requests in a
tight loop under some failure scenarios.
sasl.kerberos.
Kerberos kinit command path. string /usr/bin/kinit lo
kinit.cmd
sasl.kerberos.
Login thread sleep time between refresh
min.time.befo long 60000 lo
attempts.
Download re.relogin
sasl.kerberos.
Percentage of random jitter added to the
@apachekafka ticket.renew.ji double 0.05 lo
renewal time.
tter
sasl.kerberos. Login thread will sleep until the speci ed
ticket.renew. window factor of time from last refresh to
double 0.8 lo
window.facto ticket's expiry has been reached, at which time it
r will try to renew the ticket.
A list of cipher suites. This is a named
combination of authentication, encryption, MAC
ssl.cipher.suit and key exchange algorithm used to negotiate
list null lo
es the security settings for a network connection
using TLS or SSL network protocol. By default all
the available cipher suites are supported.
ssl.endpoint.i
The endpoint identi cation algorithm to validate
denti cation. string null lo
server hostname using server certi cate.
algorithm
The algorithm used by key manager factory for
ssl.keymanag SSL connections. Default value is the key
string SunX509 lo
er.algorithm manager factory algorithm con gured for the
Java Virtual Machine.
ssl.secure.ran The SecureRandom PRNG implementation to string null lo
dom.impleme use for SSL cryptography operations.
ntation
http://kafka.apache.org/documentation.html#introduction 41/130
11/8/2017 Apache Kafka
For those interested in the legacy Scala producer con gs, information can be found here.
In 0.9.0.0 we introduced the new Java consumer as a replacement for the older Scala-based simple and high-level consumers. The c
new and old consumers are described below.
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
http://kafka.apache.org/documentation.html#introduction 42/130
11/8/2017 Apache Kafka
which can improve server throughput a bit at the
cost of some additional latency.
A unique string that identi es the consumer
group this consumer belongs to. This property is
required if the consumer uses either the group
group.id string "" h
management functionality by using
subscribe(topic) or the Kafka-based offset
management strategy.
The expected time between heartbeats to the
consumer coordinator when using Kafka's group
management facilities. Heartbeats are used to
ensure that the consumer's session stays active
heartbeat.inte and to facilitate rebalancing when new
int 3000 h
rval.ms consumers join or leave the group. The value
must be set lower than session.timeout.ms ,
but typically should be set no higher than 1/3 of
that value. It can be adjusted even lower to
control the expected time for normal rebalances.
The maximum amount of data per-partition the
server will return. Records are fetched in
batches by the consumer. If the rst record
batch in the rst non-empty partition of the fetch
is larger than this limit, the batch will still be
max.partition. returned to ensure that the consumer can make
int 1048576 [0,...] h
fetch.bytes progress. The maximum record batch size
accepted by the broker is de ned via
message.max.bytes (broker con g) or
max.message.bytes (topic con g). See
fetch.max.bytes for limiting the consumer
request size.
The timeout used to detect consumer failures
when using Kafka's group management facility.
The consumer sends periodic heartbeats to
indicate its liveness to the broker. If no
heartbeats are received by the broker before the
session.timeo expiration of this session timeout, then the
int 10000 h
ut.ms broker will remove this consumer from the group
and initiate a rebalance. Note that the value
must be in the allowable range as con gured in
the broker con guration by
group.min.session.timeout.ms and
group.max.session.timeout.ms .
ssl.key.passw The password of the private key in the key store
password null h
ord le. This is optional for client.
The location of the key store le. This is optional
ssl.keystore.l
for client and can be used for two-way string null h
ocation
Download authentication for client.
The store password for the key store le. This is
ssl.keystore.p
optional for client and only needed if password null h
@apachekafka assword
ssl.keystore.location is con gured.
ssl.truststore.
The location of the trust store le. string null h
location
The password for the trust store le. If a
ssl.truststore.
password is not set access to the truststore is password null h
password
still available, but integrity checking is disabled.
What to do when there is no initial offset in
Kafka or if the current offset does not exist any
more on the server (e.g. because that data has
been deleted):
earliest: automatically reset the offset to the
earliest offset
auto.offset.re [latest,
latest: automatically reset the offset to the string latest m
set earliest, none]
latest offset
none: throw exception to the consumer if no
previous offset is found for the consumer's
group
anything else: throw exception to the
consumer.
http://kafka.apache.org/documentation.html#introduction 43/130
11/8/2017 Apache Kafka
max.idle.ms milliseconds speci ed by this con g.
enable.auto.c If true the consumer's offset will be periodically
boolean true m
ommit committed in the background.
Whether records from internal topics (such as
exclude.intern offsets) should be exposed to the consumer. If
boolean true m
al.topics set to true the only way to receive records
from an internal topic is subscribing to it.
The maximum amount of data the server should
return for a fetch request. Records are fetched in
batches by the consumer, and if the rst record
batch in the rst non-empty partition of the fetch
is larger than this value, the record batch will still
be returned to ensure that the consumer can
fetch.max.byt
make progress. As such, this is not a absolute int 52428800 [0,...] m
es
maximum. The maximum record batch size
accepted by the broker is de ned via
message.max.bytes (broker con g) or
max.message.bytes (topic con g). Note that
the consumer performs multiple fetches in
parallel.
Controls how to read messages written
transactionally. If set to read_committed ,
consumer.poll() will only return transactional
messages which have been committed. If set to
read_uncommitted ' (the default),
consumer.poll() will return all messages, even
transactional messages which have been
aborted. Non-transactional messages will be
returned unconditionally in either mode.
http://kafka.apache.org/documentation.html#introduction 44/130
11/8/2017 Apache Kafka
con guration les. JAAS con guration le
format is described here. The format for the
value is: ' (=)*;'
The Kerberos principal name that Kafka runs as.
sasl.kerberos.
This can be de ned either in Kafka's JAAS string null m
service.name
con g or in Kafka's con g.
SASL mechanism used for client connections.
sasl.mechani This may be any mechanism for which a security
string GSSAPI m
sm provider is available. GSSAPI is the default
mechanism.
Protocol used to communicate with brokers.
security.proto
Valid values are: PLAINTEXT, SSL, string PLAINTEXT m
col
SASL_PLAINTEXT, SASL_SSL.
The size of the TCP send buffer (SO_SNDBUF) to
send.buffer.b
use when sending data. If the value is -1, the OS int 131072 [-1,...] m
ytes
default will be used.
ssl.enabled.pr The list of protocols enabled for SSL TLSv1.2,TLSv
list m
otocols connections. 1.1,TLSv1
ssl.keystore.t The le format of the key store le. This is
string JKS m
ype optional for client.
The SSL protocol used to generate the
SSLContext. Default setting is TLS, which is ne
for most cases. Allowed values in recent JVMs
ssl.protocol are TLS, TLSv1.1 and TLSv1.2. SSL, SSLv2 and string TLS m
SSLv3 may be supported in older JVMs, but their
usage is discouraged due to known security
vulnerabilities.
The name of the security provider used for SSL
ssl.provider connections. Default value is the default security string null m
provider of the JVM.
ssl.truststore.
The le format of the trust store le. string JKS m
type
The frequency in milliseconds that the
auto.commit.i
consumer offsets are auto-committed to Kafka int 5000 [0,...] lo
nterval.ms
if enable.auto.commit is set to true .
Automatically check the CRC32 of the records
consumed. This ensures no on-the-wire or on-
check.crcs disk corruption to the messages occurred. This boolean true lo
check adds some overhead, so it may be
disabled in cases seeking extreme performance.
An id string to pass to the server when making
requests. The purpose of this is to be able to
client.id track the source of requests beyond just ip/port string "" lo
Download by allowing a logical application name to be
included in server-side request logging.
@apachekafka The maximum amount of time the server will
fetch.max.wai block before answering the fetch request if there
int 500 [0,...] lo
t.ms isn't su cient data to immediately satisfy the
requirement given by fetch.min.bytes.
A list of classes to use as interceptors.
Implementing the
org.apache.kafka.clients.consumer.Cons
interceptor.cl
umerInterceptor interface allows you to list null lo
asses
intercept (and possibly mutate) records received
by the consumer. By default, there are no
interceptors.
The period of time in milliseconds after which
we force a refresh of metadata even if we
metadata.ma
haven't seen any partition leadership changes to long 300000 [0,...] lo
x.age.ms
proactively discover any new brokers or
partitions.
metric.report A list of classes to use as metrics reporters. list "" lo
ers Implementing the
org.apache.kafka.common.metrics.Metric
sReporter interface allows plugging in classes
that will be noti ed of new metric creation. The
JmxReporter is always included to register JMX
statistics.
http://kafka.apache.org/documentation.html#introduction 45/130
11/8/2017 Apache Kafka
group.id
zookeeper.connect
http://kafka.apache.org/documentation.html#introduction 46/130
11/8/2017 Apache Kafka
group.id A string that uniquely identi es the group of consumer processes to which this consum
By setting the same group id multiple processes indicate that they are all part of the sa
consumer group.
Speci es the ZooKeeper connection string in the form hostname:port where host an
the host and port of a ZooKeeper server. To allow connecting through other ZooKeeper
that ZooKeeper machine is down you can also specify multiple hosts in the form
hostname1:port1,hostname2:port2,hostname3:port3 .
zookeeper.co
nnect The server may also have a ZooKeeper chroot path as part of its ZooKeeper connection
puts its data under some path in the global ZooKeeper namespace. If so the consumer
the same chroot path in its connection string. For example to give a chroot path of /ch
you would give the connection string as
hostname1:port1,hostname2:port2,hostname3:port3/chroot/path .
consumer.id null Generated automatically if not set.
socket.timeo The socket timeout for network requests. The actual timeout set will be max.fetch.wait
30 * 1000
ut.ms socket.timeout.ms.
socket.receiv
64 * 1024 The socket receive buffer for network requests
e.buffer.bytes
The number of bytes of messages to attempt to fetch for each topic-partition in each fe
These bytes will be read into memory for each partition, so this helps control the memo
fetch.messag
1024 * 1024 the consumer. The fetch request size must be at least as large as the maximum messa
e.max.bytes
server allows or else it is possible for the producer to send messages larger than the co
fetch.
num.consum
1 The number fetcher threads used to fetch data.
er.fetchers
If true, periodically commit to ZooKeeper the offset of messages already fetched by the
auto.commit.
true This committed offset will be used when the process fails as the position from which th
enable
consumer will begin.
auto.commit.i
60 * 1000 The frequency in ms that the consumer offsets are committed to zookeeper.
nterval.ms
queued.max.
Max number of message chunks buffered for consumption. Each chunk can be up to
message.chu 2
fetch.message.max.bytes.
nks
When a new consumer joins a consumer group the set of consumers attempt to "rebala
rebalance.ma load to assign partitions to each consumer. If the set of consumers changes while this
4
x.retries is taking place the rebalance will fail and retry. This setting controls the maximum num
attempts before giving up.
fetch.min.byt The minimum amount of data the server should return for a fetch request. If insu cient
1
es available the request will wait for that much data to accumulate before answering the re
fetch.wait.ma The maximum amount of time the server will block before answering the fetch request
100
x.ms su cient data to immediately satisfy fetch.min.bytes
rebalance.bac Backoff time between retries during rebalance. If not set explicitly, the value in
2000
koff.ms zookeeper.sync.time.ms is used.
Download
refresh.leader
200 Backoff time to wait before trying to determine the leader of a partition that has just los
.backoff.ms
@apachekafka
What to do when there is no initial offset in ZooKeeper or if an offset is out of range:
auto.offset.re * smallest : automatically reset the offset to the smallest offset
largest
set * largest : automatically reset the offset to the largest offset
* anything else: throw exception to the consumer
consumer.tim Throw a timeout exception to the consumer if no message is available for consumption
-1
eout.ms speci ed interval
exclude.intern
true Whether messages from internal topics (such as offsets) should be exposed to the con
al.topics
The client id is a user-speci ed string sent in each request to help trace calls. It should
client.id group id value
identify the application making the request.
zookeeper.se
ZooKeeper session timeout. If the consumer fails to heartbeat to ZooKeeper for this pe
ssion.timeout 6000
it is considered dead and a rebalance will occur.
.ms
zookeeper.co
nnection.time 6000 The max time that the client waits while establishing a connection to zookeeper.
out.ms
zookeeper.sy
2000 How far a ZK follower can be behind a ZK leader
nc.time.ms
offsets.storag zookeeper Select where offsets should be stored (zookeeper or kafka).
http://kafka.apache.org/documentation.html#introduction 47/130
11/8/2017 Apache Kafka
e
offsets.chann The backoff period when reconnecting the offsets channel or retrying failed offset fetch
1000
el.backoff.ms requests.
offsets.chann
Socket timeout when reading responses for offset fetch/commit requests. This timeou
el.socket.time 10000
for ConsumerMetadata requests that are used to query for the offset manager.
out.ms
Retry the offset commit up to this many times on failure. This retry count only applies to
offsets.com commits during shut-down. It does not apply to commits originating from the auto-com
mit.max.retrie 5 also does not apply to attempts to query for the offset coordinator before committing o
s a consumer metadata request fails for any reason, it will be retried and that retry does n
toward this limit.
If you are using "kafka" as offsets.storage, you can dual commit offsets to ZooKeeper (
Kafka). This is required during migration from zookeeper-based offset storage to kafka
dual.commit.
true storage. With respect to any given consumer group, it is safe to turn this off after all ins
enabled
within that group have been migrated to the new version that commits offsets to the br
of directly to ZooKeeper).
Select between the "range" or "roundrobin" strategy for assigning partitions to consume
The round-robin partition assignor lays out all the available partitions and all the availab
threads. It then proceeds to do a round-robin assignment from partition to consumer th
subscriptions of all consumer instances are identical, then the partitions will be uniform
distributed. (i.e., the partition ownership counts will be within a delta of exactly one acro
partition.assi consumer threads.) Round-robin assignment is permitted only if: (a) Every topic has the
gnment.strate range number of streams within a consumer instance (b) The set of subscribed topics is iden
gy every consumer instance within the group.
Range partitioning works on a per-topic basis. For each topic, we lay out the available p
numeric order and the consumer threads in lexicographic order. We then divide the num
partitions by the total number of consumer streams (threads) to determine the number
to assign to each consumer. If it does not evenly divide, then the rst few consumers w
extra partition.
More details about consumer con guration can be found in the scala class kafka.consumer.ConsumerConfig .
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
http://kafka.apache.org/documentation.html#introduction 51/130
11/8/2017 Apache Kafka
rest.advertise If this is set, this is the port that will be given out int null lo
d.port to other workers to connect to.
rest.host.nam Hostname for the REST API. If this is set, it will
string null lo
e only bind to this interface.
rest.port Port for the REST API to listen on. int 8083 lo
The amount of time to wait before attempting to
retry.backoff. retry a failed request to a given topic partition.
long 100 [0,...] lo
ms This avoids repeatedly sending requests in a
tight loop under some failure scenarios.
sasl.kerberos.
Kerberos kinit command path. string /usr/bin/kinit lo
kinit.cmd
sasl.kerberos.
Login thread sleep time between refresh
min.time.befo long 60000 lo
attempts.
re.relogin
sasl.kerberos.
Percentage of random jitter added to the
ticket.renew.ji double 0.05 lo
renewal time.
tter
sasl.kerberos. Login thread will sleep until the speci ed
ticket.renew. window factor of time from last refresh to
double 0.8 lo
window.facto ticket's expiry has been reached, at which time it
r will try to renew the ticket.
A list of cipher suites. This is a named
combination of authentication, encryption, MAC
ssl.cipher.suit and key exchange algorithm used to negotiate
list null lo
es the security settings for a network connection
using TLS or SSL network protocol. By default all
the available cipher suites are supported.
ssl.endpoint.i
The endpoint identi cation algorithm to validate
denti cation. string null lo
server hostname using server certi cate.
algorithm
The algorithm used by key manager factory for
ssl.keymanag SSL connections. Default value is the key
string SunX509 lo
er.algorithm manager factory algorithm con gured for the
Java Virtual Machine.
ssl.secure.ran
The SecureRandom PRNG implementation to
dom.impleme string null lo
use for SSL cryptography operations.
ntation
The algorithm used by trust manager factory for
ssl.trustmana SSL connections. Default value is the trust
string PKIX lo
ger.algorithm manager factory algorithm con gured for the
Java Virtual Machine.
status.storag The number of partitions used when creating the
int 5 [1,...] lo
Download e.partitions status storage topic
status.storag
Replication factor used when creating the status
e.replication.f short 3 [1,...] lo
@apachekafka storage topic
actor
Amount of time to wait for tasks to shutdown
task.shutdow
gracefully. This is the total amount of time, not
n.graceful.tim long 5000 lo
per task. All task have shutdown triggered, then
eout.ms
they are waited on sequentially.
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
http://kafka.apache.org/documentation.html#introduction 52/130
11/8/2017 Apache Kafka
client will make use of all servers irrespective of
which servers are speci ed here for
bootstrappingthis list only impacts the initial
hosts used to discover the full set of servers.
This list should be in the form
host1:port1,host2:port2,... . Since these
servers are just used for the initial connection to
discover the full cluster membership (which may
change dynamically), this list need not contain
the full set of servers (you may want more than
one, though, in case a server is down).
The replication factor for change log topics and
replication.fa
repartition topics created by the stream int 1 h
ctor
processing application.
/tmp/kafka-
state.dir Directory location for state store. string h
streams
cache.max.by Maximum number of memory bytes to be used
long 10485760 [0,...] m
tes.buffering for buffering across all threads
An ID pre x string used for the client IDs of
client.id internal consumer, producer and restore- string "" m
consumer, with pattern '-StreamThread--'.
org.apache.k
default.deseri Exception handling class that implements the afka.streams.
alization.exce org.apache.kafka.streams.errors.Deseri class errors.LogAn m
ption.handler alizationExceptionHandler interface. dFailExceptio
nHandler
org.apache.k
Default serializer / deserializer class for key that
afka.common
default.key.se implements the
class .serialization. m
rde org.apache.kafka.common.serialization.
Serdes$ByteA
Serde interface.
rraySerde
org.apache.k
Default timestamp extractor class that
afka.streams.
default.timest implements the
class processor.Fail m
amp.extractor org.apache.kafka.streams.processor.Tim
OnInvalidTim
estampExtractor interface.
estamp
org.apache.k
Default serializer / deserializer class for value
afka.common
default.value. that implements the
class .serialization. m
serde org.apache.kafka.common.serialization.
Serdes$ByteA
Serde interface.
rraySerde
num.standby.
The number of standby replicas for each task. int 0 m
replicas
num.stream.t The number of threads to execute stream
int 1 m
Download hreads processing.
The processing guarantee that should be used. [at_least_onc
processing.g
@apachekafka Possible values are at_least_once (default) string at_least_once e, m
uarantee
and exactly_once . exactly_once]
Protocol used to communicate with brokers.
security.proto
Valid values are: PLAINTEXT, SSL, string PLAINTEXT m
col
SASL_PLAINTEXT, SASL_SSL.
A host:port pair pointing to an embedded user
application.se de ned endpoint that can be used for
string "" lo
rver discovering the locations of state stores within a
single KafkaStreams application
buffered.reco
The maximum number of records to buffer per
rds.per.partiti int 1000 lo
partition.
on
The frequency with which to save the position of
commit.interv the processor. (Note, if 'processing.guarantee' is
long 30000 lo
al.ms set to 'exactly_once', the default value is 100,
otherwise the default value is 30000.
connections. Close idle connections after the number of
long 540000 lo
max.idle.ms milliseconds speci ed by this con g.
key.serde Serializer / deserializer class for key that class null lo
implements the
org.apache.kafka.common.serialization.
http://kafka.apache.org/documentation.html#introduction 53/130
11/8/2017 Apache Kafka
Serde interface. This con g is deprecated, use
default.key.serde instead
The period of time in milliseconds after which
we force a refresh of metadata even if we
metadata.ma
haven't seen any partition leadership changes to long 300000 [0,...] lo
x.age.ms
proactively discover any new brokers or
partitions.
A list of classes to use as metrics reporters.
Implementing the
org.apache.kafka.common.metrics.Metric
metric.report
sReporter interface allows plugging in classes list "" lo
ers
that will be noti ed of new metric creation. The
JmxReporter is always included to register JMX
statistics.
metrics.num. The number of samples maintained to compute
int 2 [1,...] lo
samples metrics.
metrics.recor [INFO,
The highest recording level for metrics. string INFO lo
ding.level DEBUG]
metrics.samp The window of time a metrics sample is
long 30000 [0,...] lo
le.window.ms computed over.
org.apache.k
Partition grouper class that implements the afka.streams.
partition.grou
org.apache.kafka.streams.processor.Par class processor.Def lo
per
titionGrouper interface. aultPartitionG
rouper
The amount of time in milliseconds to block
poll.ms long 100 lo
waiting for input.
The size of the TCP receive buffer (SO_RCVBUF)
receive.buffer
to use when reading data. If the value is -1, the int 32768 [0,...] lo
.bytes
OS default will be used.
The maximum amount of time in milliseconds to
wait when reconnecting to a broker that has
repeatedly failed to connect. If provided, the
reconnect.ba backoff per host will increase exponentially for
long 1000 [0,...] lo
ckoff.max.ms each consecutive connection failure, up to this
maximum. After calculating the backoff
increase, 20% random jitter is added to avoid
connection storms.
The base amount of time to wait before
attempting to reconnect to a given host. This
reconnect.ba
avoids repeatedly connecting to a host in a tight long 50 [0,...] lo
ckoff.ms
loop. This backoff applies to all connection
attempts by the client to a broker.
The con guration controls the maximum
Download
amount of time the client will wait for the
request.timeo response of a request. If the response is not
int 40000 [0,...] lo
@apachekafka ut.ms received before the timeout elapses the client
will resend the request if necessary or fail the
request if retries are exhausted.
The amount of time to wait before attempting to
retry.backoff. retry a failed request to a given topic partition.
long 100 [0,...] lo
ms This avoids repeatedly sending requests in a
tight loop under some failure scenarios.
A Rocks DB con g setter class or class name
rocksdb.con that implements the
class null lo
g.setter org.apache.kafka.streams.state.RocksDB
ConfigSetter interface
The size of the TCP send buffer (SO_SNDBUF) to
send.buffer.b
use when sending data. If the value is -1, the OS int 131072 [0,...] lo
ytes
default will be used.
The amount of time in milliseconds to wait
before deleting state when a partition has
state.cleanup
migrated. Only state directories that have not long 600000 lo
.delay.ms
been modi ed for at least
state.cleanup.delay.ms will be removed
timestamp.ex Timestamp extractor class that implements the class null lo
tractor org.apache.kafka.streams.processor.Tim
estampExtractor interface. This con g is
http://kafka.apache.org/documentation.html#introduction 54/130
11/8/2017 Apache Kafka
deprecated, use
default.timestamp.extractor instead
Serializer / deserializer class for value that
implements the
value.serde org.apache.kafka.common.serialization. class null lo
Serde interface. This con g is deprecated, use
default.value.serde instead
windowstore.
Added to a windows maintainMs to ensure data
changelog.ad
is not deleted from the log prematurely. Allows long 86400000 lo
ditional.retent
for clock drift. Default is 1 day
ion.ms
Zookeeper connect string for Kafka topics
zookeeper.co management. This con g is deprecated and will
string "" lo
nnect be ignored as Streams API does not use
Zookeeper anymore.
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
http://kafka.apache.org/documentation.html#introduction 56/130
11/8/2017 Apache Kafka
ms retry a failed request. This avoids repeatedly
sending requests in a tight loop under some
failure scenarios.
sasl.kerberos.
Kerberos kinit command path. string /usr/bin/kinit lo
kinit.cmd
sasl.kerberos.
Login thread sleep time between refresh
min.time.befo long 60000 lo
attempts.
re.relogin
sasl.kerberos.
Percentage of random jitter added to the
ticket.renew.ji double 0.05 lo
renewal time.
tter
sasl.kerberos. Login thread will sleep until the speci ed
ticket.renew. window factor of time from last refresh to
double 0.8 lo
window.facto ticket's expiry has been reached, at which time it
r will try to renew the ticket.
A list of cipher suites. This is a named
combination of authentication, encryption, MAC
ssl.cipher.suit and key exchange algorithm used to negotiate
list null lo
es the security settings for a network connection
using TLS or SSL network protocol. By default all
the available cipher suites are supported.
ssl.endpoint.i
The endpoint identi cation algorithm to validate
denti cation. string null lo
server hostname using server certi cate.
algorithm
The algorithm used by key manager factory for
ssl.keymanag SSL connections. Default value is the key
string SunX509 lo
er.algorithm manager factory algorithm con gured for the
Java Virtual Machine.
ssl.secure.ran
The SecureRandom PRNG implementation to
dom.impleme string null lo
use for SSL cryptography operations.
ntation
The algorithm used by trust manager factory for
ssl.trustmana SSL connections. Default value is the trust
string PKIX lo
ger.algorithm manager factory algorithm con gured for the
Java Virtual Machine.
4. DESIGN
4.1 Motivation
We designed Kafka to be able to act as a uni ed platform for handling all the real-time data feeds a large company might have. To d
think through a fairly broad set of use cases.
Download
It would have to have high-throughput to support high volume event streams such as real-time log aggregation.
@apachekafka
It would need to deal gracefully with large data backlogs to be able to support periodic data loads from o ine systems.
It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.
We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our p
consumer model.
Finally in cases where the stream is fed into other data systems for serving, we knew the system would have to be able to guarantee
in the presence of machine failures.
Supporting these uses led us to a design with a number of unique elements, more akin to a database log than a traditional messagin
will outline some elements of the design in the following sections.
4.2 Persistence
http://kafka.apache.org/documentation.html#introduction 57/130
11/8/2017 Apache Kafka
Kafka relies heavily on the lesystem for storing and caching messages. There is a general perception that "disks are slow" which m
skeptical that a persistent structure can offer competitive performance. In fact disks are both much slower and much faster than pe
depending on how they are used; and a properly designed disk structure can often be as fast as the network.
The key fact about disk performance is that the throughput of hard drives has been diverging from the latency of a disk seek for the
a result the performance of linear writes on a JBOD con guration with six 7200rpm SATA RAID-5 array is about 600MB/sec but the p
random writes is only about 100k/seca difference of over 6000X. These linear reads and writes are the most predictable of all usa
are heavily optimized by the operating system. A modern operating system provides read-ahead and write-behind techniques that pr
large block multiples and group smaller logical writes into large physical writes. A further discussion of this issue can be found in th
article; they actually nd that sequential disk access can in some cases be faster than random memory access!
To compensate for this performance divergence, modern operating systems have become increasingly aggressive in their use of ma
disk caching. A modern OS will happily divert all free memory to disk caching with little performance penalty when the memory is re
reads and writes will go through this uni ed cache. This feature cannot easily be turned off without using direct I/O, so even if a proc
an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.
Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:
1. The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
2. Java garbage collection becomes increasingly ddly and slow as the in-heap data increases.
As a result of these factors using the lesystem and relying on pagecache is superior to maintaining an in-memory cache or other st
least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte s
than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine without GC penalties. Furthermore, this
warm even if the service is restarted, whereas the in-process cache will need to be rebuilt in memory (which for a 10GB cache may t
or else it will need to start with a completely cold cache (which likely means terrible initial performance). This also greatly simpli es
logic for maintaining coherency between the cache and lesystem is now in the OS, which tends to do so more e ciently and more
one-off in-process attempts. If your disk usage favors linear reads then read-ahead is effectively pre-populating this cache with usef
disk read.
This suggests a design which is very simple: rather than maintain as much as possible in-memory and ush it all out to the lesyste
when we run out of space, we invert that. All data is immediately written to a persistent log on the lesystem without necessarily u
effect this just means that it is transferred into the kernel's pagecache.
This style of pagecache-centric design is described in an article on the design of Varnish here (along with a healthy dose of arrogan
Download The persistent data structure used in messaging systems are often a per-consumer queue with an associated BTree or other genera
random access data structures to maintain metadata about messages. BTrees are the most versatile data structure available, and m
@apachekafka to support a wide variety of transactional and non-transactional semantics in the messaging system. They do come with a fairly hig
Btree operations are O(log N). Normally O(log N) is considered essentially equivalent to constant time, but this is not true for disk op
seeks come at 10 ms a pop, and each disk can do only one seek at a time so parallelism is limited. Hence even a handful of disk see
high overhead. Since storage systems mix very fast cached operations with very slow physical disk operations, the observed perform
structures is often superlinear as data increases with xed cache--i.e. doubling your data makes things much worse than twice as s
Intuitively a persistent queue could be built on simple reads and appends to les as is commonly the case with logging solutions. Th
the advantage that all operations are O(1) and reads do not block writes or each other. This has obvious performance advantages si
performance is completely decoupled from the data sizeone server can now take full advantage of a number of cheap, low-rotatio
SATA drives. Though they have poor seek performance, these drives have acceptable performance for large reads and writes and co
price and 3x the capacity.
Having access to virtually unlimited disk space without any performance penalty means that we can provide some features not usua
messaging system. For example, in Kafka, instead of attempting to delete messages as soon as they are consumed, we can retain m
relatively long period (say a week). This leads to a great deal of exibility for consumers, as we will describe.
4.3 E ciency
http://kafka.apache.org/documentation.html#introduction 58/130
11/8/2017 Apache Kafka
We have put signi cant effort into e ciency. One of our primary use cases is handling web activity data, which is very high volume: e
may generate dozens of writes. Furthermore, we assume each message published is read by at least one consumer (often many), h
to make consumption as cheap as possible.
We have also found, from experience building and running a number of similar systems, that e ciency is a key to effective multi-ten
If the downstream infrastructure service can easily become a bottleneck due to a small bump in usage by the application, such sma
often create problems. By being very fast we help ensure that the application will tip-over under load before the infrastructure. This i
important when trying to run a centralized service that supports dozens or hundreds of applications on a centralized cluster as chan
patterns are a near-daily occurrence.
We discussed disk e ciency in the previous section. Once poor disk access patterns have been eliminated, there are two common c
ine ciency in this type of system: too many small I/O operations, and excessive byte copying.
The small I/O problem happens both between the client and the server and in the server's own persistent operations.
To avoid this, our protocol is built around a "message set" abstraction that naturally groups messages together. This allows network
group messages together and amortize the overhead of the network roundtrip rather than sending a single message at a time. The s
appends chunks of messages to its log in one go, and the consumer fetches large linear chunks at a time.
This simple optimization produces orders of magnitude speed up. Batching leads to larger network packets, larger sequential disk o
contiguous memory blocks, and so on, all of which allows Kafka to turn a bursty stream of random message writes into linear writes
consumers.
The other ine ciency is in byte copying. At low message rates this is not an issue, but under load the impact is signi cant. To avoid
a standardized binary message format that is shared by the producer, the broker, and the consumer (so data chunks can be transfer
modi cation between them).
The message log maintained by the broker is itself just a directory of les, each populated by a sequence of message sets that have
disk in the same format used by the producer and consumer. Maintaining this common format allows optimization of the most impo
network transfer of persistent log chunks. Modern unix operating systems offer a highly optimized code path for transferring data o
to a socket; in Linux this is done with the send le system call.
To understand the impact of send le, it is important to understand the common data path for transfer of data from le to socket:
1. The operating system reads data from the disk into pagecache in kernel space
2. The application reads the data from kernel space into a user-space buffer
3. The application writes the data back into kernel space into a socket buffer
4. The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network
This is clearly ine cient, there are four copies and two system calls. Using send le, this re-copying is avoided by allowing the OS to
Download
from pagecache to the network directly. So in this optimized path, only the nal copy to the NIC buffer is needed.
@apachekafka We expect a common use case to be multiple consumers on a topic. Using the zero-copy optimization above, data is copied into pag
once and reused on each consumption instead of being stored in memory and copied out to user-space every time it is read. This al
to be consumed at a rate that approaches the limit of the network connection.
This combination of pagecache and send le means that on a Kafka cluster where the consumers are mostly caught up you will see
on the disks whatsoever as they will be serving data entirely from cache.
For more background on the send le and zero-copy support in Java, see this article.
In some cases the bottleneck is actually not CPU or disk but network bandwidth. This is particularly true for a data pipeline that need
messages between data centers over a wide-area network. Of course, the user can always compress its messages one at a time wit
support needed from Kafka, but this can lead to very poor compression ratios as much of the redundancy is due to repetition betwee
the same type (e.g. eld names in JSON or user agents in web logs or common string values). E cient compression requires comp
messages together rather than compressing each message individually.
http://kafka.apache.org/documentation.html#introduction 59/130
11/8/2017 Apache Kafka
Kafka supports this with an e cient batching format. A batch of messages can be clumped together compressed and sent to the se
form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompres
consumer.
Kafka supports GZIP, Snappy and LZ4 compression protocols. More details on compression can be found here.
Load balancing
The producer sends data directly to the broker that is the leader for the partition without any intervening routing tier. To help the prod
Kafka nodes can answer a request for metadata about which servers are alive and where the leaders for the partitions of a topic are
time to allow the producer to appropriately direct its requests.
The client controls which partition it publishes messages to. This can be done at random, implementing a kind of random load balan
be done by some semantic partitioning function. We expose the interface for semantic partitioning by allowing the user to specify a
by and using this to hash to a partition (there is also an option to override the partition function if need be). For example if the key ch
user id then all data for a given user would be sent to the same partition. This in turn will allow consumers to make locality assumpt
consumption. This style of partitioning is explicitly designed to allow locality-sensitive processing in consumers.
Asynchronous send
Batching is one of the big drivers of e ciency, and to enable batching the Kafka producer will attempt to accumulate data in memor
out larger batches in a single request. The batching can be con gured to accumulate no more than a xed number of messages and
longer than some xed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O op
servers. This buffering is con gurable and gives a mechanism to trade off a small amount of additional latency for better throughpu
Details on con guration and the api for the producer can be found elsewhere in the documentation.
The Kafka consumer works by issuing "fetch" requests to the brokers leading the partitions it wants to consume. The consumer spe
in the log with each request and receives back a chunk of log beginning from that position. The consumer thus has signi cant contr
position and can rewind it to re-consume data if need be.
Download An initial question we considered is whether consumers should pull data from brokers or brokers should push data to the consumer
Kafka follows a more traditional design, shared by most messaging systems, where data is pushed to the broker from the producer
@apachekafka the broker by the consumer. Some logging-centric systems, such as Scribe and Apache Flume, follow a very different push-based pa
pushed downstream. There are pros and cons to both approaches. However, a push-based system has di culty dealing with diverse
the broker controls the rate at which data is transferred. The goal is generally for the consumer to be able to consume at the maxim
rate; unfortunately, in a push system this means the consumer tends to be overwhelmed when its rate of consumption falls below th
production (a denial of service attack, in essence). A pull-based system has the nicer property that the consumer simply falls behind
when it can. This can be mitigated with some kind of backoff protocol by which the consumer can indicate it is overwhelmed, but ge
transfer to fully utilize (but never over-utilize) the consumer is trickier than it seems. Previous attempts at building systems in this fa
go with a more traditional pull model.
Another advantage of a pull-based system is that it lends itself to aggressive batching of data sent to the consumer. A push-based s
choose to either send a request immediately or accumulate more data and then send it later without knowledge of whether the dow
consumer will be able to immediately process it. If tuned for low latency, this will result in sending a single message at a time only fo
end up being buffered anyway, which is wasteful. A pull-based design xes this as the consumer always pulls all available messages
current position in the log (or up to some con gurable max size). So one gets optimal batching without introducing unnecessary late
The de ciency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectiv
waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a "long po
data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).
http://kafka.apache.org/documentation.html#introduction 60/130
11/8/2017 Apache Kafka
You could imagine other possible designs which would be only pull, end-to-end. The producer would locally write to a local log, and b
pull from that with consumers pulling from them. A similar type of "store-and-forward" producer is often proposed. This is intriguing
very suitable for our target use cases which have thousands of producers. Our experience running persistent data systems at scale
that involving thousands of disks in the system across many applications would not actually make things more reliable and would b
operate. And in practice we have found that we can run a pipeline with strong SLAs at large scale without a need for producer persis
Consumer Position
Keeping track of what has been consumed is, surprisingly, one of the key performance points of a messaging system.
Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is hande
consumer, the broker either records that fact locally immediately or it may wait for acknowledgement from the consumer. This is a f
choice, and indeed for a single machine server it is not clear where else this state could go. Since the data structures used for storag
messaging systems scale poorly, this is also a pragmatic choice--since the broker knows what is consumed it can immediately dele
data size small.
What is perhaps not obvious is that getting the broker and consumer to come into agreement about what has been consumed is not
problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consum
process the message (say because it crashes or the request times out or whatever) that message will be lost. To solve this problem
messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when th
broker waits for a speci c acknowledgement from the consumer to record the message as consumed. This strategy xes the proble
messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowled
message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about eve
message ( rst to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed
problems must be dealt with, like what to do with messages that are sent but never acknowledged.
Kafka handles this differently. Our topic is divided into a set of totally ordered partitions, each of which is consumed by exactly one c
each subscribing consumer group at any given time. This means that the position of a consumer in each partition is just a single int
of the next message to consume. This makes the state about what has been consumed very small, just one number for each partitio
can be periodically checkpointed. This makes the equivalent of message acknowledgements very cheap.
There is a side bene t of this decision. A consumer can deliberately rewind back to an old offset and re-consume data. This violates
contract of a queue, but turns out to be an essential feature for many consumers. For example, if the consumer code has a bug and
after some messages are consumed, the consumer can re-consume those messages once the bug is xed.
Download Scalable persistence allows for the possibility of consumers that only periodically consume such as batch data loads that periodica
data into an o ine system such as Hadoop or a relational data warehouse.
@apachekafka
In the case of Hadoop we parallelize the data load by splitting the load over individual map tasks, one for each node/topic/partition
allowing full parallelism in the loading. Hadoop provides the task management, and tasks which fail can restart without danger of du
they simply restart from their original position.
Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides be
and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees wh
a message.
Many systems claim to provide "exactly once" delivery semantics, but it is important to read the ne print, most of these claims are m
they don't translate to the case where consumers or producers can fail, cases where there are multiple consumer processes, or case
http://kafka.apache.org/documentation.html#introduction 61/130
11/8/2017 Apache Kafka
Kafka's semantics are straight-forward. When publishing a message we have a notion of the message being "committed" to the log.
published message is committed it will not be lost as long as one broker that replicates the partition to which this message was wri
"alive". The de nition of committed message, alive partition as well as a description of which types of failures we attempt to handle
described in more detail in the next section. For now let's assume a perfect, lossless broker and try to understand the guarantees to
and consumer. If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened
the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.
Prior to 0.11.0.0, if a producer failed to receive a response indicating that a message was committed, it had little choice but to resen
This provides at-least-once delivery semantics since the message may be written to the log again during resending if the original req
succeeded. Since 0.11.0.0, the Kafka producer also supports an idempotent delivery option which guarantees that resending will no
duplicate entries in the log. To achieve this, the broker assigns each producer an ID and deduplicates messages using a sequence n
sent by the producer along with every message. Also beginning with 0.11.0.0, the producer supports the ability to send messages to
partitions using transaction-like semantics: i.e. either all messages are successfully written or none of them are. The main use case
exactly-once processing between Kafka topics (described below).
Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durabilit
desires. If the producer speci es that it wants to wait on the message being committed this can take on the order of 10 ms. Howeve
can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not ne
followers) have the message.
Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets
controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the consumer fails an
topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start
Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.
1. It can read the messages, then save its position in the log, and nally process the messages. In this case there is a possibility
consumer process crashes after saving its position but before saving the output of its message processing. In this case the p
over processing would start at the saved position even though a few messages prior to that position had not been processed.
corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.
2. It can read the messages, process the messages, and nally save its position. In this case there is a possibility that the consu
crashes after processing messages but before saving its position. In this case when the new process takes over the rst few m
receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure.
messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record
copy of itself).
So what about exactly once semantics (i.e. the thing you actually want)? When consuming from a Kafka topic and producing to anot
Download
a Kafka Streams application), we can leverage the new transactional producer capabilities in 0.11.0.0 that were mentioned above. T
position is stored as a message in a topic, so we can write the offset to Kafka in the same transaction as the output topics receiving
@apachekafka
data. If the transaction is aborted, the consumer's position will revert to its old value and the produced data on the output topics will
other consumers, depending on their "isolation level." In the default "read_uncommitted" isolation level, all messages are visible to co
if they were part of an aborted transaction, but in "read_committed," the consumer will only return messages from transactions whic
committed (and any messages which were not part of a transaction).
When writing to an external system, the limitation is in the need to coordinate the consumer's position with what is actually stored a
classic way of achieving this would be to introduce a two-phase commit between the storage of the consumer position and the stor
consumers output. But this can be handled more simply and generally by letting the consumer store its offset in the same place as i
is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an examp
consider a Kafka Connect connector which populates data in HDFS along with the offsets of the data it reads so that it is guarantee
data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronge
for which the messages do not have a primary key to allow for deduplication.
So effectively Kafka supports exactly-once delivery in Kafka Streams, and the transactional producer/consumer can be used genera
exactly-once delivery when transfering and processing data between Kafka topics. Exactly-once delivery for other destination system
requires cooperation with such systems, but Kafka provides the offset which makes implementing this feasible (see also Kafka Con
Otherwise, Kafka guarantees at-least-once delivery by default, and allows the user to implement at-most-once delivery by disabling r
producer and committing offsets in the consumer prior to processing a batch of messages.
http://kafka.apache.org/documentation.html#introduction 62/130
11/8/2017 Apache Kafka
4.7 Replication
Kafka replicates the log for each topic's partitions across a con gurable number of servers (you can set this replication factor on a t
basis). This allows automatic failover to these replicas when a server in the cluster fails so messages remain available in the presen
Other messaging systems provide some replication-related features, but, in our (totally biased) opinion, this appears to be a tacked-o
heavily used, and with large downsides: slaves are inactive, throughput is heavily impacted, it requires ddly manual con guration, e
meant to be used with replication by defaultin fact we implement un-replicated topics as replicated topics where the replication fa
The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single leader and zero or mor
total number of replicas including the leader constitute the replication factor. All reads and writes go to the leader of the partition. Ty
are many more partitions than brokers and the leaders are evenly distributed among brokers. The logs on the followers are identical
logall have the same offsets and messages in the same order (though, of course, at any given time the leader may have a few as-y
messages at the end of its log).
Followers consume messages from the leader just as a normal Kafka consumer would and apply them to their own log. Having the
from the leader has the nice property of allowing the follower to naturally batch together log entries they are applying to their log.
As with most distributed systems automatically handling failures requires having a precise de nition of what it means for a node to
Kafka node liveness has two conditions
1. A node must be able to maintain its session with ZooKeeper (via ZooKeeper's heartbeat mechanism)
2. If it is a slave it must replicate the writes happening on the leader and not fall "too far" behind
We refer to nodes satisfying these two conditions as being "in sync" to avoid the vagueness of "alive" or "failed". The leader keeps tra
"in sync" nodes. If a follower dies, gets stuck, or falls behind, the leader will remove it from the list of in sync replicas. The determina
and lagging replicas is controlled by the replica.lag.time.max.ms con guration.
In distributed systems terminology we only attempt to handle a "fail/recover" model of failures where nodes suddenly cease working
recover (perhaps without knowing that they have died). Kafka does not handle so-called "Byzantine" failures in which nodes produce
malicious responses (perhaps due to bugs or foul play).
We can now more precisely de ne that a message is considered committed when all in sync replicas for that partition have applied
Only committed messages are ever given out to the consumer. This means that the consumer need not worry about potentially seei
that could be lost if the leader fails. Producers, on the other hand, have the option of either waiting for the message to be committed
depending on their preference for tradeoff between latency and durability. This preference is controlled by the acks setting that the p
Note that topics have a setting for the "minimum number" of in-sync replicas that is checked when the producer requests acknowled
message has been written to the full set of in-sync replicas. If a less stringent acknowledgement is requested by the producer, then t
can be committed, and consumed, even if the number of in-sync replicas is lower than the minimum (e.g. it can be as low as just the
Download
The guarantee that Kafka offers is that a committed message will not be lost, as long as there is at least one in sync replica alive, at
@apachekafka
Kafka will remain available in the presence of node failures after a short fail-over period, but may not remain available in the presenc
partitions.
At its heart a Kafka partition is a replicated log. The replicated log is one of the most basic primitives in distributed data systems, an
many approaches for implementing one. A replicated log can be used by other systems as a primitive for implementing other distrib
the state-machine style.
A replicated log models the process of coming into consensus on the order of a series of values (generally numbering the log entrie
There are many ways to implement this, but the simplest and fastest is with a leader who chooses the ordering of values provided to
the leader remains alive, all followers need to only copy the values and ordering the leader chooses.
Of course if leaders didn't fail we wouldn't need followers! When the leader does die we need to choose a new leader from among th
followers themselves may fall behind or crash so we must ensure we choose an up-to-date follower. The fundamental guarantee a lo
algorithm must provide is that if we tell the client a message is committed, and the leader fails, the new leader we elect must also h
http://kafka.apache.org/documentation.html#introduction 63/130
11/8/2017 Apache Kafka
message. This yields a tradeoff: if the leader waits for more followers to acknowledge a message before declaring it committed the
more potentially electable leaders.
If you choose the number of acknowledgements required and the number of logs that must be compared to elect a leader such that
guaranteed to be an overlap, then this is called a Quorum.
A common approach to this tradeoff is to use a majority vote for both the commit decision and the leader election. This is not what
let's explore it anyway to understand the tradeoffs. Let's say we have 2f+1 replicas. If f+1 replicas must receive a message prior to a
declared by the leader, and if we elect a new leader by electing the follower with the most complete log from at least f+1 replicas, th
than f failures, the leader is guaranteed to have all committed messages. This is because among any f+1 replicas, there must be at l
that contains all committed messages. That replica's log will be the most complete and therefore will be selected as the new leader.
remaining details that each algorithm must handle (such as precisely de ned what makes a log more complete, ensuring log consis
leader failure or changing the set of servers in the replica set) but we will ignore these for now.
This majority vote approach has a very nice property: the latency is dependent on only the fastest servers. That is, if the replication f
the latency is determined by the faster slave not the slower one.
There are a rich variety of algorithms in this family including ZooKeeper's Zab, Raft, and Viewstamped Replication. The most similar
publication we are aware of to Kafka's actual implementation is Paci cA from Microsoft.
The downside of majority vote is that it doesn't take many failures to leave you with no electable leaders. To tolerate one failure requ
copies of the data, and to tolerate two failures requires ve copies of the data. In our experience having only enough redundancy to t
failure is not enough for a practical system, but doing every write ve times, with 5x the disk space requirements and 1/5th the throu
very practical for large volume data problems. This is likely why quorum algorithms more commonly appear for shared cluster con
ZooKeeper but are less common for primary data storage. For example in HDFS the namenode's high-availability feature is built on a
based journal, but this more expensive approach is not used for the data itself.
Kafka takes a slightly different approach to choosing its quorum set. Instead of majority vote, Kafka dynamically maintains a set of
(ISR) that are caught-up to the leader. Only members of this set are eligible for election as leader. A write to a Kafka partition is not c
committed until all in-sync replicas have received the write. This ISR set is persisted to ZooKeeper whenever it changes. Because of
in the ISR is eligible to be elected leader. This is an important factor for Kafka's usage model where there are many partitions and en
leadership balance is important. With this ISR model and f+1 replicas, a Kafka topic can tolerate f failures without losing committed
For most use cases we hope to handle, we think this tradeoff is a reasonable one. In practice, to tolerate f failures, both the majority
ISR approach will wait for the same number of replicas to acknowledge before committing a message (e.g. to survive one failure a m
needs three replicas and one acknowledgement and the ISR approach requires two replicas and one acknowledgement). The ability
without the slowest servers is an advantage of the majority vote approach. However, we think it is ameliorated by allowing the client
whether they block on the message commit or not, and the additional throughput and disk space due to the lower required replicatio
Download worth it.
Another important design distinction is that Kafka does not require that crashed nodes recover with all their data intact. It is not unc
@apachekafka
replication algorithms in this space to depend on the existence of "stable storage" that cannot be lost in any failure-recovery scenari
potential consistency violations. There are two primary problems with this assumption. First, disk errors are the most common prob
in real operation of persistent data systems and they often do not leave data intact. Secondly, even if this were not a problem, we do
require the use of fsync on every write for our consistency guarantees as this can reduce performance by two to three orders of mag
protocol for allowing a replica to rejoin the ISR ensures that before rejoining, it must fully re-sync again even if it lost un ushed data
Note that Kafka's guarantee with respect to data loss is predicated on at least one replica remaining in sync. If all the nodes replicat
die, this guarantee no longer holds.
However a practical system needs to do something reasonable when all the replicas die. If you are unlucky enough to have this occu
to consider what will happen. There are two behaviors that could be implemented:
1. Wait for a replica in the ISR to come back to life and choose this replica as the leader (hopefully it still has all its data).
2. Choose the rst replica (not necessarily in the ISR) that comes back to life as the leader.
http://kafka.apache.org/documentation.html#introduction 64/130
11/8/2017 Apache Kafka
This is a simple tradeoff between availability and consistency. If we wait for replicas in the ISR, then we will remain unavailable as lo
replicas are down. If such replicas were destroyed or their data was lost, then we are permanently down. If, on the other hand, a non
comes back to life and we allow it to become leader, then its log becomes the source of truth even though it is not guaranteed to hav
committed message. By default, Kafka chooses the second strategy and favor choosing a potentially inconsistent replica when all r
ISR are dead. This behavior can be disabled using con guration property unclean.leader.election.enable, to support use cases where
preferable to inconsistency.
This dilemma is not speci c to Kafka. It exists in any quorum-based scheme. For example in a majority voting scheme, if a majority
a permanent failure, then you must either choose to lose 100% of your data or violate consistency by taking what remains on an exis
your new source of truth.
When writing to Kafka, producers can choose whether they wait for the message to be acknowledged by 0,1 or all (-1) replicas. Note
"acknowledgement by all replicas" does not guarantee that the full set of assigned replicas have received the message. By default, w
acknowledgement happens as soon as all the current in-sync replicas have received the message. For example, if a topic is con gur
two replicas and one fails (i.e., only one in sync replica remains), then writes that specify acks=all will succeed. However, these write
if the remaining replica also fails. Although this ensures maximum availability of the partition, this behavior may be undesirable to so
prefer durability over availability. Therefore, we provide two topic-level con gurations that can be used to prefer message durability o
1. Disable unclean leader election - if all replicas become unavailable, then the partition will remain unavailable until the most rec
becomes available again. This effectively prefers unavailability over the risk of message loss. See the previous section on Unc
Election for clari cation.
2. Specify a minimum ISR size - the partition will only accept writes if the size of the ISR is above a certain minimum, in order to p
of messages that were written to just a single replica, which subsequently becomes unavailable. This setting only takes effect
uses acks=all and guarantees that the message will be acknowledged by at least this many in-sync replicas. This setting offer
between consistency and availability. A higher setting for minimum ISR size guarantees better consistency since the message
to be written to more replicas which reduces the probability that it will be lost. However, it reduces availability since the partitio
unavailable for writes if the number of in-sync replicas drops below the minimum threshold.
Replica Management
The above discussion on replicated logs really covers only a single log, i.e. one topic partition. However a Kafka cluster will manage
thousands of these partitions. We attempt to balance partitions within a cluster in a round-robin fashion to avoid clustering all partit
volume topics on a small number of nodes. Likewise we try to balance leadership so that each node is the leader for a proportional s
partitions.
Download
It is also important to optimize the leadership election process as that is the critical window of unavailability. A naive implementatio
@apachekafka election would end up running an election per partition for all partitions a node hosted when that node failed. Instead, we elect one o
the "controller". This controller detects failures at the broker level and is responsible for changing the leader of all affected partitions
broker. The result is that we are able to batch together many of the required leadership change noti cations which makes the electio
cheaper and faster for a large number of partitions. If the controller fails, one of the surviving brokers will become the new controlle
Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a
partition. It addresses use cases and scenarios such as restoring state after application crashes or system failure, or reloading cach
application restarts during operational maintenance. Let's dive into these use cases in more detail and then describe how compactio
So far we have described only the simpler approach to data retention where old log data is discarded after a xed period of time or w
reaches some predetermined size. This works well for temporal event data such as logging where each record stands alone. Howev
class of data streams are the log of changes to keyed, mutable data (for example, the changes to a database table).
Let's discuss a concrete example of such a stream. Say we have a topic containing user email addresses; every time a user updates
address we send a message to this topic using their user id as the primary key. Now say we send the following messages over som
a user with id 123, each message corresponding to a change in email address (messages for other ids are omitted):
http://kafka.apache.org/documentation.html#introduction 65/130
11/8/2017 Apache Kafka
1 123 => [email protected]
2 .
3 .
4 .
5 123 => [email protected]
6 .
7 .
8 .
9 123 => [email protected]
Log compaction gives us a more granular retention mechanism so that we are guaranteed to retain at least the last update for each
(e.g. [email protected] ). By doing this we guarantee that the log contains a full snapshot of the nal value for every key not just ke
recently. This means downstream consumers can restore their own state off this topic without us having to retain a complete log of
Let's start by looking at a few use cases where this is useful, then we'll see how it can be used.
1. Database change subscription. It is often necessary to have a data set in multiple data systems, and often one of these system
database of some kind (either a RDBMS or perhaps a new-fangled key-value store). For example you might have a database, a
cluster, and a Hadoop cluster. Each change to the database will need to be re ected in the cache, the search cluster, and even
Hadoop. In the case that one is only handling the real-time updates you only need recent log. But if you want to be able to relo
restore a failed search node you may need a complete data set.
2. Event sourcing. This is a style of application design which co-locates query processing with application design and uses a log
the primary store for the application.
3. Journaling for high-availability. A process that does local computation can be made fault-tolerant by logging out changes that
local state so another process can reload these changes and carry on if it should fail. A concrete example of this is handling c
aggregations, and other "group by"-like processing in a stream query system. Samza, a real-time stream-processing framewor
feature for exactly this purpose.
In each of these cases one needs primarily to handle the real-time feed of changes, but occasionally, when a machine crashes or da
re-loaded or re-processed, one needs to do a full load. Log compaction allows feeding both of these use cases off the same backing
style of usage of a log is described in more detail in this blog post.
The general idea is quite simple. If we had in nite log retention, and we logged each change in the above cases, then we would have
state of the system at each time from when it rst began. Using this complete log, we could restore to any point in time by replaying
records in the log. This hypothetical complete log is not very practical for systems that update a single record many times as the log
without bound even for a stable dataset. The simple log retention mechanism which throws away old updates will bound space but
longer a way to restore the current statenow restoring from the beginning of the log no longer recreates the current state as old up
be captured at all.
Log compaction is a mechanism to give ner-grained per-record retention, rather than the coarser-grained time-based retention. The
Download selectively remove records where we have a more recent update with the same primary key. This way the log is guaranteed to have a
state for each key.
@apachekafka
This retention policy can be set per-topic, so a single cluster can have some topics where retention is enforced by size or time and o
where retention is enforced by compaction.
This functionality is inspired by one of LinkedIn's oldest and most successful pieces of infrastructurea database changelog cachin
Databus. Unlike most log-structured storage systems Kafka is built for subscription and organizes data for fast linear reads and writ
Databus, Kafka acts as a source-of-truth store so it is useful even in situations where the upstream data source would not otherwise
Here is a high-level picture that shows the logical structure of a Kafka log with the offset for each message.
http://kafka.apache.org/documentation.html#introduction 66/130
11/8/2017 Apache Kafka
The head of the log is identical to a traditional Kafka log. It has dense, sequential offsets and retains all messages. Log compaction
for handling the tail of the log. The picture above shows a log with a compacted tail. Note that the messages in the tail of the log ret
offset assigned when they were rst writtenthat never changes. Note also that all offsets remain valid positions in the log, even if t
with that offset has been compacted away; in this case this position is indistinguishable from the next highest offset that does appe
For example, in the picture above the offsets 36, 37, and 38 are all equivalent positions and a read beginning at any of these offsets
message set beginning with 38.
Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. This delete ma
any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that the
themselves be cleaned out of the log after a period of time to free up space. The point in time at which deletes are no longer retaine
the "delete retention point" in the above diagram.
The compaction is done in the background by periodically recopying log segments. Cleaning does not block reads and can be thrott
more than a con gurable amount of I/O throughput to avoid impacting producers and consumers. The actual process of compactin
looks something like this:
Download
@apachekafka
1. Any consumer that stays caught-up to within the head of the log will see every message that is written; these messages will h
offsets. The topic's min.compaction.lag.ms can be used to guarantee the minimum length of time must pass after a mes
before it could be compacted. I.e. it provides a lower bound on how long each message will remain in the (uncompacted) head
2. Ordering of messages is always maintained. Compaction will never re-order messages, just remove some.
3. The offset for a message never changes. It is the permanent identi er for a position in the log.
4. Any consumer progressing from the start of the log will see at least the nal state of all records in the order they were written.
delete markers for deleted records will be seen, provided the consumer reaches the head of the log in a time period less than t
delete.retention.ms setting (the default is 24 hours). In other words: since the removal of delete markers happens conc
reads, it is possible for a consumer to miss delete markers if it lags by more than delete.retention.ms .
http://kafka.apache.org/documentation.html#introduction 67/130
11/8/2017 Apache Kafka
Log compaction is handled by the log cleaner, a pool of background threads that recopy log segment les, removing records whose
the head of the log. Each compactor thread works as follows:
1. It chooses the log that has the highest ratio of log head to log tail
2. It creates a succinct summary of the last offset for each key in the head of the log
3. It recopies the log from beginning to end removing keys which have a later occurrence in the log. New, clean segments are sw
log immediately so the additional disk space required is just one additional log segment (not a fully copy of the log).
4. The summary of the log head is essentially just a space-compact hash table. It uses exactly 24 bytes per entry. As a result wit
cleaner buffer one cleaner iteration can clean around 366GB of log head (assuming 1k messages).
The log cleaner is enabled by default. This will start the pool of cleaner threads. To enable log cleaning on a particular topic you can
speci c property
1 log.cleanup.policy=compact
This can be done either at topic creation time or using the alter topic command.
The log cleaner can be con gured to retain a minimum amount of the uncompacted "head" of the log. This is enabled by setting the
time lag.
1 log.cleaner.min.compaction.lag.ms
This can be used to prevent messages newer than a minimum message age from being subject to compaction. If not set, all log seg
eligible for compaction except for the last segment, i.e. the one currently being written to. The active segment will not be compacted
messages are older than the minimum compaction time lag.
4.9 Quotas
Kafka cluster has the ability to enforce quotas on requests to control the broker resources used by clients. Two types of client quota
enforced by Kafka brokers for each group of clients sharing a quota:
Download It is possible for producers and consumers to produce/consume very high volumes of data or generate requests at a very high rate a
monopolize broker resources, cause network saturation and generally DOS other clients and the brokers themselves. Having quotas
@apachekafka against these issues and is all the more important in large multi-tenant clusters where a small set of badly behaved clients can degr
experience for the well behaved ones. In fact, when running Kafka as a service this even makes it possible to enforce API limits acco
agreed upon contract.
Client groups
The identity of Kafka clients is the user principal which represents an authenticated user in a secure cluster. In a cluster that suppor
unauthenticated clients, user principal is a grouping of unauthenticated users chosen by the broker using a con gurable Principa
Client-id is a logical grouping of clients with a meaningful name chosen by the client application. The tuple (user, client-id) de nes a
group of clients that share both user principal and client-id.
Quotas can be applied to (user, client-id), user or client-id groups. For a given connection, the most speci c quota matching the conn
applied. All connections of a quota group share the quota con gured for the group. For example, if (user="test-user", client-id="test-c
produce quota of 10MB/sec, this is shared across all producer instances of user "test-user" with the client-id "test-client".
http://kafka.apache.org/documentation.html#introduction 68/130
11/8/2017 Apache Kafka
Quota con guration may be de ned for (user, client-id), user and client-id groups. It is possible to override the default quota at any o
levels that needs a higher (or even lower) quota. The mechanism is similar to the per-topic log con g overrides. User and (user, clien
overrides are written to ZooKeeper under /con g/users and client-id quota overrides are written under /con g/clients. These overrid
all brokers and are effective immediately. This lets us change quotas without having to do a rolling restart of the entire cluster. See h
Default quotas for each group may also be updated dynamically using the same mechanism.
1. /con g/users/<user>/clients/<client-id>
2. /con g/users/<user>/clients/<default>
3. /con g/users/<user>
4. /con g/users/<default>/clients/<client-id>
5. /con g/users/<default>/clients/<default>
6. /con g/users/<default>
7. /con g/clients/<client-id>
8. /con g/clients/<default>
Broker properties (quota.producer.default, quota.consumer.default) can also be used to set defaults of network bandwidth quotas fo
groups. These properties are being deprecated and will be removed in a later release. Default quotas for client-id can be set in Zooke
the other quota overrides and defaults.
Network bandwidth quotas are de ned as the byte rate threshold for each group of clients sharing a quota. By default, each unique c
receives a xed quota in bytes/sec as con gured by the cluster. This quota is de ned on a per-broker basis. Each group of clients ca
a maximum of X bytes/sec per broker before clients are throttled.
Request rate quotas are de ned as the percentage of time a client can utilize on request handler I/O threads and network threads of
within a quota window. A quota of n% represents n% of one thread, so the quota is out of a total capacity of ((num.io.threads + num.net
100)%. Each group of clients may use a total percentage of upto n% across all I/O and network threads in a quota window before being
Since the number of threads allocated for I/O and network threads are typically based on the number of cores available on the broke
rate quotas represent the total percentage of CPU that may be used by each group of clients sharing the quota.
Enforcement
Download
By default, each unique client group receives a xed quota as con gured by the cluster. This quota is de ned on a per-broker basis.
@apachekafka utilize this quota per broker before it gets throttled. We decided that de ning these quotas per broker is much better than having a
bandwidth per client because that would require a mechanism to share client quota usage among all the brokers. This can be harde
than the quota implementation itself!
How does a broker react when it detects a quota violation? In our solution, the broker does not return an error rather it attempts to s
client exceeding its quota. It computes the amount of delay needed to bring a guilty client under its quota and delays the response f
This approach keeps the quota violation transparent to clients (outside of client-side metrics). This also keeps them from having to
special backoff and retry behavior which can get tricky. In fact, bad client behavior (retry without backoff) can exacerbate the very p
are trying to solve.
Byte-rate and thread utilization are measured over multiple small windows (e.g. 30 windows of 1 second each) in order to detect and
violations quickly. Typically, having large measurement windows (for e.g. 10 windows of 30 seconds each) leads to large bursts of t
by long delays which is not great in terms of user experience.
5. IMPLEMENTATION
http://kafka.apache.org/documentation.html#introduction 69/130
11/8/2017 Apache Kafka
The network layer is a fairly straight-forward NIO server, and will not be described in great detail. The send le implementation is don
MessageSet interface a writeTo method. This allows the le-backed message set to use the more e cient transferTo imp
instead of an in-process buffered write. The threading model is a single acceptor thread and N processor threads which handle a x
connections each. This design has been pretty thoroughly tested elsewhere and found to be simple to implement and fast. The prot
quite simple to allow for future implementation of clients in other languages.
5.2 Messages
Messages consist of a variable-length header, a variable length opaque key byte array and a variable length opaque value byte array.
the header is described in the following section. Leaving the key and value opaque is the right decision: there is a great deal of progr
on serialization libraries right now, and any particular choice is unlikely to be right for all uses. Needless to say a particular applicatio
would likely mandate a particular serialization type as part of its usage. The RecordBatch interface is simply an iterator over mes
specialized methods for bulk reading and writing to an NIO Channel .
Messages (aka Records) are always written in batches. The technical term for a batch of messages is a record batch, and a record b
one or more records. In the degenerate case, we could have a record batch containing a single record. Record batches and records h
headers. The format of each is described below for Kafka version 0.11.0 and later (message format version v2, or magic=2). Click h
about message formats 0 and 1.
1 baseOffset: int64
2 batchLength: int32
3 partitionLeaderEpoch: int32
4 magic: int8 (current magic value is 2)
5 crc: int32
6 attributes: int16
7 bit 0~2:
8 0: no compression
9 1: gzip
10 2: snappy
11 3: lz4
12 bit 3: timestampType
13 bit 4: isTransactional (0 means not transactional)
14 bit 5: isControlBatch (0 means not a control batch)
15 bit 6~15: unused
Download 16 lastOffsetDelta: int32
17 firstTimestamp: int64
18 maxTimestamp: int64
@apachekafka
19 producerId: int64
20 producerEpoch: int16
21 baseSequence: int32
22 records: [Record]
23
Note that when compression is enabled, the compressed record data is serialized directly following the count of the number of reco
The CRC covers the data from the attributes to the end of the batch (i.e. all the bytes that follow the CRC). It is located after the mag
means that clients must parse the magic byte before deciding how to interpret the bytes between the batch length and the magic by
leader epoch eld is not included in the CRC computation to avoid the need to recompute the CRC when this eld is assigned for eve
received by the broker. The CRC-32C (Castagnoli) polynomial is used for the computation.
On compaction: unlike the older message formats, magic v2 and above preserves the rst and last offset/sequence numbers from t
batch when the log is cleaned. This is required in order to be able to restore the producer's state when the log is reloaded. If we did n
last sequence number, for example, then after a partition leader failure, the producer might see an OutOfSequence error. The base se
number must be preserved for duplicate checking (the broker checks incoming Produce requests for duplicates by verifying that the
sequence numbers of the incoming batch match the last from that producer). As a result, it is possible to have empty batches in the
http://kafka.apache.org/documentation.html#introduction 70/130
11/8/2017 Apache Kafka
records in the batch are cleaned but batch is still retained in order to preserve a producer's last sequence number. One oddity here is
baseTimestamp eld is not preserved during compaction, so it will change if the rst record in the batch is compacted away.
A control batch contains a single record called the control record. Control records should not be passed on to applications. Instead,
by consumers to lter out aborted transactional messages.
The schema for the value of a control record is dependent on the type. The value is opaque to clients.
5.3.2 Record
Record level headers were introduced in Kafka 0.11.0. The on-disk format of a record with Headers is delineated below.
1 length: varint
2 attributes: int8
3 bit 0~7: unused
4 timestampDelta: varint
5 offsetDelta: varint
6 keyLength: varint
7 key: byte[]
8 valueLen: varint
9 value: byte[]
10 Headers => [Header]
11
1 headerKeyLength: varint
2 headerKey: String
3 headerValueLength: varint
4 Value: byte[]
5
We use the the same varint encoding as Protobuf. More information on the latter can be found here. The count of headers in a recor
encoded as a varint.
Download
5.4 Log
@apachekafka A log for a topic named "my_topic" with two partitions consists of two directories (namely my_topic_0 and my_topic_1 ) popu
les containing the messages for that topic. The format of the log les is a sequence of "log entries""; each log entry is a 4 byte integ
message length which is followed by the N message bytes. Each message is uniquely identi ed by a 64-bit integer offset giving the
the start of this message in the stream of all messages ever sent to that topic on that partition. The on-disk format of each message
Each log le is named with the offset of the rst message it contains. So the rst le created will be 00000000000.kafka, and each a
will have an integer name roughly S bytes from the previous le where S is the max log le size given in the con guration.
The exact binary format for records is versioned and maintained as a standard interface so record batches can be transferred betwe
broker, and client without recopying or conversion when desirable. The previous section included details about the on-disk format of
The use of the message offset as the message id is unusual. Our original idea was to use a GUID generated by the producer, and ma
mapping from GUID to offset on each broker. But since a consumer must maintain an ID for each server, the global uniqueness of th
no value. Furthermore, the complexity of maintaining the mapping from a random id to an offset requires a heavy weight index struc
be synchronized with disk, essentially requiring a full persistent random-access data structure. Thus to simplify the lookup structure
use a simple per-partition atomic counter which could be coupled with the partition id and node id to uniquely identify a message; th
lookup structure simpler, though multiple seeks per consumer request are still likely. However once we settled on a counter, the jum
using the offset seemed naturalboth after all are monotonically increasing integers unique to a partition. Since the offset is hidden
consumer API this decision is ultimately an implementation detail and we went with the more e cient approach.
http://kafka.apache.org/documentation.html#introduction 71/130
11/8/2017 Apache Kafka
Writes
The log allows serial appends which always go to the last le. This le is rolled over to a fresh le when it reaches a con gurable siz
The log takes two con guration parameters: M, which gives the number of messages to write before forcing the OS to ush the le t
which gives a number of seconds after which a ush is forced. This gives a durability guarantee of losing at most M messages or S
in the event of a system crash.
Reads
Reads are done by giving the 64-bit logical offset of a message and an S-byte max chunk size. This will return an iterator over the me
contained in the S-byte buffer. S is intended to be larger than any single message, but in the event of an abnormally large message, t
retried multiple times, each time doubling the buffer size, until the message is read successfully. A maximum message and buffer s
speci ed to make the server reject messages larger than some size, and to give a bound to the client on the maximum it needs to ev
complete message. It is likely that the read buffer ends with a partial message, this is easily detected by the size delimiting.
Download
The actual process of reading from an offset requires rst locating the log segment le in which the data is stored, calculating the
offset from the global offset value, and then reading from that le offset. The search is done as a simple binary search variation aga
@apachekafka
memory range maintained for each le.
The log provides the capability of getting the most recently written message to allow clients to start subscribing as of "right now". T
useful in the case the consumer fails to consume its data within its SLA-speci ed number of days. In this case when the client attem
consume a non-existent offset it is given an OutOfRangeException and can either reset itself or fail as appropriate to the use case.
Deletes
Data is deleted one log segment at a time. The log manager allows pluggable delete policies to choose which les are eligible for de
current policy deletes any log with a modi cation time of more than N days ago, though a policy which retained the last N GB could
To avoid locking reads while still allowing deletes that modify the segment list we use a copy-on-write style segment list implementa
provides consistent views to allow a binary search to proceed on an immutable static snapshot view of the log segments while dele
progressing.
Guarantees
The log provides a con guration parameter M which controls the maximum number of messages that are written before forcing a
startup a log recovery process is run that iterates over all messages in the newest log segment and veri es that each message entry
message entry is valid if the sum of its size and offset are less than the length of the le AND the CRC32 of the message payload m
stored with the message. In the event corruption is detected the log is truncated to the last valid offset.
Note that two kinds of corruption must be handled: truncation in which an unwritten block is lost due to a crash, and corruption in w
block is ADDED to the le. The reason for this is that in general the OS makes no guarantee of the write order between the le inode
block data so in addition to losing written data the le can gain nonsense data if the inode is updated with a new size but a crash oc
block containing that data is written. The CRC detects this corner case, and prevents it from corrupting the log (though the unwritten
of course, lost).
5.5 Distribution
The high-level consumer tracks the maximum offset it has consumed in each partition and periodically commits its offset vector so
resume from those offsets in the event of a restart. Kafka provides the option to store all the offsets for a given consumer group in a
broker (for that group) called the offset manager. i.e., any consumer instance in that consumer group should send its offset commit
that offset manager (broker). The high-level consumer handles this automatically. If you use the simple consumer you will need to m
manually. This is currently unsupported in the Java simple consumer which can only commit or fetch offsets in ZooKeeper. If you us
simple consumer you can discover the offset manager and explicitly commit or fetch offsets to the offset manager. A consumer can
offset manager by issuing a GroupCoordinatorRequest to any Kafka broker and reading the GroupCoordinatorResponse which will c
offset manager. The consumer can then proceed to commit or fetch offsets from the offsets manager broker. In case the offset ma
the consumer will need to rediscover the offset manager. If you wish to manage your offsets manually, you can take a look at these
Download that explain how to issue OffsetCommitRequest and OffsetFetchRequest.
@apachekafka When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named
__consumer_offsets. The offset manager sends a successful offset commit response to the consumer only after all the replicas of t
receive the offsets. In case the offsets fail to replicate within a con gurable timeout, the offset commit will fail and the consumer m
commit after backing off. (This is done automatically by the high-level consumer.) The brokers periodically compact the offsets topi
needs to maintain the most recent offset commit per partition. The offset manager also caches the offsets in an in-memory table in
offset fetches quickly.
When the offset manager receives an offset fetch request, it simply returns the last committed offset vector from the offsets cache.
offset manager was just started or if it just became the offset manager for a new set of consumer groups (by becoming a leader for
the offsets topic), it may need to load the offsets topic partition into the cache. In this case, the offset fetch will fail with an OffsetsL
exception and the consumer may retry the OffsetFetchRequest after backing off. (This is done automatically by the high-level consu
Kafka consumers in earlier releases store their offsets by default in ZooKeeper. It is possible to migrate these consumers to commit
Kafka by following these steps:
http://kafka.apache.org/documentation.html#introduction 73/130
11/8/2017 Apache Kafka
2. Do a rolling bounce of your consumers and then verify that your consumers are healthy.
3. Set dual.commit.enabled=false in your consumer con g.
4. Do a rolling bounce of your consumers and then verify that your consumers are healthy.
A roll-back (i.e., migrating from Kafka back to ZooKeeper) can also be performed using the above steps if you set offsets.storag
ZooKeeper Directories
The following gives the ZooKeeper structures and algorithms used for co-ordination between consumers and brokers.
Notation
When an element in a path is denoted [xyz], that means that the value of xyz is not xed and there is in fact a ZooKeeper znode for e
value of xyz. For example /topics/[topic] would be a directory named /topics containing a sub-directory for each topic name. Numer
also given such as [0...5] to indicate the subdirectories 0, 1, 2, 3, 4. An arrow -> is used to indicate the contents of a znode. For exam
world would indicate a znode /hello containing the value "world".
This is a list of all present broker nodes, each of which provides a unique logical broker id which identi es it to consumers (which m
part of its con guration). On startup, a broker node registers itself by creating a znode with the logical broker id under /brokers/ids.
the logical broker id is to allow a broker to be moved to a different physical machine without affecting consumers. An attempt to reg
that is already in use (say because two servers are con gured with the same broker id) results in an error.
Since the broker registers itself in ZooKeeper using ephemeral znodes, this registration is dynamic and will disappear if the broker is
dies (thus notifying consumers it is no longer available).
Each broker registers itself under the topics it maintains and stores the number of partitions for that topic.
Consumers of topics also register themselves in ZooKeeper, in order to coordinate with each other and balance the consumption of
@apachekafka Consumers can also store their offsets in ZooKeeper by setting offsets.storage=zookeeper . However, this offset storage mec
deprecated in a future release. Therefore, it is recommended to migrate offsets storage to Kafka.
Multiple consumers can form a group and jointly consume a single topic. Each consumer in the same group is given a shared group
example if one consumer is your foobar process, which is run across three machines, then you might assign this group of consumer
"foobar". This group id is provided in the con guration of the consumer, and is your way to tell the consumer which group it belongs
The consumers in a group divide up the partitions as fairly as possible, each partition is consumed by exactly one consumer in a con
Consumer Id Registry
In addition to the group_id which is shared by all consumers in a group, each consumer is given a transient, unique consumer_id (of
hostname:uuid) for identi cation purposes. Consumer ids are registered in the following directory.
Each of the consumers in the group registers under its group and creates a znode with its consumer_id. The value of the znode cont
<topic, #streams>. This id is simply used to identify each of the consumers which is currently active within a group. This is an ephem
http://kafka.apache.org/documentation.html#introduction 74/130
11/8/2017 Apache Kafka
Consumer Offsets
Consumers track the maximum offset they have consumed in each partition. This value is stored in a ZooKeeper directory if
offsets.storage=zookeeper .
Each broker partition is consumed by a single consumer within a given consumer group. The consumer must establish its ownershi
partition before any consumption can begin. To establish its ownership, a consumer writes its own id in an ephemeral node under th
broker partition it is claiming.
Cluster Id
The cluster id is a unique and immutable identi er assigned to a Kafka cluster. The cluster id can have a maximum of 22 characters
characters are de ned by the regular expression [a-zA-Z0-9_\-]+, which corresponds to the characters used by the URL-safe Base64
padding. Conceptually, it is auto-generated when a cluster is started for the rst time.
Implementation-wise, it is generated when a broker with version 0.10.1 or later is successfully started for the rst time. The broker t
cluster id from the /cluster/id znode during startup. If the znode does not exist, the broker generates a new cluster id and crea
with this cluster id.
The broker nodes are basically independent, so they only publish information about what they have. When a broker joins, it registers
broker node registry directory and writes information about its host name and port. The broker also register the list of existing topics
logical partitions in the broker topic registry. New topics are registered dynamically when they are created on the broker.
The consumer rebalancing algorithms allows all the consumers in a group to come into consensus on which consumer is consumin
partitions. Consumer rebalancing is triggered on each addition or removal of both broker nodes and other consumers within the sam
given topic and a given consumer group, broker partitions are divided evenly among consumers within the group. A partition is alway
a single consumer. This design simpli es the implementation. Had we allowed a partition to be concurrently consumed by multiple c
there would be contention on the partition and some kind of locking would be required. If there are more consumers than partitions,
consumers won't get any data at all. During rebalancing, we try to assign partitions to consumers in such a way that reduces the num
nodes each consumer has to connect to.
http://kafka.apache.org/documentation.html#introduction 75/130
11/8/2017 Apache Kafka
When rebalancing is triggered at one consumer, rebalancing should be triggered in other consumers within the same group about th
6. OPERATIONS
Here is some information on actually running Kafka as a production system based on usage and experience at LinkedIn. Please sen
additional tips you know of.
This section will review the most common operations you will perform on your Kafka cluster. All of the tools reviewed in this section
under the bin/ directory of the Kafka distribution and each tool will print details on all possible commandline options if it is run w
arguments.
You have the option of either adding topics manually or having them be created automatically when data is rst published to a non-e
topics are auto-created then you may want to tune the default topic con gurations used for auto-created topics.
The replication factor controls how many servers will replicate each message that is written. If you have a replication factor of 3 the
servers can fail before you will lose access to your data. We recommend you use a replication factor of 2 or 3 so that you can transp
machines without interrupting data consumption.
The partition count controls how many logs the topic will be sharded into. There are several impacts of the partition count. First eac
t entirely on a single server. So if you have 20 partitions the full data set (and read and write load) will be handled by no more than 2
counting replicas). Finally the partition count impacts the maximum parallelism of your consumers. This is discussed in greater deta
Download
concepts section.
@apachekafka Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic
appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation
of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 2
This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.
The con gurations added on the command line override the default settings the server has for things like the length of time data sho
The complete set of per-topic con gurations is documented here.
Modifying topics
You can change the con guration or partitioning of a topic using the same topic tool.
Be aware that one use case for partitions is to semantically partition data, and adding partitions doesn't change the partitioning of e
this may disturb consumers if they rely on that partition. That is if data is partitioned by hash(key) % number_of_partitions th
http://kafka.apache.org/documentation.html#introduction 76/130
11/8/2017 Apache Kafka
partitioning will potentially be shu ed by adding partitions but Kafka will not attempt to automatically redistribute data in any way.
To remove a con g:
Kafka does not currently support reducing the number of partitions for a topic.
Instructions for changing the replication factor of a topic can be found here.
Graceful shutdown
The Kafka cluster will automatically detect any broker shutdown or failure and elect new leaders for the partitions on that machine.
whether a server fails or it is brought down intentionally for maintenance or con guration changes. For the latter cases Kafka suppo
graceful mechanism for stopping a server than just killing it. When a server is stopped gracefully it has two optimizations it will take
1. It will sync all its logs to disk to avoid needing to do any log recovery when it restarts (i.e. validating the checksum for all mess
of the log). Log recovery takes time so this speeds up intentional restarts.
2. It will migrate any partitions the server is the leader for to other replicas prior to shutting down. This will make the leadership t
and minimize the time each partition is unavailable to a few milliseconds.
Syncing the logs will happen automatically whenever the server is stopped other than by a hard kill, but the controlled leadership mig
using a special setting:
1 controlled.shutdown.enable=true
Note that controlled shutdown will only succeed if all the partitions hosted on the broker have replicas (i.e. the replication factor is g
and at least one of these replicas is alive). This is generally what you want since shutting down the last replica would make that topi
unavailable.
Balancing leadership
Whenever a broker stops or crashes leadership for that broker's partitions transfers to other replicas. This means that by default wh
Download restarted it will only be a follower for all its partitions, meaning it will not be used for client reads and writes.
@apachekafka To avoid this imbalance, Kafka has a notion of preferred replicas. If the list of replicas for a partition is 1,5,9 then node 1 is preferred
either node 5 or 9 because it is earlier in the replica list. You can have the Kafka cluster try to restore leadership to the restored replic
the command:
Since running this command can be tedious you can also con gure Kafka to do this automatically by setting the following con gura
1 auto.leader.rebalance.enable=true
The rack awareness feature spreads replicas of the same partition across different racks. This extends the guarantees Kafka provid
failure to cover rack-failure, limiting the risk of data loss should all the brokers on a rack fail at once. The feature can also be applied
groupings such as availability zones in EC2.
You can specify that a broker belongs to a particular rack by adding a property to the broker con g:
1 broker.rack=my-rack-id
http://kafka.apache.org/documentation.html#introduction 77/130
11/8/2017 Apache Kafka
When a topic is created, modi ed or replicas are redistributed, the rack constraint will be honoured, ensuring replicas span as many
can (a partition will span min(#racks, replication-factor) different racks).
The algorithm used to assign replicas to brokers ensures that the number of leaders per broker will be constant, regardless of how b
distributed across racks. This ensures balanced throughput.
However if racks are assigned different numbers of brokers, the assignment of replicas will not be even. Racks with fewer brokers w
replicas, meaning they will use more storage and put more resources into replication. Hence it is sensible to con gure an equal num
per rack.
We refer to the process of replicating data between Kafka clusters "mirroring" to avoid confusion with the replication that happens a
nodes in a single cluster. Kafka comes with a tool for mirroring data between Kafka clusters. The tool consumes from a source clus
produces to a destination cluster. A common use case for this kind of mirroring is to provide a replica in another datacenter. This sc
discussed in more detail in the next section.
You can run many such mirroring processes to increase throughput and for fault-tolerance (if one process dies, the others will take o
additional load).
Data will be read from topics in the source cluster and written to a topic with the same name in the destination cluster. In fact the m
little more than a Kafka consumer and producer hooked together.
The source and destination clusters are completely independent entities: they can have different numbers of partitions and the offse
the same. For this reason the mirror cluster is not really intended as a fault-tolerance mechanism (as the consumer position will be d
that we recommend using normal in-cluster replication. The mirror maker process will, however, retain and use the message key for
order is preserved on a per-key basis.
Here is an example showing how to mirror a single topic (named my-topic) from an input cluster:
1 > bin/kafka-mirror-maker.sh
2 --consumer.config consumer.properties
3 --producer.config producer.properties --whitelist my-topic
Note that we specify the list of topics with the --whitelist option. This option allows any regular expression using Java-style re
expressions. So you could mirror two topics named A and B using --whitelist 'A|B' . Or you could mirror all topics using --w
'*' . Make sure to quote any regular expression to ensure the shell doesn't try to expand it as a le path. For convenience we allow
instead of '|' to specify a list of topics.
Sometimes it is easier to say what it is that you don't want. Instead of using --whitelist to say what you want to mirror you can
Download blacklist to say what to exclude. This also takes a regular expression argument. However, --blacklist is not supported whe
consumer has been enabled (i.e. when bootstrap.servers has been de ned in the consumer con guration).
@apachekafka
Combining mirroring with the con guration auto.create.topics.enable=true makes it possible to have a replica cluster that w
automatically create and replicate all data in a source cluster even as new topics are added.
Sometimes it's useful to see the position of your consumers. We have a tool that will show the position of all consumers in a consum
well as how far behind the end of the log they are. To run this tool on a consumer group named my-group consuming a topic named
look like this:
http://kafka.apache.org/documentation.html#introduction 78/130
11/8/2017 Apache Kafka
1 > bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe --group my-group
2
3 Note: This will only show information about consumers that use ZooKeeper (not those using the Java consumer A
4
5 TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
6 my-topic 0 2 4 2 my-group_consumer-1
7 my-topic 1 2 3 1 my-group_consumer-1
8 my-topic 2 2 3 1 my-group_consumer-2
With the ConsumerGroupCommand tool, we can list, describe, or delete consumer groups. Note that deletion is only available when
metadata is stored in ZooKeeper. When using the new consumer API (where the broker handles coordination of partition handling an
the group is deleted when the last committed offset for that group expires. For example, to list all consumer groups across all topics
To view offsets, as mentioned earlier, we "describe" the consumer group like this:
If you are using the old high-level consumer and storing the group metadata in ZooKeeper (i.e. offsets.storage=zookeeper ), pa
zookeeper instead of bootstrap-server :
Adding servers to a Kafka cluster is easy, just assign them a unique broker id and start up Kafka on your new servers. However thes
will not automatically be assigned any data partitions, so unless partitions are moved to them they won't be doing any work until new
created. So usually when you add machines to your cluster you will want to migrate some existing data to these machines.
The process of migrating data is manually initiated but fully automated. Under the covers what happens is that Kafka will add the ne
follower of the partition it is migrating and allow it to fully replicate the existing data in that partition. When the new server has fully r
contents of this partition and joined the in-sync replica one of the existing replicas will delete their partition's data.
The partition reassignment tool can be used to move partitions across brokers. An ideal partition distribution would ensure even dat
Download partition sizes across all brokers. The partition reassignment tool does not have the capability to automatically study the data distrib
cluster and move partitions around to attain an even load distribution. As such, the admin has to gure out which topics or partitions
@apachekafka moved around.
--generate: In this mode, given a list of topics and a list of brokers, the tool generates a candidate reassignment to move all partit
speci ed topics to the new brokers. This option merely provides a convenient way to generate a partition reassignment plan give
and target brokers.
--execute: In this mode, the tool kicks off the reassignment of partitions based on the user provided reassignment plan. (using th
reassignment-json- le option). This can either be a custom reassignment plan hand crafted by the admin or provided by using th
option
--verify: In this mode, the tool veri es the status of the reassignment for all partitions listed during the last --execute. The status c
successfully completed, failed or in progress
The partition reassignment tool can be used to move some topics off of the current set of brokers to the newly added brokers. This
useful while expanding an existing cluster since it is easier to move entire topics to the new set of brokers, than moving one partitio
When used to do this, the user should provide a list of topics that should be moved to the new set of brokers and a target list of new
http://kafka.apache.org/documentation.html#introduction 79/130
11/8/2017 Apache Kafka
tool then evenly distributes all partitions for the given list of topics across the new set of brokers. During this move, the replication fa
topic is kept constant. Effectively the replicas for all partitions for the input list of topics are moved from the old set of brokers to the
brokers.
For instance, the following example will move all partitions for topics foo1,foo2 to the new set of brokers 5,6. At the end of this mov
for topics foo1 and foo2 will only exist on brokers 5,6.
Since the tool accepts the input list of topics as a json le, you rst need to identify the topics you want to move and create the json
Once the json le is ready, use the partition reassignment tool to generate a candidate assignment:
The tool generates a candidate assignment that will move all partitions from topics foo1,foo2 to brokers 5,6. Note, however, that at t
partition movement has not started, it merely tells you the current assignment and the proposed new assignment. The current assig
be saved in case you want to rollback to it. The new assignment should be saved in a json le (e.g. expand-cluster-reassignment.jso
the tool with the --execute option as follows:
Download
1 > bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassi
2 Current partition replica assignment
@apachekafka
3
4 {"version":1,
5 "partitions":[{"topic":"foo1","partition":2,"replicas":[1,2]},
6 {"topic":"foo1","partition":0,"replicas":[3,4]},
7 {"topic":"foo2","partition":2,"replicas":[1,2]},
8 {"topic":"foo2","partition":0,"replicas":[3,4]},
9 {"topic":"foo1","partition":1,"replicas":[2,3]},
10 {"topic":"foo2","partition":1,"replicas":[2,3]}]
11 }
12
13 Save this to use as the --reassignment-json-file option during rollback
14 Successfully started reassignment of partitions
15 {"version":1,
16 "partitions":[{"topic":"foo1","partition":2,"replicas":[5,6]},
17 {"topic":"foo1","partition":0,"replicas":[5,6]},
18 {"topic":"foo2","partition":2,"replicas":[5,6]},
19 {"topic":"foo2","partition":0,"replicas":[5,6]},
20 {"topic":"foo1","partition":1,"replicas":[5,6]},
21 {"topic":"foo2","partition":1,"replicas":[5,6]}]
22 }
http://kafka.apache.org/documentation.html#introduction 80/130
11/8/2017 Apache Kafka
Finally, the --verify option can be used with the tool to check the status of the partition reassignment. Note that the same expand-clu
reassignment.json (used with the --execute option) should be used with the --verify option:
The partition reassignment tool can also be used to selectively move replicas of a partition to a speci c set of brokers. When used i
is assumed that the user knows the reassignment plan and does not require the tool to generate a candidate reassignment, effective
--generate step and moving straight to the --execute step
For instance, the following example moves partition 0 of topic foo1 to brokers 5,6 and partition 1 of topic foo2 to brokers 2,3:
The rst step is to hand craft the custom reassignment plan in a json le:
Then, use the json le with the --execute option to start the reassignment process:
Download The --verify option can be used with the tool to check the status of the partition reassignment. Note that the same expand-cluster-
reassignment.json (used with the --execute option) should be used with the --verify option:
@apachekafka
1 > bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file custom-reassignment.js
2 Status of partition reassignment:
3 Reassignment of partition [foo1,0] completed successfully
4 Reassignment of partition [foo2,1] completed successfully
Decommissioning brokers
The partition reassignment tool does not have the ability to automatically generate a reassignment plan for decommissioning broke
the admin has to come up with a reassignment plan to move the replica for all partitions hosted on the broker to be decommissione
the brokers. This can be relatively tedious as the reassignment needs to ensure that all the replicas are not moved from the decomm
to only one other broker. To make this process effortless, we plan to add tooling support for decommissioning brokers in the future.
Increasing the replication factor of an existing partition is easy. Just specify the extra replicas in the custom reassignment json le a
the --execute option to increase the replication factor of the speci ed partitions.
http://kafka.apache.org/documentation.html#introduction 81/130
11/8/2017 Apache Kafka
For instance, the following example increases the replication factor of partition 0 of topic foo from 1 to 3. Before increasing the repl
the partition's only replica existed on broker 5. As part of increasing the replication factor, we will add more replicas on brokers 6 and
The rst step is to hand craft the custom reassignment plan in a json le:
Then, use the json le with the --execute option to start the reassignment process:
The --verify option can be used with the tool to check the status of the partition reassignment. Note that the same increase-replicatio
(used with the --execute option) should be used with the --verify option:
You can also verify the increase in replication factor with the kafka-topics tool:
Kafka lets you apply a throttle to replication tra c, setting an upper bound on the bandwidth used to move replicas from machine to
is useful when rebalancing a cluster, bootstrapping a new broker or adding or removing brokers, as it limits the impact these data-int
operations will have on users.
There are two interfaces that can be used to engage a throttle. The simplest, and safest, is to apply a throttle when invoking the kafk
partitions.sh, but kafka-con gs.sh can also be used to view and alter the throttle values directly.
Download
So for example, if you were to execute a rebalance, with the below command, it would move partitions at no more than 50MB/s.
When you execute this script you will see the throttle engage:
Should you wish to alter the throttle, during a rebalance, say to increase the throughput so it completes quicker, you can do this by re
execute command passing the same reassignment-json- le:
Once the rebalance completes the administrator can check the status of the rebalance using the --verify option. If the rebalance has
throttle will be removed via the --verify command. It is important that administrators remove the throttle in a timely manner once reb
completes by running the command with the --verify option. Failure to do so could cause regular replication tra c to be throttled.
When the --verify option is executed, and the reassignment has completed, the script will con rm that the throttle was removed:
http://kafka.apache.org/documentation.html#introduction 82/130
11/8/2017 Apache Kafka
2 Status of partition reassignment:
3 Reassignment of partition [my-topic,1] completed successfully
4 Reassignment of partition [mytopic,0] completed successfully
5 Throttle was removed.
The administrator can also validate the assigned con gs using the kafka-con gs.sh. There are two pairs of throttle con guration us
the throttling process. The throttle value itself. This is con gured, at a broker level, using the dynamic properties:
1 leader.replication.throttled.rate
2 follower.replication.throttled.rate
1 leader.replication.throttled.replicas
2 follower.replication.throttled.replicas
Which are con gured per topic. All four con g values are automatically assigned by kafka-reassign-partitions.sh (discussed below).
This shows the throttle applied to both leader and follower side of the replication protocol. By default both sides are assigned the sa
throughput value.
Here we see the leader throttle is applied to partition 1 on broker 102 and partition 0 on broker 101. Likewise the follower throttle is
partition 1 on broker 101 and partition 0 on broker 102.
By default kafka-reassign-partitions.sh will apply the leader throttle to all replicas that exist before the rebalance, any one of which m
It will apply the follower throttle to all move destinations. So if there is a partition with replicas on brokers 101,102, being reassigned
leader throttle, for that partition, would be applied to 101,102 and a follower throttle would be applied to 103 only.
If required, you can also use the --alter switch on kafka-con gs.sh to alter the throttle con gurations manually.
Download Some care should be taken when using throttled replication. In particular:
The throttle should be removed in a timely manner once reassignment completes (by running kafka-reassign-partitions verify).
If the throttle is set too low, in comparison to the incoming write rate, it is possible for replication to not make progress. This occurs
Where BytesInPerSec is the metric that monitors the write throughput of producers into each broker.
The administrator can monitor whether replication is making progress, during the rebalance, using the metric:
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9
http://kafka.apache.org/documentation.html#introduction 83/130
11/8/2017 Apache Kafka
The lag should constantly decrease during replication. If the metric does not decrease the administrator should increase the throttle
described above.
Setting quotas
Quotas overrides and defaults may be con gured at (user, client-id), user or client-id levels as described here. By default, clients rece
quota. It is possible to set custom quotas for each (user, client-id), user or client-id group.
It is possible to set default quotas for each (user, client-id), user or client-id group by specifying --entity-default option instead of --en
1 > bin/kafka-configs.sh --zookeeper localhost:2181 --describe --entity-type users --entity-name user1 --entit
Download
2 Configs for user-principal 'user1', client-id 'clientA' are producer_byte_rate=1024,consumer_byte_rate=2048,r
@apachekafka
Describe quota for a given user:
If entity name is not speci ed, all entities of the speci ed type are described. For example, describe all users:
http://kafka.apache.org/documentation.html#introduction 84/130
11/8/2017 Apache Kafka
It is possible to set default quotas that apply to all client-ids by setting these con gs on the brokers. These properties are applied on
overrides or defaults are not con gured in Zookeeper. By default, each client-id receives an unlimited quota. The following sets the d
producer and consumer client-id to 10MB/sec.
1 quota.producer.default=10485760
2 quota.consumer.default=10485760
Note that these properties are being deprecated and may be removed in a future release. Defaults con gured using kafka-con gs.sh
precedence over these properties.
6.2 Datacenters
Some deployments will need to manage a data pipeline that spans multiple datacenters. Our recommended approach to this is to de
Kafka cluster in each datacenter with application instances in each datacenter interacting only with their local cluster and mirroring
clusters (see the documentation on the mirror maker tool for how to do this).
This deployment pattern allows datacenters to act as independent entities and allows us to manage and tune inter-datacenter replic
This allows each facility to stand alone and operate even if the inter-datacenter links are unavailable: when this occurs the mirroring
until the link is restored at which time it catches up.
For applications that need a global view of all data you can use mirroring to provide clusters which have aggregate data mirrored fro
clusters in all datacenters. These aggregate clusters are used for reads by applications that require the full data set.
This is not the only possible deployment pattern. It is possible to read from or write to a remote Kafka cluster over the WAN, though
will add whatever latency is required to get the cluster.
Kafka naturally batches data in both the producer and consumer so it can achieve high-throughput even over a high-latency connect
this though it may be necessary to increase the TCP socket buffer sizes for the producer, consumer, and broker using the
socket.send.buffer.bytes and socket.receive.buffer.bytes con gurations. The appropriate way to set this is docume
It is generally not advisable to run a single Kafka cluster that spans multiple datacenters over a high-latency link. This will incur very
latency both for Kafka writes and ZooKeeper writes, and neither Kafka nor ZooKeeper will remain available in all locations if the netw
locations is unavailable.
Download acks
compression
@apachekafka sync vs async production
batch size (for async producers)
acks
compression
batch size
1 # ZooKeeper
2 zookeeper.connect=[list of ZooKeeper servers]
3
4 # Log configuration
http://kafka.apache.org/documentation.html#introduction 85/130
11/8/2017 Apache Kafka
5 num.partitions=8
6 default.replication.factor=3
7 log.dir=[List of directories. Kafka should have its own dedicated disk(s) or SSD(s).]
8
9 # Other configurations
10 broker.id=[An integer. Start with 0 and increment by 1 for each new broker.]
11 listeners=[list of listeners]
12 auto.create.topics.enable=false
13 min.insync.replicas=2
14 queued.max.requests=[number of concurrent requests]
Our client con guration varies a fair amount between different use cases.
From a security perspective, we recommend you use the latest released version of JDK 1.8 as older freely available versions have di
vulnerabilities. LinkedIn is currently running JDK 1.8 u5 (looking to upgrade to a newer version) with the G1 collector. If you decide to
collector (the current default) and you are still on JDK 1.7, make sure you are on u51 or newer. LinkedIn tried out u21 in testing, but t
number of problems with the GC implementation in that version. LinkedIn's tuning looks like this:
For reference, here are the stats on one of LinkedIn's busiest clusters (at peak):
60 brokers
50k partitions (replication factor 2)
800k messages/sec in
300 MB/sec inbound, 1 GB/sec+ outbound
The tuning looks fairly aggressive, but all of the brokers in that cluster have a 90% GC pause time of about 21ms, and they're doing le
young GC per second.
We are using dual quad-core Intel Xeon machines with 24GB of memory.
You need su cient memory to buffer active readers and writers. You can do a back-of-the-envelope estimate of memory needs by a
want to be able to buffer for 30 seconds and compute your memory need as write_throughput*30.
The disk throughput is important. We have 8x7200 rpm SATA drives. In general disk throughput is the performance bottleneck, and m
better. Depending on how you con gure ush behavior you may or may not bene t from more expensive disks (if you force ush oft
Download RPM SAS drives may be better).
@apachekafka
OS
Kafka should run well on any unix system and has been tested on Linux and Solaris.
We have seen a few issues running on Windows and Windows is not currently a well supported platform though we would be happy
It is unlikely to require much OS-level tuning, but there are two potentially important OS-level con gurations:
File descriptor limits: Kafka uses le descriptors for log segments and open connections. If a broker hosts many partitions, cons
broker needs at least (number_of_partitions)*(partition_size/segment_size) to track all log segments in addition to the number of
the broker makes. We recommend at least 100000 allowed le descriptors for the broker processes as a starting point.
Max socket buffer size: can be increased to enable high-performance data transfer between data centers as described here.
We recommend using multiple drives to get good throughput and not sharing the same drives used for Kafka data with application l
lesystem activity to ensure good latency. You can either RAID these drives together into a single volume or format and mount each
http://kafka.apache.org/documentation.html#introduction 86/130
11/8/2017 Apache Kafka
directory. Since Kafka has replication the redundancy provided by RAID can also be provided at the application level. This choice has
tradeoffs.
If you con gure multiple data directories partitions will be assigned round-robin to data directories. Each partition will be entirely in o
directories. If data is not well balanced among partitions this can lead to load imbalance between disks.
RAID can potentially do better at balancing load between disks (although it doesn't always seem to) because it balances load at a lo
primary downside of RAID is that it is usually a big performance hit for write throughput and reduces the available disk space.
Another potential bene t of RAID is the ability to tolerate disk failures. However our experience has been that rebuilding the RAID arr
intensive that it effectively disables the server, so this does not provide much real availability improvement.
Kafka always immediately writes all data to the lesystem and supports the ability to con gure the ush policy that controls when d
of the OS cache and onto disk using the ush. This ush policy can be controlled to force data to disk after a period of time or after
number of messages has been written. There are several choices in this con guration.
Kafka must eventually call fsync to know that data was ushed. When recovering from a crash for any log segment not known to be
will check the integrity of each message by checking its CRC and also rebuild the accompanying offset index le as part of the recov
executed on startup.
Note that durability in Kafka does not require syncing data to disk, as a failed node will always recover from its replicas.
We recommend using the default ush settings which disable application fsync entirely. This means relying on the background ush
and Kafka's own background ush. This provides the best of all worlds for most uses: no knobs to tune, great throughput and latenc
recovery guarantees. We generally feel that the guarantees provided by replication are stronger than sync to local disk, however the
may prefer having both and application level fsync policies are still supported.
The drawback of using application level ush settings is that it is less e cient in its disk usage pattern (it gives the OS less leeway t
writes) and it can introduce latency as fsync in most Linux lesystems blocks writes to the le whereas the background ushing doe
granular page-level locking.
In general you don't need to do any low-level tuning of the lesystem, but in the next few sections we will go over some of this in cas
In Linux, data written to the lesystem is maintained in pagecache until it must be written out to disk (due to an application-level fsy
own ush policy). The ushing of data is done by a set of background threads called pd ush (or in post 2.6.32 kernels " usher threa
Download
Pd ush has a con gurable policy that controls how much dirty data can be maintained in cache and for how long before it must be w
@apachekafka disk. This policy is described here. When Pd ush cannot keep up with the rate of data being written it will eventually cause the writin
block incurring latency in the writes to slow down the accumulation of data.
Using pagecache has several advantages over an in-process cache for storing data that will be written out to disk:
The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput.
The I/O scheduler will attempt to re-sequence writes to minimize movement of the disk head which improves throughput.
It automatically uses all the free memory on the machine
Filesystem Selection
Kafka uses regular les on disk, and as such it has no hard dependency on a speci c lesystem. The two lesystems which have the
however, are EXT4 and XFS. Historically, EXT4 has had more usage, but recent improvements to the XFS lesystem have shown it to
performance characteristics for Kafka's workload with no compromise in stability.
http://kafka.apache.org/documentation.html#introduction 87/130
11/8/2017 Apache Kafka
Comparison testing was performed on a cluster with signi cant message loads, using a variety of lesystem creation and mount op
primary metric in Kafka that was monitored was the "Request Local Time", indicating the amount of time append operations were ta
resulted in much better local times (160ms vs. 250ms+ for the best EXT4 con guration), as well as lower average wait times. The X
also showed less variability in disk performance.
For any lesystem used for data directories, on Linux systems, the following options are recommended to be used at mount time:
noatime: This option disables updating of a le's atime (last access time) attribute when the le is read. This can eliminate a sign
of lesystem writes, especially in the case of bootstrapping consumers. Kafka does not rely on the atime attributes at all, so it is
this.
XFS Notes
The XFS lesystem has a signi cant amount of auto-tuning in place, so it does not require any change in the default settings, either
creation time or at mount. The only tuning parameters worth considering are:
largeio: This affects the preferred I/O size reported by the stat call. While this can allow for higher performance on larger disk wri
had minimal or no effect on performance.
nobarrier: For underlying devices that have battery-backed cache, this option can provide a little more performance by disabling p
ushes. However, if the underlying device is well-behaved, it will report to the lesystem that it does not require ushes, and this o
no effect.
EXT4 Notes
EXT4 is a serviceable choice of lesystem for the Kafka data directories, however getting the most performance out of it will require
several mount options. In addition, these options are generally unsafe in a failure scenario, and will result in much more data loss an
For a single broker failure, this is not much of a concern as the disk can be wiped and the replicas rebuilt from the cluster. In a multip
scenario, such as a power outage, this can mean underlying lesystem (and therefore data) corruption that is not easily recoverable
options can be adjusted:
data=writeback: Ext4 defaults to data=ordered which puts a strong order on some writes. Kafka does not require this ordering as
paranoid data recovery on all un ushed log. This setting removes the ordering constraint and seems to signi cantly reduce laten
Disabling journaling: Journaling is a tradeoff: it makes reboots faster after server crashes but it introduces a great deal of additio
which adds variance to write performance. Those who don't care about reboot time and want to reduce a major source of write la
can turn off journaling entirely.
Download commit=num_secs: This tunes the frequency with which ext4 commits to its metadata journal. Setting this to a lower value reduc
un ushed data during a crash. Setting this to a higher value will improve throughput.
@apachekafka
nobh: This setting controls additional ordering guarantees when using data=writeback mode. This should be safe with Kafka as w
depend on write ordering and improves throughput and latency.
delalloc: Delayed allocation means that the lesystem avoid allocating any blocks until the physical write occurs. This allows ext4
large extent instead of smaller pages and helps ensure the data is written sequentially. This feature is great for throughput. It doe
involve some locking in the lesystem which adds a bit of latency variance.
6.6 Monitoring
Kafka uses Yammer Metrics for metrics reporting in the server and Scala clients. The Java clients use Kafka Metrics, a built-in metri
minimizes transitive dependencies pulled into client applications. Both expose metrics via JMX and can be con gured to report stat
pluggable stats reporters to hook up to your monitoring system.
The easiest way to see the available metrics is to re up jconsole and point it at a running kafka client or server; this will allow brows
with JMX.
http://kafka.apache.org/documentation.html#introduction 88/130
11/8/2017 Apache Kafka
Message in rate kafka.server:type=Brok
erTopicMetrics,name=
MessagesInPerSec
kafka.server:type=Brok
Byte in rate erTopicMetrics,name=
BytesInPerSec
kafka.network:type=Re
questMetrics,name=Re
Request rate questsPerSec,request=
{Produce|FetchConsu
mer|FetchFollower}
kafka.server:type=Brok
Byte out rate erTopicMetrics,name=
BytesOutPerSec
kafka.log:type=LogFlu
Log ush rate and
shStats,name=LogFlus
time
hRateAndTimeMs
# of under
kafka.server:type=Repl
replicated
icaManager,name=Und 0
partitions (|ISR| <
erReplicatedPartitions
|all replicas|)
# of under minIsr
kafka.server:type=Repl
partitions (|ISR| <
icaManager,name=Und 0
min.insync.replicas
erMinIsrPartitionCount
)
kafka.log:type=LogMa
# of o ine log
nager,name=O ineLog 0
directories
DirectoryCount
kafka.controller:type=K
Is controller active
afkaController,name=A only one broker in the cluster should have 1
on broker
ctiveControllerCount
kafka.controller:type=
ControllerStats,name=
Leader election rate non-zero when there are broker failures
LeaderElectionRateAn
dTimeMs
kafka.controller:type=
Unclean leader ControllerStats,name=
0
election rate UncleanLeaderElection
sPerSec
kafka.server:type=Repl
Partition counts icaManager,name=Par mostly even across brokers
titionCount
kafka.server:type=Repl
Leader replica
Download icaManager,name=Lea mostly even across brokers
counts
derCount
@apachekafka kafka.server:type=Repl If a broker goes down, ISR for some of the partitions will shrink. When th
ISR shrink rate icaManager,name=IsrS up again, ISR will be expanded once the replicas are fully caught up. Othe
hrinksPerSec the expected value for both ISR shrink rate and expansion rate is 0.
kafka.server:type=Repl
ISR expansion rate icaManager,name=IsrE See above
xpandsPerSec
Max lag in kafka.server:type=Repl
messages btw icaFetcherManager,na
lag should be proportional to the maximum batch size of a produce requ
follower and leader me=MaxLag,clientId=R
replicas eplica
kafka.server:type=Fetc
herLagMetrics,name=
Lag in messages ConsumerLag,clientId=
lag should be proportional to the maximum batch size of a produce requ
per follower replica ([-.\w]+),topic=
([-.\w]+),partition=([0-
9]+)
kafka.server:type=Dela
Requests waiting in yedOperationPurgatory
the producer ,name=PurgatorySize,d non-zero if ack=-1 is used
purgatory elayedOperation=Prod
uce
http://kafka.apache.org/documentation.html#introduction 89/130
11/8/2017 Apache Kafka
Requests waiting in kafka.server:type=Dela size depends on fetch.wait.max.ms in the consumer
the fetch purgatory yedOperationPurgatory
,name=PurgatorySize,d
elayedOperation=Fetch
kafka.network:type=Re
questMetrics,name=To
Request total time talTimeMs,request= broken into queue, local, remote and response send time
{Produce|FetchConsu
mer|FetchFollower}
kafka.network:type=Re
questMetrics,name=Re
Time the request
questQueueTimeMs,re
waits in the request
quest=
queue
{Produce|FetchConsu
mer|FetchFollower}
kafka.network:type=Re
Time the request is questMetrics,name=Lo
processed at the calTimeMs,request=
leader {Produce|FetchConsu
mer|FetchFollower}
kafka.network:type=Re
Time the request questMetrics,name=Re
waits for the moteTimeMs,request= non-zero for produce requests when ack=-1
follower {Produce|FetchConsu
mer|FetchFollower}
kafka.network:type=Re
questMetrics,name=Re
Time the request
sponseQueueTimeMs,
waits in the
request=
response queue
{Produce|FetchConsu
mer|FetchFollower}
kafka.network:type=Re
questMetrics,name=Re
Time to send the sponseSendTimeMs,re
response quest=
{Produce|FetchConsu
mer|FetchFollower}
Old consumer:
kafka.consumer:type=
ConsumerFetcherMan
Number of
ager,name=MaxLag,cli
messages the
entId=([-.\w]+)
consumer lags
behind the New consumer:
producer by. kafka.consumer:type=
Published by the consumer-fetch-
Download consumer, not manager-
broker. metrics,client-id=
{client-id} Attribute:
@apachekafka records-lag-max
The average kafka.network:type=So
fraction of time the cketServer,name=Netw
between 0 and 1, ideally > 0.3
network processors orkProcessorAvgIdleP
are idle ercent
The average kafka.server:type=Kafk
fraction of time the aRequestHandlerPool,
between 0 and 1, ideally > 0.3
request handler name=RequestHandler
threads are idle AvgIdlePercent
Two attributes. throttle-time indicates the amount of time in ms the clien
Bandwidth quota kafka.server:type=
throttled. Ideally = 0. byte-rate indicates the data produce/consume rate
metrics per (user, {Produce|Fetch},user=
in bytes/sec. For (user, client-id) quotas, both user and client-id are spec
client-id), user or ([-.\w]+),client-id=
client-id quota is applied to the client, user is not speci ed. If per-user qu
client-id ([-.\w]+)
applied, client-id is not speci ed.
Two attributes. throttle-time indicates the amount of time in ms the clien
Request quota kafka.server:type=Req
throttled. Ideally = 0. request-time indicates the percentage of time spen
metrics per (user, uest,user=
network and I/O threads to process requests from client group. For (use
client-id), user or ([-.\w]+),client-id=
quotas, both user and client-id are speci ed. If per-client-id quota is appl
client-id ([-.\w]+)
client, user is not speci ed. If per-user quota is applied, client-id is not sp
Requests exempt kafka.server:type=Req exempt-throttle-time indicates the percentage of time spent in broker ne
from throttling uest I/O threads to process requests that are exempt from throttling.
http://kafka.apache.org/documentation.html#introduction 90/130
11/8/2017 Apache Kafka
The following metrics are available on producer/consumer/connector/streams instances. For speci c metrics, please see following
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
kafka.[producer|consumer|connect]:ty
connection-close-
Connections closed per second in the window. [producer|consumer|connect]-metrics
rate
([-.\w]+)
kafka.[producer|consumer|connect]:ty
connection-creation-
New connections established per second in the window. [producer|consumer|connect]-metrics
rate
([-.\w]+)
kafka.[producer|consumer|connect]:ty
The average number of network operations (reads or
network-io-rate [producer|consumer|connect]-metrics
writes) on all connections per second.
([-.\w]+)
kafka.[producer|consumer|connect]:ty
The average number of outgoing bytes sent per second
outgoing-byte-rate [producer|consumer|connect]-metrics
to all servers.
([-.\w]+)
kafka.[producer|consumer|connect]:ty
request-rate The average number of requests sent per second. [producer|consumer|connect]-metrics
([-.\w]+)
kafka.[producer|consumer|connect]:ty
request-size-avg The average size of all requests in the window. [producer|consumer|connect]-metrics
([-.\w]+)
kafka.[producer|consumer|connect]:ty
request-size-max The maximum size of any request sent in the window. [producer|consumer|connect]-metrics
([-.\w]+)
kafka.[producer|consumer|connect]:ty
incoming-byte-rate Bytes/second read off all sockets. [producer|consumer|connect]-metrics
([-.\w]+)
kafka.[producer|consumer|connect]:ty
response-rate Responses received sent per second. [producer|consumer|connect]-metrics
([-.\w]+)
kafka.[producer|consumer|connect]:ty
Number of times the I/O layer checked for new I/O to
select-rate [producer|consumer|connect]-metrics
perform per second.
([-.\w]+)
kafka.[producer|consumer|connect]:ty
The average length of time the I/O thread spent waiting
io-wait-time-ns-avg [producer|consumer|connect]-metrics
for a socket ready for reads or writes in nanoseconds.
([-.\w]+)
Download kafka.[producer|consumer|connect]:ty
io-wait-ratio The fraction of time the I/O thread spent waiting. [producer|consumer|connect]-metrics
@apachekafka
([-.\w]+)
kafka.[producer|consumer|connect]:ty
The average length of time for I/O per select call in
io-time-ns-avg [producer|consumer|connect]-metrics
nanoseconds.
([-.\w]+)
kafka.[producer|consumer|connect]:ty
io-ratio The fraction of time the I/O thread spent doing I/O. [producer|consumer|connect]-metrics
([-.\w]+)
kafka.[producer|consumer|connect]:ty
connection-count The current number of active connections. [producer|consumer|connect]-metrics
([-.\w]+)
The following metrics are available on producer/consumer/connector/streams instances. For speci c metrics, please see following
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
Producer monitoring
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
The number of user threads blocked waiting for buffer memory to kafka.producer:type=produ
waiting-threads
enqueue their records. metrics,client-id=([-.\w]+)
The maximum amount of buffer memory the client can use kafka.producer:type=produ
buffer-total-bytes
(whether or not it is currently used). metrics,client-id=([-.\w]+)
buffer-available- The total amount of buffer memory that is not being used (either kafka.producer:type=produ
bytes unallocated or in the free list). metrics,client-id=([-.\w]+)
kafka.producer:type=produ
bufferpool-wait-time The fraction of time an appender waits for space allocation.
metrics,client-id=([-.\w]+)
kafka.producer:type=producer-metrics,client-id="{client-id}"
http://kafka.apache.org/documentation.html#introduction 92/130
11/8/2017 Apache Kafka
records-per-request-avg The average number of records per request.
request-latency-avg The average request latency in ms
request-latency-max The maximum request latency in ms
requests-in- ight The current number of in- ight requests awaiting a response.
kafka.producer:type=producer-topic-metrics,client-id="{client-id}",topic="{topic}"
byte-rate The average number of bytes sent per second for a topic.
byte-total The total number of bytes sent for a topic.
compression-rate The average compression rate of record batches for a topic.
record-error-rate The average per-second number of record sends that resulted in errors for
record-error-total The total number of record sends that resulted in errors for a topic
record-retry-rate The average per-second number of retried record sends for a topic
record-retry-total The total number of retried record sends for a topic
record-send-rate The average number of records sent per second for a topic.
record-send-total The total number of records sent for a topic.
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
kafka.consumer:type=consumer-coordina
commit-latency-avg The average time taken for a commit request
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
commit-latency-max The max time taken for a commit request
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
commit-rate The number of commit calls per second
metrics,client-id=([-.\w]+)
The number of partitions currently assigned to kafka.consumer:type=consumer-coordina
assigned-partitions
this consumer metrics,client-id=([-.\w]+)
heartbeat-response- The max time taken to receive a response to a kafka.consumer:type=consumer-coordina
Download time-max heartbeat request metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
heartbeat-rate The average number of heartbeats per second
metrics,client-id=([-.\w]+)
@apachekafka
kafka.consumer:type=consumer-coordina
join-time-avg The average time taken for a group rejoin
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
join-time-max The max time taken for a group rejoin
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
join-rate The number of group joins per second
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
sync-time-avg The average time taken for a group sync
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
sync-time-max The max time taken for a group sync
metrics,client-id=([-.\w]+)
kafka.consumer:type=consumer-coordina
sync-rate The number of group syncs per second
metrics,client-id=([-.\w]+)
last-heartbeat- The number of seconds since the last controller kafka.consumer:type=consumer-coordina
seconds-ago heartbeat metrics,client-id=([-.\w]+)
http://kafka.apache.org/documentation.html#introduction 93/130
11/8/2017 Apache Kafka
kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}"
bytes-consumed-rate The average number of bytes consumed per second for a topic
bytes-consumed-total The total number of bytes consumed for a topic
fetch-size-avg The average number of bytes fetched per request for a topic
fetch-size-max The maximum number of bytes fetched per request for a topic
records-consumed-rate The average number of records consumed per second for a topic
records-consumed-total The total number of records consumed for a topic
records-per-request-avg The average number of records in each request for a topic
Connect Monitoring
Download
A Connect worker process contains all the producer and consumer metrics as well as metrics speci c to Connect. The worker proce
number of metrics, while each connector and task have additional metrics.
@apachekafka
kafka.connect:type=connect-worker-metrics
ATTRIBUTE
DESCRIPTION
NAME
connector-
The number of connectors run in this worker.
count
connector-
startup-
The total number of connector startups that this worker has attempted.
attempts-
total
connector-
startup-
The average percentage of this worker's connectors starts that failed.
failure-
percentage
connector-
startup- The total number of connector starts that failed.
failure-total
connector- The average percentage of this worker's connectors starts that succeeded.
startup-
http://kafka.apache.org/documentation.html#introduction 94/130
11/8/2017 Apache Kafka
success-
percentage
connector-
startup- The total number of connector starts that succeeded.
success-total
task-count The number of tasks run in this worker.
task-startup-
attempts- The total number of task startups that this worker has attempted.
total
task-startup-
failure- The average percentage of this worker's tasks starts that failed.
percentage
task-startup-
The total number of task starts that failed.
failure-total
task-startup-
success- The average percentage of this worker's tasks starts that succeeded.
percentage
task-startup-
The total number of task starts that succeeded.
success-total
kafka.connect:type=connect-worker-rebalance-metrics
ATTRIBUTE
DESCRIPTION
NAME
completed-
rebalances- The total number of rebalances completed by this worker.
total
epoch The epoch or generation number of this worker.
leader-name The name of the group leader.
rebalance-
The average time in milliseconds spent by this worker to rebalance.
avg-time-ms
rebalance-
The maximum time in milliseconds spent by this worker to rebalance.
max-time-ms
rebalancing Whether this worker is currently rebalancing.
time-since-
last- The time in milliseconds since this worker completed the most recent rebalance.
rebalance-ms
kafka.connect:type=connector-metrics,connector="{connector}"
ATTRIBUTE
DESCRIPTION
NAME
Download connector-
The name of the connector class.
class
@apachekafka connector-
The type of the connector. One of 'source' or 'sink'.
type
connector-
The version of the connector class, as reported by the connector.
version
status The status of the connector. One of 'unassigned', 'running', 'paused', 'failed', or 'destroye
kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}"
ATTRIBUTE
DESCRIPTION
NAME
batch-size-
The average size of the batches processed by the connector.
avg
batch-size-
The maximum size of the batches processed by the connector.
max
offset-
commit-avg- The average time in milliseconds taken by this task to commit offsets.
time-ms
offset-
commit-
The average percentage of this task's offset commit attempts that failed.
failure-
percentage
http://kafka.apache.org/documentation.html#introduction 95/130
11/8/2017 Apache Kafka
offset- The maximum time in milliseconds taken by this task to commit offsets.
commit-max-
time-ms
offset-
commit-
The average percentage of this task's offset commit attempts that succeeded.
success-
percentage
pause-ratio The fraction of time this task has spent in the pause state.
running-ratio The fraction of time this task has spent in the running state.
status The status of the connector task. One of 'unassigned', 'running', 'paused', 'failed', or 'des
kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
ATTRIBUTE
DESCRIPTION
NAME
offset-
commit-
The average per-second number of offset commit completions that were completed su
completion-
rate
offset-
commit-
The total number of offset commit completions that were completed successfully.
completion-
total
offset-
commit-seq- The current sequence number for offset commits.
no
offset-
The average per-second number of offset commit completions that were received too l
commit-skip-
skipped/ignored.
rate
offset-
commit-skip- The total number of offset commit completions that were received too late and skipped
total
partition- The number of topic partitions assigned to this task belonging to the named sink conne
count worker.
put-batch-
The average time taken by this task to put a batch of sinks records.
avg-time-ms
put-batch-
The maximum time taken by this task to put a batch of sinks records.
max-time-ms
sink-record- The number of records that have been read from Kafka but not yet completely
active-count committed/ ushed/acknowledged by the sink task.
sink-record-
The average number of records that have been read from Kafka but not yet completely
active-count-
committed/ ushed/acknowledged by the sink task.
Download avg
sink-record-
The maximum number of records that have been read from Kafka but not yet complete
@apachekafka active-count-
committed/ ushed/acknowledged by the sink task.
max
sink-record- The maximum lag in terms of number of records that the sink task is behind the consum
lag-max position for any topic partitions.
sink-record- The average per-second number of records read from Kafka for this task belonging to t
read-rate sink connector in this worker. This is before transformations are applied.
sink-record- The total number of records read from Kafka by this task belonging to the named sink c
read-total this worker, since the task was last restarted.
The average per-second number of records output from the transformations and sent/p
sink-record-
task belonging to the named sink connector in this worker. This is after transformations
send-rate
and excludes any records ltered out by the transformations.
sink-record- The total number of records output from the transformations and sent/put to this task b
send-total the named sink connector in this worker, since the task was last restarted.
kafka.connect:type=source-task-metrics,connector="{connector}",task="{task}"
ATTRIBUTE
DESCRIPTION
NAME
poll-batch-
The average time in milliseconds taken by this task to poll for a batch of source records
avg-time-ms
poll-batch- The maximum time in milliseconds taken by this task to poll for a batch of source recor
http://kafka.apache.org/documentation.html#introduction 96/130
11/8/2017 Apache Kafka
max-time-ms
source-
record-active- The number of records that have been produced by this task but not yet completely writ
count
source-
The average number of records that have been produced by this task but not yet compl
record-active-
to Kafka.
count-avg
source-
The maximum number of records that have been produced by this task but not yet com
record-active-
written to Kafka.
count-max
source-
The average per-second number of records produced/polled (before transformation) by
record-poll-
belonging to the named source connector in this worker.
rate
source-
The total number of records produced/polled (before transformation) by this task belon
record-poll-
named source connector in this worker.
total
source- The average per-second number of records output from the transformations and writte
record-write- for this task belonging to the named source connector in this worker. This is after trans
rate are applied and excludes any records ltered out by the transformations.
source-
The number of records output from the transformations and written to Kafka for this ta
record-write-
to the named source connector in this worker, since the task was last restarted.
total
Streams Monitoring
A Kafka Streams instance contains all the producer and consumer metrics as well as additional metrics speci c to streams. By defa
Streams has metrics with two recording levels: debug and info. The debug level records all metrics, while the info level records only
metrics.
Note that the metrics have a 3-layer hierarchy. At the top level there are per-thread metrics. Each thread has tasks, with their own me
has a number of processor nodes, with their own metrics. Each task also has a number of state stores and record caches, all with th
metrics.
Use the following con guration option to specify which metrics you want collected:
metrics.recording.level="info"
Thread Metrics
Download
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
The average execution time in ms for committing, across all running kafka.streams:type=stream
commit-latency-avg
tasks of this thread. metrics,client-id=([-.\w]+)
The maximum execution time in ms for committing across all kafka.streams:type=stream
commit-latency-max
running tasks of this thread. metrics,client-id=([-.\w]+)
The average execution time in ms for polling, across all running kafka.streams:type=stream
poll-latency-avg
tasks of this thread. metrics,client-id=([-.\w]+)
The maximum execution time in ms for polling across all running kafka.streams:type=stream
poll-latency-max
tasks of this thread. metrics,client-id=([-.\w]+)
The average execution time in ms for processing, across all running kafka.streams:type=stream
process-latency-avg
tasks of this thread. metrics,client-id=([-.\w]+)
process-latency- The maximum execution time in ms for processing across all kafka.streams:type=stream
max running tasks of this thread. metrics,client-id=([-.\w]+)
punctuate-latency- The average execution time in ms for punctuating, across all kafka.streams:type=stream
avg running tasks of this thread. metrics,client-id=([-.\w]+)
punctuate-latency- The maximum execution time in ms for punctuating across all kafka.streams:type=stream
http://kafka.apache.org/documentation.html#introduction 97/130
11/8/2017 Apache Kafka
max running tasks of this thread. metrics,client-id=([-.\w]+)
kafka.streams:type=stream
commit-rate The average number of commits per second across all tasks.
metrics,client-id=([-.\w]+)
kafka.streams:type=stream
poll-rate The average number of polls per second across all tasks.
metrics,client-id=([-.\w]+)
kafka.streams:type=stream
process-rate The average number of process calls per second across all tasks.
metrics,client-id=([-.\w]+)
kafka.streams:type=stream
punctuate-rate The average number of punctuates per second across all tasks.
metrics,client-id=([-.\w]+)
kafka.streams:type=stream
task-created-rate The average number of newly created tasks per second.
metrics,client-id=([-.\w]+)
kafka.streams:type=stream
task-closed-rate The average number of tasks closed per second.
metrics,client-id=([-.\w]+)
kafka.streams:type=stream
skipped-records-rate The average number of skipped records per second.
metrics,client-id=([-.\w]+)
Task Metrics
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
kafka.streams:type=stream-processor-node-m
Download process-latency-avg The average process execution time in ns.
id=([-.\w]+),task-id=([-.\w]+),processor-node-id
process-latency- kafka.streams:type=stream-processor-node-m
The maximum process execution time in ns.
@apachekafka max id=([-.\w]+),task-id=([-.\w]+),processor-node-id
punctuate-latency- kafka.streams:type=stream-processor-node-m
The average punctuate execution time in ns.
avg id=([-.\w]+),task-id=([-.\w]+),processor-node-id
punctuate-latency- kafka.streams:type=stream-processor-node-m
The maximum punctuate execution time in ns.
max id=([-.\w]+),task-id=([-.\w]+),processor-node-id
kafka.streams:type=stream-processor-node-m
create-latency-avg The average create execution time in ns.
id=([-.\w]+),task-id=([-.\w]+),processor-node-id
kafka.streams:type=stream-processor-node-m
create-latency-max The maximum create execution time in ns.
id=([-.\w]+),task-id=([-.\w]+),processor-node-id
kafka.streams:type=stream-processor-node-m
destroy-latency-avg The average destroy execution time in ns.
id=([-.\w]+),task-id=([-.\w]+),processor-node-id
kafka.streams:type=stream-processor-node-m
destroy-latency-max The maximum destroy execution time in ns.
id=([-.\w]+),task-id=([-.\w]+),processor-node-id
The average number of process operations per kafka.streams:type=stream-processor-node-m
process-rate
second. id=([-.\w]+),task-id=([-.\w]+),processor-node-id
The average number of punctuate operations per kafka.streams:type=stream-processor-node-m
punctuate-rate
second. id=([-.\w]+),task-id=([-.\w]+),processor-node-id
The average number of create operations per kafka.streams:type=stream-processor-node-m
create-rate
second. id=([-.\w]+),task-id=([-.\w]+),processor-node-id
http://kafka.apache.org/documentation.html#introduction 98/130
11/8/2017 Apache Kafka
destroy-rate The average number of destroy operations per kafka.streams:type=stream-processor-node-m
second. id=([-.\w]+),task-id=([-.\w]+),processor-node-id
The average rate of records being forwarded
kafka.streams:type=stream-processor-node-m
forward-rate downstream, from source nodes only, per
id=([-.\w]+),task-id=([-.\w]+),processor-node-id
second.
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
METRIC/ATTRIBUTE
DESCRIPTION MBEAN NAME
NAME
Others
We recommend monitoring GC time and other stats and various server stats such as CPU utilization, I/O service time, etc. On the cli
recommend monitoring the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in m
among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch
be larger than 0.
Audit
The nal alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consu
measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. Th
are discussed in KAFKA-260.
6.7 ZooKeeper
Download
Stable version
@apachekafka
The current stable branch is 3.4 and the latest release of that branch is 3.4.9.
Operationalizing ZooKeeper
Redundancy in the physical/hardware/network layout: try not to put them all in the same rack, decent (but don't go nuts) hardwar
redundant power and network paths, etc. A typical ZooKeeper ensemble has 5 or 7 servers, which tolerates 2 and 3 servers down
you have a small deployment, then using 3 servers is acceptable, but keep in mind that you'll only be able to tolerate 1 server dow
I/O segregation: if you do a lot of write type tra c you'll almost de nitely want the transaction logs on a dedicated disk group. Wr
transaction log are synchronous (but batched for performance), and consequently, concurrent writes can signi cantly affect perf
ZooKeeper snapshots can be one such a source of concurrent writes, and ideally should be written on a disk group separate from
log. Snapshots are written to disk asynchronously, so it is typically ok to share with the operating system and message log les. Y
con gure a server to use a separate disk group with the dataLogDir parameter.
Application segregation: Unless you really understand the application patterns of other apps that you want to install on the same
good idea to run ZooKeeper in isolation (though this can be a balancing act with the capabilities of the hardware).
http://kafka.apache.org/documentation.html#introduction 100/130
11/8/2017 Apache Kafka
Use care with virtualization: It can work, depending on your cluster layout and read/write patterns and SLAs, but the tiny overhead
the virtualization layer can add up and throw off ZooKeeper, as it can be very time sensitive
ZooKeeper con guration: It's java, make sure you give it 'enough' heap space (We usually run them with 3-5G, but that's mostly du
set size we have here). Unfortunately we don't have a good formula for it, but keep in mind that allowing for more ZooKeeper stat
snapshots can become large, and large snapshots affect recovery time. In fact, if the snapshot becomes too large (a few gigabyt
may need to increase the initLimit parameter to give enough time for servers to recover and join the ensemble.
Monitoring: Both JMX and the 4 letter words (4lw) commands are very useful, they do overlap in some cases (and in those cases
letter commands, they seem more predictable, or at the very least, they work better with the LI monitoring infrastructure)
Don't overbuild the cluster: large clusters, especially in a write heavy usage pattern, means a lot of intracluster communication (q
writes and subsequent cluster member updates), but don't underbuild it (and risk swamping the cluster). Having more servers ad
capacity.
Overall, we try to keep the ZooKeeper system as small as will handle the load (plus standard growth capacity planning) and as simp
We try not to do anything fancy with the con guration or application layout as compared to the o cial release as well as keep it as s
as possible. For these reasons, we tend to skip the OS packaged versions, since it has a tendency to try to put things in the OS stand
which can be 'messy', for want of a better way to word it.
7. SECURITY
In release 0.9.0.0, the Kafka community added a number of features that, used either separately or together, increases security in a K
The following security measures are currently supported:
1. Authentication of connections to brokers from clients (producers and consumers), other brokers and tools, using either SSL o
supports the following SASL mechanisms:
It's worth noting that security is optional - non-secured clusters are supported, as well as a mix of authenticated, unauthenticated, en
non-encrypted clients. The guides below explain how to con gure and use the security features in both clients and brokers.
Download
7.2 Encryption and Authentication using SSL
@apachekafka
Apache Kafka allows clients to connect over SSL. By default, SSL is disabled but can be turned on as needed.
1. Generate SSL key and certi cate for each Kafka broker
The rst step of deploying one or more brokers with the SSL support is to generate the key and the certi cate for each machin
You can use Java's keytool utility to accomplish this task. We will generate the key into a temporary keystore initially so that w
and sign it later with CA.
1 keytool -keystore server.keystore.jks -alias localhost -validity {validity} -genkey -keyalg RSA
Note: By default the property ssl.endpoint.identification.algorithm is not de ned, so hostname veri cation is not p
order to enable hostname veri cation, set the following property:
1 ssl.endpoint.identification.algorithm=HTTPS
http://kafka.apache.org/documentation.html#introduction 101/130
11/8/2017 Apache Kafka
Once enabled, clients will verify the server's fully quali ed domain name (FQDN) against one of the following two elds:
1. Common Name (CN)
2. Subject Alternative Name (SAN)
Both elds are valid, RFC-2818 recommends the use of SAN however. SAN is also more exible, allowing for multiple DNS ent
declared. Another advantage is that the CN can be set to a more meaningful value for authorization purposes. To add a SAN
following argument -ext SAN=DNS:{FQDN} to the keytool command:
1 keytool -keystore server.keystore.jks -alias localhost -validity {validity} -genkey -keyalg RSA -ext SAN
The following command can be run afterwards to verify the contents of the generated certi cate:
After the rst step, each machine in the cluster has a public-private key pair, and a certi cate to identify the machine. The cert
is unsigned, which means that an attacker can create such a certi cate to pretend to be any machine.
Therefore, it is important to prevent forged certi cates by signing them for each machine in the cluster. A certi cate authority
responsible for signing certi cates. CA works likes a government that issues passportsthe government stamps (signs) each
that the passport becomes di cult to forge. Other governments verify the stamps to ensure the passport is authentic. Similar
the certi cates, and the cryptography guarantees that a signed certi cate is computationally di cult to forge. Thus, as long as
genuine and trusted authority, the clients have high assurance that they are connecting to the authentic machines.
1 openssl req -new -x509 -keyout ca-key -out ca-cert -days 365
The generated CA is simply a public-private key pair and certi cate, and it is intended to sign other certi cates.
The next step is to add the generated CA to the **clients' truststore** so that the clients can trust this CA:
Note: If you con gure the Kafka brokers to require client authentication by setting ssl.client.auth to be "requested" or "required
brokers con g then you must provide a truststore for the Kafka brokers as well and it should have all the CA certi cates that c
were signed by.
In contrast to the keystore in step 1 that stores each machine's own identity, the truststore of a client stores all the certi cates
should trust. Importing a certi cate into one's truststore also means trusting all certi cates that are signed by that certi cate.
above, trusting the government (CA) also means trusting all passports (certi cates) that it has issued. This attribute is called
trust, and it is particularly useful when deploying SSL on a large Kafka cluster. You can sign all certi cates in the cluster with a
Download
have all machines share the same truststore that trusts the CA. That way all machines can authenticate all other machines.
@apachekafka
3. Signing the certi cate
The next step is to sign all certi cates generated by step 1 with the CA generated in step 2. First, you need to export the certi
keystore:
1 openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days {validity} -CAcreatese
Finally, you need to import both the certi cate of the CA and the signed certi cate into the keystore:
http://kafka.apache.org/documentation.html#introduction 102/130
11/8/2017 Apache Kafka
#!/bin/bash
#Step 1
keytool -keystore server.keystore.jks -alias localhost -validity 365 -keyalg RSA -genk
#Step 2
openssl req -new -x509 -keyout ca-key -out ca-cert -days 365
keytool -keystore server.truststore.jks -alias CARoot -import -file ca-cert
keytool -keystore client.truststore.jks -alias CARoot -import -file ca-cert
#Step 3
keytool -keystore server.keystore.jks -alias localhost -certreq -file cert-file
openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days 365 -
keytool -keystore server.keystore.jks -alias CARoot -import -file ca-cert
keytool -keystore server.keystore.jks -alias localhost -import -file cert-signed
Kafka Brokers support listening for connections on multiple ports. We need to con gure the following property in server.prope
must have one or more comma-separated values:
listeners
If SSL is not enabled for inter-broker communication (see below for how to enable it), both PLAINTEXT and SSL ports will be n
1 listeners=PLAINTEXT://host.name:port,SSL://host.name:port
1 ssl.keystore.location=/var/private/ssl/server.keystore.jks
2 ssl.keystore.password=test1234
3 ssl.key.password=test1234
Download 4 ssl.truststore.location=/var/private/ssl/server.truststore.jks
5 ssl.truststore.password=test1234
@apachekafka Note: ssl.truststore.password is technically optional but highly recommended. If a password is not set access to the truststore
available, but integrity checking is disabled. Optional settings that are worth considering:
1. ssl.client.auth=none ("required" => client authentication is required, "requested" => client authentication is requested and
certs can still connect. The usage of "requested" is discouraged as it provides a false sense of security and miscon gur
still connect successfully.)
2. ssl.cipher.suites (Optional). A cipher suite is a named combination of authentication, encryption, MAC and key exchange
to negotiate the security settings for a network connection using TLS or SSL network protocol. (Default is an empty list)
3. ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1 (list out the SSL protocols that you are going to accept from clients. Do n
deprecated in favor of TLS and using SSL in production is not recommended)
4. ssl.keystore.type=JKS
5. ssl.truststore.type=JKS
6. ssl.secure.random.implementation=SHA1PRNG
If you want to enable SSL for inter-broker communication, add the following to the server.properties le (it defaults to PLAINTE
security.inter.broker.protocol=SSL
http://kafka.apache.org/documentation.html#introduction 103/130
11/8/2017 Apache Kafka
Due to import regulations in some countries, the Oracle implementation limits the strength of cryptographic algorithms availab
stronger algorithms are needed (for example, AES with 256-bit keys), the JCE Unlimited Strength Jurisdiction Policy Files mus
and installed in the JDK/JRE. See the JCA Providers Documentation for more information.
The JRE/JDK will have a default pseudo-random number generator (PRNG) that is used for cryptography operations, so it is no
con gure the implementation used with the
ssl.secure.random.implementation
. However, there are performance issues with some implementations (notably, the default chosen on Linux systems,
NativePRNG
, utilizes a global lock). In cases where performance of SSL connections becomes an issue, consider explicitly setting the imp
be used. The
SHA1PRNG
implementation is non-blocking, and has shown very good performance characteristics under heavy load (50 MB/sec of produ
plus replication tra c, per-broker).
Once you start the broker you should be able to see in the server.log
To check quickly if the server keystore and truststore are setup properly you can run the following command
If the certi cate does not show up or if there are any other error messages then your keystore is not setup properly.
SSL is supported only for the new Kafka Producer and Consumer, the older API is not supported. The con gs for SSL will be th
producer and consumer.
If client authentication is not required in the broker, then the following is a minimal con guration example:
1 security.protocol=SSL
2 ssl.truststore.location=/var/private/ssl/client.truststore.jks
3 ssl.truststore.password=test1234
http://kafka.apache.org/documentation.html#introduction 104/130
11/8/2017 Apache Kafka
Note: ssl.truststore.password is technically optional but highly recommended. If a password is not set access to the truststore
available, but integrity checking is disabled. If client authentication is required, then a keystore must be created like in step 1 a
must also be con gured:
1 ssl.keystore.location=/var/private/ssl/client.keystore.jks
2 ssl.keystore.password=test1234
3 ssl.key.password=test1234
Other con guration settings that may also be needed depending on our requirements and the broker con guration:
1. ssl.provider (Optional). The name of the security provider used for SSL connections. Default value is the default security
JVM.
2. ssl.cipher.suites (Optional). A cipher suite is a named combination of authentication, encryption, MAC and key exchange
to negotiate the security settings for a network connection using TLS or SSL network protocol.
3. ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1. It should list at least one of the protocols con gured on the broker side
4. ssl.truststore.type=JKS
5. ssl.keystore.type=JKS
Kafka uses the Java Authentication and Authorization Service (JAAS) for SASL con guration.
KafkaServer is the section name in the JAAS le used by each KafkaServer/Broker. This section provides SASL con gura
the broker including any SASL client connections made by the broker for inter-broker communication.
Client section is used to authenticate a SASL connection with zookeeper. It also allows the brokers to set SASL ACL on
nodes which locks these nodes down so that only the brokers can modify it. It is necessary to have the same principal n
brokers. If you want to use a section name other than Client, set the system property zookeeper.sasl.clientconfig to the a
name (e.g., -Dzookeeper.sasl.clientconfig=ZkClient).
ZooKeeper uses "zookeeper" as the service name by default. If you want to change this, set the system property
Download zookeeper.sasl.client.username to the appropriate name (e.g., -Dzookeeper.sasl.client.username=zk).
Clients may con gure JAAS using the client con guration property sasl.jaas.con g or using the static JAAS con g le s
brokers.
Clients may specify JAAS con guration as a producer or consumer property without creating a physical con gura
mode also enables different producers and consumers within the same JVM to use different credentials by specif
properties for each client. If both static JAAS con guration system property java.security.auth.login.conf
property sasl.jaas.config are speci ed, the client property will be used.
To con gure SASL authentication on the clients using static JAAS con g le:
http://kafka.apache.org/documentation.html#introduction 105/130
11/8/2017 Apache Kafka
1. Add a JAAS con g le with a client login section named KafkaClient. Con gure a login module in KafkaClient
selected mechanism as described in the examples for setting up GSSAPI (Kerberos), PLAIN or SCRAM. For
GSSAPI credentials may be con gured as:
1 KafkaClient {
2 com.sun.security.auth.module.Krb5LoginModule required
3 useKeyTab=true
4 storeKey=true
5 keyTab="/etc/security/keytabs/kafka_client.keytab"
6 principal="[email protected]";
7 };
2. Pass the JAAS con g le location as JVM parameter to each client JVM. For example:
1 -Djava.security.auth.login.config=/etc/kafka/kafka_client_jaas.conf
SASL may be used with PLAINTEXT or SSL as the transport layer using the security protocol SASL_PLAINTEXT or SASL_SSL r
SASL_SSL is used, then SSL must also be con gured.
1. SASL mechanisms
GSSAPI (Kerberos)
PLAIN
SCRAM-SHA-256
SCRAM-SHA-512
1. Con gure a SASL port in server.properties, by adding at least one of SASL_PLAINTEXT or SASL_SSL to the listene
which contains one or more comma-separated values:
listeners=SASL_PLAINTEXT://host.name:port
If you are only con guring a SASL port (or if you want the Kafka brokers to authenticate each other using SASL) th
Download you set the same SASL protocol for inter-broker communication:
2. Select one or more supported mechanisms to enable in the broker and follow the steps to con gure SASL for the
enable multiple mechanisms in the broker, follow the steps here.
SASL authentication is only supported for the new Java Kafka producer and consumer, the older API is not supported.
To con gure SASL authentication on the clients, select a SASL mechanism that is enabled in the broker for client authen
follow the steps to con gure SASL for the selected mechanism.
1. Prerequisites
http://kafka.apache.org/documentation.html#introduction 106/130
11/8/2017 Apache Kafka
1. Kerberos
If your organization is already using a Kerberos server (for example, by using Active Directory), there is no need to
server just for Kafka. Otherwise you will need to install one, your Linux vendor likely has packages for Kerberos an
on how to install and con gure it (Ubuntu, Redhat). Note that if you are using Oracle Java, you will need to downlo
les for your Java version and copy them to $JAVA_HOME/jre/lib/security.
2. Create Kerberos Principals
If you are using the organization's Kerberos or Active Directory server, ask your Kerberos administrator for a princi
Kafka broker in your cluster and for every operating system user that will access Kafka with Kerberos authenticati
and tools).
If you have installed your own Kerberos, you will need to create these principals yourself using the following comm
3. Make sure all hosts can be reachable using hostnames - it is a Kerberos requirement that all your hosts can be re
FQDNs.
1. Add a suitably modi ed JAAS le similar to the one below to each Kafka broker's con g directory, let's call it
kafka_server_jaas.conf for this example (note that each broker should have its own keytab):
1 KafkaServer {
2 com.sun.security.auth.module.Krb5LoginModule required
3 useKeyTab=true
4 storeKey=true
5 keyTab="/etc/security/keytabs/kafka_server.keytab"
6 principal="kafka/[email protected]";
7 };
8
9 // Zookeeper client authentication
10 Client {
11 com.sun.security.auth.module.Krb5LoginModule required
12 useKeyTab=true
13 storeKey=true
14 keyTab="/etc/security/keytabs/kafka_server.keytab"
15 principal="kafka/[email protected]";
16 };
KafkaServer section in the JAAS le tells the broker which principal to use and the location of the keytab where this
stored. It allows the broker to login using the keytab speci ed in this section. See notes for more details on Zooke
Download con guration.
2. Pass the JAAS and optionally the krb5 le locations as JVM parameters to each Kafka broker (see here for more d
@apachekafka
-Djava.security.krb5.conf=/etc/kafka/krb5.conf
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
3. Make sure the keytabs con gured in the JAAS le are readable by the operating system user who is starting kafka
4. Con gure SASL port and SASL mechanisms in server.properties as described here. For example:
listeners=SASL_PLAINTEXT://host.name:port
security.inter.broker.protocol=SASL_PLAINTEXT
sasl.mechanism.inter.broker.protocol=GSSAPI
sasl.enabled.mechanisms=GSSAPI
We must also con gure the service name in server.properties, which should match the principal name of the kafka
above example, principal is "kafka/[email protected]", so:
http://kafka.apache.org/documentation.html#introduction 107/130
11/8/2017 Apache Kafka
sasl.kerberos.service.name=kafka
sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useKeyTab=true \
storeKey=true \
keyTab="/etc/security/keytabs/kafka_client.keytab" \
principal="[email protected]";
For command-line utilities like kafka-console-consumer or kafka-console-producer, kinit can be used along with
"useTicketCache=true" as in:
sasl.jaas.config=com.sun.security.auth.module.Krb5LoginModule required \
useTicketCache=true;
JAAS con guration for clients may alternatively be speci ed as a JVM parameter similar to brokers as described
the login section named KafkaClient. This option allows only one user for all client connections from a JVM.
2. Make sure the keytabs con gured in the JAAS con guration are readable by the operating system user who is sta
client.
3. Optionally pass the krb5 le locations as JVM parameters to each client JVM (see here for more details):
-Djava.security.krb5.conf=/etc/kafka/krb5.conf
Download
SASL/PLAIN is a simple username/password authentication mechanism that is typically used with TLS for encryption to imple
authentication. Kafka supports a default implementation for SASL/PLAIN which can be extended for production use as descri
The username is used as the authenticated Principal for con guration of ACLs etc.
http://kafka.apache.org/documentation.html#introduction 108/130
11/8/2017 Apache Kafka
1. Add a suitably modi ed JAAS le similar to the one below to each Kafka broker's con g directory, let's call it
kafka_server_jaas.conf for this example:
1 KafkaServer {
2 org.apache.kafka.common.security.plain.PlainLoginModule required
3 username="admin"
4 password="admin-secret"
5 user_admin="admin-secret"
6 user_alice="alice-secret";
7 };
This con guration de nes two users (admin and alice). The properties username and password in the KafkaServer sec
the broker to initiate connections to other brokers. In this example, admin is the user for inter-broker communicati
properties user_userName de nes the passwords for all users that connect to the broker and the broker validates all
connections including those from other brokers using these properties.
2. Pass the JAAS con g le location as JVM parameter to each Kafka broker:
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
3. Con gure SASL port and SASL mechanisms in server.properties as described here. For example:
listeners=SASL_SSL://host.name:port
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=PLAIN
sasl.enabled.mechanisms=PLAIN
1 sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
2 username="alice" \
3 password="alice-secret";
Download The options username and password are used by clients to con gure the user for client connections. In this example,
to the broker as user alice. Different clients within a JVM may connect as different users by specifying different us
@apachekafka passwords in sasl.jaas.config .
JAAS con guration for clients may alternatively be speci ed as a JVM parameter similar to brokers as described
the login section named KafkaClient. This option allows only one user for all client connections from a JVM.
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
SASL/PLAIN should be used only with SSL as transport layer to ensure that clear passwords are not transmitted on t
encryption.
The default implementation of SASL/PLAIN in Kafka speci es usernames and passwords in the JAAS con guration
here. To avoid storing passwords on disk, you can plug in your own implementation of javax.security.auth.spi
that provides usernames and passwords from an external source. The login module implementation should provide u
http://kafka.apache.org/documentation.html#introduction 109/130
11/8/2017 Apache Kafka
public credential and password as the private credential of the Subject . The default implementation
org.apache.kafka.common.security.plain.PlainLoginModule can be used as an example.
In production systems, external authentication servers may implement password authentication. Kafka brokers can b
with these servers by adding your own implementation of javax.security.sasl.SaslServer . The default imple
included in Kafka in the package org.apache.kafka.common.security.plain can be used as an example to get
New providers must be installed and registered in the JVM. Providers can be installed by adding provider classes
CLASSPATH or bundled as a jar le and added to JAVA_HOME/lib/ext.
Providers can be registered statically by adding a provider to the security properties le JAVA_HOME/lib/security/jav
security.provider.n=providerClassName
where providerClassName is the fully quali ed name of the new provider and n is the preference order with lower
indicating higher preference.
Alternatively, you can register providers dynamically at runtime by invoking Security.addProvider at the begi
client application or in a static initializer in the login module. For example:
Security.addProvider(new PlainSaslServerProvider());
Salted Challenge Response Authentication Mechanism (SCRAM) is a family of SASL mechanisms that addresses the security
traditional mechanisms that perform username/password authentication like PLAIN and DIGEST-MD5. The mechanism is de
5802. Kafka supports SCRAM-SHA-256 and SCRAM-SHA-512 which can be used with TLS to perform secure authentication. T
used as the authenticated Principal for con guration of ACLs etc. The default SCRAM implementation in Kafka stores SC
credentials in Zookeeper and is suitable for use in Kafka installations where Zookeeper is on a private network. Refer to Secur
Considerations for more details.
The SCRAM implementation in Kafka uses Zookeeper as credential store. Credentials can be created in Zookeeper using
configs.sh. For each SCRAM mechanism enabled, credentials must be created by adding a con g with the mechanism n
Download
Credentials for inter-broker communication must be created before Kafka brokers are started. Client credentials may be
@apachekafka updated dynamically and updated credentials will be used to authenticate new connections.
The default iteration count of 4096 is used if iterations are not speci ed. A random salt is created and the SCRAM ident
salt, iterations, StoredKey and ServerKey are stored in Zookeeper. See RFC 5802 for details on SCRAM identity and the in
The following examples also require a user admin for inter-broker communication which can be created using:
Credentials may be deleted for one or more SCRAM mechanisms using the --delete option:
http://kafka.apache.org/documentation.html#introduction 110/130
11/8/2017 Apache Kafka
1. Add a suitably modi ed JAAS le similar to the one below to each Kafka broker's con g directory, let's call it
kafka_server_jaas.conf for this example:
KafkaServer {
org.apache.kafka.common.security.scram.ScramLoginModule required
username="admin"
password="admin-secret";
};
The properties username and password in the KafkaServer section are used by the broker to initiate connections to oth
this example, admin is the user for inter-broker communication.
2. Pass the JAAS con g le location as JVM parameter to each Kafka broker:
-Djava.security.auth.login.config=/etc/kafka/kafka_server_jaas.conf
3. Con gure SASL port and SASL mechanisms in server.properties as described here. For example:
listeners=SASL_SSL://host.name:port
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-256 (or SCRAM-SHA-512)
sasl.enabled.mechanisms=SCRAM-SHA-256 (or SCRAM-SHA-512)
1 sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required \
2 username="alice" \
Download
3 password="alice-secret";
@apachekafka The options username and password are used by clients to con gure the user for client connections. In this example,
to the broker as user alice. Different clients within a JVM may connect as different users by specifying different us
passwords in sasl.jaas.config .
JAAS con guration for clients may alternatively be speci ed as a JVM parameter similar to brokers as described
the login section named KafkaClient. This option allows only one user for all client connections from a JVM.
security.protocol=SASL_SSL
sasl.mechanism=SCRAM-SHA-256 (or SCRAM-SHA-512)
The default implementation of SASL/SCRAM in Kafka stores SCRAM credentials in Zookeeper. This is suitable for pro
installations where Zookeeper is secure and on a private network.
http://kafka.apache.org/documentation.html#introduction 111/130
11/8/2017 Apache Kafka
Kafka supports only the strong hash functions SHA-256 and SHA-512 with a minimum iteration count of 4096. Stron
functions combined with strong passwords and high iteration counts protect against brute force attacks if Zookeepe
compromised.
SCRAM should be used only with TLS-encryption to prevent interception of SCRAM exchanges. This protects against
brute force attacks and against impersonation if Zookeeper is compromised.
The default SASL/SCRAM implementation may be overridden using custom login modules in installations where Zoo
secure. See here for details.
For more details on security considerations, refer to RFC 5802.
1. Specify con guration for the login modules of all enabled mechanisms in the KafkaServer section of the JAAS con g le
KafkaServer {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/etc/security/keytabs/kafka_server.keytab"
principal="kafka/[email protected]";
org.apache.kafka.common.security.plain.PlainLoginModule required
username="admin"
password="admin-secret"
user_admin="admin-secret"
user_alice="alice-secret";
};
sasl.enabled.mechanisms=GSSAPI,PLAIN,SCRAM-SHA-256,SCRAM-SHA-512
3. Specify the SASL security protocol and mechanism for inter-broker communication in server.properties if required:
Download
security.inter.broker.protocol=SASL_PLAINTEXT (or SASL_SSL)
sasl.mechanism.inter.broker.protocol=GSSAPI (or one of the other enabled mechanisms)
@apachekafka
4. Follow the mechanism-speci c steps in GSSAPI (Kerberos), PLAIN and SCRAM to con gure SASL for the enabled mech
SASL mechanism can be modi ed in a running cluster using the following sequence:
1. Enable new SASL mechanism by adding the mechanism to sasl.enabled.mechanisms in server.properties for each broker.
con g le to include both mechanisms as described here. Incrementally bounce the cluster nodes.
2. Restart clients using the new mechanism.
3. To change the mechanism of inter-broker communication (if this is required), set sasl.mechanism.inter.broker.protocol in
server.properties to the new mechanism and incrementally bounce the cluster again.
4. To remove old mechanism (if this is required), remove the old mechanism from sasl.enabled.mechanisms in server.propert
the entries for the old mechanism from JAAS con g le. Incrementally bounce the cluster again.
http://kafka.apache.org/documentation.html#introduction 112/130
11/8/2017 Apache Kafka
Kafka ships with a pluggable Authorizer and an out-of-box authorizer implementation that uses zookeeper to store all the acls. Kafka
de ned in the general format of "Principal P is [Allowed/Denied] Operation O From Host H On Resource R". You can read more about
structure on KIP-11. In order to add, remove or list acls you can use the Kafka authorizer CLI. By default, if a Resource R has no asso
one other than super users is allowed to access R. If you want to change that behavior, you can include the following in server.prope
allow.everyone.if.no.acl.found=true
One can also add super users in server.properties like the following (note that the delimiter is semicolon since SSL user names may
comma).
super.users=User:Bob;User:Alice
principal.builder.class=CustomizedPrincipalBuilderClass
By default, the SASL user name will be the primary part of the Kerberos principal. One can change that by setting
sasl.kerberos.principal.to.local.rules to a customized rule in server.properties. The format of
sasl.kerberos.principal.to.local.rules is a list where each rule works in the same way as the auth_to_local in Kerberos c
(krb5.conf). Each rules starts with RULE: and contains an expression in the format [n:string](regexp)s/pattern/replacement/g. See th
documentation for more details. An example of adding a rule to properly translate [email protected] to user while also keepin
rule in place is:
sasl.kerberos.principal.to.local.rules=RULE:[1:$1@$0](.*@MYDOMAIN.COM)s/@.*//,DEFAULT
Kafka Authorization management CLI can be found under bin directory with all the other CLIs. The CLI script is called kafka-acls.sh.
all the options that the script supports:
Download
OPTION DESCRIPTION DEFAULT O
@apachekafka --add Indicates to the script that user is trying to add an acl. A
--remove Indicates to the script that user is trying to remove an acl. A
--list Indicates to the script that user is trying to list acls. A
kafka.security.auth.SimpleAclA
--authorizer Fully quali ed class name of the authorizer. C
uthorizer
key=val pairs that will be passed to authorizer for initialization.
--authorizer-
For the default authorizer the example values are: C
properties
zookeeper.connect=localhost:2181
--cluster Speci es cluster as resource. R
--topic [topic-
Speci es the topic as resource. R
name]
--group
Speci es the consumer-group as resource. R
[group-name]
Principal is in PrincipalType:name format that will be added to
--allow-
ACL with Allow permission. P
principal
You can specify multiple --allow-principal in a single command.
Principal is in PrincipalType:name format that will be added to
--deny-
ACL with Deny permission. P
principal
You can specify multiple --deny-principal in a single command.
http://kafka.apache.org/documentation.html#introduction 113/130
11/8/2017 Apache Kafka
--allow-host IP address from which principals listed in --allow-principal will if --allow-principal is speci ed H
have access. defaults to * which translates
to "all hosts"
if --deny-principal is speci ed
IP address from which principals listed in --deny-principal will be
--deny-host defaults to * which translates H
denied access.
to "all hosts"
Operation that will be allowed or denied.
--operation Valid values are : Read, Write, Create, Delete, Alter, Describe, All O
ClusterAction, All
Convenience option to add/remove acls for producer role. This
--producer will generate acls that allows WRITE, DESCRIBE on topic and C
CREATE on cluster.
Convenience option to add/remove acls for consumer role. This
--consumer will generate acls that allows READ, DESCRIBE on topic and C
READ on consumer-group.
Convenience option to assume yes to all queries and do not
--force C
prompt.
Examples
Adding Acls
Suppose you want to add an acl "Principals User:Bob and User:Alice are allowed to perform Operation Read and Write on Topic T
IP 198.51.100.0 and IP 198.51.100.1". You can do that by executing the CLI with following options:
By default, all principals that don't have an explicit acl that allows access for an operation to a resource are denied. In rare cases
acl is de ned that allows access to all but some principal we will have to use the --deny-principal and --deny-host option. For exam
to allow all users to Read from Test-topic but only deny User:BadBob from IP 198.51.100.3 we can do so using following comma
Note that ``--allow-host`` and ``deny-host`` only support IP addresses (hostnames are not supported). Above examples add acls t
specifying --topic [topic-name] as the resource option. Similarly user can add acls to cluster by specifying --cluster and to a consu
specifying --group [group-name].
Removing Acls
Removing acls is pretty much the same. The only difference is instead of --add option users will have to specify --remove option.
acls added by the rst example above we can execute the CLI with following options:
Similarly to add Alice as a consumer of Test-topic with consumer group Group-1 we just have to pass --consumer option:
Note that for consumer option we must also specify the consumer group. In order to remove a principal from producer or consum
need to pass --remove option.
http://kafka.apache.org/documentation.html#introduction 114/130
11/8/2017 Apache Kafka
You can secure a running cluster via one or more of the supported protocols discussed previously. This is done in phases:
The speci c steps for con guring SSL and SASL are described in sections 7.2 and 7.3. Follow these steps to enable security for you
protocol(s).
The security implementation lets you con gure different protocols for both broker-client and broker-broker communication. These m
in separate bounces. A PLAINTEXT port must be left open throughout so brokers and/or clients can continue to communicate.
When performing an incremental bounce stop the brokers cleanly via a SIGTERM. It's also good practice to wait for restarted replica
the ISR list before moving onto the next node.
As an example, say we wish to encrypt both broker-client and broker-broker communication with SSL. In the rst incremental bounce
opened on each node:
listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092
We then restart the clients, changing their con g to point at the newly opened, secured port:
bootstrap.servers = [broker1:9092,...]
security.protocol = SSL
...etc
In the second incremental server bounce we instruct Kafka to use SSL as the broker-broker protocol (which will use the same SSL po
listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092
security.inter.broker.protocol=SSL
Download
In the nal bounce we secure the cluster by closing the PLAINTEXT port:
@apachekafka
listeners=SSL://broker1:9092
security.inter.broker.protocol=SSL
Alternatively we might choose to open multiple ports so that different protocols can be used for broker-broker and broker-client com
Say we wished to use SSL encryption throughout (i.e. for broker-broker and broker-client communication) but we'd like to add SASL a
to the broker-client connection also. We would achieve this by opening two additional ports during the rst bounce:
listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092,SASL_SSL://broker1:9093
We would then restart the clients, changing their con g to point at the newly opened, SASL & SSL secured port:
http://kafka.apache.org/documentation.html#introduction 115/130
11/8/2017 Apache Kafka
bootstrap.servers = [broker1:9093,...]
security.protocol = SASL_SSL
...etc
The second server bounce would switch the cluster to use encrypted broker-broker communication via the SSL port we previously o
9092:
listeners=PLAINTEXT://broker1:9091,SSL://broker1:9092,SASL_SSL://broker1:9093
security.inter.broker.protocol=SSL
The nal bounce secures the cluster by closing the PLAINTEXT port.
listeners=SSL://broker1:9092,SASL_SSL://broker1:9093
security.inter.broker.protocol=SSL
ZooKeeper can be secured independently of the Kafka cluster. The steps for doing this are covered in section 7.6.2.
1. Create a JAAS login le and set the appropriate system property to point to it as described above
2. Set the con guration property zookeeper.set.acl in each broker to true
The metadata stored in ZooKeeper for the Kafka cluster is world-readable, but can only be modi ed by the brokers. The rationale be
decision is that the data stored in ZooKeeper is not sensitive, but inappropriate manipulation of that data can cause cluster disruptio
recommend limiting the access to ZooKeeper via network segmentation (only brokers and some admin tools need access to ZooKe
Java consumer and producer clients are used).
Download
@apachekafka
7.6.2 Migrating clusters
If you are running a version of Kafka that does not support security or simply with security disabled, and you want to make the clust
you need to execute the following steps to enable ZooKeeper authentication with minimal disruption to your operations:
1. Perform a rolling restart setting the JAAS login le, which enables brokers to authenticate. At the end of the rolling restart, bro
manipulate znodes with strict ACLs, but they will not create znodes with those ACLs
2. Perform a second rolling restart of brokers, this time setting the con guration parameter zookeeper.set.acl to true, which enab
secure ACLs when creating znodes
3. Execute the ZkSecurityMigrator tool. To execute the tool, there is this script: ./bin/zookeeper-security-migration.sh with zookeep
secure. This tool traverses the corresponding sub-trees changing the ACLs of the znodes
It is also possible to turn off authentication in a secure cluster. To do it, follow these steps:
1. Perform a rolling restart of brokers setting the JAAS login le, which enables brokers to authenticate, but setting zookeeper.set
the end of the rolling restart, brokers stop creating znodes with secure ACLs, but are still able to authenticate and manipulate
2. Execute the ZkSecurityMigrator tool. To execute the tool, run this script ./bin/zookeeper-security-migration.sh with zookeeper.a
unsecure. This tool traverses the corresponding sub-trees changing the ACLs of the znodes
3. Perform a second rolling restart of brokers, this time omitting the system property that sets the JAAS login le
http://kafka.apache.org/documentation.html#introduction 116/130
11/8/2017 Apache Kafka
1 ./bin/zookeeper-security-migration.sh --help
It is also necessary to enable authentication on the ZooKeeper ensemble. To do it, we need to perform a rolling restart of the server
properties. Please refer to the ZooKeeper documentation for more detail:
8. KAFKA CONNECT
8.1 Overview
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems. It makes it simple to qui
connectors that move large collections of data into and out of Kafka. Kafka Connect can ingest entire databases or collect metrics f
application servers into Kafka topics, making the data available for stream processing with low latency. An export job can deliver da
topics into secondary storage and query systems or into batch systems for o ine analysis.
A common framework for Kafka connectors - Kafka Connect standardizes integration of other data systems with Kafka, simplify
development, deployment, and management
Distributed and standalone modes - scale up to a large, centrally managed service supporting an entire organization or scale dow
development, testing, and small production deployments
REST interface - submit and manage connectors to your Kafka Connect cluster via an easy to use REST API
Automatic offset management - with just a little information from connectors, Kafka Connect can manage the offset commit pro
automatically so connector developers do not need to worry about this error prone part of connector development
Distributed and scalable by default - Kafka Connect builds on the existing group management protocol. More workers can be add
a Kafka Connect cluster.
Streaming/batch integration - leveraging Kafka's existing capabilities, Kafka Connect is an ideal solution for bridging streaming a
systems
The quickstart provides a brief example of how to run a standalone version of Kafka Connect. This section describes how to con gu
@apachekafka
manage Kafka Connect in more detail.
Kafka Connect currently supports two modes of execution: standalone (single process) and distributed.
In standalone mode all work is performed in a single process. This con guration is simpler to setup and get started with and may be
situations where only one worker makes sense (e.g. collecting log les), but it does not bene t from some of the features of Kafka C
fault tolerance. You can start a standalone process with the following command:
The rst parameter is the con guration for the worker. This includes settings such as the Kafka connection parameters, serialization
how frequently to commit offsets. The provided example should work well with a local cluster running with the default con guration
config/server.properties . It will require tweaking to use with a different con guration or production deployment. All workers
standalone and distributed) require a few con gs:
http://kafka.apache.org/documentation.html#introduction 117/130
11/8/2017 Apache Kafka
key.converter - Converter class used to convert between Kafka Connect format and the serialized form that is written to Kaf
controls the format of the keys in messages written to or read from Kafka, and since this is independent of connectors it allows a
work with any serialization format. Examples of common formats include JSON and Avro.
value.converter - Converter class used to convert between Kafka Connect format and the serialized form that is written to K
controls the format of the values in messages written to or read from Kafka, and since this is independent of connectors it allows
to work with any serialization format. Examples of common formats include JSON and Avro.
The parameters that are con gured here are intended for producers and consumers used by Kafka Connect to access the con gura
status topics. For con guration of Kafka source and Kafka sink tasks, the same parameters can be used but need to be pre xed wit
and producer. respectively. The only parameter that is inherited from the worker con guration is bootstrap.servers , which
will be su cient, since the same cluster is often used for all purposes. A notable exeption is a secured cluster, which requires extra
allow connections. These parameters will need to be set up to three times in the worker con guration, once for management access
sinks and once for Kafka sources.
The remaining parameters are connector con guration les. You may include as many as you want, but all will execute within the sa
different threads).
Distributed mode handles automatic balancing of work, allows you to scale up (or down) dynamically, and offers fault tolerance both
tasks and for con guration and offset commit data. Execution is very similar to standalone mode:
The difference is in the class which is started and the con guration parameters which change how the Kafka Connect process decid
store con gurations, how to assign work, and where to store offsets and task statues. In the distributed mode, Kafka Connect stores
con gs and task statuses in Kafka topics. It is recommended to manually create the topics for offset, con gs and statuses in order
desired the number of partitions and replication factors. If the topics are not yet created when starting Kafka Connect, the topics wil
created with default number of partitions and replication factor, which may not be best suited for its usage.
In particular, the following con guration parameters, in addition to the common settings mentioned above, are critical to set before s
cluster:
group.id (default connect-cluster ) - unique name for the cluster, used in forming the Connect cluster group; note that th
con ict with consumer group IDs
config.storage.topic (default connect-configs ) - topic to use for storing connector and task con gurations; note that t
single partition, highly replicated, compacted topic. You may need to manually create the topic to ensure the correct con guration
Download created topics may have multiple partitions or be automatically con gured for deletion rather than compaction
offset.storage.topic (default connect-offsets ) - topic to use for storing offsets; this topic should have many partition
Note that in distributed mode the connector con gurations are not passed on the command line. Instead, use the REST API describe
create, modify, and destroy connectors.
Connector con gurations are simple key-value mappings. For standalone mode these are de ned in a properties le and passed to t
process on the command line. In distributed mode, they will be included in the JSON payload for the request that creates (or modi e
connector.
Most con gurations are connector dependent, so they can't be outlined here. However, there are a few common options:
name - Unique name for the connector. Attempting to register again with the same name will fail.
connector.class - The Java class for the connector
http://kafka.apache.org/documentation.html#introduction 118/130
11/8/2017 Apache Kafka
tasks.max - The maximum number of tasks that should be created for this connector. The connector may create fewer tasks i
achieve this level of parallelism.
key.converter - (optional) Override the default key converter set by the worker.
value.converter - (optional) Override the default value converter set by the worker.
The connector.class con g supports several formats: the full name or alias of the class for this connector. If the connector is
org.apache.kafka.connect. le.FileStreamSinkConnector, you can either specify this full name or use FileStreamSink or FileStreamSin
make the con guration a bit shorter.
Sink connectors also have one additional option to control their input:
For any other options, you should consult the documentation for the connector.
Transformations
Connectors can be con gured with transformations to make lightweight message-at-a-time modi cations. They can be convenient f
massaging and event routing.
transforms - List of aliases for the transformation, specifying the order in which the transformations will be applied.
transforms.$alias.type - Fully quali ed class name for the transformation.
transforms.$alias.$transformationSpecificConfig Con guration properties for the transformation
For example, lets take the built-in le source connector and use a transformation to add a static eld.
Throughout the example we'll use schemaless JSON data format. To use schemaless format, we changed the following two lines in
standalone.properties from true to false:
1 key.converter.schemas.enable
2 value.converter.schemas.enable
The le source connector reads each line as a String. We will wrap each line in a Map and then add a second eld to identify the orig
To do this, we use two transformations:
All the lines starting with transforms were added for the transformations. You can see the two transformations we created: "Inse
"MakeMap" are aliases that we chose to give the transformations. The transformation types are based on the list of built-in transform
see below. Each transformation type has additional con guration: HoistField requires a con guration called " eld", which is the nam
the map that will include the original String from the le. InsertField transformation lets us specify the eld name and the value that
When we ran the le source connector on my sample le without the transformations, and then read them using kafka-console-c
the results were:
1 "foo"
2 "bar"
3 "hello world"
http://kafka.apache.org/documentation.html#introduction 119/130
11/8/2017 Apache Kafka
We then create a new le connector, this time after adding the transformations to the con guration le. This time, the results will be
1 {"line":"foo","data_source":"test-file-source"}
2 {"line":"bar","data_source":"test-file-source"}
3 {"line":"hello world","data_source":"test-file-source"}
You can see that the lines we've read are now part of a JSON map, and there is an extra eld with the static value we speci ed. This
example of what you can do with transformations.
Several widely-applicable data and routing transformations are included with Kafka Connect:
org.apache.kafka.connect.transforms.InsertField
Insert eld(s) using attributes from the record metadata or a con gured static value.
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.InsertField$Key
( org.apache.kafka.connect.transforms.InsertField$Value ).
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
org.apache.kafka.connect.transforms.ReplaceField
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.ReplaceField$Key
( org.apache.kafka.connect.transforms.ReplaceField$Value ).
org.apache.kafka.connect.transforms.MaskField
Mask speci ed elds with a valid null value for the eld type (i.e. 0, false, empty string, and so on).
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.MaskField$Key )
( org.apache.kafka.connect.transforms.MaskField$Value ).
org.apache.kafka.connect.transforms.ValueToKey
Replace the record key with a new key formed from a subset of elds in the record value.
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
org.apache.kafka.connect.transforms.HoistField
Wrap data using the speci ed eld name in a Struct when schema present, or a Map in the case of schemaless data.
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.HoistField$Key )
( org.apache.kafka.connect.transforms.HoistField$Value ).
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
Download
Field name for the single eld that will be
eld string m
created in the resulting Struct or Map.
@apachekafka
org.apache.kafka.connect.transforms.ExtractField
Extract the speci ed eld from a Struct when schema present, or a Map in the case of schemaless data. Any null values are passed
unmodi ed.
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.ExtractField$Key
( org.apache.kafka.connect.transforms.ExtractField$Value ).
org.apache.kafka.connect.transforms.SetSchemaMetadata
http://kafka.apache.org/documentation.html#introduction 121/130
11/8/2017 Apache Kafka
Set the schema name, version or both on the record's key ( org.apache.kafka.connect.transforms.SetSchemaMetadata$Key
( org.apache.kafka.connect.transforms.SetSchemaMetadata$Value ) schema.
org.apache.kafka.connect.transforms.TimestampRouter
Update the record's topic eld as a function of the original topic value and the record timestamp.
This is mainly useful for sink connectors, since the topic eld is often used to determine the equivalent entity name in the destinatio
database table or search index name).
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
org.apache.kafka.connect.transforms.RegexRouter
Update the record topic using the con gured regular expression and replacement string.
Under the hood, the regex is compiled to a java.util.regex.Pattern . If the pattern matches the input topic,
java.util.regex.Matcher#replaceFirst() is used with the replacement string to obtain the new topic.
regex Regular expression to use for matching. string valid regex high
replacement Replacement string. string high
Download org.apache.kafka.connect.transforms.Flatten
@apachekafka Flatten a nested data structure, generating names for each eld by concatenating the eld names at each level with a con gurable d
character. Applies to Struct when schema present, or a Map in the case of schemaless data. The default delimiter is '.'.
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.Flatten$Key ) or
( org.apache.kafka.connect.transforms.Flatten$Value ).
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
org.apache.kafka.connect.transforms.Cast
Cast elds or the entire key or value to a speci c type, e.g. to force an integer eld to a smaller width. Only simple primitive types are
integers, oats, boolean, and string.
http://kafka.apache.org/documentation.html#introduction 122/130
11/8/2017 Apache Kafka
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.Cast$Key ) or valu
( org.apache.kafka.connect.transforms.Cast$Value ).
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
List of elds and the type to cast them to of the list of colon-
form eld1:type, eld2:type to cast elds of delimited
spec Maps or Structs. A single type to cast the entire list pairs, e.g. h
value. Valid types are int8, int16, int32, int64, foo:bar,a
oat32, oat64, boolean, and string. bc:xyz
org.apache.kafka.connect.transforms.TimestampConverter
Convert timestamps between different formats such as Unix epoch, strings, and Connect Date/Timestamp types.Applies to individu
the entire value.
Use the concrete transformation type designed for the record key ( org.apache.kafka.connect.transforms.TimestampConvert
value ( org.apache.kafka.connect.transforms.TimestampConverter$Value ).
VALID
NAME DESCRIPTION TYPE DEFAULT I
VALUES
REST API
Since Kafka Connect is intended to be run as a service, it also provides a REST API for managing connectors. By default, this service
8083. The following are the currently supported endpoints:
Kafka Connect also provides a REST API for getting information about connector plugins:
GET /connector-plugins - return a list of connector plugins installed in the Kafka Connect cluster. Note that the API only che
connectors on the worker that handles the request, which means you may see inconsistent results, especially during a rolling upg
new connector jars
http://kafka.apache.org/documentation.html#introduction 123/130
11/8/2017 Apache Kafka
PUT /connector-plugins/{connector-type}/config/validate - validate the provided con guration values against the co
de nition. This API performs per con g validation, returns suggested values and error messages during validation.
This guide describes how developers can write new connectors for Kafka Connect to move data between Kafka and other systems.
reviews a few key concepts and then describes how to create a simple connector.
To copy data between Kafka and another system, users create a Connector for the system they want to pull data from or push da
Connectors come in two avors: SourceConnectors import data from another system (e.g. JDBCSourceConnector would imp
database into Kafka) and SinkConnectors export data (e.g. HDFSSinkConnector would export the contents of a Kafka topic to
Connectors do not perform any data copying themselves: their con guration describes the data to be copied, and the Connecto
responsible for breaking that job into a set of Tasks that can be distributed to workers. These Tasks also come in two correspo
SourceTask and SinkTask .
With an assignment in hand, each Task must copy its subset of the data to or from Kafka. In Kafka Connect, it should always be p
frame these assignments as a set of input and output streams consisting of records with consistent schemas. Sometimes this map
each le in a set of log les can be considered a stream with each parsed line forming a record using the same schema and offsets
offsets in the le. In other cases it may require more effort to map to this model: a JDBC connector can map each table to a stream,
less clear. One possible mapping uses a timestamp column to generate queries incrementally returning new data, and the last queri
can be used as the offset.
Each stream should be a sequence of key-value records. Both the keys and values can have complex structure -- many primitive type
but arrays, objects, and nested data structures can be represented as well. The runtime data format does not assume any particular
format; this conversion is handled internally by the framework.
In addition to the key and value, records (both those generated by sources and those delivered to sinks) have associated stream IDs
These are used by the framework to periodically commit the offsets of data that have been processed so that in the event of failures
can resume from the last committed offsets, avoiding unnecessary reprocessing and duplication of events.
Download
Dynamic Connectors
@apachekafka
Not all jobs are static, so Connector implementations are also responsible for monitoring the external system for any changes th
recon guration. For example, in the JDBCSourceConnector example, the Connector might assign a set of tables to each Tas
table is created, it must discover this so it can assign the new table to one of the Tasks by updating its con guration. When it not
that requires recon guration (or a change in the number of Tasks ), it noti es the framework and the framework updates any corre
Tasks .
Developing a connector only requires implementing two interfaces, the Connector and Task . A simple example is included with
code for Kafka in the file package. This connector is meant for use in standalone mode and has implementations of a
SourceConnector / SourceTask to read each line of a le and emit it as a record and a SinkConnector / SinkTask that wri
to a le.
The rest of this section will walk through some code to demonstrate the key steps in creating a connector, but developers should als
full example source code as many details are omitted for brevity.
Connector Example
http://kafka.apache.org/documentation.html#introduction 124/130
11/8/2017 Apache Kafka
We'll cover the SourceConnector as a simple example. SinkConnector implementations are very similar. Start by creating the
inherits from SourceConnector and add a couple of elds that will store parsed con guration information (the lename to read f
topic to send data to):
The easiest method to ll in is taskClass() , which de nes the class that should be instantiated in worker processes to actually r
1 @Override
2 public Class<? extends Task> taskClass() {
3 return FileStreamSourceTask.class;
4 }
We will de ne the FileStreamSourceTask class below. Next, we add some standard lifecycle methods, start() and stop()
1 @Override
2 public void start(Map<String, String> props) {
3 // The complete version includes error handling as well.
4 filename = props.get(FILE_CONFIG);
5 topic = props.get(TOPIC_CONFIG);
6 }
7
8 @Override
9 public void stop() {
10 // Nothing to do since no background monitoring is required.
11 }
Finally, the real core of the implementation is in taskConfigs() . In this case we are only handling a single le, so even though we
permitted to generate more tasks as per the maxTasks argument, we return a list with only one entry:
1 @Override
2 public List<Map<String, String>> taskConfigs(int maxTasks) {
3 ArrayList<Map<String, String>> configs = new ArrayList<>();
4 // Only one input stream makes sense.
5 Map<String, String> config = new HashMap<>();
6 if (filename != null)
7 config.put(FILE_CONFIG, filename);
8 config.put(TOPIC_CONFIG, topic);
9 configs.add(config);
10 return configs;
11 }
Download Although not used in the example, SourceTask also provides two APIs to commit offsets in the source system: commit and co
The APIs are provided for source systems which have an acknowledgement mechanism for messages. Overriding these methods a
@apachekafka connector to acknowledge messages in the source system, either in bulk or individually, once they have been written to Kafka. The
stores the offsets in the source system, up to the offsets that have been returned by poll . The implementation of this API should
commit is complete. The commitRecord API saves the offset in the source system for each SourceRecord after it is written to
Connect will record offsets automatically, SourceTask s are not required to implement them. In cases where a connector does nee
acknowledge messages in the source system, only one of the APIs is typically required.
Even with multiple tasks, this method implementation is usually pretty simple. It just has to determine the number of input tasks, wh
contacting the remote service it is pulling data from, and then divvy them up. Because some patterns for splitting work among tasks
common, some utilities are provided in ConnectorUtils to simplify these cases.
Note that this simple example does not include dynamic input. See the discussion in the next section for how to trigger updates to t
Next we'll describe the implementation of the corresponding SourceTask . The implementation is short, but too long to cover com
guide. We'll use pseudo-code to describe most of the implementation, but you can refer to the source code for the full example.
http://kafka.apache.org/documentation.html#introduction 125/130
11/8/2017 Apache Kafka
Just as with the connector, we need to create a class inheriting from the appropriate base Task class. It also has some standard l
methods:
These are slightly simpli ed versions, but show that that these methods should be relatively simple and the only work they should pe
allocating or freeing resources. There are two points to note about this implementation. First, the start() method does not yet h
from a previous offset, which will be addressed in a later section. Second, the stop() method is synchronized. This will be necess
SourceTasks are given a dedicated thread which they can block inde nitely, so they need to be stopped with a call from a differen
Worker.
Next, we implement the main functionality of the task, the poll() method which gets events from the input system and returns a
List<SourceRecord> :
1 @Override
2 public List<SourceRecord> poll() throws InterruptedException {
3 try {
4 ArrayList<SourceRecord> records = new ArrayList<>();
5 while (streamValid(stream) && records.isEmpty()) {
6 LineAndOffset line = readToNextLine(stream);
7 if (line != null) {
8 Map<String, Object> sourcePartition = Collections.singletonMap("filename", filename);
9 Map<String, Object> sourceOffset = Collections.singletonMap("position", streamOffset);
10 records.add(new SourceRecord(sourcePartition, sourceOffset, topic, Schema.STRING_SCHEMA, lin
11 } else {
12 Thread.sleep(1);
13 }
14 }
15 return records;
Download 16 } catch (IOException e) {
17 // Underlying stream was killed, probably as a result of calling stop. Allow to return
18 // null, and driving thread will handle any shutdown if necessary.
@apachekafka 19 }
20 return null;
21 }
Again, we've omitted some details, but we can see the important steps: the poll() method is going to be called repeatedly, and f
will loop trying to read records from the le. For each line it reads, it also tracks the le offset. It uses this information to create an o
SourceRecord with four pieces of information: the source partition (there is only one, the single le being read), source offset (by
le), output topic name, and output value (the line, and we include a schema indicating this value will always be a string). Other varia
SourceRecord constructor can also include a speci c output partition and a key.
Note that this implementation uses the normal Java InputStream interface and may sleep if data is not available. This is accepta
Kafka Connect provides each task with a dedicated thread. While task implementations have to conform to the basic poll() inte
a lot of exibility in how they are implemented. In this case, an NIO-based implementation would be more e cient, but this simple ap
is quick to implement, and is compatible with older versions of Java.
Sink Tasks
http://kafka.apache.org/documentation.html#introduction 126/130
11/8/2017 Apache Kafka
The previous section described how to implement a simple SourceTask . Unlike SourceConnector and SinkConnector , So
SinkTask have very different interfaces because SourceTask uses a pull interface and SinkTask uses a push interface. Bot
common lifecycle methods, but the SinkTask interface is quite different:
The SinkTask documentation contains full details, but this interface is nearly as simple as the SourceTask . The put() meth
contain most of the implementation, accepting sets of SinkRecords , performing any required translation, and storing them in the
system. This method does not need to ensure the data has been fully written to the destination system before returning. In fact, in m
internal buffering will be useful so an entire batch of records can be sent at once, reducing the overhead of inserting events into the
data store. The SinkRecords contain essentially the same information as SourceRecords : Kafka topic, partition, offset and the
value.
The flush() method is used during the offset commit process, which allows tasks to recover from failures and resume from a sa
that no events will be missed. The method should push any outstanding data to the destination system and then block until the write
acknowledged. The offsets parameter can often be ignored, but is useful in some cases where implementations want to store o
information in the destination store to provide exactly-once delivery. For example, an HDFS connector could do this and use atomic
operations to make sure the flush() operation atomically commits the data and offsets to a nal location in HDFS.
The SourceTask implementation included a stream ID (the input lename) and offset (position in the le) with each record. The f
this to commit offsets periodically so that in the case of a failure, the task can recover and minimize the number of events that are r
possibly duplicated (or to resume from the most recent offset if Kafka Connect was stopped gracefully, e.g. in standalone mode or d
recon guration). This commit process is completely automated by the framework, but only the connector knows how to seek back t
position in the input stream to resume from that location.
To correctly resume upon startup, the task can use the SourceContext passed into its initialize() method to access the of
initialize() , we would add a bit more code to read the offset (if it exists) and seek to that position:
Of course, you might need to read many keys for each of the input streams. The OffsetStorageReader interface also allows you
reads to e ciently load all offsets, then apply them by seeking each input stream to the appropriate position.
Kafka Connect is intended to de ne bulk data copying jobs, such as copying an entire database rather than creating many jobs to co
individually. One consequence of this design is that the set of input or output streams for a connector can vary over time.
Source connectors need to monitor the source system for changes, e.g. table additions/deletions in a database. When they pick up
should notify the framework via the ConnectorContext object that recon guration is necessary. For example, in a SourceConne
1 if (inputsChanged())
2 this.context.requestTaskReconfiguration();
The framework will promptly request new con guration information and update the tasks, allowing them to gracefully commit their p
recon guring them. Note that in the SourceConnector this monitoring is currently left up to the connector implementation. If an e
http://kafka.apache.org/documentation.html#introduction 127/130
11/8/2017 Apache Kafka
Ideally this code for monitoring changes would be isolated to the Connector and tasks would not need to worry about them. How
can also affect tasks, most commonly when one of their input streams is destroyed in the input system, e.g. if a table is dropped fro
the Task encounters the issue before the Connector , which will be common if the Connector needs to poll for changes, the
need to handle the subsequent error. Thankfully, this can usually be handled simply by catching and handling the appropriate except
SinkConnectors usually only have to handle the addition of streams, which may translate to new entries in their outputs (e.g., a n
table). The framework manages any changes to the Kafka input, such as when the set of input topics changes because of a regex su
SinkTasks should expect new input streams, which may require creating new resources in the downstream system, such as a ne
database. The trickiest situation to handle in these cases may be con icts between multiple SinkTasks seeing a new input stream
time and simultaneously trying to create the new resource. SinkConnectors , on the other hand, will generally require no special c
handling a dynamic set of streams.
Kafka Connect allows you to validate connector con gurations before submitting a connector to be executed and can provide feedb
errors and recommended values. To take advantage of this, connector developers need to provide an implementation of config()
con guration de nition to the framework.
The following code in FileStreamSourceConnector de nes the con guration and exposes it to the framework.
ConfigDef class is used for specifying the set of expected con gurations. For each con guration, you can specify the name, the
value, the documentation, the group information, the order in the group, the width of the con guration value and the name suitable fo
UI. Plus, you can provide special validation logic used for single con guration validation by overriding the Validator class. Moreo
may be dependencies between con gurations, for example, the valid values and visibility of a con guration may change according t
other con gurations. To handle this, ConfigDef allows you to specify the dependents of a con guration and to provide an implem
Recommender to get valid values and set visibility of a con guration given the current con guration values.
Also, the validate() method in Connector provides a default validation implementation which returns a list of allowed con g
together with con guration errors and recommended values for each con guration. However, it does not use the recommended valu
con guration validation. You may provide an override of the default implementation for customized con guration validation, which m
Download
recommended values.
@apachekafka
The FileStream connectors are good examples because they are simple, but they also have trivially structured data -- each line is jus
Almost all practical connectors will need schemas with more complex data formats.
To create more complex data, you'll need to work with the Kafka Connect data API. Most structured records will need to interact w
in addition to primitive types: Schema and Struct .
The API documentation provides a complete reference, but here is a simple example creating a Schema and Struct :
http://kafka.apache.org/documentation.html#introduction 128/130
11/8/2017 Apache Kafka
If you are implementing a source connector, you'll need to decide when and how to create schemas. Where possible, you should avo
them as much as possible. For example, if your connector is guaranteed to have a xed schema, create it statically and reuse a sing
However, many connectors will have dynamic schemas. One simple example of this is a database connector. Considering even just
the schema will not be prede ned for the entire connector (as it varies from table to table). But it also may not be xed for a single ta
lifetime of the connector since the user may execute an ALTER TABLE command. The connector must be able to detect these cha
appropriately.
Sink connectors are usually simpler because they are consuming data and therefore do not need to create schemas. However, they
as much care to validate that the schemas they receive have the expected format. When the schema does not match -- usually indic
upstream producer is generating invalid data that cannot be correctly translated to the destination system -- sink connectors should
exception to indicate this error to the system.
Kafka Connect's REST layer provides a set of APIs to enable administration of the cluster. This includes APIs to view the con guratio
connectors and the status of their tasks, as well as to alter their current behavior (e.g. changing con guration and restarting tasks).
When a connector is rst submitted to the cluster, the workers rebalance the full set of connectors in the cluster and their tasks so t
has approximately the same amount of work. This same rebalancing procedure is also used when connectors increase or decrease
tasks they require, or when a connector's con guration is changed. You can use the REST API to view the current status of a connec
tasks, including the id of the worker to which each was assigned. For example, querying the status of a le source (using GET
/connectors/file-source/status ) might produce output like the following:
1 {
2 "name": "file-source",
3 "connector": {
4 "state": "RUNNING",
5 "worker_id": "192.168.1.208:8083"
6 },
7 "tasks": [
8 {
9 "id": 0,
10 "state": "RUNNING",
11 "worker_id": "192.168.1.209:8083"
12 }
13 ]
14 }
Connectors and their tasks publish status updates to a shared topic (con gured with status.storage.topic ) which all workers
monitor. Because the workers consume this topic asynchronously, there is typically a (short) delay before a state change is visible th
Download
status API. The following states are possible for a connector or one of its tasks:
@apachekafka UNASSIGNED: The connector/task has not yet been assigned to a worker.
RUNNING: The connector/task is running.
PAUSED: The connector/task has been administratively paused.
FAILED: The connector/task has failed (usually by raising an exception, which is reported in the status output).
In most cases, connector and task states will match, though they may be different for short periods of time when changes are occur
have failed. For example, when a connector is rst started, there may be a noticeable delay before the connector and its tasks have
to the RUNNING state. States will also diverge when tasks fail since Connect does not automatically restart failed tasks. To restart a
connector/task manually, you can use the restart APIs listed above. Note that if you try to restart a task while a rebalance is taking p
will return a 409 (Con ict) status code. You can retry after the rebalance completes, but it might not be necessary since rebalances
restart all the connectors and tasks in the cluster.
It's sometimes useful to temporarily stop the message processing of a connector. For example, if the remote system is undergoing m
would be preferable for source connectors to stop polling it for new data instead of lling logs with exception spam. For this use cas
offers a pause/resume API. While a source connector is paused, Connect will stop polling it for additional records. While a sink conn
Connect will stop pushing new messages to it. The pause state is persistent, so even if you restart the cluster, the connector will not
processing again until the task has been resumed. Note that there may be a delay before all of a connector's tasks have transitioned
http://kafka.apache.org/documentation.html#introduction 129/130
11/8/2017 Apache Kafka
state since it may take time for them to nish whatever processing they were in the middle of when being paused. Additionally, failed
transition to the PAUSED state until they have been restarted.
9. KAFKA STREAMS
Kafka Streams is a client library for processing and analyzing data stored in Kafka. It builds upon important stream processing conc
properly distinguishing between event time and processing time, windowing support, exactly-once processing semantics and simple
management of application state.
Kafka Streams has a low barrier to entry: You can quickly write and run a small-scale proof-of-concept on a single machine; and you
run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams t
handles the load balancing of multiple instances of the same application by leveraging Kafka's parallelism model.
The contents of this website are 2016 Apache Software Foundation under the terms of the Apache License v2. Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other
countries.
Download
@apachekafka
http://kafka.apache.org/documentation.html#introduction 130/130