Responsive-Navigating_Kafka_Streams
Responsive-Navigating_Kafka_Streams
Navigating Phases of
Operational Maturity with
Kafka Streams.
.
.
.
.
.
.
.
.
Table of Contents
Introduction 2
Overview: The Four Phases of Operational Maturity 3
Phase I: Experimentation 4
Phase II: Pre-Production 5
Phase III: Production 6
Phase IV: Scale 7
Conclusion 8
Introduction
Many companies have bene ted from the commoditization of message brokers such as
Apache Kafka to make realtime data available to across silos of engineering organization. To
leverage that data e ectively, companies deploy architectures that leverage event-driven
applications. These applications power a wide variety of solutions from mission-critical
business logic to ETL pipelines that feed downstream analytics technologies.
One of, if not the most, popular open source event-driven application frameworks is Apache
Kafka Streams (Kafka Streams). It owes its success to many characteristics, some of which are
listed:
• It has a powerful API that handles complicated semantics such as exactly-once event
processing out-of-the-box.
• It can be deployed as a library without external dependencies aside from a Kafka broker.
• It integrates with existing CI/CD, tooling and monitoring.
The simplicity of Kafka Streams’ adoption leads engineering teams to believe that simple code
will translate to simple production operations: reality often diverges from this expectation.
This whitepaper highlights the typical adoption journey of Kafka Streams within an
organization, the challenges faced at each phase and provides some best practices to ensure
smooth transitions as your company’s adoption of Kafka Streams matures.
When companies adopt any technology they typically see four phases of operational maturity:
Phase Description
Experimentation The rst phase, experimentation, is largely driven by developers and plays a
large part in the emotional attachment to the technology. This phase covers
everything from product discovery and playing around with a quickstart to
making rst attempts to implement business logic within the con nes of the
framework.
Pre-Production The second phase, pre-production, covers the e ort to go from an initial
implementation to a pre-production environment. Generally security review,
performance and correctness testing and any cross-component integration
testing happen during this phase.
Production The third phase happens when an engineering team’s solution begins serving
production tra c. At this point, teams begin to invest more in supporting
functionality like CI/CD, observability, alerting and operational tooling.
Scale The fourth phase typi ed but early production success that results in scaling in
one of two directions: the application itself handles more tra c or the
framework is picked up by other parts of the organization. To be successful at
scale, many companies have spent years building custom solutions that cater
to their needs.
Each of these stages introduce a new set of challenges that increasingly a ect wider parts of
the organization with increasing severity. Kafka Streams in particular excels at the earliest
phases of adoption and requires more dedicated expertise to handle the later phases.
The remaining sections describe typical roadblocks faced by organizations teams at each
phase and highlights solutions to those roadblocks. It also highlights the work done from
leading companies that have been successful at the later stages of the adoption journey.
The Open Source Software (OSS) version of Kafka Streams excels during the experimentation
phase.
Unlike many other frameworks that require a centralized orchestration system to execute jobs,
Kafka Streams applications run just like any other Java application (and can even run
embedded in a microservice that completes other business logic, though this is not
recommended). Due to this, developers that test out Kafka Streams on their local machine are
quickly able to stand up a quickstart application and modify it to achieve their desired business
logic.
Beyond the deployment model, the API is expressive and familiar: it mirrors the Java `Stream`
API, something Java developers frequently leverage for standard applications. They are familiar
with concepts like ` lter` and `map`, and any engineers which have background in data
processing will be familiar with `join` and `aggregate` functions. Much of the complicated nature
of realtime event processing is hidden elegantly behind this powerfully simple API.
These two characteristics (the deployment model and the simple API) drive the initial success
that developers achieve when experimenting with Kafka Streams.
The challenges faced during this stage are at the level of an individual developer and can be
solved with quality online resources:
Problem Resources
Deploying a Kafka There are various good docker compose based setups that can spin up a local
Broker broker. See kafka-stack-docker-compose as an example.
Generating Sample There are a few existing open source and paid tools for generating sample
Data data. The kafka console producer is good for manually testing, while the
Datagen Connector (kafka-connect-datagen) and shadowtra c.io allow
generating constant streams of sample data.
Understanding Time See educational resources like https://kafka.apache.org/0110/documentation/
Semantics streams/core-concepts#streams_time and https://www.youtube.com/watch?
v=QHBkbDKFnIM to get a better understanding of how time plays in to Kafka
Streams applications
The pre-production phase of operational maturity starts to involve more of the organization,
particularly to ensure that the design of the new component meets engineering and security
nonfunctional requirements:
This phase is still primarily driven by an individual engineering team within an organization, but
will require occasional collaboration with other technical teams such as the Information
Security team.
The challenges faced are often with the di culty of estimating how much resources will be
necessary and developing reliable functional correctness tests.
Problem Resources
Estimating Resources Sizing a Kafka Streams deployment is more of an art than a science. Generally
it’s recommended to start with the smallest possible cluster size and to slowly
increase the allocation of resources until the cluster is in a stable state. For a
full guide on sizing applications see https://www.responsive.dev/blog/a-size-
for-every-stream
Functional The premier way to test Kafka Streams applications is to utilize the built in
Correctness Testing TopologyTestDriver. This will allow you to generate sample input and verify the
speci c output.
Performance Testing Testing performance can be challenging with Kafka Streams. Many
organizations will deploy dedicated pre-production environments that
“shadow” a percentage of tra c from production. Using this tra c, it is
possible to deploy a smaller Kafka Streams cluster with identical code to
estimate performance characteristics.
As with other OSS, the freely available components are solid core technology without the
auxiliary functions that are required to successfully operate in production. Take the comparison
of Java — the core technology, the JVM, is a powerful starting point. In order to successfully
run it in production, however, an entire ecosystem is required: there are metrics and
observability solutions, performance pro lers, development tooling, build systems, and more.
Kafka Streams is no exception here. Once deployed in production, engineering teams realize
that they must develop expertise to sift through hundreds of metrics to identify which are
necessary for altering and which are necessary to debug problems, build tooling to reset
o sets, write runbooks for triage and handle incidents and more.
This is also where Responsive begins to provide signi cant value over Open Source Kafka
Streams:
The production phase problems are mostly related to availability and developer productivity:
Observability There are some open source resources that can help setting up dashboards.
kafka-streams-dashboards is an excellent starting point.
Responsive provides observability out of the box, and selects the most
important Kafka Streams metrics for monitoring the health of an applications.
Support When things go wrong, it can be challenging to gure out what to do. The best
way to get support for free is the Responsive Discord or the Con uent
Community Slack #kafka-streams channel.
Management & Most companies build in-house solutions to various problems they face.
Tooling Common operations include resetting o sets, scaling up/down, xing
imbalanced partition assignment, debugging contents of state stores, etc…
The nal phase of a Kafka Streams adoption journey is scaling. This covers both scaling the
technical demands on a single application and also scaling to many apps across the
organization. The challenges here are often handled at the highest level of an engineering
organization and may have wider reaching implications to budgeting as maintenance and
infrastructure costs grow.
Kafka Streams is particularly di cult to scale. At Responsive, we’ve seen many companies
deal with extended outages (particularly relating to rebalance loops and extended state
restoration) as they scale to state store sizes that exceed 4-8GB per partition. Our core insight
is that Kafka Streams is a distributed database in disguise: just like other distributed
databases, Kafka Streams manages replication, partitioning and distribution of signi cant
amounts of RocksDB state across multiple nodes. This is compounded by the collocation of
compute (the stream processing) and the storage.
Problem Resources
Availability Signi cant state exacerbates rough edges that are papered over at smaller
scale. Many companies running RocksDB based Kafka Streams at a scale of
>4GB per partition face multiple-hour outages.
To succeed at signi cant technical and organizational scale, companies have invested
thousands of engineering hours to be successful with home-grown solutions. Some publicly
referenced solutions are listed below:
Conclusion
This document covered the four di erent phases of Kafka Streams operational maturity:
experimentation, pre-production, production and scale. Open Source Kafka Streams excels out
of the box at the rst two stages, but requires signi cant investment in the latter two.