0% found this document useful (0 votes)
9 views

Responsive-Navigating_Kafka_Streams

Uploaded by

avilanchee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Responsive-Navigating_Kafka_Streams

Uploaded by

avilanchee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

[ whitepaper - may 2024 ]

Navigating Phases of
Operational Maturity with
Kafka Streams.
.
.
.
.
.
.
.
.
Table of Contents

Introduction 2
Overview: The Four Phases of Operational Maturity 3
Phase I: Experimentation 4
Phase II: Pre-Production 5
Phase III: Production 6
Phase IV: Scale 7
Conclusion 8

Introduction
Many companies have bene ted from the commoditization of message brokers such as
Apache Kafka to make realtime data available to across silos of engineering organization. To
leverage that data e ectively, companies deploy architectures that leverage event-driven
applications. These applications power a wide variety of solutions from mission-critical
business logic to ETL pipelines that feed downstream analytics technologies.

One of, if not the most, popular open source event-driven application frameworks is Apache
Kafka Streams (Kafka Streams). It owes its success to many characteristics, some of which are
listed:

• It has a powerful API that handles complicated semantics such as exactly-once event
processing out-of-the-box.
• It can be deployed as a library without external dependencies aside from a Kafka broker.
• It integrates with existing CI/CD, tooling and monitoring.

For an in-depth analysis on technologies that power event-driven applications, refer to


Foundations for Stateful Event-Driven Applications (available on Responsive’s resource page).

The simplicity of Kafka Streams’ adoption leads engineering teams to believe that simple code
will translate to simple production operations: reality often diverges from this expectation.

This whitepaper highlights the typical adoption journey of Kafka Streams within an
organization, the challenges faced at each phase and provides some best practices to ensure
smooth transitions as your company’s adoption of Kafka Streams matures.

© 2024 Responsive Computing, Inc. | responsive.dev 2


ff
fi
Overview: The Four Phases of Operational Maturity

When companies adopt any technology they typically see four phases of operational maturity:
Phase Description

Experimentation The rst phase, experimentation, is largely driven by developers and plays a
large part in the emotional attachment to the technology. This phase covers
everything from product discovery and playing around with a quickstart to
making rst attempts to implement business logic within the con nes of the
framework.
Pre-Production The second phase, pre-production, covers the e ort to go from an initial
implementation to a pre-production environment. Generally security review,
performance and correctness testing and any cross-component integration
testing happen during this phase.
Production The third phase happens when an engineering team’s solution begins serving
production tra c. At this point, teams begin to invest more in supporting
functionality like CI/CD, observability, alerting and operational tooling.
Scale The fourth phase typi ed but early production success that results in scaling in
one of two directions: the application itself handles more tra c or the
framework is picked up by other parts of the organization. To be successful at
scale, many companies have spent years building custom solutions that cater
to their needs.

Each of these stages introduce a new set of challenges that increasingly a ect wider parts of
the organization with increasing severity. Kafka Streams in particular excels at the earliest
phases of adoption and requires more dedicated expertise to handle the later phases.

The remaining sections describe typical roadblocks faced by organizations teams at each
phase and highlights solutions to those roadblocks. It also highlights the work done from
leading companies that have been successful at the later stages of the adoption journey.

© 2024 Responsive Computing, Inc. | responsive.dev 3


fi
fi
ffi
fi
ff
ffi
fi
ff
Phase I: Experimentation

The Open Source Software (OSS) version of Kafka Streams excels during the experimentation
phase.

Unlike many other frameworks that require a centralized orchestration system to execute jobs,
Kafka Streams applications run just like any other Java application (and can even run
embedded in a microservice that completes other business logic, though this is not
recommended). Due to this, developers that test out Kafka Streams on their local machine are
quickly able to stand up a quickstart application and modify it to achieve their desired business
logic.

Beyond the deployment model, the API is expressive and familiar: it mirrors the Java `Stream`
API, something Java developers frequently leverage for standard applications. They are familiar
with concepts like ` lter` and `map`, and any engineers which have background in data
processing will be familiar with `join` and `aggregate` functions. Much of the complicated nature
of realtime event processing is hidden elegantly behind this powerfully simple API.

These two characteristics (the deployment model and the simple API) drive the initial success
that developers achieve when experimenting with Kafka Streams.

The challenges faced during this stage are at the level of an individual developer and can be
solved with quality online resources:

Problem Resources

Deploying a Kafka There are various good docker compose based setups that can spin up a local
Broker broker. See kafka-stack-docker-compose as an example.
Generating Sample There are a few existing open source and paid tools for generating sample
Data data. The kafka console producer is good for manually testing, while the
Datagen Connector (kafka-connect-datagen) and shadowtra c.io allow
generating constant streams of sample data.
Understanding Time See educational resources like https://kafka.apache.org/0110/documentation/
Semantics streams/core-concepts#streams_time and https://www.youtube.com/watch?
v=QHBkbDKFnIM to get a better understanding of how time plays in to Kafka
Streams applications

© 2024 Responsive Computing, Inc. | responsive.dev 4


-
fi
-
ffi
Phase II: Pre-Production

The pre-production phase of operational maturity starts to involve more of the organization,
particularly to ensure that the design of the new component meets engineering and security
nonfunctional requirements:

• Functional Correctness Testing


• Performance Testing & Cost Considerations
• Security Review

This phase is still primarily driven by an individual engineering team within an organization, but
will require occasional collaboration with other technical teams such as the Information
Security team.

Kafka Streams as a technology is setup to succeed in most aspects of the pre-production


phase. Notably, if an organization has deployed any other Java applications the organization is
already setup to estimate the costs. Similarly, security is generally not a concern since all the
data remains within the organization’s environmental boundaries.

The challenges faced are often with the di culty of estimating how much resources will be
necessary and developing reliable functional correctness tests.

Problem Resources

Estimating Resources Sizing a Kafka Streams deployment is more of an art than a science. Generally
it’s recommended to start with the smallest possible cluster size and to slowly
increase the allocation of resources until the cluster is in a stable state. For a
full guide on sizing applications see https://www.responsive.dev/blog/a-size-
for-every-stream
Functional The premier way to test Kafka Streams applications is to utilize the built in
Correctness Testing TopologyTestDriver. This will allow you to generate sample input and verify the
speci c output.
Performance Testing Testing performance can be challenging with Kafka Streams. Many
organizations will deploy dedicated pre-production environments that
“shadow” a percentage of tra c from production. Using this tra c, it is
possible to deploy a smaller Kafka Streams cluster with identical code to
estimate performance characteristics.

© 2024 Responsive Computing, Inc. | responsive.dev 5


fi
ffi
ffi
ffi
There are public references to work that companies have invested to make the pre-production
phase successful. For example, Bloomberg gave a talk in 2022 around their integration testing
setup for Kafka Streams: https://www.con uent.io/events/current-2022/verifying-apache-
kafka-based-data-pipelines/ — in this talk they cover their usage of test containers to
automate testing and their custom framework for testing producer/consumer interactions
driven by Kafka Streams.

Phase III: Production

The production phase is characterized by successfully processing production events. While


this milestone often marks an ending, it’s more accurately described as a beginning: the real
challenges with Kafka Streams start at this point. At this phase, the project tends to have
business impact and visibility at higher levels in the organization. Outages are monitored by
engineering directors or VPs.

As with other OSS, the freely available components are solid core technology without the
auxiliary functions that are required to successfully operate in production. Take the comparison
of Java — the core technology, the JVM, is a powerful starting point. In order to successfully
run it in production, however, an entire ecosystem is required: there are metrics and
observability solutions, performance pro lers, development tooling, build systems, and more.

Kafka Streams is no exception here. Once deployed in production, engineering teams realize
that they must develop expertise to sift through hundreds of metrics to identify which are
necessary for altering and which are necessary to debug problems, build tooling to reset
o sets, write runbooks for triage and handle incidents and more.

This is also where Responsive begins to provide signi cant value over Open Source Kafka
Streams:

1. Responsive provides a robust architecture that decreases the complexity of what


application teams operate
2. When there are issues, organizations have a team of experts with over twenty years of
experience a phone call away
3. Organizations don’t need to spend time building observability and management tooling that
would otherwise occupy years of engineering time.

The production phase problems are mostly related to availability and developer productivity:

© 2024 Responsive Computing, Inc. | responsive.dev 6


ff
fi
fl
fi
Problem Resources

Observability There are some open source resources that can help setting up dashboards.
kafka-streams-dashboards is an excellent starting point.

Responsive provides observability out of the box, and selects the most
important Kafka Streams metrics for monitoring the health of an applications.
Support When things go wrong, it can be challenging to gure out what to do. The best
way to get support for free is the Responsive Discord or the Con uent
Community Slack #kafka-streams channel.
Management & Most companies build in-house solutions to various problems they face.
Tooling Common operations include resetting o sets, scaling up/down, xing
imbalanced partition assignment, debugging contents of state stores, etc…

Many of these do not have existing solutions in open source.


As with pre-production improvements, various companies have invested signi cantly in making
the production operations more streamlined:

• Michelin developed Kstreamplify (https://blogit.michelin.io/kstreamplify/) to improve error


handling in Kafka Streams with features such as Dead Letter Queues, property injection for
uni ed con guration across their streams deployments and more.
• WiX developed Greyhound (https://github.com/wix/greyhound) to provide a higher-level
interface to Kafka and to express richer semantics such as parallel message handling or retry
policies with ease.

Phase IV: Scale

The nal phase of a Kafka Streams adoption journey is scaling. This covers both scaling the
technical demands on a single application and also scaling to many apps across the
organization. The challenges here are often handled at the highest level of an engineering
organization and may have wider reaching implications to budgeting as maintenance and
infrastructure costs grow.

Kafka Streams is particularly di cult to scale. At Responsive, we’ve seen many companies
deal with extended outages (particularly relating to rebalance loops and extended state
restoration) as they scale to state store sizes that exceed 4-8GB per partition. Our core insight
is that Kafka Streams is a distributed database in disguise: just like other distributed
databases, Kafka Streams manages replication, partitioning and distribution of signi cant
amounts of RocksDB state across multiple nodes. This is compounded by the collocation of
compute (the stream processing) and the storage.

© 2024 Responsive Computing, Inc. | responsive.dev 7


fi
fi
fi
ffi
ff
fi
fi
fl
fi
fi
Kafka Streams is a Distributed Database in disguise.
While Kafka Streams does an adequate job hiding this complexity from developers earlier in
their adoption cycle, at scale these challenges grow super-linearly. From a technical
perspective, what appear as rough edges at low scale can cause signi cant outage times at
high scale. One customer observed rebalance loops with 1TB of state that would take over 30
hours to fully resolve, and much of that time was spent with total degradation of the service.

From an organizational perspective, each application team builds up expertise in managing


and operating their applications — which leads to signi cant ine ciency in engineering
resource utilization. More often than not, 1-2 engineers emerge as the subject matter experts in
Kafka Streams and the entire organization is bottlenecked on their ability to triage and
remediate incidents. This creates additional risk for organizations as they constantly fear the
churn of these key experts.

To summarize, the main challenges with scaling phase are:

Problem Resources

Availability Signi cant state exacerbates rough edges that are papered over at smaller
scale. Many companies running RocksDB based Kafka Streams at a scale of
>4GB per partition face multiple-hour outages.

To our knowledge, there are no commercially available solutions outside of


Responsive that x this problem at the core.
Organizational Strain Typically 1-2 engineers emerge as the subject matter experts in Kafka
Streams. This bottlenecks organizations on their ability to triage and remediate
incidents and introduces risk if these experts churn.

To succeed at signi cant technical and organizational scale, companies have invested
thousands of engineering hours to be successful with home-grown solutions. Some publicly
referenced solutions are listed below:

• Walmart augmented their Kafka Streams architecture with a custom Cassandra-backed


realtime pipeline to make it scale to their technical needs (https://www.con uent.io/blog/
walmart-real-time-inventory-management-using-kafka/)
• Bloomberg deploys their applications with 2TB of RAM and 160 CPUs to handle the scale
they need (https://dl.acm.org/doi/pdf/10.1145/3448016.3457556)

Conclusion
This document covered the four di erent phases of Kafka Streams operational maturity:
experimentation, pre-production, production and scale. Open Source Kafka Streams excels out
of the box at the rst two stages, but requires signi cant investment in the latter two.

© 2024 Responsive Computing, Inc. | responsive.dev 8


fi
fi
fi
fi
ff
fi
fi
ffi
fi
fl
Responsive closes this gap by providing a solution to the most pressing problems at all staging
of operational maturity, particularly in the production and scale phases.

© 2024 Responsive Computing, Inc. | responsive.dev 9

You might also like