Kafka Streams
Kafka Streams
For Developers
using
Java/SpringBoot
Dilip Sundarraj
About Me
• Dilip
ff
Source Code
Thank You!
Prerequisites
• Java 17
• Docker
Source Sink
Connector Connector
Kafka Streams
Introduction to Kafka Streams API
• Kafka Streams is a Java Library which primary focuses on :
• Data Enrichment
• Transformation of Data
• Aggregating the data
• Joning data from multiple Kafka topics
• Kafka Streams API uses Functional programming Style
• Lamdas
• Map, lter , atMap operators similar to Modern Java features
• Kafka Streams API is build on top of Java 8
fi
fl
Data Flow in a Kafka Streams App
Aggregating, transforming,
Kafka Streams App 2 Joining , etc.,
Kafka
Broker
3
Behind the scens Streams API uses the producer and consumer API to read and write it back to Kafka topic
How is this different from Consumer API ?
• Consumer applications built using Consumer API are stateless.
• Kafka Consumer App Data Flow :
• Read the event
• Process the event
• Move on to the next event
• Consumer apps are great for noti cations or processing each event
independent of each other.
• Consumer apps does not have an easy way to join or aggregate events.
fi
Streams API use cases
• Stateful and Stateless applications
• Stateless applications are similar to what we build using the Consumer API
• StateFul Operations:
• Retail:
• Calculating the total number of orders in realtime
• Calculating a total revenue made in realtime
• Entertainment
• Total number of tickets sold in realtime
• Total revenue generated by a movie in realtime
Streams API Impementations
• Streams DSL
• This is a high level API
• Use operators to build your stream processing logic
• Processor API
• This is a low level API
• Complex compared to Streams DSL
• Streams DSL is built on top of Processor API
Kafka Streams Terminologies
Topology
&
Processor
Topology & Processors
• A Kafka stream processing
application has a series of
Source
processors. Source Topic
Processor 1
Stream
Processor
Sink Sink
Processor Processor
Subtoplogy 2
Subtoplogy 1
Stream Stream
Processor Processor
How the Data Flows in a Kafka Streams Processing ?
C B A
Source
• At any given point of time,
Processor
only one record gets
processed in the Topology.
Sink
Processor
Introduction to KStreams API
Stream Topology
Processor
Sink
Processor
KStream
• KStream is an abstraction in Kafka Streams which holds each event in the
Kafka topic
Topic (key,Value)
• KStream gives you access to all the records in the • Any new event is made available to the KStream
Kafka topic.
• KStream can be called as record stream or log
• KStream treats each event independent of one
another. • Record Stream is in nite
• Each event will be executed by the whole • Analogy : Inserts into a DB table
topology.
fi
Greeting Kafka Streams App
greetings greetings-uppercase
Greeting
Kafka Streams
App
GOOD MORNING!
Good Morning!
Prerequisites for Greetings App
• Kafka environment
• Topology
filter
• lter
• This operator is used to drop elements in the Kafka Stream that does not
meet a certain criteria.
Apple A P P L E
Topic1
KStream1
CombineStream CombinedTopic
Topic2
KStream2
Serialization/Deserialization
in
Kafka Streams API
Serdes
• Serdes is the factory class in Kafka Streams that takes care of handling
serialization and deserialization of key and value.
Consumed.with(Serdes.String(), Serdes.String()));
Aggregating, transforming,
Kafka Streams App 2 Joining , etc.,
Kafka
Broker
3
Behind the scens Streams API uses the producer and consumer API to read and write it back to Kafka topic
Enhanced Greeting Message
{
"message":"Good Morning”,
“timeStamp":"2022-12-04T05:26:31.060293"
}
What’s needed to build a Custom Serde ?
• Serializer
• Deserializer
• Serde that holds the Serializer and Deserializer
Overview of the retail App
general
ABCMarket
Restaurant
restaurant
Orders
Data Model for the Order
General Order Restaurant Order
{
{
"orderId": 12345,
"orderId": 12345,
"locationId": "store_1234",
"locationId": "store_1234",
"finalAmount": 15.00,
"finalAmount": 27.00,
"orderType": "RESTAURANT",
"orderType": "GENERAL",
"orderLineItems": [
"orderLineItems": [
{
{
"item": "Pizza",
"item": "Bananas",
"count": 2,
"count": 2,
"amount": 12.00
"amount": 2.00
},
},
{
{
"item": "Coffee",
"item": "Iphone Charger",
"count": 1,
"count": 1,
"amount": 3.00
"amount": 25.00
}
}
],
],
"orderedDateTime": "2022-12-05T08:55:27"
"orderedDateTime": "2022-12-05T08:55:27"
}
}
New Business requirements for our app
• Split the restaturant and general orders and publish them into di erent topics
• Just Publish the Transaction amount of these two types of Orders
• Transform the Order type to Revenue Type
ff
Kafka Streams - Internals
Kafka Streams - Topology, Task & Threads
• Topology in Kafka Streams means its a data pipeline or template on how
the data is going to be ow or handled in our Kafka Streams application.
Execute “Tasks” by
2
Stream Threads
Tasks in Kafka Streams
• A Task is basically the unit of work in a Kafka Streams application
• For Example, If the source topic has four partitions then we will have four
tasks created behind the scenes for the Kafka Streams application
• You can modify it based on how many tasks you want to execute in parallel.
Kafka Streams - Default Stream Threads
Topic
P1 P2 P3 P4
Tasks T1 T2 T3 T4
Kafka Streams App
Stream
num.stream.threads=1
Thread
T1 T2 T3 T4
Kafka Streams - Parallelism - Approach 1
Topic
P1 P2 P3 P4
Tasks T1 T2 T3 T4
Kafka Streams App
num.stream.threads=4
T1 T2 T3 T4
Kafka Streams - Parallelism - Approach 1
Topic
P1 P2 P3 P4
Tasks T1 T2 T3 T4
Kafka Streams App
num.stream.threads=2
Stream Stream
Thread Thread
T1 T2 T3 T4
Kafka Streams - Parallelism - Approach 2
Topic
P1 P2 P3 P4
T1 T2 T3 T4
Stream Thread Stream Thread
What’s the ideal number of Stream Threads ?
• Keep the Stream threads size equal to the number of cores in the machine.
Runtime.getRuntime().availableProcessors()
Kafka Streams & Consumer Groups
• Kafka Streams behind the scenes uses consumer api to stream from a topic.
• Tasks are split across multiple instances using the consumer group concept.
• Fault Tolernace : If one instance goes down then rebalance will be triggered to
redistribute the tasks.
fi
Errors In Kafka Streams
Kafka
Streams App
Deserialization DeserializationExceptionHandler
StreamsUncaughtExceptionHandler
Application Error (Topology)
ProductionExceptionHandler
Serialization
KTable
• KTable is an abstraction in Kafka Streams which holds latest value for a given
key.
Topic (key,Value)
1 2 3
KTable A, Alpha B,
B,Baby
Bus
• KTable is also known as update-stream or a • Record that comes with the same key updates the previous
value.
change log.
• Any Reocrd without the Key is ignored.
• KTable represents the latest value for given key in • Analogy (Relational DB):
a Kafka Record. • You can think of this as an update operation in the table
for a given primary-key
How to create a KTable ?
• Data in the Rocks DB also maintained in a changelog topic for Fault Tolerance.
fi
When to use a KTable?
• Any business usecase that requires the streaming app to maintain the latest
value for a given key can bene t from KTable.
• Example
• Stock Trading App that requires to maintain the latest value for a given
Stock symbol.
fi
KTable - Under the Hood
• When does KTable decides to emit the data to the downstream nodes?
• cache.max.bytes.bu ering
• commit.interval.ms
fi
ff
cache.max.bytes.buffering
• KTable uses a cache internally for deduplication and the cache serves as bu er.
• This bu er holds the previous value, any new message with the same key
updates the current value.
• Caching also helps with the amount of data written into the RocksDB
• So, if the bu er size is greater than the cache.max.bytes.bu ering value then
the data will be emitted to the downstream nodes
ff
ff
ff
ff
ff
commit.interval.ms
• commit.interval.ms = 30000 (30 seconds)
• KTable emits the data to the downstream processor nodes once the commit
interval is exhausted.
GlobalKTable
var wordsGlobalTable = streamsBuilder
.globalTable("words", Consumed.with(Serdes.String(), Serdes.String())
, Materialized.as("words-global-store")
);
P1 P2 P3 P4
T1 T2 T3 T4
Stream Thread Stream Thread
• Aggreation of Data
• count , reduce and aggregate
• Joining Data
• join, leftJoin, outerJoin
• Windowing Data
• windowedBy
Aggregation of Data
• Aggregation is the concept of combining multiple input values to produce one
output.
Group Records
1
by Key
2 Aggregate the
Records
• cache.max.bytes.bu ering
• commit.interval.ms
fi
ff
Results of Aggregation
• Aggregation results are present in the StateStore and internal Kafka Topic.
Internal Topic
Aggregation operators in Kafka Streams
• We have three operators in Kafka Streams library to support aggregation:
• count
• reduce
• aggregate
count Operator
• This is used to count the number of di erent events that share the same
key.
var groupedString = wordsTable
.groupByKey(Grouped.with(Serdes.String(), Serdes.String()))
.count(Named.as("count-per-alphabet"));
Count [A : 1] [A : 2] [B : 1]
reduce operator
• This operator is used to reduce multiple values to a single value that shares
the same key.
• The return type of the reduce operator must the same type as the value.
• The aggregated value gets stored in an internal topic.
Reduce
reduce [A :Apple-Alpha]
[A : Apple] [A :Apple-Alpha-Ant]
[B : Bus]
[B : Bus-Baby]
Use this operator if you have a use case to continuously aggregate the events in this fashion.
aggregate
• An aggregate operator is similar to reduce operator.
• The aggregated type can be di erent from actual Kafka Records that we are acting on.
Topic (key,Value) A, Apple A, Adam
{ {
“key”: “A”, “key”: “A”,
“values” : [“Apple”], “Values” : [“Apple”, Adam],
aggregate
“running_count” : 1 “running_count” : 2
} }
statestore
Client Rest API
(RocksDB)
Internal
Results
Aggregation in OrderManagement Service
Topic A
Joined
Stream
Topic B
• Joins will only be performed if their a matching key in both the topics.
• How is this di erent from merge operator?
• Merge operator has no condition or check to combine events.
ff
Join operators in Kafka Stream
• join
• leftJoin
• outerJoin
Types of Joins in Kafka Streams
1 KStream-KTable join,leftJoin
2 KStream-GlobalKTable join,leftJoin
3 KTable-KTable join,leftJoin,outerJoin
4 KStream-KSTream join,leftJoin,outerJoin
join
(key,Value)
alphabets
Joined
Stream
{
alphabet_abbreviations "abbreviation": "Apple",
"description": "A is the first letter in the English alphabet"
}
A, Apple
(key,Value)
• Joins will get triggered if there is a matching record for the same key.
• To achieve a resulting data model like this , we would need a ValueJoiner.
• This is also called innerJoin.
• Join won’t happen if the records from topics don't share the same key.
fi
Types of Joins in Kafka Streams
• KStream - KTable
• KStream - GlobalKTable
• Ktable - KTable
• KStream - KStream
Join KStream - KStream
• The KStream-KStream join is little di erent compared to the other ones.
• A KStream is an in nite stream which represents a log of everything that
happened.
A, A is the rst letter in the English alphabet
Topic1 KStream1
Primary time window
(5 secs)
Topic2 KStream2
Secondary
A, Apple
fi
fi
ff
leftJoin
• Join is triggered when a record on the left side of the join is received.
• If there is no matching record on the right side , then the join will be triggerred with null
value for the right side value.
outerJoin
• Join will be triggerred if there is a record on either side of the join.
• When a record is recieved on either side of the join, and If there is no matching record
on the other side then the join will populate the null value to the combined result.
Co-Partitioning or Prerequisites in Joins
• The source topics used for Joins should have the same number of partitions.
• Each partition is assigned to a task in Kafka Streams.
• This guarantees that the related tasks are together and join will work as
expected.
KStream-KSTream Join,leftJoin,outerJoin
KStream-KTable Join,leftJoin
KTable-KTable Join,leftJoin,outerJoin
KStream-GlobalKTable Join,leftJoin
Joins in OrderManagement Service
Current
New Requirement
• In order to group events by time, we need to extract the time from the Kafka
record.
Time Concepts
• Event Time
• Represents the time the record gets produced from the Kafka Producer.
• Ingestion Time
• Represents the time the record gets appended to the Kafka topic in the log.
• Processing Time
• This is the time the record gets processed by the consumer.
TimeStamp Extractor in Kafka Streams
• FailOnInvalidTimestamp
• This implementation extracts the timestamp from the Consumer Record.
• Throws a StreamsException if the timestamp is invalid.
• LogAndSkipOnInvalidTimestamp
• Parses the record timestamp, similar to the implementation of
FailOnInvalidTimestamp.
• Hopping Window
• SlidingWindow
• Session Window
wordsStream
.groupByKey()
.windowedBy(tumblingWindow)
.count()
fi
Tumbling Window
Duration windowSize = Duration.ofSeconds(5);
B B
A A A A A C B A C
C B
0 5 6 10 11 15
RealTime Examples
• Entertainment:
• Total number of tickets sold on the opening day.
• Total revenue made on the opening day.
• We can achieve this by de ning a tumbling window for a day.
Duration windowSize = Duration.ofDays(1);
• Suppressed.untilWindowCloses
• Only emit the results after the de ned window is exhausted.
• Suppressed.untilTimeLimit
• Only emit the results if there are no successive events for the de ned time.
fi
fi
fi
ff
fi
ff
ff
fi
fi
fi
Suppression config for Buffer
Buffer Con g BufferFull Strategy
(Con g 2) (Con g 3)
wordsStream
.groupByKey()
.windowedBy(hoppingWindow)
.count()
.suppress(Suppressed
.untilWindowCloses(Suppressed.BufferConfig.unbounded().shutDownWhenFull())
)
Hopping Windows
• Hopping windows are xed time windows that may overlap.
• Time Windows/Buckets are created from the clock of the app’s running
machine. fi
Hopping Windows
1 Duration windowSize = Duration.ofSeconds(5);
.ofSizeWithNoGrace(windowSize)
.advanceBy(advanceBySize);
0 5 6 10 11 15
Sliding Windows
• This is a xed time window, but the windows created are not de ned by the
machines running clock.
A A A
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Seconds
• Tumbling Window
Order Message
{
"orderId": 12345,
"locationId": "store_1234",
"finalAmount": 27.00,
"orderType": "GENERAL",
"orderLineItems": [
{
"item": "Bananas",
"count": 2,
"amount": 2.00
},
{
"item": "Iphone Charger",
"count": 1,
"amount": 25.00
}
],
"orderedDateTime": "2022-12-05T08:55:27"
}
Kafka Streams Application using SpringBoot
• Spring framework is one the popular JVM frameworks.
• JSON Serialization/Deserialization.
• Unit and Integration tests for the Spring Kafka Streams using JUnit5.
Querying Materialized Views
statestore
Client Rest API
(RocksDB)
Topic (key,Value)
• There are two con gurations that controls this:
• cache.max.bytes.bu ering
A, Apple B, Ball A, Ant A, Alpha B, Baby
• commit.interval.ms
1 2 3
• An Integration test in our Kafka Streams app is to interact with the real Kafka
environment.
• This will simulate the behavior of the app that’s running in a production
environment.
Options for Integration Tests
• EmbeddedKafka
• TestContainers
Grace Period
• Data delays are a common concern in streaming applications.
• Events could be delayed because the producing app has some issue or
the Kafka broker is down.
2. Process 3. Produce
1. Consume Records Records
Records
Kafka
Streams
4. Commit
App 3.1 ACK
offsets
Same Record
2. Process 3. Produce
1. Consume Records Records
Records
Kafka
Streams
4. Commit
App 3.1 ACK
offsets
2.1. State Store
retries > 0
2. Process 3. Produce
1. Consume Duplicate
Records Records
Records
Kafka
Streams
4. Commit
App 3.1 ACK
offsets
2.1. State Store
Are there any issues with Duplicates?
• It depends on the use case.
• Duplicates not Allowed:
• Finance related transactions
• Revenue Calculation
• Duplicated Allowed:
• Likes on a post
• No of viewers on a live stream
How to guarantee Exactly Once Processing ?
1
2. Process 3. Produce
1. Consume Records Records
Records
Kafka
Streams
4. Commit
App 3.1 ACK
offsets
2 2
• Committing o sets
ff
Exactly Once Processing in Kafka Streams
• Idempotent Producer 1
3 records = consumer.poll();
4
transactional.id = abc P0 D C producer.send(“changelog”)
changelog producer.send(“output”)
State 4 producer.sendConsumerOffsets()
Store
Broker
processing.guarantee =
//commit the txn
exactly_once_v2
4 producer.commitTxn()
P1 D C
output } catch (KafkaException e){
}
Broker
Consumer App
P3 D C
isolation.level = read_committed
_consumer_o sets
ff
Transactions Error Scenario
Source Topic Transaction Cordinator
(Broker)
pid & epoch try {
P0 P1 P3 A
producer.beginTxn();
__transaction_state
Kafka Streams App 1
3 records = consumer.poll();
4
transactional.id = abc P0 D A producer.send(“changelog”)
changelog producer.send(“output”)
State 4 producer.sendConsumerOffsets()
Store
Broker
processing.guarantee =
//commit the txn
exactly_once_v2
4 producer.commitTxn()
P1 D A } catch (KafkaException e){
output
producer.abortTxn()
}
Broker
P3 D A
_consumer_o sets
ff
ff
Transactions : Idempotent Producer
Transaction Cordinator
(Broker)
pid & epoch
Kafka Streams App
__transaction_state
1
Topology
Broker 1 Broker 2
transactional.id = abc
Data : pid : 1 seq: 0
State P1
retried P0
Store
Leader P0 Follower
Leader
processing.guarantee =
exactly_once_v2 Topic
Topic
• If a duplicate record is published into source topic then the Kafka Streams
app has no clue that the record was not processed already
• Kafka Transactions are not applicable for Consumers that read from a Kafka
Topic and write to a DB.
fl
Performance of Kafka Transactions
• Enabling Transactions adds a performance overhead to the Kafka Producer.
• Additional call to the broker to register the producer id.
• Additional call to add the register the partitions to the transaction co-
ordinator.
• On the Consumer end, the consumer needs to wait for records until the
transaction is complete because of the read_committed con guration.
• If the transaction takes a longer time then the wait period for consumers
goes up.
fi
Running Kafka Streams as Multiple Instances
Orders
application.id = orders-stream-app
Stream Thread
Running Kafka Streams as Multiple Instances
• Each instance needs to know the address of
the other instance.
Orders
1
Client
1 5
A. Spring WebClient
fi
KeyBased Queries with Multiple Instances - Overview
• Each instance needs to know the
instance that holds the data for a
given key.
curl -i http://localhost:8080/v1/orders/count/general_orders?location_id=store_4567
curl -i http://localhost:8080/v1/orders/count/general_orders?location_id=store_1234
{“locationId":"store_1234","orderCount":1} {"locationId":"store_4567","orderCount":1}