kafka
kafka
Apache Kafka
Introduction
Chapter 1
Agenda
– Investigate Kafka Streams, the Kafka API for building stream processing applications
Chapter 2
01: Introduction
06: Conclusion
integrated
Systems Complexity
– A single ETL (Extract, Transform, Load) process to move data to that location
– Codebase grows
– “I can’t start to analyze today’s data until the overnight ingest process has run”
Real-Time Processing: Often a Better Approach
▪ Examples:
– Fraud detection
– etc.
– However, this is changing over time, as more ‘stream processing’ systems emerge
– Kafka Streams
– Apache Storm
– Apache Samza
– etc.
Kafka’s Origins
– Twitter, Netflix, Goldman Sachs, Hotels.com, IBM, Spotify, Uber, Square, Cisco…
Use Cases
– Log aggregation
– Stream processing
– Messaging
Kafka Fundamentals
▪ Brokers are the main storage and messaging components of the Kafka cluster
Kafka Messages
– Producer provides serializers to convert the key and value to byte arrays
Topics
– Logical representation
assigned to
Kafka Components
– Producers
– Brokers
– Consumers
– ZooKeeper
– Confluent develops and supports a REST (REpresentational State Transfer) server which can be used by
clients written in any language
processing
– hash(key) % number_of_partitions
Broker Basics
▪ Brokers receive and store messages when they are sent by the Producers
▪ Each Partition is stored on the Broker’s disk as one or more log files
▪ Kafka provides a configurable retention policy for messages to manage log file growth
Consumer Basics
– As messages are written to a Topic, the Consumer will automatically retrieve them
– By default, each Consumer will receive all the messages in the Topic
What is ZooKeeper?
– Cluster management
▪ In earlier versions of Kafka, the Consumer needed access to the ZooKeeper quorum
– How Kafka Connect compares to writing your own data transfer system
▪ Kafka Connect is a framework for streaming data between Apache Kafka and other
data systems
▪ Kafka Connect is open source, and is part of the Apache Kafka distribution
▪ It is simple, scalable, and reliable
▪ Internally, Kafka Connect is a Kafka client using the standard Producer and
Consumer APIs
– Features fault tolerance and automatic load balancing when running in distributed mode
– No coding required
– Pluggable/extendable by developers
Connect Basics
▪ Connectors are logical jobs responsible for managing the copying of data between Kafka and another
system
▪ Connector Sources read data from an external data system into Kafka
▪ Splitting the workload into smaller pieces provides the parallelism and scalability
▪ Connector jobs are broken down into tasks that do the actual copying of the data
Converting Data
▪ Converters provide the data format written to or read from Kafka (like Serializers)
– key.converter
– value.converter
– Avro: AvroConverter
– JSON: JsonConverter
– String: StringConverter
▪ Best Practice is to use an Avro Converter and Schema Registry with the Connectors
Benefits
configurable way
Off-The-Shelf Connectors
– FileStream
– JDBC
– HDFS
– Elasticsearch
– AWS S3
– Replicator
– JMS
– See https://www.confluent.io/product/connectors/
▪ JDBC Source periodically polls a relational database for new or recently modified
rows
– Creates a record for each row, and Produces that record as a Kafka message
▪ Records from each table are Produced to their own Kafka topic
▪ The Connector can detect new and updated rows in several ways:
▪ Continuously polls from Kafka and writes to HDFS (Hadoop Distributed File System)
▪ Works with secure HDFS and the Hive Metastore, using Kerberos
– Field Partitioner
– Time Partitioner
– Custom Partitioners
Two Modes: Standalone and Distributed
– Standalone mode
– Use case: testing and development, or when a process should not be distributed
– Distributed mode
▪ Group coordination
– InsertField: insert a field using attributes from the message metadata or from a
– ValueToKey: replace the key with a new key formed from a subset of fields in the
value payload
– TimestampRouter: update the topic field as a function of the original topic value
and timestamp
▪ Connectors can be added, modified, and deleted via a REST API on port 8083
-This is optional to the other way of modifying the standalone configuration properties file
-Changes made this way will not persist after worker restart
- Changes made this way will persist after a worker process restart
In this Hands-On Exercise, you will run Kafka Connect to pull data from a MySQL database into a
Kafka topic
Please refer to the Hands-On Exercise Manual
Chapter Review
▪ Kafka Connect provides a scalable, reliable way to transfer data from external systems into Kafka, and
vice versa
▪ Many off-the-shelf Connectors are provided by Confluent, and many others are under development by
third parties
▪ Kafka Streams is a lightweight Java library for building distributed stream processing applications using
Kafka
▪ Supports windowing operations, and stateful processing including distributed joins and aggregation
– Spark Streaming
– Apache Storm
– Apache Samza
– etc.
▪ Continuous transformations
▪ Continuous queries
▪ Microservices
▪ Event-driven systems
▪ …and more
▪ Many people are currently building their own stream processing applications
▪ Using Kafka Streams is much easier than taking the ‘do it yourself’ approach
– Means you can focus on the application logic, not the low-level plumbing
▪ There is no ‘installation’!
▪ Kafka Streams is a library. Add it to your application like any other library
▪ There is no cluster!
▪ Unlearn your bad habits: ‘do cool stuff with data’ != ‘must have a cluster’
Kafka Streams Basic Concepts (1)
▪ A processor topology defines the data flow through the stream processors
KStreams and KTables
– Each record represents a self-contained piece of data in the unbounded data set
▪ If we were to treat the stream as a KStream and sum up the values for apple, the result would be 6
▪ If we were to treat the stream as a KTable and sum up the values for apple, the result would be 5
– The second record is treated as an update to the first, because they have the same key
▪ Typically, if you are going to treat a topic as a KTable it makes sense to configure log compaction on the
topic
– Reliable
– Fault-tolerant
– Scalable
– Interactive Queries
Stateful Computations
▪ Example: count() will cause the creation of a state store to keep track of counts
– Default: RocksDB (key-value store) to allow for local state that is larger than available RAM
▪ May appeal to people who are familiar with Storm, Samza etc
– Or for people who require functionality that is not yet available in the DSL
API Option 2: Kafka Streams DSL
– Combine multiple input records together in some way into a single output record
Windowing: Example
▪ Example:
▪ Each Kafka Streams application must provide serdes (serializers and deserializers) for the data types of
the keys and values of the records
– These will be used by some of the operations which transform the data
Available Serdes
– See https://github.com/confluentinc/examples/tree/kafka-0.10.0.0-cp3.0.0/kafkastreams/src/main/
java/io/confluent/examples/streams/utils
▪ Create a KStream or KTable object from one or more Kafka Topics using KStreamBuilder or
KTableBuilder
▪ Example:
Transforming a Stream
– filter
– Creates a new KStream containing only records from the previous KStream which meet some
specified criteria
– map
– Creates a new KStream by transforming each element in the current stream into a different element
in the new stream
– mapValues
– Creates a new KStream by transforming the value of each element in the current stream into a
different element in the new stream
– flatMap
– Creates a new KStream by transforming each element in the current stream into zero or more
different elements in the new stream
– flatMapValues
– Creates a new KStream by transforming the value of each element in the current stream into zero
or more different elements in the new stream
– countByKey
– Counts the number of instances of each key in the stream; results in a new, everupdating KTable
– reduceByKey
– Combines values of the stream using a supplied Reducer into a new, everupdating KTable
– Example:
▪ We often want to write to a topic but then continue to process the data
▪ Example:
Processing Guarantees
▪ If a machine fails, it is possible that some records may be processed more than once
– Whether this is acceptable or not depends on the use-case
▪ Exactly-once processing semantics will be supported in the next major release of Kafka
▪ Check the documentation to learn about processing windowed tables, joining tables, joining streams
and tables…
– Simple
– Powerful
– Flexible
– Scalable
Conclusion - Chapter 6
▪ Together, they give us a way to import data from external systems, manipulate or modify that data in
real time, and then export it to destination systems for storage or further processing