Chandan Prakash's Blog
Chandan Prakash's Blog
Recently while exploring some ingestion technologies, I got chance to look into Kafka
Connect (KC) in detail. As a developer, the 2 things that intrigued my mind were:
1. WHY it exists : With Kafka already around,why do we need KC, why should we use it,
what different purpose it serves.
2. HOW it works: How it does what it does, internal code flow, distributed architecture, how
it guarantees fault tolerance, parallelism, rebalancing, delivery semantics, etc
If you too have the above two questions in mind, this post might be helpful to you.
Note: My findings about the HOW part are based out of my personal understanding
of KC source code, have not actually run it yet, so take it with pinch of salt.
Corrections are welcome!
Kafka Connect is an open source framework, built as another layer on core Apache Kafka,
to support large scale streaming data:
1. import from any external system (called Source) like mysql,hdfs,etc to Kafka broker
cluster
2. export from Kafka cluster to any external system (called Sink) like hdfs,s3,etc
For the above 2 mentioned responsibilities, KC works in 2 modes:
1. Source Connector : imports data from a Source to Kafka
2. Sink Connector : exports data from Kafka to a Sink
Dynamic Views theme. Powered by Blogger.
why-not-learn-something.blogspot.com/?view=classic 1/4
4/20/2019 Chandan Prakash's Blog
The core code of KC framework is part of Apache Kafka code base as another module
Chandan Prakash's
named “connect”. Blog
For using Kafka Connect for a specific data source/sink, a
kafka connect
corresponding source/sink connector needs to be implemented by overriding abstraction
classes Magazine
Classic Flipcard provided byMosaic
KC framework
Sidebar(Connector,
Snapshot Source/Sink
Timeslide Task,etc). Most of the times
it is being done by companies or community associated with the specific data source/sink
like hdfs,mysql,s3,etc and then reused by application developers.
KC assumes that at least one side, either source or sink has to be Kafka.
Though KC is part of Apache Kafka download, but for using it, we need to start KC
daemons on servers. It is important to understand that KC cluster is different from Kafka
message broker cluster, KC has its own dedicated cluster where on each machine, KC
daemons are running (java processes called as Workers).
A KC worker instance is started on each server with a Kafka broker address, topics for
internal use and group id. Each worker instance is stateless and does not share info with
other workers. But they coordinate with each other, belonging to same group id, via the
internal kafka topics. This is similar to how Kafka consumer group works and is
implemented underneath in a similar way.
A Source Connector (with help of Source Tasks) is responsible for getting data into kafka
while a Sink Connector (with help of Sink Tasks) is responsible for getting data out of
Kafka.
Internally, what happens is, at regular intervals, data is polled by source task from data
source and written to kafka and then offsets are committed. Similarly on the sink side,
data is pushed at regular intervals by Sink tasks from Kafka to destination system and
offsets are committed. For example, I looked into the code of S3sink connector
[https://github.com/confluentinc/kafka-connect-storage-cloud/tree/master/kafka-connect-
s3/src/main/java/io/confluent/connect/s3] and found that, a sink task keeps putting data for a
specific kafka topic-partition to bytebuffer and then at a configurable time (by default 30
seconds), the data is accumulated and written to S3 files. Also, it uses a deterministic
partitioner which ensures that in case of failures, even if the same data is written twice,
the operation is idempotent. Deterministic partitioner means that a for a given same set
of records, there will be same number of files with each file having exactly same data. As
such, it ensures Exactly once delivery semantics.
The following diagram shows how a Sink Connector works. More info here
[https://docs.confluent.io/3.2.0/connect/concepts.html#connect-workers] .
Any other streaming ingestion framework like Flume, Logstash (as far as I googled) will
use some internal buffering mechanism of data before writing to destination. That
buffering might or might not be reliable, might be local to node and so not fault tolerant in
case of node failures. KC just makes it explicit that it either reads from Kafka Topic
(Source) or write to a Kafka Topic (Sink) and it uses Kafka message broker underneath
for data buffering. By doing that, it brings on table all those features like reliable
buffering, scalability, fault tolerance, simple parallelism, auto recovery, rebalancing, etc
for which Kafka is popular.
Dynamic Views theme. Powered by Blogger.
why-not-learn-something.blogspot.com/?view=classic 3/4
4/20/2019 Chandan Prakash's Blog
Apache Gobblin, well known reliable batch ingestion framework, is working on streaming
Chandan Prakash's Blog
ingestion development but is not available yet. kafka connect
Also, KC is not only for ingestion but also for data export provided source of data is
Kafka.
Classic Flipcard Magazine Mosaic Sidebar Snapshot Timeslide
Like any other framework, KC is not suitable for all use cases . First of all, it is Kafka
centric, so if your ETL pipeline does not involve Kafka, you can not use KC. Second, the
KC framework is jvm specific and only supports Java/scala languages. Third and perhaps
most importantly, Kafka Connect does not support heavy Transformation like Gobblin.
It is important to understand here that we can always write a custom connector which can
do the Transformation as well but it is intentionally not encouraged by KC development
guideline. Reason is simple, it does not want to mix the responsibility of processing
engine (T) with its responsibility (EL). Also, by ensuring that Transformation is not part of
KC, we keep the connector generic and ensure its ease of reusability by other users
having different transformation needs with same source/sink.
Conclusion:
Kafka Connect is a great framework to evaluate and invest for building streaming ETL
pipelines and a must have if Kafka is already part of the pipeline. There’s a lot of work
happening in streaming ingestion space these days, we need to keep a close watch on
Apache Gobblin where work is going on for support for streaming ingestion
[https://engineering.linkedin.com/blog/2018/01/gobblin-enters-apache-incubation] as well.
Happy ETLing!
Posted 28th February 2018 by Unknown
Labels: Apache Kafka, Big data, Ingestion, kafka connect, Real Time Analytics, Streaming
4 View comments