0% found this document useful (0 votes)
167 views

Chandan Prakash's Blog

Kafka Connect exists to handle data extraction (E) and loading (L) in ETL pipelines, separating these concerns from the data transformation (T) layer. It allows ingestion of data from external systems into Kafka (using source connectors) and export of data from Kafka to external systems (using sink connectors). Kafka Connect uses a distributed architecture of worker processes that coordinate through Kafka topics to parallelly and reliably transfer data between Kafka and external systems. It provides features like fault tolerance, scalability and delivery semantics by leveraging Kafka as the intermediary data buffer.

Uploaded by

nilesh86378
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views

Chandan Prakash's Blog

Kafka Connect exists to handle data extraction (E) and loading (L) in ETL pipelines, separating these concerns from the data transformation (T) layer. It allows ingestion of data from external systems into Kafka (using source connectors) and export of data from Kafka to external systems (using sink connectors). Kafka Connect uses a distributed architecture of worker processes that coordinate through Kafka topics to parallelly and reliably transfer data between Kafka and external systems. It provides features like fault tolerance, scalability and delivery semantics by leveraging Kafka as the intermediary data buffer.

Uploaded by

nilesh86378
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

4/20/2019 Chandan Prakash's Blog

Chandan Prakash's Blog kafka connect

Classic Flipcard Magazine Mosaic Sidebar Snapshot Timeslide

28th February 2018 Kafka Connect: WHY (exists) and HOW


(works)

Recently while exploring some ingestion technologies, I got chance to look into Kafka
Connect (KC) in detail. As a developer, the 2 things that intrigued my mind were:
1. WHY it exists : With Kafka already around,why do we need KC, why should we use it,
what different purpose it serves.
2. HOW it works: How it does what it does, internal code flow, distributed architecture, how
it guarantees fault tolerance, parallelism, rebalancing, delivery semantics, etc
If you too have the above two questions in mind, this post might be helpful to you.
Note: My findings about the HOW part are based out of my personal understanding
of KC source code, have not actually run it yet, so take it with pinch of salt.
Corrections are welcome!

Intro about Kafka Connect:

Kafka Connect is an open source framework, built as another layer on core Apache Kafka,
to support large scale streaming data:
1. import from any external system (called Source) like mysql,hdfs,etc to Kafka broker
cluster
2. export from Kafka cluster to any external system (called Sink) like hdfs,s3,etc
For the above 2 mentioned responsibilities, KC works in 2 modes:
1. Source Connector : imports data from a Source to Kafka
2. Sink Connector : exports data from Kafka to a Sink
Dynamic Views theme. Powered by Blogger.
why-not-learn-something.blogspot.com/?view=classic 1/4
4/20/2019 Chandan Prakash's Blog

The core code of KC framework is part of Apache Kafka code base as another module
Chandan Prakash's
named “connect”. Blog
For using Kafka Connect for a specific data source/sink, a
kafka connect
corresponding source/sink connector needs to be implemented by overriding abstraction
classes Magazine
Classic Flipcard provided byMosaic
KC framework
Sidebar(Connector,
Snapshot Source/Sink
Timeslide Task,etc). Most of the times
it is being done by companies or community associated with the specific data source/sink
like hdfs,mysql,s3,etc and then reused by application developers.
KC assumes that at least one side, either source or sink has to be Kafka.

WHY Kafka Connect:

KC brings a very important Dev principle in play in ETL pipeline development:


SoC (Separation of Concerns)
An ETL pipeline involves Extract,Transform and Load. Any stream processing engine like
Spark Streaming, Kafka Streams,etc is specialized for T (Transform) part, not for E
(Extract) and L (Load).
This is where KC fills in. Kafka Connect takes care of the E and L, no matter which
processing engine is there, provided Kafka is part of the pipeline. This is great because
now the same code of E and L for a given source/sink can be used across different
processing engines. This ensures decoupling and reusability .
KC reduces friction and effort in implementing any Stream processing framework like
Spark Streaming by taking up how-to-ingest-data responsibility .
Also, since Kafka is either on source or sink side for sure, we have better guarantees like
parallelism,fault tolerance,delivery semantics,ordering,etc.

HOW Kafka Connect works:

Though KC is part of Apache Kafka download, but for using it, we need to start KC
daemons on servers. It is important to understand that KC cluster is different from Kafka
message broker cluster, KC has its own dedicated cluster where on each machine, KC
daemons are running (java processes called as Workers).
A KC worker instance is started on each server with a Kafka broker address, topics for
internal use and group id. Each worker instance is stateless and does not share info with
other workers. But they coordinate with each other, belonging to same group id, via the
internal kafka topics. This is similar to how Kafka consumer group works and is
implemented underneath in a similar way.

Each worker instantiates a Connector (Source or Sink Connector) which defines


configurations of the associated Task(Source or Sink Task). A Task is actually a runnable
thread which does the actual work of transferring data and committing the offsets. e.g.
S3SinkTask is a runnable thread which takes care of transferring data from Kafka of a
single topic-partition to S3.
Worker instantiates pool of Task threads,based on info provided by Connector. Before
creating task threads, Workers decide among themselves which part of data each of
Dynamic Views theme. Powered by Blogger.
why-not-learn-something.blogspot.com/?view=classic 2/4
4/20/2019 Chandan Prakash's Blog

them is going to process, through internal Kafka topics.


Chandan Prakash's Blog kafka connect
It is important to note here, that there is no central server in KC cluster and KC is not
responsible for launching worker instances, or restarting them on failure.
Classic Flipcard Magazine
If a worker process Mosaic
dies, the Sidebar
cluster is Snapshot
rebalancedTimeslide
to distribute the work fairly over the
remaining workers. If a new worker is started, a rebalance ensures it takes over some
work from the existing workers. This rebalancing is an existing feature of Apache Kafka.
KC exposes REST api to create, modify and destroy connectors.

A Source Connector (with help of Source Tasks) is responsible for getting data into kafka
while a Sink Connector (with help of Sink Tasks) is responsible for getting data out of
Kafka.
Internally, what happens is, at regular intervals, data is polled by source task from data
source and written to kafka and then offsets are committed. Similarly on the sink side,
data is pushed at regular intervals by Sink tasks from Kafka to destination system and
offsets are committed. For example, I looked into the code of S3sink connector
[https://github.com/confluentinc/kafka-connect-storage-cloud/tree/master/kafka-connect-
s3/src/main/java/io/confluent/connect/s3] and found that, a sink task keeps putting data for a
specific kafka topic-partition to bytebuffer and then at a configurable time (by default 30
seconds), the data is accumulated and written to S3 files. Also, it uses a deterministic
partitioner which ensures that in case of failures, even if the same data is written twice,
the operation is idempotent. Deterministic partitioner means that a for a given same set
of records, there will be same number of files with each file having exactly same data. As
such, it ensures Exactly once delivery semantics.

The following diagram shows how a Sink Connector works. More info here
[https://docs.confluent.io/3.2.0/connect/concepts.html#connect-workers] .

Comparison with alternatives:

Any other streaming ingestion framework like Flume, Logstash (as far as I googled) will
use some internal buffering mechanism of data before writing to destination. That
buffering might or might not be reliable, might be local to node and so not fault tolerant in
case of node failures. KC just makes it explicit that it either reads from Kafka Topic
(Source) or write to a Kafka Topic (Sink) and it uses Kafka message broker underneath
for data buffering. By doing that, it brings on table all those features like reliable
buffering, scalability, fault tolerance, simple parallelism, auto recovery, rebalancing, etc
for which Kafka is popular.
Dynamic Views theme. Powered by Blogger.
why-not-learn-something.blogspot.com/?view=classic 3/4
4/20/2019 Chandan Prakash's Blog

Apache Gobblin, well known reliable batch ingestion framework, is working on streaming
Chandan Prakash's Blog
ingestion development but is not available yet. kafka connect
Also, KC is not only for ingestion but also for data export provided source of data is
Kafka.
Classic Flipcard Magazine Mosaic Sidebar Snapshot Timeslide

Why not Kafka Connect:

Like any other framework, KC is not suitable for all use cases . First of all, it is Kafka
centric, so if your ETL pipeline does not involve Kafka, you can not use KC. Second, the
KC framework is jvm specific and only supports Java/scala languages. Third and perhaps
most importantly, Kafka Connect does not support heavy Transformation like Gobblin.
It is important to understand here that we can always write a custom connector which can
do the Transformation as well but it is intentionally not encouraged by KC development
guideline. Reason is simple, it does not want to mix the responsibility of processing
engine (T) with its responsibility (EL). Also, by ensuring that Transformation is not part of
KC, we keep the connector generic and ensure its ease of reusability by other users
having different transformation needs with same source/sink.

Conclusion:

Kafka Connect is a great framework to evaluate and invest for building streaming ETL
pipelines and a must have if Kafka is already part of the pipeline. There’s a lot of work
happening in streaming ingestion space these days, we need to keep a close watch on
Apache Gobblin where work is going on for support for streaming ingestion
[https://engineering.linkedin.com/blog/2018/01/gobblin-enters-apache-incubation] as well.

Happy ETLing!
Posted 28th February 2018 by Unknown
Labels: Apache Kafka, Big data, Ingestion, kafka connect, Real Time Analytics, Streaming

4 View comments

Dynamic Views theme. Powered by Blogger.


why-not-learn-something.blogspot.com/?view=classic 4/4

You might also like