Understand Azure Event Hubs
Understand Azure Event Hubs
Azure Event
Hubs
A glance at the top level architecture
of Azure Event Hubs.
made with
About The Author
Saravana Kumar (Microsoft Integration MVP)
For the past few months, I got hooked up with BizTalk360 and isolated
myself from any deep technology learning. Now BizTalk360 version 8.0 is
out of my way the team is kind of settled down, I thought it’s time to look
deep into some of the technology areas I was intending to learn for some
time. As part of the process, I decided to put my foot first on Azure Event
Hubs.
Azure Event hubs are not entirely new to me, I’ve been doing small bits and
pieces here and there as hobby/prototype projects for a while. Now it’s
time to understand bit more deeper into have various pieces fit together. I’ll
try to cover my learning’s as much as possible in these blogs, so it might
help people to get started.
This is the exact problem Azure Event Hubs is going to solve for us. You
simply provision event hub using the portal, which will give you the end
points, and you can start firing the messages to that end point either via
HTTPS or AMQP, the collected (ingested) data will get stored in EventHubs,
then you can later read it using readers at your own phase. The data is
retained for a period of time automatically.
Azure event hubs are designed for scale, it can process millions and millions
of message on both directions – inbound and outbound. Some of the real
world use cases include getting telemetry data from cars, games,
application so on, IoT scenario where millions of devices push data to the
cloud, gaming scenarios where you push user activities at scale to the cloud
etc. There are tons of use cases why we need a technology like Azure Event
Hubs.
EventData (message)
Publishers (or producers)
Partitions
Partition Keys / Partition Id
Receivers (or consumer)
Consumer Groups
Event Processor Host
Transport Protocol (HTTP, AMQP)
Throughput Units
Event Data
In the context of event hubs, messages are referred to as event data. Event
data contains the body of the event which is a binary stream (you can place
any binary content like serialized JSON, XML, etc), a user defined property
bag (name-value pair) and various system metadata about the event like
offset in the partition, and it’s number in the stream sequence.
EventData class is included in the .NET Azure Service Bus client library. It’s
has the same sender model (BrokeredMessage) used in Service Bus queues
and topics at the protocol level.
Publishing events with a .NET client using the Azure Service Bus client
library is just 3 lines of code
Partitions
As you can see from the image, there are lot of partitions inside an event
hub. Partitions are one of the key differentiation in the way the data is
stored and retrieved compared to other Service Bus technologies like Queue
and Topics. The traditional Queue and Topics are designed based on the
“Competing Consumer” pattern in which each consumer attempts to read
from the same queue (imagine a single lane road, where the vehicles goes
one after the other, and there is no option to overtake), whereas Event hubs
are designed based on “Partitioned consumer pattern” (think of it like
parallel lanes in motorways, where traffic flows through multiple lanes, if
one lane is busy vehicles choose another lane to move fast). The single lane
queue model will ultimately results in scale limits, hence event hubs uses
partitioned consumer pattern to achieve the massive scale required.
The other important differentiation between normal Queue and Event Hub
partitions is, in the Queues once the message is read it’s taken out of the
queue and it’s not available (if there are errors, it will move to dead letter
queue), whereas in Event hubs partitions the data in the partition stays
there even after it’s been read by the consumer. This allows the consumers
to come and read the data again if required (example lost connection). The
events stored in partitions gets deleted only after the retention period is
expired (example 1 day). You cannot manually delete data from partitions.
The events gets stored in partition in sequence, as the newer events arrive
they are added to the end of the sequence (more or less like how events
gets added to queues).
Due to the varying distribution patterns, the partitions will all grow in
different sizes from one another. It’s also important to know your event
data will live in only one partition, it will not get duplicated. Partitions
underneath has private blob storage managed by Azure.
The number of partitions is specified at the event hub creation time and it
cannot be modified later (so care should be taken). Using the standard
Azure portal you can create between 2 to 32 partitions (default 4), however
if required, you can create up to 1024 partitions by contacting Azure
support.
Partition Keys / Partition Ids
In the main image, you can see the concepts of partition keys (represented
as red and orange dots). The partition key is the value you pass in the event
data for the purpose of event organisation (grouping and making sure they
live in the same place). There may be a requirement you wanted to move
data from all the servers in a particular data centre to get stored in a
specific partition. If you note carefully, the event published have no
knowledge about the actual partition, they simply specify a partition key
and the event hubs makes sure the keys with same partition key are stored
in the same partition. This decoupling of key and partition insulates the
sender from needing to know too much about the downstream processing
and storage of events.
If you do not specify the partition keys, event hub will just store incoming
events in different partition on round-robin basis.
You can also send the event data directly to specific partitions if required.
This is generally not a good practice you should either leave the publishing
of events to event hubs in round-robin model, or you can take advantage of
the PartitionKeys concepts as explained above, which is a bit more abstract
and event hub will take care of grouping them together in relevant
partitions.
However if you have some special needs and want to write events to
particular partitions, then you can do so easily by using the partition id as
shown in the below code snippet.
Receivers (or consumers)
Any entity (applications) that read event data from an Event Hub is a
receiver (or consumer). You can view from the main diagram above there
can be multiple receivers for the same event hub, they can read the same
partition data at their own pace. There is also an important thing to not,
event consumers connect only via AMQP (whereas in the receive side we
have both HTTPS and AMQP), this is because the events are pushed to the
consumer from event hub via the AMQP channel, the client does not need
to pull for data availability. This model is important both for scalability
purpose as well as to avoid each consumers writing their own logic for
checking new data availability and putting unnecessary load on the
platform.
Consumer Groups
The data stored in your event hub (in all partitions) are not going to change,
once it got stored in a partition they are going to live there until the
retention period is elapsed (in which case the event data will be deleted).
The data can be consumed (or read) by different consumers (applications)
based on their own requirements. Some consumers want to read it carefully
only once, some consumers may go back and read historical data again
and again so on.
You can see there are varying requirements for each consumer, in order to
support this, event hub uses consumer groups. A consumer group is simply
a view of the data in the entire event hub. The data in the event hub can
only be accessed via consumer group, you cannot access the partitions
directly to fetch the data. When you create an event hub, a default
consumer group is also created. Azure also uses consumer group as
differentiating factor between multiple pricing tiers (on Basic tier you
cannot have more than 1 consumer group, whereas in the Standard tier you
can have up to 20 consumer groups)
If you are going down building your own Direct Receivers, you need to
manually take care of all the above points. Your code will look something
like this
this is going to be a repeated task for every customer and it will require a
level of knowledge to write the receivers in an efficient way. To address this
challenge, Azure Event Hubs comes with a higher level abstracted intelligent
host/agent called “Event Processor Host”, you simply implement the
IEventProcessor interface and 3 core methods OpenAsync,
ProcessEventsAsync and CloseAsync. It takes an Azure blob parameter to
manage offset/checkpoint against various partitions. This is the simplest
way to consume events from event hubs. The new code will look like
Transport Protocols (HTTPs/AMQP)
From the main picture from the beginning you can see there are 2 different
types of transport protocols used for communications with Event Hubs. On
the publisher (receiver) side, you can either use HTTPs if you are going to
push low volume of events to the event hubs, or you can use AMQP for high
throughput, better performance scenarios. On the consuming side, since
event hubs uses push based model to push events to listeners/receivers,
AMQP is the only option.
Throughput Units
Throughput units are basics of how you can scale the traffic coming in and
going out of Event hubs. Throughput units are one of the key pricing
parameters. Throughput units are purchased at event hub namespaces
level and are applicable to all the event hubs in a given names space. The
other important thing to keep in mind is a single partition can scale to only
one throughput unit, so it makes sense to have less number of throughput
units than the total number of partitions you have. Example
Namespace: IOTRECEIVER
Event Hub #1 : 4 partitions
Event Hub #2 : 2 partitions
In the above case, there is no point to have more than 6 throughput units
for the namespace IOTRECEIVER. You probably need only 1 or 2 throughput
units, unless otherwise you have a huge volume of traffic.
Throughput units handle 1mb or 1000 events (event data) per second on
the publishing side and 2mb/second on the consuming side. Think of
throughput units like pipes, if you need more water to flow through, you get
more pipes. The circumference of the pipe is fixed it can only take so much
water, so if you either need to fill the tank or take more water from the tank
the only way to do it is by adding/fixing more pipes. Every pipe you add is
going to cost you money.
Summary
Now we covered all the building
blocks we saw in the original
picture at the beginning of the
article. Hopefully, this article would
have given you all the bits and
pieces requires to understand
Azure Event Hubs, there are still
few topics I would like to cover that
includes security mechanism,
metrics data, consumer groups
working etc, I’ll hopefully try to
cover them in future articles.
made with