Getting started with real-time
analytics with Kafka and
Spark in Microsoft Azure
Joe Plumb
Cloud Solution Architect – Microsoft UK
@joe_plumb
Alternative title: Everything I
know about real time analytics
in Microsoft Azure
Joe Plumb
Cloud Solution Architect – Microsoft UK
@joe_plumb
Agenda
• Fundamentals of streaming data
• What streaming data can be useful for
• What options are there to use data streams in Microsoft Azure?
• Demo
• Q&A
Streaming 101
What is streaming data?
• “Streaming data is data that is continuously generated by different
sources.” https://en.wikipedia.org/wiki/Streaming_data
• Streaming system - A type of data processing engine that is designed
with infinite datasets in mind.
https://learning.oreilly.com/library/view/streaming-systems/9781491
983867/ch01.html
Why bother?
• Batch processing can give great insights into things that happened in
the past, but it lacks the ability to answer the question of "what is
happening right now?”
• “Data is valuable only when there is an easy way to process and get
timely insights from data sources.”
Where is streaming What is it good for?
data?
• Website monitoring
• Network monitoring
• Fraud detection
Clickstream data Sensors
• Web clicks
• Advertising
• Environment monitoring
• Application usage tracking
• Recommendations
Smart machinery (e.g. GPS ….
production lines)
Streaming System architecture
Source: https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/real-time-processing
Event vs Message
• Could be argued its an issue of semantics, as they ‘look’ the same
(e.g. JSON object, CSV etc)
• Message is a catch-all term, as messages are just bundles of data
• Event message is a type of message
“When a subject has an event to announce, it will create an event
object, wrap it in a message, and send it on a channel.”
https://www.enterpriseintegrationpatterns.com/patterns/messaging/E
ventMessage.html
It’s all about time
Cardinality is important because the
unbounded nature of infinite datasets
imposes additional burdens on data
processing frameworks that consume
them..
We need ways to reason about time
It’s all about time: Event time vs Processing
time
• In an ideal world, the processing system
receives the event when it happens.
Event Time Processing time • In reality, the skew between an event
Time the event Time the system happening and the system processing
occurs becomes aware of
the event
that event can vary wildly
Time
It’s all about time: Event time vs Processing
time
• In an ideal world, the processing system
receives the event when it happens.
• In reality, the skew between an event
happening and the system processing
that event can vary wildly
• Processing time lag is the difference in
observed time vs processing time
• Event-time skew is how far behind the
processing pipeline is at that moment.
It’s all about time: Watermarking
• An event time marker that indicates all events up to “a point” have
been fed to the streaming processor. By the nature of streams, the
incoming event data never stops, so watermarks indicate the progress
to a certain point in the stream.
• Watermarks can either be a strict guarantee (perfect watermark) or
an educated guess (heuristic watermark)
It’s all about time: Windowing
Tumbling windows
Hopping windows
Sliding windows
Session Windows
It’s not just about time: Triggers
• They determine when the processing on the accumulated data is
started.
• Repeated update triggers
• These periodically generate updated panes for a window as its contents
evolve.
• Completeness triggers
• These materialize a pane for a window only after the input for that window is
believed to be complete to some threshold
Delivery Guarantees
• At-most-once
• means that for each message handed to the mechanism, that message is
delivered zero or one times; in more casual terms it means that messages
may be lost.
• At-least-once
• means that for each message handed to the mechanism potentially multiple
attempts are made at delivering it, such that at least one succeeds; again, in
more casual terms this means that messages may be duplicated but not lost.
• Exactly-once
• means that for each message handed to the mechanism exactly one delivery
is made to the recipient; the message can neither be lost nor duplicated.
Streaming + Batch?
• Lambda architecture
• Increasingly viewed as a
workaround, due to advances in
capabilities and reliability of
streaming data systems
By Textractor - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=34963985
https://en.wikipedia.org/wiki/Lambda_architecture#/media/
File:Diagram_of_Lambda_Architecture_(generic).png
Service Options
in Azure
Event Hubs
• Fully-managed PaaS service
• Big data streaming platform and event ingestion service.
• It can receive millions of events per second. Data sent to an event hub can be
transformed and stored by using any real-time analytics provider or
batching/storage adapters.
• Wide range of use cases
• Scalable
• Kafka for Event Hubs
• Data can be captured automatically in either Azure Blob Storage or Azure
Data Lake Store
Stream Analytics
• Event-processing engine that allows you to examine high volumes of
data streaming from devices.
• Supports extracting information from data streams, identifying patterns,
and relationships.
• Can then use these patterns to trigger other actions downstream, such
as create alerts, feed information to a reporting tool, or store it for later
use
Integration with Azure event hubs and IoT hub
• Azure Stream Analytics has built-in, first class
Streaming data
integration with Azure Event Hubs and IoT Hub
• Data from Azure Event Hubs and Azure IoT Hub
can be sources of streaming data to Azure Azure Event Hubs
Stream Analytics
• The connections can be established through the
Azure Portal without any coding Streaming data
• Azure Blob Storage is supported as a source of
reference data Azure IoT Hub
Azure Stream
• Azure Stream Analytics supports compression Analytics
across all data stream input sources—Event Reference data
Hubs, IoT Hub, and Blob Storage
Azure Blob Storage
Fully-managed Hadoop and Spark for the cloud. 99.9% SLA
100% Open Source Hortonworks data platform
Azure Clusters up and running in minutes
HDInsight Familiar BI tools, interactive open source notebooks
63% lower TCO than deploy your own Hadoop on-premises*
Cloud Spark and Hadoop Scale clusters on demand
service for the Enterprise
Secure Hadoop workloads via Active Directory and Ranger
Compliance for Open Source bits
Best in class monitoring and predictive operations via OMS
Native Integration with leading ISVs
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
Azure HD Insight
Apache storm on HDInsight
Apache Storm offered as a managed service on Azure HDInsight
Scalable. Can analyse One of seven HDInsight
millions of events per second cluster types
Dynamically scale-up and
Integrates with Event Hub
scale-down
SLA of 99.9 percent Develop with Visual Studio
uptime using Java or C#
Azure Databricks
• Apache Spark-based analytics platform optimized for Microsoft Azure.
Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace or analytics.
Spark structured streaming overview
A unified system for end-to-end fault-tolerant, exactly-once, stateful stream processing
The simplest way to perform streaming analytics is not having to think about streaming at all!
Develop
• Unifies streaming, interactive, and batch queries. Uses a single API
for both static bounded data and streaming unbounded data continuous applications
• Supports streaming aggregations, event-time windows, windowed That need to interact with batch data,
grouped aggregation, and stream-to-batch joins interactive analysis, machine learning…
• Features streaming deduplication, multiple output modes, and APIs
for managing and monitoring streaming queries
Pure streaming system Continuous application
• Also supports interactive and batch queries >_ Ad-hoc
queries
• Aggregate data in a stream, then serve using JDBC
Input Streaming Output Input Continuous Output
• Change queries at runtime stream computation sink
(transactions stream application sink
(transactions
often up to user) co often up to user)
ns
• Build and apply machine learning models ist
en
tw
ith
• Built-in sources: Kafka, file source (JSON, CSV, text, and Parquet) (interactions with other systems
Batch
job
left to the user Static data
• App development in Scala, Java, Python, and R
Demo – Event Hubs and
Stream Analytics
What we’re looking at
Event hubs Stream Analytics
Python Flask app - Kafka enabled - Simple tumbling Power BI
- kafka-python window
Azure
So… what do I use?
INGESTION SERVICES- A COMPARISON
HDInsight (Apache Kafka) Azure Event Hubs Azure IoT Hub
Open-Source Yes No No
Serverless Service No Yes Yes
Hybrid (cloud and on-prem) Yes No No
MQTT, AQMP, HTTPS and
Protocols supported HTTP REST AQMP, HTTP REST Azure IoT protocol gateway for custom
protocols
Replication and reliability Manually configured with tools like Relies on underlying Azure Blob Storage Relies on underlying Azure Blob Storage
MirrorMaker
• A side-by-side comparison of the capabilities and features
SLA 99.9% 99.9% 99.9%
Limited by number and type of nodes in the
Scaling HDInsight cluster provisioned Limited by number of Throughput Units Limited by number of IoT Hub Units
Throttling No, explicit throttling Yes, when TU limits are reached Yes, when IoT Hub Unit limits are reached
Message Size No Limits 1MB 256 KB
Message Ordering Yes, Ordered within a partition Yes, Ordered within a partition Yes, within a partition
Can automatically store in Azure Managed Disk. Can automatically store in Azure Blob Storage
Can automatically store in Azure Blob Storage or
Long-term Storage Number of disk has to be explicitly specified during
cluster creation.
Azure Data Lake Store. (using ABS as an endpoint)
COMPARING STREAMING ANALYTICS SERVICE
(1/2)
HDInsight (Apache Storm) Azure Stream Analytics Spark Streaming (Azure Databricks)
Open-Source Yes No Yes
Serverless Service No Yes No
Hybrid (cloud and on-prem) Yes No Yes
Exactly-once processing No Yes Yes
[Cannot distinguish between new events and replays]
SQL as Query Language No Yes Yes
Yes – Combines Batch, Interactive, Machine
Unified Programming Model No No
Learning and Streaming.
• A side-by-side comparison of the capabilities and features
Extensibility Yes, custom code in Java, C# etc No (partial support with JavaScript UDFs) Yes, custom code in Java, Scala, Python
Windowing Support No, needs Trident for Tumbling Window Yes – Sliding, Hopping and Tumbling Yes – Siding and Tumbling Window
No built-in support. Trained model can be Yes. Published Azure Machine Learning models
Azure ML Integration can be configured as functions during job No
invoked through custom Storm Bolts. creation.
Kafka Integration Yes, Kafka Spout available Yes Kafka connector available
COMPARING STREAMING ANALYTICS SERVICE
(2/2)
Apache Storm on HDInsight Azure Stream Analytics Spark Structured Streaming
(Azure Databricks)
Pay for number and type of nodes in the Pay for number and type of nodes in the
Pricing Model HDInsight cluster and duration of use Pay per Streaming Unit (SU) cluster and duration of use
Limited by number of Streaming Units (SU). Limited by number and type of nodes in the
Scaling Model Upper limit of 1 GB/sec.
cluster provisioned
Each SU = 1 MB/sec with max 50 SUs.
Can be anything - custom code needed to
Input Data Format parse Avro, CSV, JSON Text, CSV, JSON, PARQUET
Azure Event Hubs, Azure Blob Storage, Azure
Input Data Sources Can be anything –but need custom code. File Source, Kafka, Socket (for testing)
IoT Hub
Azure Events Hubs, Azure Blob Storage, Azure
Azure Events Hubs, Azure Blob Storage, Azure
Output Data Sink • A side-by-side comparisonTables,
of theAzure capabilities
Tables, Azure Cosmos DB, Azure SQL DB, Azure
Power BI, Azure Data Lake Store, HBase,
and
Cosmos DB, Azure SQL DB, features
Azure Console, Kafka, Memory, ForEachSink
Custom Power BI, Azure Data Lake Store
No limits on data size. Connectors available for Azure Blob Storage only with max size of 100 No limits on data size. Can be stored in any
Reference Data HBase, Azure Cosmos DB, Azure SQL DB and MB in-memory lookup cache source supported by Apache Spark.
custom sources
Users can create, debug, and monitor jobs
Users using .NET can develop, debug, and
Dev Experience monitor through Visual Studio. through the Azure portal, using sample data Use Azure Databricks Notebooks.
derived from a live stream
Further reading
Hands on with Event Hubs and python
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-python
Hands on with streaming ETL with Azure Databricks
https://medium.com/microsoftazure/an-introduction-to-streaming-etl-o
n-azure-databricks-using-structured-streaming-databricks-16b369d77e3
4
Choosing the right service(s) for your use case
https://docs.microsoft.com/en-us/azure/architecture/data-guide/techn
ology-choices/stream-processing
Further reading
http://shop.oreilly.com/product/0636920073994.do
Questions?
We'd love your feedback!
aka.ms/SQLBits19
Thanks!
Joe Plumb
Cloud Solution Architect – Microsoft UK
@joe_plumb