Building Real Time Analytics Applications
Building Real Time Analytics Applications
m
pl
im
en
ts
of
Building
Real-Time
Analytics
Applications
Operational Workflows
with Apache Druid
Darin Briskman
REPORT
Building Real-Time
Analytics Applications
Operational Workflows
with Apache Druid
Darin Briskman
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Building Real-
Time Analytics Applications, the cover image, and related trade dress are trademarks
of O’Reilly Media, Inc.
The views expressed in this work are those of the author and do not represent the
publisher’s views. While the publisher and the author have used good faith efforts to
ensure that the information and instructions contained in this work are accurate, the
publisher and the author disclaim all responsibility for errors or omissions, includ‐
ing without limitation responsibility for damages resulting from the use of or reli‐
ance on this work. Use of the information and instructions contained in this work is
at your own risk. If any code samples or other technology this work contains or
describes is subject to open source licenses or the intellectual property rights of oth‐
ers, it is your responsibility to ensure that your use thereof complies with such licen‐
ses and/or rights.
This work is part of a collaboration between O’Reilly and Imply. See our statement of
editorial independence.
978-1-098-14656-6
[LSI]
Table of Contents
iii
Building Real-Time
Analytics Applications
1
Leading organizations like Netflix, Target, Salesforce, and Wikime‐
dia Foundation1 have recognized the opportunity and value of
bringing together analytics and applications in a new way. Instead of
infrequent analytics queries on historical data, their developers are
building a new type of application that queries everything from tera‐
bytes to petabytes of real-time and historical data at subsecond
response under load.
This new approach is the real-time analytics application, and it’s
formed at the intersection of the analytics and application para‐
digms, with technical requirements that bring the scale of data
warehouses to the speed of transactional databases (Figure 1).
Netflix
How can we be confident that updates are not harming our users? And
that we’re actually making measurable improvements when we intend
to?
—Ben Sykes, software engineer, Netflix2
2 Ben Sykes, “How Netflix Uses Druid for Real-Time Insights to Ensure a High-Quality
Experience”, Netflix Technology Blog, March 3, 2020.
Netflix collects these measures and feeds them into the real-time
analytics database. Every measure is tagged with anonymized details
about the kind of device being used—for example, whether the
device is a smart TV, an iPad, or an Android phone. This enables
Netflix to classify devices and view the data according to various
aspects. This aggregated data is available immediately for querying,
either via dashboards or ad hoc queries. “We’re currently ingesting
at over 2 million events per second,” Sykes says, “and querying over
1.5 trillion rows to get detailed insights into how our users are
Walmart
Druid has been designed to...enable exploration of real-time data and
historical data while providing low latencies and high availability.
—Kartik Khare, software engineer, Walmart Labs6
Confluent
We don’t shy away from high-cardinality data, which means we can find
the needle in the haystack. As a result, our teams can detect problems
before they emerge and quickly troubleshoot issues to improve the over‐
all customer experience. The flexibility we have with Druid also means
we can expose the same data we use internally also to our customers,
giving them detailed insights into how their own applications are
behaving.
—Xavier Léauté and Zohreh Karimi, lead engineers, Confluent9
Data Sources
Data needs to come from somewhere. At least some of the data will
be continuously generated, but some data may come from static
sources.
For example, a real-time analytics application that places internet
advertising may combine real-time sources (real-time bidding auc‐
tions, user clickstreams) with historical data (effectiveness of past
advertising) and external data that is periodically updated (census
data, demographic data).
11 Zohreh Karimi and Harini Rajendran, “Scaling Apache Druid for Real-Time Cloud
Analytics at Confluent”, Confluent, November 8, 2021.
Visualization
If the real-time analytics application is to be used by humans, it will
probably need some sort of visualization engine to render data into
human-consumable graphics.
While there are many commercial and open source visualization
tools (Grafana is probably the most widely used), most of them are
designed for the speed of reporting analytics and cannot render vis‐
ualizations fast enough to keep up with the needs of real-time ana‐
lytics applications.
There are a few exceptions. Apache Superset is a large-scale data
exploration and visualization toolset created to work with the
Apache Druid database. D3 (Data-Driven Documents) is an open
source JavaScript library for visualizing and analyzing data, com‐
monly used to create high-performance visualization tools for real-
time analytics applications (Figure 8).
Origins of Druid
In 2011, the data team at a technology company had a problem. It
needed to quickly aggregate and query real-time data coming from
website users across the internet to analyze digital advertising auc‐
tions. This created large data sets, with millions or billions of rows.
The data team first implemented its product using relational data‐
bases. It worked but needed many machines to scale, and that was
too expensive.
The team then tried the NoSQL database HBase, which was popula‐
ted from Hadoop MapReduce jobs. These jobs took hours to build
the aggregations necessary for the product. At one point, adding
only three dimensions on a data set that numbered in the low mil‐
lions of rows took the processing time from 9 hours to 24 hours.
12 Eric Tschetter, “Introducing Druid: Real-Time Analytics at a Billion Rows Per Second”,
Druid, April 30, 2011.
High Performance
The features of Druid combine to enable high performance at high
concurrency by avoiding unneeded work. Pre-aggregated, sorted
data avoids moving data across process boundaries or across servers
and avoids processing data that isn’t needed for a query. Long-
running processes avoid the need to start new processes for each
query. Using indexes avoids costly reading of the full data set for
each query. Acting directly on encoded, compressed data avoids the
need to uncompress and decode. Using only the minimum data
needed to answer each query avoids moving data from disk to mem‐
ory and from memory to CPU.
In a paper published at the 22nd International Conference on Busi‐
ness Information Systems (May 2019), José Correia, Carlos Costa,
and Maribel Yasmina Santos benchmarked performance of Hive,
Presto, and Druid using a TPC-H-derived test of 13 queries run
against a denormalized star schema on a cluster of 5 servers, each
with an Intel i5 quad-core processor and 32 GB memory.13 Druid
performance was measured as greater than 98% faster than Hive and
greater than 90% faster than Presto in each of six test runs, using
13 José Correia, Carlos Costa, and Maribel Yasmina Santos, “Challenging SQL-on-Hadoop
Performance with Apache Druid”, RepositoriUM, June 2019.
Figure 10. Hive, Presto, and Druid results from Correia, Costa, and
Santos (2019)14
16 Danny Ruchman and Itai Yaffe, “Our Journey with Druid: From Initial Research to Full
Production Scale”, SlideShare, February 25, 2018.
Figure 12. Druid works with both stream ingestion and batch ingestion
The ingestion process both creates tables and loads data into them.
All tables are always fully indexed, so there is no need to explicitly
create or manage indexes.
When data is ingested into Druid, it is automatically indexed, parti‐
tioned, and, optionally, partially pre-aggregated. Compressed
Stream Ingestion
Druid was specifically created to enable real-time analytics of stream
data, which begins with stream ingestion. Druid includes built-in
indexing services for both Apache Kafka and Amazon Kinesis, so
additional stream connectors are not needed.
Stream data is immediately queryable by Druid as each event
arrives.
Supervisor processes run on Druid management nodes to manage
the ingestion processes, coordinating indexing processes on data
nodes that read stream events and guarantee exactly-once ingestion.
The indexing processes use the partition and offset mechanisms
found in Kafka and the shard and sequence mechanisms found in
Kinesis.
If there is a failure during stream ingestion, for example, a server or
network outage on a data or management node, Druid automatically
recovers and continues to ingest every event exactly once, even
events that arrive during the outage. No data is lost.
Batch Ingestion
It’s often useful to combine real-time stream data with historical
data. Batch ingestion is used for loading the latter type of data. This
approach to loading data is known as batch ingestion. Some Druid
implementations use entirely historical files. Data from any source
can take advantage of the interactive data conversations, easy scal‐
ing, high performance, high concurrency, and high reliability of
Druid.
Druid usually ingests data from object stores—which include HDFS,
Amazon S3, Azure Blob, and Google Cloud Storage—or from local
storage. The datafiles can be in a number of common formats,
including JSON, CSV, TSV, Parquet, ORC, Avro, or Protobuf. Druid
can ingest directly from both standard and compressed files, using
formats including .gz, .bz2, .xz, .zip, .sz, and .zst.
The easiest way to ingest batch data is with a SQL statement, using
INSERT INTO. Since this uses SQL, the ingestion can also include
Data Design
The less data you store, the faster your queries will run. How can
you minimize the database size while maintaining the precision you
need?
In Druid, data is ingested from files, tables, and streams and stored
in tables. If the original source data is stored, the column is a
dimension. Many data sets will have high dimensionality, with hun‐
dreds or thousands of dimensions.
A useful option to reduce data storage size and improve perfor‐
mance is to use data rollup, a pre-aggregation to combine rows that
have identical dimensions within an interval that you choose, such
as “minute” or “day,” and adding metrics, such as “count” and “total.”
This aggregated dimension is a metric.
For example, a data stream of website activity that gets an average of
400 hits per second could choose to store every event in its own
table row, or it could use a rollup to aggregate the sum of the total
number of page hits each minute, reducing the needed data storage
by 24,000x. If desired, the disaggregated data can be kept in inex‐
pensive deep storage for infrequent queries where individual events
need to be queried.
It’s possible to create multiple metrics from the same data set, with
aggregations at different levels of granularity to support multiple
query requirements.
Metrics can also be created that use stochastic approximation tech‐
niques, usually called sketches (Figure 13). These enable very fast
results, provably within 2% of the precise value, even across very
large data sets. Accuracy actually improves as the data set grows!
This is often useful for providing subsecond information on the
total number of a fast-changing dimension (How many total players
Interfaces
Most real-time analytics databases use SQL as the primary interface.
Either humans or machines can submit queries through APIs to
retrieve the data needed.
Druid supports a wide range of programming libraries for develop‐
ment, including Python, R, JavaScript, Clojure, Elixir, Ruby, PHP,
Scala, Java, .NET, and Rust. In any of these languages, queries can be
executed using either SQL commands (returning text results) or
JSON-over-HTTP (returning JSON results).
Since Druid is both high performance and high concurrency, it’s a
common pattern to use microservices architecture, with many serv‐
ices and many instances of each service able to send queries and
receive results without worries about causing bottlenecks for other
services.
Humans, though, also like pictures. It’s common to use visualization
tools with real-time analytics: open source tools like Grafana or
Apache Superset, or commercial tools like Tableau, Looker, Power
BI, or Domo (Figure 14). Sometimes, though, these tools can have
lengthy render times, so queries that return from the database in
under a second can still take 30 seconds or more to appear to the
user.
Security
Druid includes a number of features to ensure that data is always
kept secure. Druid administrators can define users with names and
passwords managed by Druid, or they can use external authoriza‐
tion systems through LDAP or Active Directory. Fine-grained role-
based access control allows each user to query only the data sources
(such as tables) to which they have been granted access, whether
using APIs or SQL queries (Figure 15).
Users belong to roles that are only able to use resources where the
role has been granted permission (Figure 16).
Resilience
Druid provides for both very high uptime (reliability) and zero data
loss (durability). It’s designed for continuous operations, with no
need for planned downtime.
A Druid cluster is a collection of processes. The smallest cluster can
run all processes on a single server (even a laptop). Larger clusters
group processes by function, with each server (“node”) focused on
running management, query, or data processes.
When a node becomes unavailable—from a server failure, a network
outage, a human error, or any other cause—the workload continues
to run on other nodes. During a data node outage, any segments
that are uniquely stored on that node are automatically loaded onto
other nodes from deep storage. The cluster continues to operate as
long as any nodes are available.
Replicas
Whenever data is ingested into Druid, a segment is created and a
replica of that segment is created on a different data node. These
replicas are used to both improve query performance and provide
an extra copy of data for recovery. If desired, for a higher level of
performance and resiliency, additional replicas of each segment can
be created.
Continuous backup
As each data segment is committed, a copy of the data is written to
deep storage, a durable object store. Common options for deep stor‐
age are cloud object storage or HDFS. This prevents data loss even if
all replicas of a data segment are lost, such as the loss of an entire
data center.
It is not necessary to perform traditional backups of a Druid cluster.
Deep storage provides an automatic continuous backup of all com‐
mitted data.
Rolling upgrades
Upgrading Druid does not require downtime. Moving to a new ver‐
sion uses a “rolling” upgrade—one node at a time is taken offline,
upgraded, and returned to the cluster. Druid can function normally
during the rolling upgrade, even though some nodes will be using a
different Druid version from other nodes during the upgrade.
Conclusion
If you’ve read this far, you’ve seen how real-time analytics applica‐
tions are driving both insights and action from fast-moving data and
how Apache Druid is designed to support real-time analytics appli‐
cations. Perhaps you’ve had some ideas about how you can use it
yourself.
How will you build a real-time analytics application to drive success
for your team? Whether you need interactive conversations with
data, subsecond response for queries from large data sets, high-
concurrency solutions for difficult-to-predict user communities, a
combination of real-time streaming data and historical context data,
or high-performance machine learning, the tools you need to make
it happen are now available and already being deployed worldwide.
The only limit is your imagination!
Further Resources
As a global open source project, Druid has a strong community,
with mailing lists, forums, discussion channels, and other resources.
You can also find information on Druid at imply.io, with a number
of blog articles and other useful resources.
If you want to try building your own real-time analytics application,
a good place to start is a free trial of Imply Polaris, which provides
Apache Druid as a service, plus quickstarts for push ingestion, visu‐
alization, and more.
Further Resources | 29
About the Author
Darin Briskman is director of technology at Imply, where he helps
developers create real-time analytics applications. He began his
career at NASA in the 1980s, and has been working with large and
interesting data sets ever since. Most recently, he’s had various tech‐
nical and leadership roles at Couchbase, Amazon Web Services, and
Snowflake. When he’s not writing code, Darin likes to juggle, blow
glass, and help children on the autism spectrum learn to use their
special abilities to work better with the neuronormative.