Lamda Architecture
Lamda Architecture
Even when faced with the same pressures, people will approach an idea in different ways.
When Jay Kreps was developing Kafka at LinkedIn, he called it The Log. Facebook (being
Facebook) created several independent implementations of “stream-oriented processing”,
including Puma and TailerSwift. Twitter has the adorably named Summingbird. The jargon we
seem to be converging on for these kinds of systems is the Lambda Architecture.
Lambda Origin
In his book, Big Data: Principles and best practices of scalable real-time data systems, Nathan
Marz coined the term ‘Lambda Architecture’ to describe a generic, scalable and fault-tolerant
data processing architecture based on his experience in working on distributed systems at
Backtype and Twitter.
Lambda in a Nutshell
The gist of the Lambda Architecture is to model everything that goes on in a complex
computing system as an ordered, immutable log of events. Processing the data (say, totaling
up the number of website visitors) is completed as a series of transformations that output to
new tables or streams.
It is important to keep the input unchanged. By breaking data processing into independent
pieces, each with a defined input and output, you get closer to the ideal of purely functional
programming. Writing and testing each piece is made simpler and parallelization can be
automated. Parts of the dataflow can be replayed (say, when code changes or machines fail)
and toyed together with other flows.
2
This sequenced approach is a nice property to have as it retains data integrity and simplifies
troubleshooting. A long time ago, people who did 3D modeling would “carve” digital blocks
into the shapes they wanted. If they wanted to undo something 10 steps back, they were
largely out of luck. Then 3DStudio introduced a brilliant feature it called the “transform stack”.
The stack records every change to an object separately, and applies them in real time. This
allows the modeler to modify, add, remove, and even reorder their changes on the fly. A
sequenced approach to data pipelines is similar, providing a nifty solution for data reprocessing
when changes to code occur.
So far, this is simply good data engineering hygiene. Any well-run batch processing or
map/reduce system will follow the same principles. There’s nothing special about stream
processing that makes immutable data flows work better.
3
Lambda Architecture Diagram - http://lambda-architecture.net/
Lambda is an old and venerable technique. Document search engines of a certain age (eg,
Yahoo’s Vespa) often have a “slow” index that is compact but difficult to update. To
compensate they will also have a “fast” index, perhaps in memory, where changes are cached
until the next index rebuild. Under the hood a search will consult both indexes and merge the
results.
The problem is, the Lambda Architecture was an evolution on top of the slower batched index.
It is not certain that you would do it that way if you were building from scratch. Lucene, for
example, uses an incremental index for everything. Jay Kreps, in a thoughtful critique of
Lambda, points out that you need two implementations of the same queries and data flow.
And of course, you need two copies of the data. If you had a better streaming system, one that
could “read a table” simply by replaying a stream, why would you need both kinds of system?
4
Rethinking the Lambda Architecture
Most companies have responded to the influx of data by adapting their data management
strategy. However, managing streaming data still poses challenges for many enterprises.
Complicating the matter further, most enterprises need instant access to both historical and
real-time data, which require specific considerations and solutions. Of the many approaches to
managing real-time and historical data concurrently, the Lambda Architecture is by far the
most talked about, and accepted today.
5
Many Internet-scale companies, like Pinterest, Zynga, Akamai, and Comcast, are using a
memory-optimized database to achieve the high-speed data component of the Lambda
Architecture. These companies are splitting the input stream to push data into both an in-
memory database and a data lake, like HDFS, in parallel.
In this era of ubiquitous big data, it is not enough for companies to merely process data.
Analyzing data to detect patterns, which can be immediately applied to maximizing operational
efficiency, is the real driver of business value.
MemSQL offers a complete solution: the ability to handle millions of transactions per second
while performing complex multi-table join queries. Let’s dig into some of the key innovations
that make MemSQL an ideal solution for simplifying the Lambda Architecture.
Scalability
MemSQL uses a distributed shared nothing architecture that scales on commodity hardware
and local storage, supporting petabytes of data. MemSQL is a memory-first, relational database
that also offers a disk-based columnstore. In-memory optimization provides high-speed data
ingestion while simultaneously delivering analytics on the changing data set. The disk-based
columnstore provides historical data management and access to historical data trends to
leverage in combination with the “hot” data to deliver real-time analytics.
Multi-model, Multi-mode
6
Full ANSI SQL support makes MemSQL readily accessible to data analysts, business analysts
and data scientists reducing application code requirements. Plugging data visualization and
query tools into the analytics architecture delivers immediate value from data to the business.
MemSQL also has extended SQL including JSON support. Traversing a JSON document is
similar to SQL with extensions to traverse the key-value pairs.
7
Lambda In Production
In this section, we will take a look at examples from innovative companies using a Lambda
Architecture built for real-time data processing and exploration.
~ 1 second
• Analysts query live data
• Alerts on complex objects
~ 30 minutes • Optimize CDN efficiency
This enables Comcast to run real-time analytics on massive, ever-changing datasets, while also
making their analytics infrastructure more performant. Instead of just logging all Xfinity data
and analyzing it hours or days later, Comcast has the power to get both viewership and
infrastructure monitoring metrics the moment they occur.
HDFS provides a quasi-infinite data store where they can run machine learning jobs and other
“offline” analytics.
Watch the Comcast team’s recorded session from Strata+Hadoop World to learn how
Comcast architected their Xfinity platform to work with millions of users, process enormous
volumes of data and, at the same time, perform advanced real-time analytics. Recording Here
8
Tapjoy Powers its Mobile Ad Platform
Tapjoy, the mobile app industry’s leading mobile marketing automation and monetization
platform, is processing and analyzing real-time and historical data concurrently to power its ad
platform.
Tapjoy optimizes ad performance by taking advantage of the speed and scalability of in-
memory computing. With the processing power to run 60,000 queries at a response time of
less than ten milliseconds, Tapjoy is able to cross-reference user data and serve higher-
performing ads to more than 500 million global users.
Above is a diagram of Tapjoy’s database architecture. For a more detailed look and
explanation, watch Principal Data Analytics Engineer at Tapjoy, David Abercrombie’s session at
the In-Memory Computing Summit.
9
Conclusion
The pace of data is not slowing. Applications of today are built with infinite data sets in mind.
As these real-time applications become the norm, and batch processing becomes a relic of the
past, digital enterprises will implement memory-optimized, distributed data systems to simplify
Lambda Architectures for real-time data processing and exploration.
By answering questions like these, you will have a clear starting point for where to improve
your existing data management system, and how to prepare for the applications you plan to
build. From there, you can narrow which technologies to try for a proof of concept. If you
need help along the way, we would love to hear from you. Send us an email at
[email protected] or give us a call at (855) 463-6775.
10