Migrating To Microservices Databases Red Hat
Migrating To Microservices Databases Red Hat
m
pl
im
Migrating to
en
ts
of
Microservice
Databases
From Relational Monolith
to Distributed Data
Edson Yanaga
Migrating to
Microservice Databases
From Relational Monolith to
Distributed Data
Edson Yanaga
Editors: Nan Barber and Susan Conant Interior Designer: David Futato
Production Editor: Melanie Yarbrough Cover Designer: Karen Montgomery
Copyeditor: Octal Publishing, Inc. Illustrator: Rebecca Demarest
Proofreader: Eliahu Sussman
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi
bility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-97461-2
[LSI]
You can sell your time, but you can never buy it back. So the price of
everything in life is the amount of time you spend on it.
To my family: Edna, my wife, and Felipe and Guilherme, my two dear
sons. This book was very expensive to me, but I hope that it will help
many developers to create better software. And with it, change the
world for the better for all of you.
To my dear late friend: Daniel deOliveira. Daniel was a DFJUG leader
and founding Java Champion. He helped thousands of Java developers
worldwide and was one of those rare people who demonstrated how
passion can truly transform the world in which we live for the better. I
admired him for demonstrating what a Java Champion must be.
To Emmanuel Bernard, Randall Hauch, and Steve Suehring. Thanks
for all the valuable insight provided by your technical feedback. The
content of this book is much better, thanks to you.
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Feedback Loop 1
DevOps 2
Why Microservices? 5
Strangler Pattern 6
Domain-Driven Design 8
Microservices Characteristics 9
2. Zero Downtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Zero Downtime and Microservices 14
Deployment Architectures 14
Blue/Green Deployment 15
Canary Deployment 17
A/B Testing 19
Application State 19
v
4. CRUD and CQRS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Consistency Models 34
CRUD 35
CQRS 36
Event Sourcing 39
5. Integration Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Shared Tables 44
Database View 45
Database Materialized View 47
Database Trigger 49
Transactional Code 49
Extract, Transform, and Load Tools 51
Data Virtualization 53
Event Sourcing 56
Change Data Capture 58
vi | Table of Contents
Foreword
vii
This is why Edsons book makes me happy. Not only does he discuss
data in a microservices architecture, but he also discusses evolution
of this data. And he does all of this in a very pragmatic and practical
manner. Youll be ready to use these evolution strategies as soon as
you close the book. Whether you fully embrace microservices or just
want to bring more agility to your IT system, expect more and more
discussions on these subjects within your teamsbe prepared.
Emmanuel Bernard
Hibernate Team and Red Hat
Middlewares data platform
architect
viii | Foreword
CHAPTER 1
Introduction
1
pile our code to check if the syntax was correct. Sometimes the com
pilation took minutes, and when it was finished we already had lost
the context of what we were doing before. The lead time1 in this case
was too long. We improved when our IDEs featured on-the-fly syn
tax highlighting and compilation.
We can say the same thing for testing. We used to have a dedicated
team for manual testing, and the lead time between committing
something and knowing if we broke anything was days or weeks.
Today, we have automated testing tools for unit testing, integration
testing, acceptance testing, and so on. We improved because now we
can simply run a build on our own machines and check if we broke
code somewhere else in the application.
These are some of the numerous examples of how reducing the lead
time generated better results in the software development process.
In fact, we might consider that all the major improvements we had
with respect to process and tools over the past 40 years were target
ing the improvement of the feedback loop in one way or another.
The current improvement areas that were discussing for the feed
back loop are DevOps and microservices.
DevOps
You can find thousands of different definitions regarding DevOps.
Most of them talk about culture, processes, and tools. And theyre
not wrong. Theyre all part of this bigger transformation that is
DevOps.
The purpose of DevOps is to make software development teams
reclaim the ownership of their work. As we all know, bad things
happen when we separate people from the consequences of their
jobs. The entire team, Dev and Ops, must be responsible for the out
comes of the application.
Theres no bigger frustration for developers than watching their
code stay idle in a repository for months before entering into pro
duction. We need to regain that bright gleam in our eyes from deliv
ering something and seeing the difference that it makes in peoples
lives.
1 The amount of time between the beginning of a task and its completion.
2 | Chapter 1: Introduction
We need to deliver software fasterand safer. But what are the excu
ses that we lean on to prevent us from delivering it?
After visiting hundreds of different development teams, from small
to big, and from financial institutions to ecommerce companies, I
can testify that the number one excuse is bugs.
We dont deliver software faster because each one of our software
releases creates a lot of bugs in production.
The next question is: what causes bugs in production?
This one might be easy to answer. The cause of bugs in production
in each one of our releases is change: both changes in code and in
the environment. When we change things, they tend to fall apart.
But we cant use this as an excuse for not changing! Change is part
of our lives. In the end, its the only certainty we have.
Lets try to make a very simple correlation between changes and
bugs. The more changes we have in each one of our releases, the
more bugs we have in production. Doesnt it make sense? The more
we mix the things in our codebase, the more likely it is something
gets screwed up somewhere.
The traditional way of trying to solve this problem is to have more
time for testing. If we delivered code every week, now we need two
weeksbecause we need to test more. If we delivered code every
month, now we need two months, and so on. It isnt difficult to
imagine that sooner or later some teams are going to deploy soft
ware into production only on anniversaries.
This approach sounds anti-economical. The economic approach for
delivering software in order to have fewer bugs in production is the
opposite: we need to deliver more often. And when we deliver more
often, were also reducing the amount of things that change between
one release and the next. So the fewer things we change between
releases, the less likely it is for the new version to cause bugs in pro
duction.
And even if we still have bugs in production, if we only changed a
few dozen lines of code, where can the source of these bugs possibly
be? The smaller the changes, the easier it is to spot the source of the
bugs. And its easier to fix them, too.
The technical term used in DevOps to characterize the amount of
changes that we have between each release of software is called batch
DevOps | 3
size. So, if we had to coin just one principle for DevOps success, it
would be this:
Reduce your batch size to the minimum allowable size you can
handle.
To achieve that, you need a fully automated software deployment
pipeline. Thats where the processes and tools fit together in the big
picture. But youre doing all of that in order to reduce your batch
size.
2 Just make sure to follow the tools best practices and do not store sensitive information,
such as passwords, in a way that unauthorized users might have access to it.
4 | Chapter 1: Introduction
figurations by packaging application and environment into a single
containment unitthe container. More specifically, the result of
packaging application and environment in a single unit is called a
virtual appliance. You can set up virtual appliances through VMs,
but they tend to be big and slow to start. Containers take virtual
appliances one level further by minimizing the virtual appliance
size and startup time, and by providing an easy way for distributing
and consuming container images.
Another popular tool is Vagrant. Vagrant currently does much
more than that, but it was created as a provisioning tool with which
you can easily set up a development environment that closely mim
ics as your production environment. You literally just need a
Vagrantfile, some configuration scripts, and with a simple
vagrant up command, you can have a full-featured VM or con
tainer with your development dependencies ready to run.
Why Microservices?
Some might think that the discussion around microservices is about
scalability. Most likely its not. Certainly we always read great things
about the microservices architectures implemented by companies
like Netflix or Amazon. So let me ask a question: how many compa
nies in the world can be Netflix and Amazon? And following this
question, another one: how many companies in the world need to
deal with the same scalability requirements as Netflix or Amazon?
The answer is that the great majority of developers worldwide are
dealing with enterprise application software. Now, I dont want to
underestimate Netflixs or Amazons domain model, but an enter
prise domain model is a completely wild beast to deal with.
So, for the majority of us developers, microservices is usually not
about scalability; its all about again improving our lead time and
reducing the batch size of our releases.
But we have DevOps that shares the same goals, so why are we even
discussing microservices to achieve this? Maybe your development
team is so big and your codebase is so huge that its just too difficult
to change anything without messing up a dozen different points in
your application. Its difficult to coordinate work between people in
a huge, tightly coupled, and entangled codebase.
Why Microservices? | 5
With microservices, were trying to split a piece of this huge mono
lithic codebase into a smaller, well-defined, cohesive, and loosely
coupled artifact. And well call this piece a microservice. If we can
identify some pieces of our codebase that naturally change together
and apart from the rest, we can separate them into another artifact
that can be released independently from the other artifacts. Well
improve our lead time and batch size because we wont need to wait
for the other pieces to be ready; thus, we can deploy our microser
vice into production.
Strangler Pattern
Martin Fowler wrote a nice article regarding the monolith-first
approach. Let me quote two interesting points of his article:
6 | Chapter 1: Introduction
Almost all the cases Ive heard of a system that was built as a
microservice system from scratch, it has ended up in serious
trouble.
Strangler Pattern | 7
The challenge of choosing which piece of software is a good candi
date for a microservice requires a bit of Domain-Driven Design
knowledge, which well cover in the next section.
Domain-Driven Design
Its interesting how some methodologies and techniques take years
to mature or to gain awareness among the general public. And
Domain-Driven Design (DDD) is one of these very useful techni
ques that is becoming almost essential in any discussion about
microservices. Why now? Historically weve always been trying to
achieve two synergic properties in software design: high cohesion
and low coupling. We aim for the ability to create boundaries
between entities in our model so that they work well together and
dont propagate changes to other entities beyond the boundary.
Unfortunately, were usually especially bad at that.
DDD is an approach to software development that tackles complex
systems by mapping activities, tasks, events, and data from a busi
ness domain to software artifacts. One of the most important con
cepts of DDD is the bounded context, which is a cohesive and well-
defined unit within the business model in which you define the
boundaries of your software artifacts.
From a domain model perspective, microservices are all about
boundaries: were splitting a specific piece of our domain model that
can be turned into an independently releasable artifact. With a badly
defined boundary, we will create an artifact that depends too much
on information confined in another microservice. We will also cre
ate another operational pain: whenever we make modifications in
one artifact, we will need to synchronize these changes with another
artifact.
We advocate for the monolith-first approach because it allows you
to mature your knowledge around your business domain model
first. DDD is such a useful technique for identifying the bounded
contexts of your domain model: things that are grouped together
and achieve high cohesion and low coupling. From the beginning,
its very difficult to guess which parts of the system change together
and which ones change separately. However, after months, or more
likely years, developers and business analysts should have a better
picture of the evolution cycle of each one of the bounded contexts.
8 | Chapter 1: Introduction
These are the ideal candidates for microservices extraction, and that
will be the starting point for the strangling of our monolith.
Microservices Characteristics
James Lewis and Martin Fowler provided a reasonable common set
of characteristics that fit most of the microservices architectures:
Microservices Characteristics | 9
vice. Experience has taught us that relying on remote calls (either
some kind of Remote Procedure Call [RPC] or REST over HTTP)
usually is not performant enough for data-intensive use cases, both
in terms of throughput and latency.
This book is all about strategies for dealing with your relational
database. Chapter 2 addresses the architectures associated with
deployment. The zero downtime migrations presented in Chapter 3
are not exclusive to microservices, but theyre even more important
in the context of distributed systems. Because were dealing with dis
tributed systems with information scattered through different arti
facts interconnected via a network, well also need to deal with how
this information will converge. Chapter 4 describes the difference
between consistency models: Create, Read, Update, and Delete
(CRUD); and Command and Query Responsibility Segregation
(CQRS). The final topic, which is covered in Chapter 5, looks at how
we can integrate the information between the nodes of a microservi
ces architecture.
10 | Chapter 1: Introduction
What About NoSQL Databases?
Discussing microservices and database types different than rela
tional ones seems natural. If each microservice must have is own
separate database, what prevents you from choosing other types of
technology? Perhaps some kinds of data will be better handled
through key-value stores, or document stores, or even flat files and
git repositories.
There are many different success stories about using NoSQL data
bases in different contexts, and some of these contexts might fit
your current enterprise context, as well. But even if it does, we still
recommend that you begin your microservices journey on the safe
side: using a relational database. First, make it work using your
existing relational database. Once you have successfully finished
implementing and integrating your first microservice, you can
decide whether you (or) your project will be better served by
another type of database technology.
The microservices journey is difficult and as with any change, youll
have better chances if you struggle with one problem at a time. It
doesnt help having to simultaneously deal with a new thing such as
microservices and new unexpected problems caused by a different
database technology.
Microservices Characteristics | 11
CHAPTER 2
Zero Downtime
Any improvement that you can make toward the reduction of your
batch size that consequently leads to a faster feedback loop is impor
tant. When you begin this continuous improvement, sooner or later
you will reach a point at which you can no longer reduce the time
between releases due to your maintenance windowthat short time
frame during which you are allowed to drop the users from your
system and perform a software release.
Maintenance windows are usually scheduled for the hours of the day
when you have the least concern disrupting users who are accessing
your application. This implies that you will mostly need to perform
your software releases late at night or on weekends. Thats not what
we, as the people responsible for owning it in production, would
consider sustainable. We want to reclaim our lives, and if we are
now supposed to release software even more often, certainly its not
sustainable to do it every night of the week.
Zero downtime is the property of your software deployment pipeline
by which you release a new version of your software to your users
without disrupting their current activitiesor at least minimizing
the extent of potential disruptions.
In a deployment pipeline, zero downtime is the feature that will
enable you to eliminate the maintenance window. Instead of having
a strict timeframe with in which you can deploy your releases, you
might have the freedom to deploy new releases of software at any
time of the day. Most companies have a maintenance window that
occurs once a day (usually at night), making your smallest release
13
cycle a single day. With zero downtime, you will have the ability to
deploy multiple times per day, possibly with increasingly smaller
batches of change.
Deployment Architectures
Traditional deployment architectures have the clients issuing
requests directly to your server deployment, as pictured in
Figure 2-1.
Unless your platform provides you with some sort of hot deploy
ment, youll need to undeploy your applications current version
and then deploy the new version to your running system. This will
result in an undesirable amount of downtime. More often than not,
Blue/Green Deployment
Blue/green deployment is a very interesting deployment architecture
that consists of two different releases of your application running
Blue/Green Deployment | 15
concurrently. This means that youll require two identical environ
ments: one for the production stage, and one for your development
platform, each being capable of handling 100% of your requests on
its own. You will need the current version and the new version run
ning in production during a deployment process. This is repre
sented by the blue deployment and the green deployment,
respectively, as depicted in Figure 2-3.
Canary Deployment
The idea of routing 100% of the users to a new version all at once
might scare some developers. If anything goes wrong, 100% of your
users will be affected. Instead, we could try an approach that gradu
ally increases user traffic to a new version and keeps monitoring it
for problems. In the event of a problem, you roll back 100% of the
requests to the current version.
This is known as a canary deployment, the name borrowed from a
technique employed by coal miners many years ago, before the
advent of modern sensor safety equipment. A common issue with
coal mines is the build up of toxic gases, not all of which even have
an odor. To alert themselves to the presence of dangerous gases,
miners would bring caged canaries with them into the mines. In
addition to their cheerful singing, canaries are highly susceptible to
toxic gases. If the canary died, it was time for the miners to get out
fast, before they ended up like the canary.
Canary development draws on this analogy, with the gradual
deployment and monitoring playing the role of the canary: if prob
lems with the new version are detected, you have the ability to revert
to the previous version and avert potential disaster.
We can make another distinction even within canary deployments.
A standard canary deployment can be handled by infrastructure
alone, as you route a certain percentage of all the requests to your
Canary Deployment | 17
new version. On the other hand, a smart canary requires the pres
ence of a smart router or a feature-toggle framework.
Application State
Any journeyman who follows the DevOps path sooner or later will
come to the conclusion that with all of the tools, techniques, and
culture that are available, creating a software deployment pipeline is
not that difficult when you talk about code, because code is stateless.
The real problem is the application state.
From the state perspective, the application has two types of state:
ephemeral and persistent. Ephemeral state is usually stored in mem
ory through the use of HTTP sessions in the application server. In
some cases, you might even prefer to not deal with the ephemeral
state when releasing a new version. In a worst-case scenario, the
user will need to authenticate again and restart the task he was exe
cuting. Of course, he wont exactly be happy if he loses that 200-line
form he was filling in, but you get the point.
To prevent ephemeral state loss during deployments, we must exter
nalize this state to another datastore. One usual approach is to store
the HTTP session state in in-memory, key-value solutions such as
A/B Testing | 19
Infinispan, Memcached, or Redis. This way, even if you restart your
application server, youll have your ephemeral state available in the
external datastore.
Its much more difficult when it comes to persistent state. For enter
prise applications, the number one choice for persistent state is
undoubtedly a relational database. Were not allowed to lose any
information from persistent data, so we need some special techni
ques to be able to deal with the upgrade of this data. We cover these
in Chapter 3.
21
Its not unusual to have teams applying database migrations man
ually between releases of software. Nor is it unusual to have some
one sending an email to the Database Administrator (DBA) with the
migrations to be applied. Unfortunately, its also not unusual for
those instructions to get lost among hundreds of other emails.
Database migrations need to be a part of our software deployment
process. Database migrations are code, and they must be treated as
such. They need to be committed in the same code repository as
your application code. They must be versioned along with your
application code. Isnt your database schema tied to a specific appli
cation version, and vice versa? Theres no better way to assure this
match between versions than to keep them in the same code reposi
tory.
We also need an automated software deployment pipeline and tools
that automate these database migration steps. Well cover some of
them in the next section.
Popular Tools
Some of the most popular tools for schema evolution are Liquibase
and Flyway. Opinions might vary, but the current set of features that
both offer almost match each other. Choosing one instead of the
other is a matter of preference and familiarity.
Both tools allow you to perform the schema evolution of your rela
tional database during the startup phase of your application. You
will likely want to avoid this, because this strategy is only feasible
when you can guarantee that you will have only a single instance of
your application starting up at a given moment. That might not be
the case if you are running your instances in a Platform as a Service
(PaaS) or container orchestration environment.
Our recommended approach is to tie the execution of the schema
evolution to your software deployment pipeline so that you can
assure that the tool will be run only once for each deployment, and
that your application will have the required schema already upgra
ded when it starts up.
In their latest versions, both Liquibase and Flyway provide locking
mechanisms to prevent multiple concurrent processes updating the
database. We still prefer to not tie database migrations to application
startup: we want to stay on the safe side.
2 You might argue that were just moving information, but if you try to access the old col
umn, its not there anymore!
3 It might be an oversimplification for the execution time calculation, but its a fair bet for
instructional purposes.
4 It can sound terrifying to suggest disabling the safety net exactly when things are more
likely to break, as this defeats the purpose of using a safety net. But then again, its all
about trade-offs. Sometimes you need to break some walls to make room for improve
ment.
Two of the most popular patterns for dealing with data manipula
tion are Create, Read, Update, and Delete (CRUD) and Command
and Query Responsibility Segregation (CQRS). Most developers are
familiar with CRUD because the majority of the material and tools
available try to support this pattern in one way or another. Any tool
or framework that is promoted as a fast way to deliver your software
to market provides some sort of scaffolding or dynamic generation
of CRUD operations.
Things start to get blurry when we talk about CQRS. Certainly the
subject of microservices will usually invoke CQRS in many different
discussions between people at conferences and among members of
development teams, but personal experience shows that we still have
plenty of room for clarification. If you look for the term CQRS on
a search engine, you will find many good definitions. But even after
reading those, it might be difficult to grasp exactly the why or even
the how of CQRS.
This chapter will try to present clear distinction between and moti
vation behind using both CRUD and CQRS patterns. And any dis
cussion about CQRS wont be complete if we do not understand the
different consistency models that are involved in distributed systems
how these systems handle read and write operations on the data
state in different nodes. Well start our explanation with these con
cepts.
33
Consistency Models
When were talking about consistency in distributed systems, we are
referring to the concept that you will have some data distributed in
different nodes of your system, and each one of those might have a
copy of your data. If its a read-only dataset, any client connecting to
any of the nodes will always receive the same data, so there is no
consistency problem. When it comes to read-write datasets, some
conflicts can arise. Each one of the nodes can update its own copy of
the data, so if a client connects to different nodes in your system, it
might receive different values for the same data.
The way that we deal with updates on different nodes and how we
propagate the information between them leads to different consis
tency models. The description presented in the next sections about
eventual consistency and strong consistency is an over-simplification
of the concepts, but it should paint a sufficiently complete picture of
them within the context of information integration between micro
services and relational databases.
Eventual Consistency
Eventual consistency is a model in distributed computing that guar
antees that given an update to a data item in your dataset, eventually,
at a point in the future, all access to this data item in any node will
return the same value. Because each one of the nodes can update its
own copy of the data item, if two or more nodes modify the same
data item, you will have a conflict. Conflict resolution algorithms are
then required to achieve convergence.1 One example of a conflict
resolution algorithm is last write wins. If we are able to add a
synchronized timestamp or counter to all of our updates, the last
update always wins the conflict.
One special case for eventual consistency is when you have your
data distributed in multiple nodes, but only one of them is allowed
to make updates to the data. If one node is the canonical source of
information for the updates, you wont have conflicts in the other
nodes as long as they are able to apply the updates in the exact same
order as the canonical information source. You add the possibility of
1 Convergence is the state in which all the nodes of the system have eventually achieved
consistency.
Strong Consistency
Strong consistency is a model that is most familiar to database
developers, given that it resembles the traditional transaction model
with its Atomicity, Consistency, Isolation, and Durability (ACID)
properties. In this model, any update in any node requires that all
nodes agree on the new value before making it visible for client
reads. It sounds naively simple, but it also introduces the require
ment of blocking all the nodes until they converge. It might be espe
cially problematic depending on network latency and throughput.
Applicability
There are always exceptions to any rule, but eventual consistency
tends to be favored for scenarios in which high throughput and
availability are more important requirements than immediate con
sistency. Keep in mind that most real-world business use cases are
already eventual consistency. When you read a web page or receive a
spreadsheet or report through email, you are already looking at
information as it was some seconds, minutes, or even hours ago.
Eventually all information converges, but were used to this even
tuality in our lives. Shouldnt we also be used to it when developing
our applications?
CRUD
CRUD architectures are certainly the most common architectures in
traditional data manipulation applications. In this scenario, we use
the same data model for both read and write operations, as shown in
Figure 4-1. One of the key advantages of this architecture is its sim
plicityyou use the same common set of classes for all operations.
Tools can easily generate the scaffolding code to automate its imple
mentation.
CRUD | 35
Figure 4-1. A traditional CRUD architecture (Source)
CQRS
CQRS is a fancy name for an architecture that uses different data
models to represent read and write operations.
Lets look again at the scenario of changing the Customers address.
In a CQRS architecture (see Figure 4-2), we could model our write
operations as Commands. In this case, we can implement a WrongAd
2 Were using the DTO term here not in the sense of an anemic domain model, but to
address its sole purpose of being the container of information.
CQRS | 37
Figure 4-3. A CQRS architecture with separate read and write stores
(Source)
Some motivations for using separate data stores for read and write
operations are performance and distribution. Your write operations
might generate a lot of contention on the data store. Or your read
operations may be so intensive that the write operations degrade sig
nificantly. You also might need to consolidate the information of
your model using information provided by other data stores. This
can be time consuming and wont perform well if you try to update
the read model together with your write model. You might want to
consider doing that asynchronously. Your read operations could be
implemented in a separate service (remember microservices?), so
you would need to issue the update request to the read model in a
remote data store.
Event Sourcing
Sometimes one thing leads to another, and now that weve raised the
concept of events, we will also want to consider the concept of event
sourcing.
Event sourcing is commonly used together with CQRS. Even though
neither one implies the use of the other, they fit well together and
complement each other in interesting ways. Traditional CRUD and
CQRS architectures store only the current state of the data in the
data stores. Its probably OK for most situations, but this approach
also has its limitations:
Event Sourcing | 39
Using a single data store for both operations can limit scalability
due to performance problems.
In a concurrent system with multiple users, you might run into
data update conflicts.
Without an additional auditing mechanism, you have neither
the history of updates nor its source.
Assuming that all bank accounts start with a zero balance, its just a
matter of sequentially applying all the credit() and debit() opera
tions to compute the current amount of money. You probably
already noticed that this is not a computational-intensive operation
if the number of operations is small, but the process tends to
become very slow as the size of the dataset grows. Thats when
CQRS comes to the assistance of event sourcing.
With CQRS and event sourcing, you can store the credit() and
debit() operations (the write operations) in a data store and then
store the consolidated current amount of money in another data
store (for the read operations). The canonical source of information
will still be the set of credit() and debit() operations, but the read
data store is created for performance reasons. You can update the
read data store synchronously or asynchronously. In a synchronous
operation, you can achieve strong or eventual consistency; in an
asynchronous operation, you will always have eventual consistency.
There are many different strategies for populating the read data
store, which we cover in Chapter 5.
Notice that when you combine CQRS and event sourcing you get
auditing for free: in any given moment of time, you can replay all
the credit() and debit() operations to check whether the amount
of money in the read data store is correct. You also get a free time
machine: you can check the state of your bank account at any given
moment in the past.
Event Sourcing | 41
Synchronous or Asynchronous Updates?
Synchronously updating the read model sounds like the obvious
choice at first glance but in fact, it turns out that in the real world
asynchronous updates are generally more flexible and powerful,
and they have added benefits. Lets take another look at the banking
example.
Real banking systems update the read models asynchronously
because they have a lot more control with their procedures and pol
icies. Banks record operations as they occur, but they reconcile
those operations at night and often in multiple passes. For example,
reconciliation often first applies all credit operations and then all
debit operations, which eliminates any improper ordering in the
actual recording of the transactions due to technical reasons (like a
store or ATM couldnt post its transactions for a few hours) or user
error (like depositing money into an account after an expensive
purchase to prevent overdrawing the account).
43
lithic, tightly coupled, and entangled database to your decoupled
microservices database. Later on, when you have already success
fully decoupled the data from each one of the endpoints, you will be
free to explore and use another technology, which might be a better
fit for your specific use case.
The following sections will present a brief description and set of
considerations when using each one of these integration strategies:
Shared Tables
Database View
Database Materialized View
Database Trigger
Transactional Code
ETL Tools
Data Virtualization
Event Sourcing
Change Data Capture
Shared Tables
Shared tables is a database integration technique that makes two or
more artifacts in your system communicate through reading and
writing to the same set of tables in a shared database. This certainly
sounds like a bad idea at first. And you are probably right. Even in
the end, it probably will still be a bad idea. We can consider this to
be in the quick-and-dirty category of solutions, but we cant discard
it completely due to its popularity. It has been used for a long time
and is probably also the most common integration strategy used
when you need to integrate different applications and artifacts that
require the same information.
Sam Newman did a great job explaining the downsides of this
approach in his book Building Microservices. Well list some of them
later in this section.
Database View
Database views are a concept that can be interpreted in at least two
different ways. The first interpretation is that a view is just a Result
Set for a stored Query.1 The second interpretation is that a view is a
logical representation of one or more tablesthese are called base
tables. You can use views to provide a subset or superset of the data
that is stored in tables.
Database View | 45
Database View Applicability
A database view is a better approach than shared tables for simple
cases because it allows you to create another representation of your
model suited to your specific artifact and use case. It can help to
reduce the coupling between the artifacts so that later you can more
easily choose a more robust integration strategy.
You might not be able to use this strategy if you have write require
ments and your DBMS implementation doesnt support it, or if the
query backing your view is too costly to be run in the frequency
required by your artifact.
Transactional Code
Integration can always be implemented in our own source code
instead of relying on software provided by third parties. In the same
way that a database trigger or a materialized view can update our
Database Trigger | 49
target tables in response to an event, we can code this logic in our
update code.
Sometimes the business logic resides in a database stored procedure:
in this case, the code is not much different from the code that would
be implemented in a trigger. We just need to ensure that everything
is run within the same transaction to guarantee data integrity.
If we are using a platform such as Java to code our business logic, we
can achieve similar results using distributed transactions to guaran
tee data integrity.
Data Virtualization
Data virtualization is a strategy that allows you to create an abstrac
tion layer over different data sources. The data source types can be
as heterogeneous as flat files, relational databases, and nonrelational
databases.
With a data virtualization platform, you can create a Virtual Data
base (VDB) that provides real-time access to data in multiple heter
ogeneous data sources. Unlike ETL tools that copy the data into a
different data structure, VDBs access the original data sources in
real time and do not need to make copies of any data.
You can also create multiple VDBs with different data models on top
of the same data sources. For example, each client application
(which again might be a microservice) might want its own VDB
with data structured specifically for what that client needs. Data vir
tualization is a powerful tool for an integration scenario in which
you have multiple artifacts consuming the same data in different
ways.
The VDB abstraction also allows you to create multiple VDBs with
different data models from the same data sources. Its a powerful
tool in an integration scenario in which you have multiple different
artifacts consuming the same data but in different ways.
Data Virtualization | 53
One open source data virtualization platform is Teiid. Figure 5-2
illustrates Teiids architecture, but it is also a good representation of
the general concept of VDBs in a data virtualization platform.
4 Note that real-time access here means that information is consumed online, not in the
sense of systems with strictly defined response times as real-time systems.
Data Virtualization | 55
Updatable depending on data virtualization platform
Depending on the features of your data virtualization platform,
your VDB might provide a read-write data source for you to
consume.
Event Sourcing
We covered this in Event Sourcing on page 39, and it is of special
interest in the scope of distributed systems such as microservices
architecturesparticularly the pattern of using event sourcing and
Command Query Responsibility Segregation (CQRS) with different
read and write data stores. If were able to model our write data store
as a stream of events, we can use a message bus to propagate them.
The message bus clients can then consume these messages and build
their own read data store to be used as a local replica.
If you need to modify your existing application to fit most if not all
of the aforementioned requirements, its probably safer to choose
another integration strategy instead of event sourcing.
Event Sourcing | 57
Usually combined with a message bus
Events are naturally modeled as messages propagated and con
sumed through a message bus.
High scalability
The asynchronous nature of a message bus makes this strategy
highly scalable. We dont need to handle throttling because the
message consumers can handle the messages at their own pace.
It eliminates the possibility of a producer overwhelming the
consumer by sending a high volume of messages in a short
period of time.
CDC Applicability
If your DBMS is supported by the CDC tool, this is the least intru
sive integration strategy available. You dont need to change the
structure of your existing data or your legacy code. And because the
CDC events are already modeled as change events such as Create,
Update, and Delete, its unlikely that youll need to implement newer
types of events laterminimizing coupling. This is our favorite inte
gration strategy when dealing with legacy monolithic applications
for nontrivial use cases.
Debezium
Debezium is a new open source project that implements CDC. As of
this writing, it supports pluggable connectors for MySQL and Mon
goDB for version 0.3; PostgreSQL support is coming for version 0.4.
Designed to persist and distribute the stream of events to CDC cli
ents, is built on top of well-known and popular technologies such as
Apache Kafka to persist and distribute the stream of events to CDC
clients.
Debezium fits very well in data replication scenarios such as those
used in microservices architectures. You can plug the Debezium
connector into your current database, configure it to listen for
changes in a set of tables, and then stream it to a Kafka topic.
Debezium messages have an extensive amount of information,
including the structure of the data, the new state of the data that was