Data Hub Guide For Architects
Data Hub Guide For Architects
for Architects
What is a data hub? How does it
simplify data integration?
MarkLogic Corporation
999 Skyway Road, Suite 200
San Carlos, CA 94070
Ken holds a B.S. Degree in Computer Science from the State University of New
York, Albany.
Additional Support
This publication is the product of many minds and was made possible with input
across the MarkLogic team, including but not limited to:
And of course, none of this would be possible without the valuable input from our
customers. It is their use of MarkLogic and important feedback that has led to the
success and refinement of MarkLogic’s data hub software and cloud services.
Contents
Executive Summary������������������������������������������������������������������������������������������������ 1
A Look Ahead���������������������������������������������������������������������������������������������������������� 65
Appendix�������������������������������������������������������������������������������������������������������������������� 71
Bibliography������������������������������������������������������������������������������������������������������������� 74
Contents
“ Architects need
to understand
the underlying
technical
capabilities on
the back-end that
make it possible
to simplify data
integration for
end users on the
front‑end.”
vi vi
1
Executive Summary
The data hub first emerged as a pattern due to a technological shift with
databases, specifically NoSQL, multi-model databases. While organizations
value their relational databases for handling structured data and querying with
SQL, they became frustrated by the lack of flexibility and extensive up-front data
modeling they require.
For most data integration use cases, the time and cost associated with a
“relational database plus ETL” approach is too great. A multi-model database, by
contrast, enables organizations to ingest raw data immediately, lower schema-
design costs, and deliver faster value for more use cases.
The data hub approach has evolved, and now MarkLogic® and other vendors
provide comprehensive data hub solutions. As a leading vendor for this emerging
technology, MarkLogic’s Data Hub Platform provides a unified data platform to
ingest, discover, curate (enrich, harmonize, master), and access data. And, it is
powered by MarkLogic Server, a leading multi-model database.
Unlike data warehouses or data lakes, a data hub is significantly more agile to keep
up with today’s fast-moving business cycles. A data hub provides transactional and
analytical capabilities for both run-the-business and observe-the-business use cases.
And, it has the security and governance required for mission-critical data.
1
Executive Summary
With its Data Hub Platform, MarkLogic provides the best of its kind for building a
data hub architecture. Along with Data Hub Service, MarkLogic’s cloud service, or-
ganizations can do agile data integration and speed their transition to the cloud.
In practice, the MarkLogic Data Hub abstracts much of the technical complexity
so that most end users don’t have to worry about modeling entities and
harmonizing data. All they have to do is login and explore the end result: well-
curated data. But, architects need to understand the underlying technical
capabilities on the back-end that make it possible to simplify data integration for
end users on the front-end.
The underlying capabilities of the MarkLogic Data Hub Platform, made possible
by MarkLogic Server, include:
In the chapters that follow, we will delve into how these underlying capabilities
are critical to support a data hub architecture and how that architecture supports
today’s rapidly changing business needs.
2
2
Data Integration
Largely credited to Bill Inmon, the data warehouse was described by him as
“a subject-oriented, non-volatile, integrated, time-variant collection of data in
support of management’s decisions.” Additionally, Ralph Kimball – another
data warehouse pioneer – described it as a “copy of transaction data specifically
structured for query and analysis.” By either definition, its purpose was (and
still is) to integrate multiple upstream line-of-business systems for subsequent
analysis, most often quantitative in nature1. Put another way, the data warehouse
is designed to support the observe-the-business functions of an enterprise.
Though the concept was conceived as early as the 1970s, its adoption began
to accelerate in 1990, creating an entire decision support ecosystem within
technology, consisting of myriad supporting tools and techniques. Today this
segment of the software industry is measured in the billions of dollars annually
and estimated to be worth $20B by 20222.
Just before the turn of the millennium, however, there was recognition that an
additional type of enterprise-scale integration was needed to support operational
activity across business units. As new business challenges and opportunities
1 https://www.1keydata.com/datawarehousing/data-warehouse-definition.html
2 https://www.marketanalysis.com/?p=65
3
Data Integration
The net result of all of this is that enterprises adopted two distinctly different
ways of integrating data for run-the-business and observe-the-business domains.
Here, on each side of the image, we see the run-the-business and observe-
the-business domains. The run-the-business domain consists of a number of
applications to support the operations of the enterprise. Integrating the data
for these applications is enabled by EAI technologies (SOA, ESB, etc.), while
decision support technologies and processes (data warehouses/marts) are used
to support observe-the-business integration. We’ll also notice a third zone on
the diagram, which we call enterprise data management (EDM). This zone is
mostly comprised of technologies and implementations that are responsible for
“connecting” the two other zones.
4
Data Integration
OLTP SOURCES
DATA
APPLICATIONS DISTRIBUTION
Used by
BUSINESS USERS
However, as we get closer to the details, we begin to see some issues, particularly
with respect to data integration. It is with this critical task that we start to see the
following characteristics emerge:
5
Data Integration
On the last point in particular, when we dig more deeply, we can’t help but
see a very slippery slope as data is copied and changed for technical reasons
as opposed to valid business reasons. What happens is that unnecessary data
wrangling leads to problems that accrue over time as an exponentially increasing
volume and variety of data must be processed within an ever-shrinking time
window. Yet most businesses have relied on the same tools and techniques for the
past three decades (or longer)! And when exponentially-increasing requirements
are met by only marginally-increased resources – the net result is a significant
shortfall that compounds over time.
6
Data Integration
“ Unnecessary data
wrangling leads
to problems that
accrue over time
as an exponentially
increasing volume
and variety of data
must be processed
within an ever-
shrinking time
window.”
7 7
3
Our Technical
(Data Integration) Debt
The term “technical debt”3 is often used to describe decisions made during
software development that result in future rework, often because shortcuts are
taken in lieu of more thoroughly vetted approaches. The term, however, may
also be applied to unintended consequences resulting from “not knowing any
better” and/or as a result of following “accepted wisdom.”
In the case of today’s data integration challenges, it has become clear that much
of IT has participated in accrual of technical debt, simply by using accepted
tools and techniques. To better understand this, let’s go back to our diagram in
Figure 1 and take a closer look at the participants in the data flow. For this, we’ll
modify the diagram with numeric labels for cross-reference purposes.
2 1 4
BACK-END
OLTP SOURCES
DATA
APPLICATIONS 3 DISTRIBUTION 5
Used by
6b
operational decision making
BUSINESS USERS
3 https://en.wikipedia.org/wiki/Technical_debt
8
Our Technical (Data Integration) Debt
While each of the technologies in figure 2 plays a positive role in the enterprise,
we should also understand and acknowledge their unintended contributions to
technical debt. In the table that follows, we examine each technology category
and assess the positive contributions of each, as well as their respective
contribution to technical debt.
3 Master data To achieve a “single source of truth,” To provide consistency for referenced An ambiguous quest to have one data
management or golden copy, for critical business data across multiple applications. standard across silos. Many challenges
(MDM) entities through a set of processes are compounded by ongoing ETL. The iro-
and programs.5 ny is that many solutions call for creating
yet another “golden” copy of the data.
4 Data warehouse To provide the ability to do cross- To allow for analysis of the Stale information arises from the heavy
line-of-business analysis, typically enterprise as a whole and/or for a dependency on ETL. Also, because of the
quantitative in nature. large complex multi-business-unit difficulty integrating disparate data sets,
portion of the enterprise that would data warehouses contain a small subset
not otherwise be possible without of source data entities and attributes,
combining the data. precipitating the need to duplicate more
data by way of data marts (see below).
5 Data marts To store narrowly-focused analytical To provide similar capabilities Proliferation of data silos and data copy
data (unlike the broader data ware- to the data warehouse for more ing. As a reaction to the compromises
house), often containing a subset domain-specific use cases than are of an enterprise-wide data warehouse
of data warehouse data, combined typical of an enterprise-wide data (slow pace of delivery, subset of line-of-
with other domain-specific data not warehouse. business [LoB] data), data marts have
contained in the data warehouse. proliferated.
6 Data To deliver pertinent information To streamline processes and Because these functions are the farthest
distribution and to data-dependent stakeholders improve the consistency of data downstream in a traditional data
operational including, but not limited to, internal access and sharing. integration architecture, they are the most
decision making decision-makers, compliance negatively impacted by the challenges of
officers, external customers, B2B integrating data. These negative impacts
partners, and regulators. manifest as issues associated with
time-to-delivery, data quality, and data
comprehensiveness.
4 In a report sponsored by Informatica, analysts at TDWI estimate between 60 percent and 80 percent of the total cost of a data
warehouse project may be taken up by ETL software and processes.
Source: https://tdwi.org/~/media/TDWI/TDWI/WhatWorks/TDWI_WW29.ashx
5 According to TechTarget: “Master data management (MDM) is a comprehensive method of enabling an enterprise to link all of its
critical data to one file, called a master file, that provides a common point of reference.”
Source: http://searchdatamanagement.techtarget.com/definition/master-data-management
9
Our Technical (Data Integration) Debt
The net result of all of the above for nearly every large enterprise is an
accumulation of technical debt, most of which is related to attempts to integrate
data from various silos. This technical debt manifests itself in many ways,
resulting in negative impacts on day-to-day business operations and hindering
IT innovation. For most large enterprises, the problems seem intractable, leaving
many wondering how we got to such a point in the first place.
When it comes to the technical debt associated with integrating data from silos,
there are often myriad seemingly unrelated business problems arising from a
common root cause. Consider the following two examples:
1. Why? They may need to potentially search across 16 different systems to find
the information they’re looking for.
2. Why? The systems all have data about a customer, some of which overlaps,
yet they each have different data models.
3. Why? They serve different operational functions and the data has yet to
be integrated.
4. Why? Combining that data in a cohesive way, to allow for a customer 360 has
proven to be difficult and error-prone.
5. Why? Creating a relational database model must consider all data model vari-
ances up front, and the schema must be created before development can begin.
In the above case, the root cause lies with limitations of modeling with a relational
database management system (RDBMS), where schema modeling is a prerequisite
activity to development. Moreover, as we’ll discuss in the subsequent section, because
nearly every model change in a relational database is often accompanied by non-trivi-
al code and back-testing changes, modelers attempt to design schemas to account for
as many scenarios as possible, potentially making the modeling exercise very complex
and time-consuming. In many cases, because of complexity, compromises are made
in the modeling process in an attempt to meet a deadline or otherwise “save time.”
11
Our Technical (Data Integration) Debt
Again, despite the different business use case, the root cause lies with the
limitations of RDBMS schema, where integrating multiple valid models is
difficult, time-consuming, and brittle.
Table: Customers
We create more tables when we need to represent more complex entities or more
complex information about entities. For instance, a customer may have more
than one phone number or address, which would indicate the need to create
other tables to capture such data.
Table: Phones
12
Our Technical (Data Integration) Debt
foreign key
id cust_id type county_code area_code number
1 1 home 1 212 555-1234
The more complex an entity becomes, the more tables and relationships that
are needed to represent the business entity. Likewise, the more business entities
we have, the more complex the overall modeling process becomes. For a large
business unit within an enterprise, the modeling process can be very significant.
The people responsible for creating schema (DBAs mostly) are on the critical
path to delivery, since data cannot be loaded into an RDBMS without a defining
(and constraining) schema. Furthermore, those same people have the profound
task of making sure that they get things right up-front. After all, they are making
or finalizing decisions around what data is welcome (and not welcome) inside
the database.
This not only has the potential to negatively impact delivery times at the
inception of projects, but also every time there is a requirements change
that impacts the database schema. This can be particularly troublesome for
applications where the development has already begun, or for those that have
loaded data and/or have gone into production.
13
Our Technical (Data Integration) Debt
• Taking a snapshot of any data that may have already been loaded
• Creating new models and/or altering existing models to provide new columns
and/or tables to allow for new data
• Changing ETL code and/or other existing application code to account for the
database changes
• Re-creating indexes so that the database may be queried efficiently
• Re-testing all dependent applications and processes
What we’re left with is a very rigid process, where even the smallest of changes
comes at a steep price. Furthermore, because relational databases force data
into rows and columns, business entities are not always modeled in ways that
represent their natural state. Finally, because of the limited ways in which
relationships may be represented in a relational database, a good deal of the
business context about relationships between entities is not explicitly captured.
14
Our Technical (Data Integration) Debt
From within the context of a large enterprise, changes to the business may
arrive in a multitude of ways. Some changes are minor, others more impactful.
The frequency of change may also vary, as well as which business units within
a large organization are impacted. Sometimes changes are (seemingly) isolated
to a single department. Other times, multiple groups within an organization are
affected. And, it is on this last point – the difference between single-department
and enterprise perspectives – that we may best appreciate the impact of change
with respect to modern data integration.
When viewed through the lens of a single department, many business changes may
be viewed as “manageable” with respect to how the changes impact the underly-
ing data models. Thirty to forty years ago, this was certainly the case. Through the
1980’s and even into the 1990’s, delivery time expectations for system changes were
much longer. And, the volumes and variety of data managed by systems were signif-
icantly less. Additionally, the interdependence between different departments’ data
was not nearly as great as it is today. They were simpler times indeed.
Today, things are more complicated – even when viewed through the lens of a single
department. Data is greater in volume and variety, and the turnaround time for re-
sponding to change has shrunk. As a result, various technologies and techniques have
sprung up in response to these challenges, oftentimes to aid the RDBMS technologies
for which the various departments’ systems were built. The “middle tier” emerged to
map business entities to the underlying databases, supported by object-to-relational
mapping (ORM) tools. Various agile delivery methods also sprung up to shrink the de-
livery cycle from years to months (or less). By and large, the new tools and techniques
worked (and still do) to address the complications, with one significant caveat. The
context within which they work is within the perspective of a single set of related busi-
ness problems, such as for a single department or division within a larger enterprise.
However, when one considers multiple business units or divisions, with different
budgets, leaders, perspectives and people, things go from complicated to complex.
And it’s at this inflection point of complexity that RDBMS technology breaks down.
15
Our Technical (Data Integration) Debt
In his bestselling book Team of Teams, U.S. General Stan McChrystal references
mathematician Edward Lorenz’s famous “butterfly effect” to make a distinction
between problems that are complicated and ones that are complex.6 He proposes
that complicated problems have a level of predictability about them that makes
it possible to address them systematically, oftentimes by applying efficiencies to
existing processes. For complex problems, however, he speaks of an increasing
number of interdependent interactions that emerge, making the system too
unpredictable for traditional systematic processes to address. With these complex
problems, a more agile approach is required to solve them, one that “embraces
the chaos” so to speak, by recognizing that rigid processes can only go so far.
This is exactly the case with today’s challenges around enterprise data
integration. There are far too many interdependent systems within a large
enterprise to assume that the rote processes associated with RDBMS data
modeling will scale to meet the level of complexity of the overall system. In other
words, we have crossed the complexity threshold.
Manageable requirements under one set of business owners. Frequently changing requirements across multiple business owners.
Relatively stable model that may follow a set process for managing change. Interdependent models resulting in frequent changes.
Solvable complications resulting in many successful implementations. Spotty track record for enterprise integration when relying on RDBMS-based
tools exclusively.
The key takeaway here is that modeling business entities for the enterprise (i.e.
modeling for data integration purposes across business domains) requires an
altogether different approach to modeling than the traditional approach that has
been used for decades. In subsequent sections, we explore these new modeling
methods and learn how a data hub makes such modeling possible.
6 McChrystal, Stanley. Team of Teams: New Rules of Engagement for a Complex World. Penguin Random House. 2015.
16
Our Technical (Data Integration) Debt
• Big data. The three Vs of the “big data” trend – volume, velocity, and
variety – cast a spotlight on the notion that traditional data management
approaches were ripe for disruption. As a result, a number of technologies
emerged that created a good deal of hype. Although there has since been
inevitable disillusionment and in some cases outright abandonment (i.e.
Hadoop), we have benefitted by the increased focus on the transformational
power of data and the recognition that enterprises do need a way to handle the
big “V’s” (volume, velocity, variety, and also veracity).
These examples reinforce the fact that technology does not stand still (nor do
business needs), with potential disruption at any turn. They also show that with
new opportunities, there are also unintended consequences.
17
Our Technical (Data Integration) Debt
“ …the power of a
pattern… lies in the
fact that they are
discovered from best
practices that are
based on evidence of
repeated success.”
18 18
Our Technical (Data Integration) Debt
To illustrate this point, let us look at how enterprises fared with the hype
surrounding data lakes. We choose this example because it speaks to data
integration challenges, and because the places where data lakes are deficient are
areas specifically addressed by a data hub. After this brief assessment of data
lakes, we’ll dive right into data hubs for the remainder of the text.
As a result, large enterprises rushed to create these data lakes using a variety
of technologies from within the Hadoop ecosystem. For the most part, the
basic design of the data lake involved dumping large volumes of raw data into
the Hadoop Distributed File System (HDFS), and then relying on a myriad of
programming tools to make sense of this raw data. There were some gains;
however, as with all hype cycles, there was also a lot of disillusionment.
Following are some of the areas where data lakes came up short:
7 http://whatis.techtarget.com/definition/3Vs
19
Our Technical (Data Integration) Debt
The result for many enterprises has been an exacerbation of many pre-existing
problems, even as other problems get addressed. In most cases, the overall effect
is a net increase in the pain associated with managing data. It is for these and
other reasons that Hadoop and data lakes fall short of expectations.
20
Our Technical (Data Integration) Debt
To answer these questions, we will start by giving a name to one of the primary
deliverables of the modeling approach: the Enterprise Entity.
The term “entity” is well known in data modeling. Relational database modelers
are familiar with it within the context of an entity-relationship diagram (ERD)
where a business entity (or more often part of a business entity) is transcribed
into a table in a relational database. In the object-oriented world, more fully
composed business entities might be referred to as classes. In each case though,
the general concept is similar, namely the things (i.e. nouns) that a business
cares about (e.g. customers, products, pharmaceutical compounds, stock trades,
insurance claims, oil rigs, etc.) are modeled as some kind of entity.
When it comes to data integration, the entities to model must be crafted from a
collection of other previously modeled entities as opposed to “modeling them from
scratch.” In other words, we are creating a “model of models” at the enterprise level.
These diagrams depict how modeling data for a department differs from
modeling data for the enterprise:
[n] DEPARTMENTAL
MODELS
21
Our Technical (Data Integration) Debt
• The model must also be able to support the creation of a canonical model that
is separate from, and in addition to, the source models. The intention of the
canonical model is to harmonize structural differences, naming differences,
and semantic differences. Additionally, this canonical portion of the model
must be flexible such that it may change over time.
• As the enterprise entity undergoes changes (e.g. new attributes are added,
additional source models are incorporated, new relationships emerge, etc.)
metadata about the changes must be storable at the instance level.
The creation of such a model for a given entity results in a richer representation
that more closely mirrors reality. Instead of trying to merge and flatten multiple
source models into a least-common-denominator model subset, we instead get
a multi-dimensional model superset that is rich with information, and dynamic
enough to handle ongoing changes.
22
Our Technical (Data Integration) Debt
CUSTOMER
Customer success notes: person
Customer given full refund
CID fname lname
Three
in order to address being source
bumped from the flight. name address person1 john doe models
Customer confirmed
satisfaction…
firstname lastname
Harmonized
source
models
Multi-entity
enterprise
graph
23 23
Our Technical (Data Integration) Debt
What the preceding picture progression shows is three differing source models
for a customer, how they are retained, linked, harmonized and enriched into
an enterprise entity, and then finally how they are linked with other enterprise
entities to create a rich data integration graph.
With this conceptual building block now in place, we will turn our attention to
the Data Hub itself and show how it is a foundational pattern for creating and
maintaining these enterprise entities. In doing so, we will also get a primer on
how to model an enterprise entity by way of a simple example.
24
4
Here we are at last. Admittedly, it has taken us almost halfway way through the
text to accurately describe more than three decades of data integration technical
debt. But, that’s what it takes to truly understand what is involved!
We finished the last section with a brief discussion of the “big data” data lake
hype cycle. And we arrived here at a transition into something that sounds much
more mundane – a pattern.
At MarkLogic, we have since productized the data hub to become our flagship
offering, but it first emerged as a new enterprise architecture pattern to address
data integration. This is important to note because as a pattern, it was based on
best practices and evidence of repeated success with real customers. This is in
contrast to many fads that are often contrived on less-proven theories or have
never been used to solve actual business problems.
“Pattern” Explained
In technology, patterns were first popularized with software development
(i.e., code), and later expanded into more coarse-grained concepts such
as inter-system design and data flows (e.g., enterprise architecture). In all
cases, technology patterns are discovered when looking across a number of
successful implementations for common best practices.
25
The Data Hub Pattern Emerges
Although our Data Hub Platform is a fully-baked product – database and UI included
– it is still helpful to think of the general data hub architecture as a pattern or blue-
print. As such, it is similar to other patterns such as the data warehouse or data lake.
To understand the rationale behind the definition, let’s break it down into its
components. Then we can talk about the principles behind it, the inner workings,
as well as the type of technology needed to implement it.
“An authoritative Multi-model within this context is itself layered. Later, we will cover the
multi‑model data store concept of a multi-model database as technology that provides multiple
and interchange…” modeling techniques (e.g., document, RDF triple, tuple, etc.) to represent
data. That is true of the larger data hub pattern. Another complementary
perspective is that the data hub may also simultaneously represent multiple
different models for similar business entities. The key point here is that
models are so important for data integration, that they must be afforded the
same agility as data, instead of treating them as rigid constraints.
The reference to an authoritative data store and interchange refers to the data
hub as the best, first place to integrate data. In other words, it functions as an
authoritative integration point as opposed to a “system of record” for all data.
“… where data from Data curation in the data hub context refers to the ability to enrich data (add
multiple sources is more value with metadata, etc.), harmonize that data (synthesize all source
curated in an agile data, without having to decide which data to “throw away”), and master the
fashion…” data (match and merge data like an MDM tool). And, it’s all done using an
agile approach in which only some of the data can be curated and made fit-
for-purpose at a given time without a major up-front modeling project.
“… to support multiple A data hub supports the entire enterprise data lifecycle (e.g., run-the-busi-
cross-functional ness, observe-the-business) as opposed to part of it. It is also able to per-
analytical and form transactional business functions that are only possible when combining
transactional needs.” data from multiple lines-of-business. It typically serves multiple downstream
consumers of enterprise data, often for different purposes.
26
The Data Hub Pattern Emerges
a) Ability to treat schema as data, given that every payload may also
include schema information. This is what gives schema and models the
same level of agility as the data itself.
c) Ability to store metadata and data together, allowing for provenance and lin
eage to be captured, and providing a strong foundation for data governance.
8 https://www.techopedia.com/definition/32462/impedance-mismatch
9 https://www.w3.org/2001/sw/
27
The Data Hub Pattern Emerges
or “Thor sonOf Odin”). In a data hub, RDF triples provide a myriad of capabili-
ties with respect to managing data and the complexities of data integration. This
will be discussed in a bit more detail in the sections that follow.
4. Indexing to support real-time ad-hoc query and search. Unlike a data lake
that depends on subsequent brute-force processing for data querying, a data hub
indexes all data on ingestion to ensure that data is queryable as soon as it is loaded.
For the most part, you may view the envelope pattern as an implementation
detail for which the more important point is that the original data is left
unaltered10 – a crucial feature for both governance and provenance.
28
The Data Hub Pattern Emerges
29
The Data Hub Pattern Emerges
To demonstrate this, we revisit our customer records for John and Jane Smith:
30
The Data Hub Pattern Emerges
31
The Data Hub Pattern Emerges
1. All original source data has been preserved (crucial for governance) without
having to make up-front and potentially lasting decisions about which source
data is important or not.
2. All original models are also preserved and searchable.
3. We have begun to create a canonical model on an as-needed basis, as
opposed to trying to figure out all possibilities up-front.
Also contained in the sample data is a newly added metadata section. Ignoring
some of the syntactical particulars for a moment, the astute reader will notice
what appears to be information about source systems and timestamps. This
is exactly the case, as that section of the document contains examples of data
provenance being maintained alongside the data itself, using what is called
PROV-XML, which is a serialization format conforming to what is known as the
Provenance Ontology or PROV-O11.
11 https://www.w3.org/TR/prov-o/
The Data Hub Pattern Emerges
33
The Data Hub Pattern Emerges
34
The Data Hub Pattern Emerges
You’ll see we’ve added another section – named “triples” – to each document. In
these sections, we have inserted RDF triples that assert a relationship between
the documents for John and Jane Smith, namely that they’re spouses. Though
this example is simple, it is already apparent how expressing such a relationship
is more powerful than how relationships are expressed in an RDBMS.
Additionally, we can begin to imagine how more complex and interesting sets of
facts and relationships may be expressed. For instance, the aforementioned PROV-O
standard is expressed by way of RDF as triples. Expressing data as triples is quite
powerful, so much so that it is the underpinning of the promise of the semantic web12.
In other words, the promise and potential of semantics and RDF triples go far beyond
the domain of data integration and what we’ll discuss with respect to the data hub.
As for the smaller (yet still expansive) scope of what’s possible with respect to
data integration, the following is a brief list of examples:
12 https://www.w3.org/standards/semanticweb/
13 In fact, one well known semantic vocabulary is named “foaf” after “friend of a friend”
(e.g., https://www.xml.com/pub/a/2004/02/04/foaf.html)
35
The Data Hub Pattern Emerges
And it’s the last point that is perhaps the most compelling. While it’s intrinsically
powerful to be able to represent various facts and relationships with triples
(particularly when combined with documents), it’s the ability to reason over
these data representations – in ways that simply were not before possible – that is
particularly exciting.
Again, the wider topic of semantics is quite expansive, even when scoped specifically
to data integration, and is beyond the scope of this document. However, for the curi-
ous reader, we’ve included some links in the Bibliography at the end of the text.
If we recall from the Principles of a Data Hub section, we need to provide a number
of capabilities. Being able to efficiently manage self-describing schema (without DBA
intervention) suggests a NoSQL document store would be part of the equation. Simi-
larly, the need to handle RDF calls for a triplestore14, while the deep indexing and ad-
hoc search capability suggest a full-text search engine. Finally, the data enrichment
required for harmonization, alongside the concurrent read/write capability to sup-
port run-the-business functions strongly indicate a need for ACID transactions15.
All of this suggests that we either need multiple databases to build a data
hub (possible but leaving us with technical silos, separate indexes, etc.), or a
single – albeit special – database.
14 https://en.wikipedia.org/wiki/Triplestore
15 https://en.wikipedia.org/wiki/ACID
36
The Data Hub Pattern Emerges
of the data hub, this means that the database must seamlessly process documents/
objects, RDF triples and full text, all within a single secure transactional platform.
Looking at the diagram from left-to-right, we start with the sources of data. In this
example, we show data coming from multiple sources such as relational databases,
a message bus, as well as other content feeds (e.g., file systems, system interfac-
es, etc.). Data from these sources gets loaded as is, directly into the multi-model
DBMS – in this case, MarkLogic. When data is loaded into the data hub, it is indexed
as part of the ingestion process, as opposed to being a separate, optional step. It is
this capability to index as is data during ingestion that contributes to allowing data
to be queried and searched without extensive ETL processes.
37
The Data Hub Pattern Emerges
After ingestion, data may then undergo any number of curation processes in
support of the many uses of the data. The process of data harmonization, covered
previously, is a foundational curation process associated with the data hub.
On the right side of the diagram, we show how the data hub provides multiple
ways to access the data by way of data-centric APIs, that not only provide search
and query capabilities against the data but also provide transactional capability
to support cross-line-of-business operations on harmonized data. This is one
of the key aspects that make it operational, allowing it to take part in critical
run-the-business functions. Another key aspect of the pattern – specifically
associated with the right side of the diagram – is that the APIs that are created are
based on incremental usage of the integrated data, in line with the principle of
harmonizing data on an as-needed basis.
Finally, the diagram calls out the cross-cutting aspects of tracking and reporting
on data provenance and lineage, a critical capability made possible by performing
incremental harmonization inside the database.
With this picture in mind, we may now turn our attention to fitting a data hub
into existing enterprise architectures, and the associated positive impacts.
38
The Data Hub Pattern Emerges
“ …because the
data hub allows for
and encourages
incremental
implementations,
nearly all data
hubs start out with
a smaller scope
before growing into
something
enterprise-wide.”
39 39
5
Fitting Into an
Existing Architecture
Though most large enterprises share many common characteristics (as we’ve
posited in The State of Enterprise Architecture section), they nonetheless differ
from one another when we get into the details.
And, while products and tooling have and will continue to emerge around the pat-
tern16 (as has occurred with the data warehouse), the data hub is best classified as a
pattern given its flexibility. Additionally, since a data hub covers more ground across
the “run-the-business” and “observe-the-business” spectrum, and across more types
of data and metadata than a data warehouse does, where it is placed within the en-
terprise may vary. This section is therefore dedicated to providing some guidance
around where to implement a data hub inside an existing enterprise architecture.
OLTP SOURCES
DATA WAREHOUSE
SOA BUS SMART MASTERING
BUSINESS USERS
16 https://www.marklogic.com/product/marklogic-database-overview/database-features/data-hub-framework/
40
Fitting Into an Existing Architecture
Okay, that looks very different from Figure 1. What happened to the ETL? Where
did the MDM go? Does this imply that the data hub is now the new center of my
entire enterprise?
These and more are among the first questions asked when a diagram similar to
Figure 6 is shown. As is typical of technology discussions, the answers to each
are very much contextual, particularly for large enterprises. Nonetheless, the
high-level view above is a good place to start and, as we’ll see, doesn’t necessarily
imply sweeping statements about the entire enterprise, at least not at the outset.
It may just as easily be referring to a smaller subset of an enterprise architecture
as opposed to the enterprise as a whole. In fact, because the data hub allows for
and encourages incremental implementations, nearly all data hubs start out with
a smaller scope before growing into something enterprise-wide.
Let’s look at ETL. In the above diagram, it is not represented. The reason for this
is that, depending upon the state of the data pre-ingestion, no ETL would need
to be done to force-fit things into any particular canonical model, since all source
models may be staged as is. That said, an ETL tool may still be leveraged to
connect to various sources or perhaps perform some very light format marshaling
(e.g., RDBMS to CSV). However, in many scenarios, transformation beyond such
simple payload translation would not be needed. In scenarios such as this, the
above diagram would be largely accurate within the use case’s context.
41
Fitting Into an Existing Architecture
That said, along this journey, traditional MDM-based data sources may remain
in place for periods of time, or may even evolve to use the data hub approach.
Additionally, such incremental mastering for one business entity (e.g., trades)
may coexist alongside traditional MDM for a different but related business entity
(e.g., instruments). Again, going back to context, the diagram might be referring
to the data strategy for but one part of an enterprise for a particular point in time.
And, that is a good segue into the last question above about context. When we
draw the diagram, it is not necessarily the case that a data hub has to immediately
be placed at the center of an enterprise for all data sources (though it could be).
Many MarkLogic customers implement enterprise-wide data hubs, but most
actually start out with a smaller scope against a defined business need.
This, of course, begs the question: when do we know when a data hub is
appropriate for a particular place within a very complex enterprise? To answer
that, let’s look at a few simple guidelines:
1. The use case requires multiple different data sources (or input sources) to
meet the requirements.
3. The integrated data will be used in multiple different ways, some of which
may not even be directly related, across a variety of consumers (i.e., multiple
output consumers and/or destinations).
4. The number and type of input sources and consumers are expected to change
over time (i.e., change over time is the norm).
5. There are direct operational requirements against the integrated data (i.e.,
run-the-business use cases depend on the integrated data).
6. (Bonus) The operational requirements are such that more than read-only
access against the data is required.
When represented pictorially, data flows that meet even the first three criteria
above, take on a common look when represented at a high-level as follows:
42
Fitting Into an Existing Architecture
Those high-level data hub diagrams will represent solution states, as opposed to
pre-existing problem states. This is why our above bowtie diagram has a circle
with a question mark in the middle of it, since that is the part of the picture that
may vary the most depending on circumstances. But, even in these cases, we’re
able to draw some simplified pictures to illustrate certain “from state” scenarios
as follows:
43
Fitting Into an Existing Architecture
ETL
REFERENCE
DATA
DATA SOURCE 2 OLTP CONSUMER 2
ETL
WAREHOUSE UNSTRUCTURED
DATA
ETL
DATA SOURCE 3 DESTINATION 1
ETL
ARCHIVES
DATA MARTS
Nearly all large enterprises contain a mix of both scenarios, often for overlapping
use cases within various places throughout the enterprise, each accumulating
their own technical debt. However, in every case where this many-in/many-out
pattern exists, chances are a data hub implementation would make a positive
contribution to not only reducing the associated technical debt, but also allowing
for innovation not before possible.
44
Fitting Into an Existing Architecture
• Data quality and data governance. Because data warehouses are limited
in terms of what they can store at a point in time, they cannot easily store ad-
hoc metadata pertaining to such things as provenance and data lineage, which
then must be handled outside the database (e.g., in ETL layers).
When used in conjunction with a data hub, however, a properly-scoped data ware-
house or mart can be much more effective as a downstream consumer of data hub data.
For instance, with the ability to handle multiple different categories of data
(structured, unstructured, etc.), having multiple modeling techniques available
(e.g., documents, triples), all the while being able to handle multiple different
schemas simultaneously, the data hub removes the ETL dependency from the
observe-the-business function. In addition, by providing mechanisms that allow
for ad-hoc metadata capture (e.g., the envelope pattern), data provenance is kept
close to the integrated data, improving data quality overall.
“ Inschema
an extreme case, adding a single column to a star
fact table in a relational database might
result in a complete change in data granularity,
forcing the entire data set to be reloaded.”
17 https://en.wikipedia.org/wiki/Star_schema
18 https://en.wikipedia.org/wiki/Fact_table
45
Fitting Into an Existing Architecture
Given these advantages, one might naturally ask if a data hub simply replaces
a data warehouse. As a rule of thumb, a data hub is not a drop-in upgrade/
replacement for a data warehouse or data mart. It is a different solution meant to
solve different problems.
That said, there are many cases where a data warehouse/mart was built in an
attempt to solve problems it wasn’t ideally suited to solve. For many years, after
all, these were the only available observe-the-business data integration patterns.
In these cases, where data warehouses and marts were built in an attempt to
provide many-in/many-out integration points, with business requirements that
change frequently, a data hub implementation is a much better replacement.
However, there are also cases where a data warehouse (or mart) is either the
right choice or at least part of the right choice, in which case the warehouse/mart
simply stays in place and is complemented by a data hub, reducing the overhead
associated with ETL, and providing discovery help in other areas. Often a data
hub will feed a downstream data warehouse or data mart. In all cases, however,
the introduction of the data hub will serve to reduce the friction imposed by a
traditional ETL + warehouse approach to data integration by simply removing
the need to have to model all of the data perfectly before getting value from it.
What this means from the perspective of an entire enterprise, is that the proper
implementation of the data hub will dramatically reduce the overall number
of instances of data warehouses and data marts, as well as rationalize what is
expected from them.
With a data hub in place, true data-centric integration is now available to the
run-the-business side of the house, allowing real-time applications to be brought
to the data in many cases, as opposed to needing to move the data around to the
46
Fitting Into an Existing Architecture
And technical debt is not the only consideration. When a data hub enables
real-time, data-centric integration, it becomes possible to support cross-line-of-
business operations that might otherwise not have a home. For instance, a global
investment banking customer needed to keep track of a particular regulation
regarding OTC derivative trades that could only reasonably be gleaned after a
number of trading desk data was integrated. At that point, any trades that needed
remediation would need to be tagged and tracked for further processing.
In this example, the best first place to tag these trades was in the data hub,
recording the fact that they needed certain types of upstream and downstream
remediation, all of it, however, based on an integrated context. In other words,
the operational workflow began in the middle, so to speak, even though the
trades originated elsewhere. Only with a data-centric integration pattern with
operational (i.e., run-the-business) capabilities was this possible.
Integration that depends on prerequisite modeling and data copying is not agile.
Every time a business change forces a change to a schema, a non-trivial amount of
work must be done. Tables must change, ETL code has to be rewritten, and data has
to be reloaded. For these reasons, data modeling in the RDBMS world is a somewhat
paranoid activity, forcing modelers to account for every possible scenario. This makes
many large-scale data integration initiatives either very slow, very brittle, or both.
47
Fitting Into an Existing Architecture
“ …despite the
differences across
industry and
business outcomes,
the goals and
approaches had
a great deal in
common, even for
seemingly dissimilar
industries.”
48 48
6
Earlier, we also said that lasting technology patterns are often discovered as
opposed to contrived. This happens over time as organizations learn from each
other and best practices emerge. To illustrate this point, our case study examples
cover a variety of industries and enterprises that both contributed to and
reinforced core concepts in their own unique ways. But, despite their uniqueness,
they all have one thing in common—they are all large organizations with complex
problems with an urgent need to simplify data integration.
49
The Data Hub in Use
For the technologists who had to make sense of the various sources of data, it
was clear that the traditional approaches to data integration would not meet
their intended goals. So they charted a different course, using MarkLogic
as the underlying technology, and (what was then) a different approach to
data integration.
19 https://www.investopedia.com/terms/c/creditdefaultswap.asp
50
The Data Hub in Use
Table 3: Top Five Global Bank – Data Hub for Derivative Trading
Business goal Consolidated view for all “over-the-counter” (OTC) derivative trades
to support cross-asset-class discovery and post-trade processing.
Challenges Multiple trade execution systems with varying asset classes and
complex models to support them presented a challenge to creating
an integrated model representative of all trades.
Input sources More than 20 different trade execution systems across various OTC
derivative asset classes.
Result A single operational data hub for OTC derivative trades eliminating
the need to create multiple bespoke data marts and complex ETL.
The first month of production, the system had scaled to 1,600
requests/second supporting discovery-based and operational
workloads. Labor savings alone were estimated to be US $4M over
the first three years.
“ Itintegration
was clear that the traditional approaches to data
would not meet their intended goals.”
51
The Data Hub in Use
Figure 10: Top Five Global Bank –Data Hub for Derivative Trades
Like many other data hub implementations, the diagram depicts a many-in/
many-out data flow that is characteristic of the pattern.
52
The Data Hub in Use
Over the past 10 years, the entertainment industry has similarly been upended,
as online and digital delivery of content has radically changed how consumers get
their entertainment content. Streaming and download services in the late 2000s
contributed to an explosion of options for delivery. The number of distribution
partners went from 100s to 1000s almost overnight, while the number of formats
and versions of the content also exploded as release windows narrowed or in
some cases disappeared altogether. As a result, media organizations have had to
reinvent nearly every aspect of how they manage and deliver their content.
All of this data resided in separate systems, and as a result, internal teams looking
to re-use assets would have to call up experts with access to these separate
systems and enlist them to help find the right clips. And teams creating automated
feeds had to connect to these different data sources and execute custom logic to
select the correct assets. Creating a new feed would take 2-3 weeks per feed, on
top of additional time to update existing feeds as specifications changed.
53
The Data Hub in Use
To address this critical issue of access to the asset data, the studio turned to
MarkLogic to create a data hub for digital distribution. This system was able to
integrate critical data from multiple source systems and then make that data
available to internal users, teams creating distribution and other feeds, and even
external partners to enable self-service. This system greatly reduced the manual
and technical effort needed to find the right assets and get them to partners.
Making this data easily accessible is also a key factor in the studio’s success in
taking advantage of the many new opportunities that the new online and digital
entertainment marketplace has created. The following table summarizes the
use case.
Business goal Create an up-to-date map of digital assets globally, with information
from multiple disparate systems, which allows end-users to find and
leverage content in real-time.
Challenges Every time assets were needed, searches required reaching out to
multiple source-system experts. The approach was not scalable to meet
real-time needs.
Input sources Data on the entire asset catalog of the studio. About six million records
including technical metadata from multiple digital asset systems,
descriptive metadata and title data from multiple systems, rights and
product information from multiple systems.
Consumers The system needed to provide a robust set of APIs that multiple teams
and outputs could access to create both end-user functionality and custom applica-
tions. These include feeds to distribution partners, a custom application
for partners to select content, integration with distribution partner sys-
tems, and integration with custom in-house asset search applications.
Result A single data hub that powers every aspect of the studio’s digital supply
chain. Using the data hub pattern, the first application was released in
just eight weeks. The final system has led to saving “hundreds of hours”
of individuals’ time to find assets and helped reduce the overall cost of
distribution by 85 percent.
This system shares the multiple input and access to data for multiple downstream
systems of other data hub solutions. It also directly powers a critical application
in the partner portal.
54
The Data Hub in Use
• Create the platform to power the online presence for the host country
Olympics including distribution feeds to partners and custom views of athletes
• Power a popular online application that leverages not just the asset metadata,
but data describing the characters and eras of the show as well as the fans’
interactions with the content
55
The Data Hub in Use
To accomplish their goals, they needed a data hub to serve as a centralized system
architecture that would integrate various types of information from multiple sources
into a consistent, well-defined view. This was something that simply could not
be solved by their legacy technology stack that relied on an RDBMS as a primary
integration point. The following table summarizes how a data hub met the challenge.
DISCONNECTED
USER
FOB
THEATER
COMMAND
DATA
ENRICHMENT HQ OR
EXTERNAL USERS
56
The Data Hub in Use
Input sources Hundreds of different sources of data and metadata, from text to video.
Result A data hub that managed over 100 million data and metadata docu-
ments after the first go-live. Query results that were 59x faster than the
previous RDBMS solution, covering more data types, with the ability to
distribute data globally in a multi-master environment.
57
The Data Hub in Use
As the deadline approached (a very public deadline that no one could afford to
push back), those at CMS tasked with creating the backbone of the ACA realized
that legacy technology and methods were not going to work. The variety of the
data that needed to be integrated, the operational demands of the enrollment
period, the unknown requirements of something that had never been done before,
and the security demands of data that included PHI, PII21 and protected data from
multiple Federal agencies, demanded they find a better way and find it fast.
58
The Data Hub in Use
Goal Create a user-friendly public marketplace where consumers may shop for
health insurance via an e-commerce style website, all the while linking the
application process to all necessary agencies and stakeholders in real-time.
Challenges Complex modeling of content that was in multiple, different formats from
multiple federal agencies, state agencies, and insurers.
Initial work had begun with assumptions of ETL and RDBMS technology
for handling the integration challenges. When that proved not feasible,
work had to begin anew, shortening the deadline further still.
The marketplace needed to be a real-time operational system, not only
interacting with end-users but choreographing workflows across multiple
agency systems (e.g., IRS, CIS, DHS, etc.).
59
The Data Hub in Use
The following picture provides a high-level view of the interactions between the
various systems as well as the two hubs created to support the architecture.
AETNA
H EA LTH I N S U RERS
BLUE SHIELD
HEALTH
UNITED
HEALTH GROUP INSURANCE
MARKETPLACE
OTHER
INSURANCE
PROVIDERS
HEALTHCARE.GOV
I N C O M E & ELI GI B I LI TY
IRS
SSA
STATE EXCHANGES
DATA
DHS SERVICES
HUB
OTHER
VERIFICATION
SYSTEMS
Figure 13: HealthCare.gov – Marketplace Data Hub and Data Services Hub
60
The Data Hub in Use
To the credit of the incumbents, the reaction of the industry giants has been
swift and broad-reaching. Many large insurers are exploring new technologies,
starting innovation spin-offs, their own VC funds, or simply buying best-in-
breed insuretechs to fuel innovation. Mentions of “technology” and “digital” on
insurance company earnings calls almost doubled between 2016 and 201722.
In the process, insurers are discovering the significant gaps they need to fill
in order to leverage their own massive stores of legacy data. The promise of
Artificial Intelligence (AI), machine learning, and Natural Language Processing
(NLP) to automate slow, and mistake-prone manual processes, and increase
customer acquisition and retention by offering a frictionless digital experience,
all rely on being able to fluidly integrate, understand and operationalize all of
this data.
All were stored across multiple source systems, ranging from mainframe
to relational.
22 https://www.cbinsights.com/research/six-big-themes-in-2017/
61
The Data Hub in Use
To deliver a modern customer experience, the P&C firm needed to have a unified
view of its customers, including policy details, billing information, as well as any
other customer data captured through various channels and touch points. As
expected, the firm originally relied on massive ETL processes to match and merge
customer data from multiple sources, including a legacy Master Data Management
(MDM) platform. This platform – a multi-million-dollar investment – required
as much as 18 months of lead time to implement any changes that called for the
capture of information for which it was not initially designed.
In other words, the firm dealt with the typical rigid-schema problem of RDBMS-
based solutions. These limitations made it next to impossible for the insurer to
innovate its way into an increasingly digital marketplace. It recognized that if it
were to stay relevant within the insurance digital revolution, the MDM system
needed to be retired, and replaced with a modern infrastructure designed to
power multiple next-generation applications. Thus, a data hub with Smart
MasteringTM became the path forward. The solution is summarized below:
Goal Create a single integrated view of the customer to serve multiple appli-
cations and use cases.
Input sources Multiple systems of record (CRM, agent, policy, claim, billing, quote)
Multiple formats (CSV, XML, JSON, PDF) and models
Combination of structured and unstructured data
62
The Data Hub in Use
Other Industries
Whether it’s existential crises, opportunities for innovation, or a mix of each, every
industry is impacted by the challenges and promises of data integration. In every
case, the data hub can play a role. Data hubs for manufacturing, transportation,
and logistics, in the energy sector, and in other industries are moving into
production at some of the largest organizations in their respective verticals.
63
The Data Hub in Use
64 64
7
A Look Ahead
Thus far, a good deal of the data hub context in this text has been around
rationalizing technical debt accumulated within large enterprises over the last
two to three decades. In other words, a lot of the time has been spent describing
how to remedy the sins of the past. Yet, if technology has taught us anything,
it’s that innovation is always right around the corner promising ever-greener
pastures. As we’ve also learned, these promises sometimes lead to their own
potential for unintended consequences.
The big data hype cycle had its moment, promising to unlock the latent
power trapped inside enterprise data by merely loading huge volumes of it
into giant data lakes. That approach is largely a failure. It’s not that the data
lakes didn’t address some issues, it’s just that in the rush to use new (yet
untested) technologies, critical aspects of solving data integration issues were
ignored – security, governance, operational capabilities – creating many new
problems (which some of the same vendors happily used as an opportunity to sell
more stuff).
As the Hadoop hype cycle fades, there are nonetheless other significant industry
shifts that some may look toward as the potential “next big thing” to help make
some of their data integration pains go away. Some of these trends are well-
established and have been validated within certain contexts (e.g., cloud), while
others are still finding their footing (e.g., blockchain). All are tempting to look at
when confronted with difficult problems. After all, it’s much more fun to look at
new things than to toil with the same old seemingly unsolvable problems.
65
A Look Ahead
From this lens, we should expect that any IT innovation must have a net
positive impact on the overall data mission. Its benefit must outweigh any
technical debt it creates. This is why the data hub is such an important
architectural construct. As a key component of enterprise-wide data
interchange, it provides a stable platform to protect against technical debt
while maximizing the potential for innovation. To demonstrate this point, let’s
consider three recent data-related innovations and assess the role of a data hub
in each case.
Cloud
There is no denying the impact of cloud on IT innovation. It has nothing short
of revolutionized the way IT resources are consumed and managed. While
its existence can make some “traditional” concerns obsolete (e.g. hardware
procurement lead time), it makes others even more critical. In other words,
while cloud technologies remove a lot of friction associated with provisioning
66
A Look Ahead
infrastructure and services, adopting them does not make challenges associated
with data integration simply “disappear into the cloud.”
Blockchain
Though some of its initial hype has recently settled a bit, the use of blockchain
in the cryptocurrency space has demonstrated that is has some staying power
and perhaps even untapped potential. Like other newer and promising
technologies, it is accompanied by the capacity to monopolize focus, leaving
open the possibility for less-exciting, yet critical, functions to be ignored by
blockchain implementers, resulting in unintended consequences. For example,
security is one area that should be top of mind, thanks in no large part to
some well-publicized digital currency breaches. As these breaches painfully
demonstrated, a blockchain implementation is only as secure as its weakest link.
Since all blockchain implementations at some point invariably interact with non-
blockchain databases, when those non-blockchain databases are not secure or
ACID compliant, the results can be disastrous.
67
A Look Ahead
Any good data scientist knows that good learners require good data — and lots of
it. Data scientists also know that they spend most of their time trying to get hold of
data, as well as trying to get it in suitable shape for their algorithms to work their re-
spective magic. Even then, should the results of the algorithms be “off” for some rea-
son (e.g. unintended bias), then it is incumbent upon the data scientist to go back to
the data, understand where it came from and from which context, and adjust things
accordingly. In such cases, a promising AI project can devolve into a series of mun-
dane data wrangling tasks. In other words, the data scientist (i.e. a human being) is
very much in the middle of the machine learning and AI processes, oftentimes doing
quite unglamorous things. And, that is just the person who tries to domesticate the
wild data algorithms of the AI jungle. What about the non-data-scientist humans
who wish to use the results of these algorithms? How might they put the resultant
findings into a broader business context and combine them with the more mundane
human-centric findings to make informed business decisions?
This is where a data hub comes into play. Removing the friction associated with
wrangling data for use by ML/AI algorithms, serving up governed and quality
data, providing a lens into data provenance and lineage, chunking machine
learning findings into broader context; these are all critically important data-
centric functions for successful use of ML and AI. They just so happen to be the
very things that a data hub is good at. In other words, in order to get the most
benefit from your enterprise ML and AI programs, you need a data hub.
Cloud, blockchain, machine learning and AI: these are just three examples of the
more recent innovations that point to the need for a fundamental shift in how
enterprise data is managed, lest the promised benefits of these innovations not be
fully realized. In this regard, the data hub represents a foundational component
to any enterprise data strategy.
68
A Look Ahead
Conclusion
Despite the increasingly fast pace of technology advancements, sometimes the
inertia of “accepted wisdom” can be a surprisingly limiting force. For years, the
accepted wisdom of ETL and data warehouses, the dogma of run-the-business
and observe-the-business separation, and the assumption that you must have a
rigid schema designed up-front before any data can be loaded to a database have
all resulted in the accumulation of significant technical debt.
Today, there is a new architecture that has emerged as a missing piece of the
architectural puzzle – the Data Hub. It is a proven approach, refined in multiple
mission-critical settings across a variety of industries. And, it arrives not a
moment too soon.
The hype cycle of early big data technologies has faded, but many of the
challenges remain and are getting more complex as organizations look to
migrate to the cloud. We hope that all enterprise architects feel empowered
to address those challenges head-on by rethinking how data is integrated and
managed using an approach that brings simplicity to their enterprise – not
more complexity. The success stories of the data hub are only just beginning to
be written.
69
A Look Ahead
“ We hope that
all enterprise
architects feel
empowered to
rethink how data
is integrated and
managed using
an approach that
brings simplicity to
their enterprise – not
more complexity.”
70 70
Appendix
A Data Hub Template
for Enterprise Architects
The information that follows is consistent with the preceding material in the text.
General Information
Granularity Blueprint
Short A cross-functional, persistent data interchange typically in support of an enter-
Description prise data integration strategy.
Long An authoritative multi-model data store and interchange, where data from
Description multiple sources is curated as-needed, to support multiple cross-functional
analytical and transactional needs.
Problem Context
The existence of business activities that require data from multiple lines of
business to be integrated into a consolidated representation. The integrated
data is necessary to support cross line-of-business discovery and operations.
Requirements of this integrated data include:
23 http://www.opengroup.org/subjectareas/enterprise/togaf
71
Appendix
Pre-conditions
Pre-conditions for consideration of a data hub include:
Solution Details
The Solution describes a persistent integrated real-time data store that provides
the following capabilities:
72
Appendix
• Real-time Capable. The data store will have the ability to ingest data in
batch or real time.
The resulting implementation will consist of multiple data sources contributing to the
data hub in either real-time or batch modes. As needed, data across the sources will be
inter-related and/or harmonized into canonical (or partially canonical) formats. There
will also be multiple consumers of the data, some in batch mode and some in real time.
Resultant Context
Within an existing Enterprise Architecture, a data hub would be the first place
where data is integrated outside of source OLTP systems.
Diagram
The following diagram depicts the data hub according to the Archimate standard24.
Security
Function(s)
Batch File(s) Batch Ingest
read Service
secure
read
write Harmonize
Function(s) enrich Harmonized
read / Enriched
RDBMS Source Documents & Cross–LoB
SQL RT Ingest
access Documents & Objects Function(s)
Service write
Objects enrich
read Governance
Function(s)
access index read/write use use
read use
Message Bus Messaging Business
write Indexing CRUD Data Services Services
Layer (DSL) use
Function(s) Services use
24 http://www.opengroup.org/subjectareas/enterprise/archimate-overview
73
Bibliography
Allemang, D. & Hendler, J. (2011). Semantic Web for the Working Ontologist,
Second Edition: Effective Modeling in RDFS and OWL. Waltham, MA:
Morgan Kaufmann.
This book discusses the capabilities of Semantic Web modeling languages, such
as RDFS (Resource Description Framework Schema) and OWL (Web Ontology
Language). Organized into 16 chapters, the book provides examples to illustrate
the use of Semantic Web technologies in solving common modeling problems.
This concise eBook examines how multi-model databases can help you integrate
data storage and access across your organization in a seamless and elegant way.
This presentation from the 2017 MarkLogic World conference features Michael
Bowers, Principal Architect, LDS Church and author of Pro CSS and HTML
Design Patterns published by Apress (1st edition in 2007, 2nd edition in 2011).
The session addresses common questions people have when considering a
move from a relational model to a NoSQL document/graph model.
74
Bibliography
This report explains how data governance can be established at the data level,
lays out practices and frameworks for regulatory compliance and security,
and explores different data governance technologies, including NoSQL and
relational databases.
Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design Patterns:
Elements of Reusable Object-Oriented Software. Boston, MA: Addison-Wesley.
Also known as the “GoF book” and now in its 37th printing [March 2009,
Kindle edition], this book captures a wealth of experience about the design of
object-oriented software. Four top-notch designers (collectively referred to
as the “Gang of Four”) present a catalog of simple and succinct solutions to
commonly occurring design problems.
75
Bibliography
This white paper discusses the top data security concerns CIOs, architects,
and business leaders should focus on at a strategic level – with a particular
focus on data integration – and provides an overview of how MarkLogic
addresses those concerns as a database.
Geoffrey Moore’s classic bestseller, Crossing the Chasm, has sold more than one
million copies by addressing the challenges faced by start-up companies. Now
Zone to Win is set to guide established enterprises through the same journey.
The Open Group (2007). Architecture Patterns. In TOGAF 8.1.1 Online. Retrieved
from http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap28.html.
Ungerer, G. (2018). Cleaning Up the Data Lake with an Operational Data Hub.
Retrieved from http://www.oreilly.com/data/free/cleaning-up-the-data-lake-
with-an-operational-data-hub.csp.
Data lakes in many organizations have devolved into unusable data swamps.
This eBook shows you how to solve this problem using a data hub to collect,
store, index, cleanse, harmonize and master data of all shapes and formats.
76
999 Skyway Road, Suite 200 San Carlos, CA 94070
+1 650 655 2300 | +1 877 992 8885
www.marklogic.com | [email protected]