The Ultimate Guide To Data Integration
The Ultimate Guide To Data Integration
to data integration
Mental models for data integration,
analytics and the modern data stack
Table of contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Without data integration, data – and the information, knowledge and insight that can
be gleaned from it – is siloed. Siloed data hampers collaboration by producing partial and
conflicting interpretations of reality, making it difficult to establish a single source of truth
and bring everyone in an organization to the same page.
True to its name, data integration is also key to data integrity – ensuring the accuracy,
completeness and consistency of data. Data integrity is a matter of fully comprehending
data, having confidence in its completeness and ensuring its accessibility where needed.
If you are an analyst, data engineer or data scientist (or manager of the aforementioned)
at an organization that uses operational systems, applications and other tools that produce
digital data, this guide is for you. Your role should allow you to influence or determine the
tools your company uses. More importantly, it should put you in a position to influence how
teams think about solving problems, finding insights and making decisions with data.
This guide is as much about building a modern data mindset as it is about implementing
new tools and processes.
With the modern data mindset, you will be able to leverage automation into substantial
savings of time, talent and money as well as the opportunity to pursue higher-value uses
of data. By contrast, as long as your organization remains tethered to outmoded methods
and mindsets of data integration, you will leave money on the table, waste the efforts of your
data professionals and lose ground to savvier entrants in the marketplace.
We’ll explain and evaluate the various approaches to data integration currently available and
you’ll learn how to bring data integration to your organization. Ultimately, we’ll build a modern
data mindset and equip you to coach others to adopt the same.
Typical business use cases for analytics include mining data for insights to improve the following:
1. Customer experiences
2. Internal processes and operations
3. Products, features and services
As your data integration and analytics capabilities mature, your organization will be able
to promote widespread data literacy and become more nimble, dynamic and innovative.
Traditionally, analytics was used for basic problem solving. This practice continues today, but the
tools, technologies and teams performing this work are increasingly characterized by massive
scale and predictive modeling, as exemplified by artificial intelligence and machine learning.
Automation and computation do more than just increase speed and efficiency or save time,
talent and money. The ability to crunch more data in minutes than any human could in a lifetime
and reach insightful conclusions that would otherwise be impossible marks a qualitative
difference in capability. Thinking about data analytics this way is a shift toward the modern data
mindset, where improvements to process and decision-making capacity are heavily informed by
advances in technology, enabled by a modern data stack.
Before anyone can use data to recognize patterns and identify causal mechanisms, the data
must be in one place and organized in a manageable fashion.
Standardized schema
In some cases, a destination might instead be a data lake. A deeper discussion of the tradeoffs
between data lakes and data warehouses is beyond our scope here, but in short, data lakes
are best used to support use cases that require unstructured data and mass storage of media
files or documents, such as certain forms of machine learning. Newer technologies combine the
functionality of both data warehouses and data lakes under a common cloud data platform, an
integrated solution that supports analytics, machine learning, data replication, data activation
and all other data needs.
You can think of data integration as a specific type of a more general operation called data
movement. Aside from data integration, data movement also includes data replication between
operational systems for the sake of redundancy and performance. Data activation pushes data
that has been modeled for analytics from a destination back into operational systems, enabling
all manner of data-driven business process automation.
Before we dig into the differences in approach, let’s understand the actions involved.
Transformation explained
Transformation plays a central role in both ETL and ELT. We can define transformation as any of
the operations involved in turning raw data into analysis-ready data models. These include:
1. Revising data ensures that values are correct and organized in a manner that supports their
intended use. Examples include fixing spelling mistakes, changing formats, hashing keys,
deduplicating records and all kinds of corrections.
2. Computing involves calculating rates, proportions, summary statistics and other important
figures, as well as turning unstructured data into structured data that can be interpreted by
an algorithm.
3. Separating consists of dividing values into their constituent parts. Data values are often
combined within the same field because of idiosyncrasies in data collection, but may need
to be separated to perform more granular analysis. For instance, you may have an “address”
field from which street address, city and state need to all be separated.
4. Combining records from across different tables and sources is essential to build a full picture
of an organization’s activities.
In ETL, data transformations typically aggregate or summarize data, shrinking its overall
volume. By transforming data before loading it, ETL limits the volume of data that is
warehoused, preserving storage, computation and bandwidth resources. When ETL was
invented in the 1970s, organizations operated under an extreme scarcity of storage,
computation and bandwidth. Constrained by the technology of the time, they had no
choice but to spend valuable engineering time constructing bespoke ETL systems to
move data from source to destination.
Identity Define
Scope Insights
sources schema
Report
Schema changes
breaks
New data requests
Under ETL, extraction and transformation are tightly coupled because both are performed
before any data is loaded to a destination. Moreover, because transformations are dictated
by an organization’s specific analytics needs, every ETL pipeline is a complicated, custom-built
solution. The bespoke nature of ETL pipelines makes it especially challenging to add or revise
data sources and data models.
You’re probably beginning to see the flaws and potential pitfalls in this approach. It was
necessary when computing and storage resources were limited, and acceptable while data
sources remained constant and predictable.
But the world has changed. The ETL workflow - including all the effort to scope, build and test
it - must be repeated whenever upstream schemas or data sources change – specifically, when
fields are added, deleted or changed at the source - or when downstream analytics needs
change, requiring new data models.
In the second case, an analyst might want to create a new attribution model that requires
several data sources to be joined together in a new way. As with the previous case, this means a
slow, labor-intensive process of rebuilding the workflow.
Since extraction and transformation precede loading, all transformation stoppages prevent data
from being loaded to the destination, causing data pipeline downtime.
These difficulties are further aggravated if the ETL setup is on-premise, hosted in data centers
and server farms directly run by an organization. These setups require yet more tuning and
hardware configuration.
The key tradeoff made by ETL is to conserve computation and storage resources at the expense
of labor. This made sense when data needs were much simpler and computation and storage
were extremely scarce, especially in comparison to labor. These circumstances are no longer
true, and ETL in the present day is a hugely expensive and labor-intensive undertaking with
dubious benefits.
This data is typically stored in cloud-based files and operational databases, and then exposed to
end users as:
1. API feeds
2. File systems
3. Database logs and query results
4. Event streams
As the volume and granularity of data continue to grow, so do opportunities for analytics.
More data presents the opportunity to understand and predict phenomena with greater
accuracy than ever.
ETL made sense at a time when computation, storage and bandwidth were extremely scarce
and expensive. These technological constraints have since disappeared. The cost of storage
has plummeted from nearly $1 million to a matter of cents per gigabyte over the course
of four decades:
The cost of computation has reduced by a factor of millions since the 1970s:
Cost of computing
Dollars per MIPS
These trends have made ETL and the costly, labor-intensive efforts necessary to support it
obsolete in two ways. First, the affordability of computation, storage and internet bandwidth
has led to the explosive growth of the cloud and cloud-based services. As the cloud has grown,
the volume, variety and complexity of data have grown as well. A brittle pipeline that integrates
a limited volume and granularity of data is no longer sufficient.
Secondly, the affordability of computation, storage and internet bandwidth allows modern data
integration technologies to be based in the cloud and store large volumes of untransformed
data in data warehouses. This makes it practical to reorder the data integration workflow, in the
process saving considerable money and headcount.
ETL workflow
Report
Schema changes
breaks
New data requests
This prevents the two failure states of ETL (i.e. changing upstream schemas and downstream
data models) from interfering with extraction and loading, leading to a simpler and more robust
approach to data integration.
Define and
Identify Automated Define
Scope create data
sources EL schema Insights
models
Report
breaks
Under ELT, extracting and loading data are upstream, and therefore independent,
of transformation. Although the transformation layer may still fail as upstream schemas or
downstream data models change, these failures do not prevent data from being loaded into
a destination. An organization can continue to extract and load data even as transformations
are periodically rewritten by analysts. Since the data is warehoused with minimal alteration,
it serves as a comprehensive, up-to-date source of truth.
Moreover, since transformations are performed within the data warehouse environment,
there is no longer any need to arrange transformations through drag-and-drop transformation
interfaces, write transformations as Python scripts or build complex orchestrations between
disparate data sources. Instead, transformations can be written in SQL, the native language of
most analysts. This turns data integration from an IT- or engineer-centric activity featuring long
project cycles and the heavy involvement of engineers to more of a self-service activity directly
owned by analysts.
Instead, an external provider such as Fivetran can provide a standardized schema to every
customer. Since there are relatively few ways to normalize a schema, the most sensible way
to standardize a data model is through normalization. Normalization fosters data integrity,
ensuring that data is accurate, complete, consistent and of known provenance. It also makes
data models easier for analysts to interpret. Standardized outputs have the added benefit of
enabling derivative products such as standardized data model templates for analytics.
In short, ELT turns data models into undifferentiated commodities, enabling outsourcing and
automation. It empowers a data team to shift in focus from building, maintaining and revising
a complicated piece of software to simply using its outputs and building data models directly in
the destination.
ETL ELT
Extract, transform, load Extract, load, transform
Predict user-cases and design data models Create new use-cases and design data
beforehand or else fully revise data pipeline models any time
Bespoke Off-the-shelf
Analyst-centric; accessible to
Engineering/IT-centric; expert system
non-technical users
There are some cases where ETL may still be preferable over ELT. These specifically include
cases where:
1. The desired data models are well-understood and unlikely to change quickly. This is especially
the case when an organization also builds and maintains the systems it uses as data sources.
2. There are stringent security and regulatory compliance requirements concerning the data
and it cannot be stored in any third-party location, such as cloud storage.
A further benefit of automation over manual data integration is repeatability and consistency of
quality. Analysts and data scientists can finally use their understanding of business to model and
analyze data instead of wrangling or munging it.
Here is another calculation for a simpler use case. You can follow along below,
or use our calculator.
In order to estimate the cost of building data pipelines, we need the following:
1. Average yearly cost of labor for your data engineers (or analysts or data scientists)
2. How many data sources you have
3. How long it takes to build and maintain a typical data source
With these figures, we can estimate the time and money spent on engineering.
• Let’s assume the cost is $140,000 for each data engineer: $100,000 base salary, with a 1.4
multiplier for benefits
Assume that it takes about 7 weeks to build a connector and about 2 weeks per year to update
and maintain it. Each connector takes about 9 weeks of work per year.
• With seven connectors, that’s 7 * (7 + 2) = 63 weeks of work per year
Use the weeks of work per year to calculate the fraction of a working year this accounts for.
Then, multiply it by the cost of labor to arrive at the total monetary cost. Assume the work year
lasts 48 weeks once you account for vacations, paid leave and other downtime.
• If the cost of labor is $140,000, seven connectors take 63 weeks of work, and there are 48
working weeks in a year, then ($140,000) * (63 / 48) = $183,750
More importantly, there are serious opportunity costs to using engineering time for building and
maintaining data pipelines. There are many high value data projects that engineers, analysts and
data scientists can build downstream of data pipelines, such as:
1. Data models
2. Visualizations, dashboards and reports
3. Customer-facing production systems
4. Business process automations
5. API feeds and other data products for third parties
6. Custom data pipelines to sources for which off-the-shelf pipelines don’t exist
7. Infrastructure to support data activation
8. Predictive models, artificial intelligence and machine learning, both for internal use
in for customer-facing products
But much like an iceberg, the majority of the work involved is hidden from view, lurking beneath
the surface. Data integration is complicated, requiring a complex system to solve a wide range
of technical problems in the following areas:
• Automation
• Performance
• Reliability
• Scalability
• Security
Building this from top to bottom also means organizing raw data from the source into a
structure that analysts can use to produce dashboards and reports. This requires a deep
understanding of the underlying data.
Finally, you need to create tools to monitor the health of a pipeline, so that your teams are
alerted the instant a problem arises, and troubleshoot any errors and stoppages that emerge,
bringing us to the next points.
Performance
A good data pipeline must perform well and deliver data before it becomes too stale to be
actionable. It must also avoid interfering with critical business operations while running.
One approach is to incrementally capture updates using logs, time stamps or triggers instead
of querying entire production tables or schemas. A commonly-used term for this approach,
especially when reading from databases, is change data capture.
Reliability
Performance is not very meaningful without consideration for reliability. There are many reasons
a synchronization of data from one location to another may fail. One common reason is simply
that schemas upstream change. Depending on how the data pipeline is architected, new or
deleted tables and columns can completely derail the operation of the data pipeline.
Bugs and hardware failures of various kinds are also common. Your pipeline could experience
memory leaks, network outages, failed queries and more. In order to recover from failed syncs,
you will need to build an idempotent system, in which operations can be repeated to produce the
same outcome after the initial application.
As the number of your connectors and users grow, you may also eventually need to design a
system to programmatically control your data pipeline.
Security
For the sake of regulatory compliance in many jurisdictions, your data pipeline shouldn’t expose
or store personally identifiable information. All traffic across your infrastructure must be
encrypted and your system should use logical and process isolation to ensure sensitive data isn’t
erroneously sent to the wrong destination.
These elements may include support for transformations, data activation and machine learning,
as well. Together, the cloud-based elements of a modern data stack make data-driven decisions
radically simpler to implement than with traditional on-premise or legacy technologies.
These are organizational considerations that have more to do with how you plan to grow and
sustain your organization in the future than the technical characteristics of each tool.
Fivetran offers a zero-touch approach to data integration into common data warehouses
like Snowflake.
As a repository for your data, your choice of data warehouse may very well be the most
financially impactful part of your data stack. Upgrades and changes to your data warehouse
involve data migrations and are very costly.
A business intelligence tool might support bar, column and pie charts, trend lines, heat maps
and many other types of visualizations
The first and most obvious reason is that your organization may be very small or operate with a
very small scale or complexity of data. You might not have data operations at all if you are a tiny
startup still attempting to find product-market fit. The same might be true if you only use one or
two applications, are unlikely to adopt new applications and your integrated analytics tools for
each application are already sufficient.
A second reason not to purchase a modern data stack is that it may not meet certain
performance or regulatory compliance standards. If nanoseconds of latency can make or break
your operations, you might avoid third-party cloud infrastructure and build your own hardware.
A third reason is that your organization is in the business of producing its own specialized
software products, and using or selling the data produced by their software. What if you are
a streaming web service that produces terabytes of user data every day and also surfaces
recommendations for users? Even so, your organization may still outsource data operations for
external data sources.
It may be costly to end existing contracts for products or services. Beyond money, familiarity
with and preference for certain tools and technologies can be an important consideration.
Ensure that prospective solutions are compatible with any products and services you intend
to keep.
Calculate the cost of your current data pipeline. The main factor is likely to be the amount of
engineering time your data team spent building and maintaining the data pipeline. This may
require a careful audit of tools you use for project management.
You’ll need to consider the sticker price of the tools and technologies involved as well. Finally,
you’ll also need to consider any opportunity costs incurred by failures, stoppages and downtime.
Include the costs of your data warehouse and BI tool as well.
On the other side of the ledger, you will want to evaluate the benefits of the potential
replacement. Some may not be very tangible or calculable (i.e., improvements in the morale of
analysts), but others, such as time and money gains, can be readily quantified.
Set up connectors between your data sources and data warehouses, and measure how much
time and effort it takes to sync your data. Perform some basic transformations. Set aside
dedicated trial time for your team and encourage them to stress-test the system in every way
imaginable. Compare the results of your trial against your standards for success.
While you may have ruled out technical barriers, there can be institutional barriers to adopting a
modern data stack. Your data team could lack funding, headcount or expertise. Data engineers
might be protective of the systems they have built. Leaders might not recognize the power
offered by the ability to rapidly scale data integration. It is important to earn buy-in from
someone with the authority to purchase the necessary tools and technologies, and to carefully
cultivate a modern data mindset from the very start of your journey.
A carefully constructed minimum viable product (MVP) demonstration that proves the
worthiness of the modern data stack on a single data source, report or test case can
accomplish exactly that.
As your data needs grow and become more complex, you will need to expand your roster of
analysts. There are arguments in favor of both centralized and decentralized data teams, but a
good compromise is both, in the guise of a hub-and-spoke model.
The best time to make this effort is as you start fully implementing your modern data stack, as
you will need to take inventory of all data assets anyway.
Get data governance under control early in order to build trust. Without a clear provenance for
every data model, it will be difficult for the end users of your data to make sense of how metrics
are determined and resolve conflicting narratives.
In a nutshell, product thinking means understanding your users (i.e. individual contributors
and leadership in your organization) and rapidly, iteratively producing and refining products
in response to changing conditions. In a little more detail, the product process for data
assets involves:
Identify
• Understanding users
• Gathering requirements
Design
• Defining scope
• Managing expectations
You will need to build a roadmap regarding how data assets will improve decision-making at
your organization. List the milestones and end goals, as well as the information and insights
necessary to achieve them. Your roadmap is something you can bring to your leadership with the
possibility of being incorporated into your organization’s strategy.
It is important to recognize these dangers in advance, and to communicate clearly about the
benefits of becoming data-driven in clear terms that appeal to your audience’s priorities, not
just your own. Avoid buzzwords and deep technical details in favor of emphasizing how desired,
data-driven outcomes are achieved. Data-based storytelling is key, and real examples or
proof points are invaluable – you will have to seamlessly combine strategy, visualizations and
narratives. This will inspire people to become data literate and begin using data.
Going forward, making data literacy a key criterion for hiring will make it easier to improve
the data capabilities of your teams. Remember, the baseline isn’t writing SQL or building new
models but a general ability to interpret graphs and tables and make decisions accordingly.
Data scientists combine expertise in applied statistics and linear algebra with enough
engineering chops to prototype machine learning models, bringing them into production
with the help of data engineers.
Many organizations jump the gun and hire data scientists to do the work of analysts or data
engineers before building a fully-fledged data hierarchy. This is a mistake. It is best to hire data
scientists as your data infrastructure matures and after your organization has fulfilled more
basic data needs.
Assign data scientists to design, build, test and tune machine learning models as your
organization pursues predictive modeling and artificial intelligence.
By now you should have a full appreciation of the modern data stack, and the modern data
mindset required to build it into your organization, surrounded by motivated teams and
supportive leadership. Good luck on your journey, and thank you for reading!
Fivetran can help you with the first step of your analytics modernization journey.