0% found this document useful (0 votes)
4 views

WP_Future of Data Transformation-3

The whitepaper discusses the future of data transformation, emphasizing the need for self-service solutions that integrate visual and code technologies to enhance productivity and efficiency in data handling. It outlines the evolution of ETL processes, the challenges faced by existing solutions, and introduces Prophecy's unified approach designed to support diverse user personas and data types while addressing scalability issues. The document also highlights the importance of active metadata, intelligent auto-complete, and natural language processing in improving the development, deployment, and observability of data pipelines.

Uploaded by

vizzooo5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

WP_Future of Data Transformation-3

The whitepaper discusses the future of data transformation, emphasizing the need for self-service solutions that integrate visual and code technologies to enhance productivity and efficiency in data handling. It outlines the evolution of ETL processes, the challenges faced by existing solutions, and introduces Prophecy's unified approach designed to support diverse user personas and data types while addressing scalability issues. The document also highlights the importance of active metadata, intelligent auto-complete, and natural language processing in improving the development, deployment, and observability of data pipelines.

Uploaded by

vizzooo5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

WHITEPAPER

The Future of Data


Transformation
Raj Bains, CEO and Co-Founder
Prophecy
Self-service is the future of Data Transformation, based on unified visual & code technology.
Data transformation, which encompasses activities like cleaning, combining, filtering, and
aggregating data, is a familiar task for every organization. However, executing these processes
effectively has proven to be an ongoing challenge. Data transformation consistently consumes
a significant portion, around 80-90%, of practitioners’ time, something that has remained
constant over the past decade. A comprehensive solution to address this issue is long overdue.

Over the past five years, we’ve dedicated ourselves to solving this challenge once and for
all, and designed a zero-compromise solution. This has now been deployed across numerous
organizations, including multiple Fortune 50 and Fortune 10 companies, running tens of
thousands of pipelines. This includes top-tier banks, healthcare and pharmaceutical firms, as
well as technology giants, including hyper-scale leaders.

As we now embark on making our solution widely accessible, I’m excited to take this
opportunity to share our learnings from working with many organizations to get quality
solutions in place. We will introduce the new canonical design for data teams that truly delivers
results across the entire organization, and discuss the improvements that are important, but
are yet to be delivered by any product, including ours.

In this white paper, we discuss:

• Evolution of ETL that talks about how various solutions have evolved over the past few
decades, what prompted the new approaches and what challenges each has run into

• What’s important in designing the best cloud-based solution that can


• Supporting the diverse users and scenarios
◦ The three user personas
◦ The three data types
◦ The three pipeline types

• Handling the three scales


◦ The scale of users
◦ The scale of data
◦ The scale of pipelines

• Best in class approaches for the entire lifecycle


◦ Development
◦ Deployment
◦ Observability

• Zero lock-in

• We’ll introduce Prophecy’s unique approach that unifies visual and code development,
and see how it does on addressing some of the goals for the best solution, and plans to
address the rest.

• Finally, we’ll touch on modernization - if you have a legacy solution, or have developed
some ad hoc code for Databricks or Snowflake, how can you accelerate your journey?

The Future of Data Transformation | 2


Evolution of ETL
Relational operational databases were built in the late 1970’s and got widespread adoption in
the 1980s with Oracle and other databases. However, one cannot run analytics on operational
databases, so this was closely followed by the development of relational analytical databases
or data warehouses with Teradata becoming dominant by the middle 1980s. Since then, the
challenge of moving up to date data into analytical databases, combining and preparing it, and
putting it in the right structure for fast analytics has remained.

This is a mission critical problem, there is impedance mismatch between code and data that
makes it hard, and the challenges of managing state and maintaining correct results are made
harder in the face of failures.

Where is the data transformed?


ETL reads data from source systems, transforms it - including cleaning, combining, filtering, and
aggregating, and finally loads it into target systems. The key question is where does the data
transformation occur? If the data fits on one machine, this is simple you can run a script on that
machine. However, if the data does not fit on a single machine you have two choices - either you
use a distributed processing engine like Apache Spark, or you load the data first into a SQL data
warehouse such as Snowflake that has distributed processing and transform the data there -
now commonly referred to as ELT, but this has been done in Teradata since middle 1980s.

Evolution of ETL solutions


ETL started as a series of ad hoc scripts that were run on a single machine that extracted
data from the operational databases, transformed it on the machine and loaded it into a data
warehouse. The scripts were fragile and hard to maintain, and there were challenges with
scale - leading to the development of tools.

The Future of Data Transformation | 3


On-premises Visual ETL started with Informatica founding in 1993 and was dominant for
the next two decades - coming in as a big improvement in ease of use and productivity.
These products came as complete solutions - supporting development, deployment and
observability. They also had metadata management including data catalogs, master data
management and lineage.

In Informatica, data transformation usually happened in its processing engine on a single


server, however as the data got large it relied on loading data into the SQL data warehouses
and doing the transformations as query pushdowns. This is now introduced as the new ELT
approach, but it has been used in the Enterprise since the beginning.

The advantages of Visual ETL have been:

• It enables a large number of data users and makes them productive

• The visual pipeline is self documenting, you can see the flow of data visually and inspect
the data at any point to quickly understand and edit a pipeline

• Complete product that includes search, column-level lineage (for impact analysis and
debugging data errors) and data pipeline observability (for break-fix, freshness of data,
and performance analysis)

With the introduction of Hadoop, and then the early move to the cloud, the existing ETL tools
no longer scaled, and their processing engines are being abandoned in favor of the high-
performance Spark and SQL based cloud platforms. Again, the challenges of coding have
re-emerged, with internal frameworks being developed in many companies as they struggle to
deal with scale and try to remain productive.

The advantages of code data engineering have been:

• It brings software engineering best practices to data with git versioning, unit tests,
continuous integration that make the development and deployment process more robust
and enable you to move changes faster to production

The Future of Data Transformation | 4


• It gives you power and flexibility, often used to create reusable frameworks that capture
repeated business logic, interact with internal systems for data and metadata

Unified Visual & Code ETL approach has been pioneered by Prophecy to combine the benefits
of visual and code development, while removing the deficiencies of each. It combines the
usability of Visual ETL, enabling all data users and making it productive, with the power of
code and software engineering best practices underneath. It runs natively in distributed cloud
data processing engines such as Spark and SQL data warehouses and seeks to provide a
solution that can evolve and be a complete solution for decades.

As we look to evaluate the solutions, we must first agree on what’s important to solve, let’s
explore this.

The Future of Data Transformation | 5


What’s important in a solution
What’s important in a solution covers the must-have and the aspirational things that an
organization should expect, and the benefits of each.

Supporting the diverse users and scenarios


There are common types of users, pipelines, and data that are found in every enterprise,
but no product in the market supports them. Often different user personas end up requiring
different products that do not interact with each other.

The Three Users


Data transformation is done by three primary user personas that must be supported

• The data engineers who build central data pipelines and standards

• The data analysts who build last mile pipelines

• The data scientists who prepare data for machine learning and AI

The Future of Data Transformation | 6


The Three Data Types
Data transformations handle three primary kinds of data

• The structured data coming from operational databases and systems

• The semi-structured data often in JSON coming from APIs and other applications

• The unstructured data such as text and PDFs used in Generative AI

The Three Pipeline Types


Data pipelines are primarily of three types

• The streaming pipelines built by data engineers

• The batch pipelines that do 90%+ of ETL built by data engineers, analysts and data
scientists that run regularly for production use cases

• The ad hoc transformations run by data analysts for one-time or exploratory analyses

However, supporting these users and use cases requires also handling the scale at which
Enterprises operate

The Three Scales


Data transformation solutions need to handle the three primary scales found in most large
organizations.

Scale users with self-service


Enterprises have an ever increasing number of users who need to access and transform data.
Any solution must enable the vast majority of data users with a self-service solution. Hand-
in-hand with enablement goes productivity - an ideal solution will minimize the time spent on
data preparation, and maximize the time spent on analysis and generating business value.

The Future of Data Transformation | 7


Visual ETL tools such as Informatica enable users to quickly develop data pipelines visually,
and make them productive, so they do well here. Similarly data analysts are productive with
self-service data preparation products such as Alteryx.

Code ETL enables only a small subset of users. While Python expertise is limited, more data
users are comfortable with SQL for simple tasks. As transforms get more complex, however,
the SQL to describe them gets more complex. One needs to tackle nuanced intricacies of
the structure of data and all SQL data warehouses add extensions such as scripting with
loops and conditionals which slowly makes it like Python. Dbt core attempts to address this
by introducing Jinja templates and macros which is python. This solution ironically leads to
more tangled code, deviating from the goal of simplification. Apart from enablement, code
has issues with productivity that we’ll discuss in the development section. Stepping back,
we notice that SQL was there in the 1980s and 1990s - if it had worked well, the ETL tools
industry would not have started.

Scale data sizes


The size of the data is ever increasing, and to be widely used a solution must be able to deal
with high volume of data - defined simply as data that does not fit on a single computer. Many
data pipelines will not need scale, however some always will. It is better to use one solution
that handles all use cases rather than the complexity of two systems, so scalable systems
become the default.

Visual ETL products such as Informatica had a processing engine that runs on a single server
and thus cannot process data at scale. As data inevitably gets larger, it has relied on query
pushdown - essentially chaining a series of visual components that each run SQL queries in
the underlying data warehouse such as Teradata that has distributed computing.

AbInitio is the one Visual ETL product that has always had a distributed processing engine.
Now as users move to the cloud, more standard distributed processing is available with Spark
or SQL data warehouses. Visual ETL tools that seek to bring their own proprietary processing
engines are no longer viable.

Code ETL typically can transform data at scale with Spark or SQL data warehouse. This does
however require developing the code carefully and with a high level of skill.

Scale number of pipelines


Maintaining productivity and quality as the number of data pipelines gets in thousands
requires standards. Enterprises create standard mechanisms - for repeated business logic,
for governance, security and operational best practices. These standards are specific to a
particular enterprise and a mechanism to develop and share these is required.

Visual ETL provides a standard palette of visual operators that is insufficient. The expert
users need to add new operators that read/write their proprietary systems, that standardize

The Future of Data Transformation | 8


business logic and that standardize operational best practices. ETL tools have some half-
solutions that usually involves writing a Java or dot net function and providing a bare bones UI
form with boxes - but this does not natively work on Databricks or SQL data warehouses and
does not scale.

Code ETL does not natively have standards, and after some experience with code
development, teams realize that they are repeating the same patterns over and over, and this
non-standard code is hard to maintain. Developers can write shareable functions, but all the
business logic will be in arguments - thus saving nothing. Data platform teams thus develop
frameworks that aim to standardize these repeated patterns - but this framework, often
expressed as XML or JSON components is a terrible one to develop all your pipelines against.
It is non-standard in the industry making it difficult to onboard new users and lacks tooling
support making it expensive to build pipelines and custom integrations. As an example when
you write the business logic in a field in your proprietary JSON file - there is no auto-complete,
there is no error-checking, there is no debugging - all leading to poor productivity. This
framework then has to be maintained by an internal team forever.

Summary
We see that visual ETL and code ETL each have their advantages, however neither is up to
the task of handling the three basic scales that all enterprises need. However, we must set
the bar higher as we look to the future. Let’s explore what the best in the complete lifecycle of
development, deployment and observability looks like next.

Best in Development
Visual development is the base way to enable a large number of users and make them
productive. As we consider our choices for the future, we should demand more than what is
the current state of the art:

The Future of Data Transformation | 9


Active Metadata and Analyses
Active metadata here is a system, including the analyses powered by it, that is proactively
assisting you in your work rather than being a passive system of record or reporting.

The analyses help greatly with productivity and quality, let’s look at three
• Intelligent auto complete
• Natural language interface
• Impact and break analysis

Intelligent auto-complete
Let’s start with an example that demonstrates what a working solution will do. As you start
developing a visual pipeline on the canvas and want to pull a source component to read data
such as a dataset called customer, there will be numerous versions of ‘customer’, and the
development IDE should recommend the customer dataset with high-quality, freshness and
one that is used by multiple users. After you have added the first source, the tables you’re
most likely going to join with can be suggested and on picking one, the join conditions can
be auto-filled. Auto-complete can be supported by a knowledge graph built on the metadata
that has a deep understanding of the existing tables, their relationships and the commonly
computed expressions on them.

Impact analysis and break-analysis


Column level lineage understands the use of columns across pipelines and has a summary
of all the edits in every pipeline by column. As you edit a data pipeline where you remove a
column, it can indicate that this will break a downstream data pipeline and thus prevent the
break. Similarly, in case a break in production has happened, often in the computation of a
particular column, the lineage can indicate where the last few edits were made to that column.
For each point of edit, you can now look through pipeline versioning to see what has recently
changed and pinpoint the cause of that break very fast.

Natural language for business logic


Business needs can be described as natural language and the system can generate the
expression or an entire pipeline to compute what’s being asked for by the user. You get good
results when you use the intelligence of the large language model, grounded in the context
of the structure of your data represented by the knowledge graph. Similarly, the user can
highlight any business logic such as a series of visual components and ask the system to
explain what it is doing in natural language. This can also take away the more mundane tasks
such as generating descriptions of entire pipelines, or changes to pipelines.

The Future of Data Transformation | 10


Existing solutions
Almost no solution in the market does very well on these aspects. Visual ETL products provide
column-level lineage and therefore some impact analysis. Some code notebooks have started
providing natural language for developing business logic, but the context required to get good
results means all queries have to be grounded in the active metadata and knowledge graphs
which are often not there.

Best in Deployment
Development is only a part of the story, and any good solution must make users productive
through the entire lifecycle including deployment and observability. A good deployment
process can ensure lower error rates contributing to the trust in the data that is essential given
the high business impact of decisions made on data. DataOps is a term that is sometimes used
to refer to the application of software best practices to data operations, however it can mean
different things to different people, so let us explain the gist of it. Some key things in a good
deployment process are:

Versioning and Collaboration


The ability of multiple users to modify the same data pipelines with version control such as git
is essential to making a large team productive. Software engineers have solved this well, so
we will skip the obvious benefits here. Web native solutions can go a step further and make
simultaneous editing possible as Google does for documents, and Figma does for design -
however, this is not considered essential for data pipelines and remains in the aspirational
category right now.

Easy test development


Pushing changes rapidly from development to production requires confidence that existing
working use cases were not broken by the change - if a use case is broken, the editor must
confirm that this change is intended by modifying the test. This is very important since most
data pipelines outlive the tenure of the original author. Increased trust requires tests on
transforms which are hard and tedious to develop, and consequently test coverage is abysmal
for most deployed pipelines across the industry. The only solution is for the development
environment to make test development iteratively simpler and more automated using data
profiling, pattern matching, and machine learning.

The Future of Data Transformation | 11


Automated deployment and rollback
Deployment must be automated, and if there are any manual steps, something will be
missed - most likely at the most critical moment when stress is high. Data pipelines and their
dependencies (libraries) must be deployed automatically to production by your development
environment, and if something breaks there needs to be a process to rollback to the previous
working version. Once the issues are resolved, the fix should again be deployed rapidly and in
a completely automated manner to minimize data downtime.

Existing Solutions
Code works well for this since software best practices have proven to scale. Most customers
see that git, tests, and continuous integration - coupled with carefully controlled, but automated
deployment works well for them. This is however hard to set up, hard to use and test coverage is
always low which translates to low trust in moving changes rapidly to production.

Visual ETL tools have their own processes and typically do not use software best practices,
and are just outdated and behind the industry.

Best in Observability
Once data pipelines have been deployed, they need to be observed and managed in
production. It is necessary to provide mechanisms to quickly identify errors and fix them,
whether these errors are breaks in the data pipelines or issues with the quality of data
produced.

As users move to cloud, making the data pipeline performance and cost transparent to the end
user is necessary to produce data in a timely manner and in reasonable cost budgets.

Monitoring of data pipelines for break-fix


A good monitoring solution should be easy to use and have a fast and reliable process for
users to be alerted to errors and to fix them. The errors should be shown in the pipeline
development environment where the user authored the pipeline. After making the necessary
changes to fix the error, users should be able to quickly deploy the updated pipeline to
production.

The Future of Data Transformation | 12


Visual ETL tools do a fairly good job of providing this functionality when running on their
execution environment. Code based solutions are exceptionally poor at this - your code
is perhaps in Github, deployment artifacts are in Artifactory, the orchestration interface
in Airflow is where you find the error, but the business logic will be in Spark or SQL data
warehouse. This system could not be made worse even if it were deliberately designed to
waste everyone’s time.

Observing cost & performance


Cost and performance dynamics in the cloud are quite different from those for the on-
premises solutions where machines were bought for peak workload and then are a sunk cost
with no incremental cost incurred by running slower or taking extra resources unless you’re
running at the peak workload time.

On the cloud, however, you will pay for every pipeline by the minute and therefore it is
especially important to optimize for performance and thus price.Management of operating
costs is always a concern, but is especially scrutinized during periods of economic uncertainty
and therefore this has rightly become and is of priority.

Most solutions do not provide an easy way to manage performance and cost.

• As soon as you build a pipeline, you should have the performance and cost displayed in
the development environment - including an analysis that can point you to the resource-
intensive portions of the pipeline.

• For the management team, this should roll up into cost and performance analytics by
projects and organizations so that they can be more deliberate about investments.

Observing data quality


A good data quality solution must provide multiple layers of quality.

First is the ability to write business rules - for example the user might know the range for a
number that represents blood pressure and can write an expression. The rules can also be
recommended using the application of pattern matching (AI) applied to data distribution.
Business rules are highly accurate and cheap to run, but the effort for development is
high. Machine learning might be able to help generate checks for many columns extending
coverage of this.

Second big piece of data quality is near zero-effort observability. All important datasets
should be monitored actively, and any divergence should be flagged. This usually requires
data quality monitoring solutions that use machine learning algorithms that are based on
fitting the data of each dataset, and then tracking divergence across time. These solutions
are still maturing, and most have issues with scale, issues with false positives and have not
reached critical scale.

The Future of Data Transformation | 13


Looking at the existing solutions, Visual ETL tools have had some basic data quality based
on rules, but nothing comprehensive. Code ETL usually does not provide any solution, and
the user must buy one or more additional solutions. Data quality has a few startups that are
building point solutions, but the data platform or development platforms will start providing
this soon.

Zero lock-in
Proprietary format lock-in & walled gardens
If you’re developing thousands of data pipelines that contain your business logic,
it’s important to ask what standard are you developing against? How will you
transition to another platform if you need to? The correct solution must use an
open format that is widely supported.

Visual ETL products lock you into their proprietary format and you have to continue
to pay them to run your business logic. There is also the walled garden effect,
where you are stuck with the features provided and cannot use any innovation
done by the community in open-source such as using data quality libraries.

Code ETL is typically in open format. Spark and SQL have emerged as the primary
standards in the cloud with multiple vendors supporting each standard with only
minor changes across vendors.

The Future of Data Transformation | 14


Prophecy Introduction:
Unify Code & Visual
Prophecy looked at the challenge of building a next-generation solution for data transformation,
one that will evolve gracefully and continue to be an excellent solution for decades.

The complete solution cannot be covered here, however we will cover the core tenets that
provide a better core for data pipelines. We continue to build an ever more complete solution
on top of the core with a focus on what is important to our customers.

Prophecy invented a solution that merges visual and code development. This approach has the
following primary tenets:

• Data pipelines are developed using a visual interface

• Visual data pipelines turn into high quality code instantly

• Code developers can create new packages of visual components

This is it! good arguments are simple and so are good solutions, now let us understand the
implications of it, which are quite substantial.

What does this mean for data transformation?


For data transformation, it is important to understand the impact of the core decisions.

The Future of Data Transformation | 15


Productivity
For an enterprise to be productive at enabling analytics by providing trusted and timely data,
we must enable most data users to self-serve and produce data pipelines to transform data
that are production grade.

Enable maximum users


Visual interface is the only known mechanism to enable a large number of data users to transform
data is to provide a simple visual interface. Whether an ETL product such as Informatica, or data
preparation product for data analysts such as Alteryx - this has proven to work for all data users.

Make every user productive


Visual interfaces are also very productive for development since a new user can visually see
how the data is flowing from source to target through multiple transformations. Running a
pipeline interactively shows the data after each step. Data CoPilot can further lower the barrier
by supporting natural language interface to develop transforms

Open source code


The visual pipelines are turned to high quality Spark or SQL code by Prophecy. The Spark code
can be in Scala or Python and SQL code uses dbt-Core as the build system and works with all
prominent data warehouses. The code has zero lock-in since it does not depend on Prophecy
for execution.

Prophecy generates code that is native to the cloud data platforms and designed to the
highest performance standards. It makes the code match the readability, structure and
performance of the code that the best data engineers develop.

Software engineering best practices


Prophecy enables the data users to use software best practices to move changes quickly from
development to production with higher quality to produce more trusted data:

• Visual data pipelines turn into code in open formats with Prophecy that is stored in the
user’s git (such as Github) along with the unit tests for these data pipelines.

• Users develop in git branches for collaboration. Every time the user commits their changes,
all the unit tests are run to ensure that the quality of the data pipeline remains high.

• When the users decide to deploy the data pipeline to production, the assets are
automatically created to reduce errors and have no runtime dependencies on Prophecy.

Extensibility
Prophecy enables coding data engineers to create new visual components using templated
open format Spark or SQL code that runs a data scale. The users can simply define an
excellent user interface that comes with the same capabilities as the inbuilt components such
as auto-complete and format checks.

The Future of Data Transformation | 16


The packages are a concept that Prophecy has brought from code to visual components.
Packages simultaneously achieve two targets

• Packages are developed and used with software best practices. The data engineers
develop them using standard code, as projects on Git, with versioning and proper release
process and evolution built in.

• Packages for the data users are extremely simple. Data users can look at the packages
available to them, pick the one they want to use, and it shows up as an extension to the
palette of visual components already available to them, as a new menu item.

Packages create standards to scale pipelines


This enables data platform teams or expert business data users to create visual components
that standardize common operations and make them easy to use for the visual data users. This
usually includes commonly used business logic, standards on operational logic and connectors
of internal systems. The impact is that

• Visual data users can self-service and build data pipelines, and when they need a new or
complex task, they can ask for a visual component to be built for it

• Visual pipelines use the same small set of high-quality components making them fast to
develop, easy to understand and high quality. The data pipelines built by business data
users run at scale and are ready for production.

The Future of Data Transformation | 17


Complete Solution
Prophecy is rapidly building a complete solution around its core technology with development
and deployment at state of the art, and observability is increasingly complete.

This complete solution, when put in the context of the ecosystem and what different users are
able to achieve shines further light on the power of the solution.

The Future of Data Transformation | 18


Let’s see if this solution meets the requirements we set for ourselves.

Did we meet the requirements we set?


Support Diverse Users and Scenarios

Three Users
Data Engineers Data engineers are able to build data pipelines visually and build
reusable standards for all other data users. Streaming pipelines are also
available to them.
Data Scientists Data scientists are able to visually build data pipelines that includes
building pipelines for precision ML including developing features,
and pipelines for generative AI that includes unstructured data
transformation and interactions with LLMs
Data Analysts Data analysts are able to build pipelines as they would in a self-serve
data preparation product, and orchestrate them without needing a
second system. The low-code interface is made further accessible with
no-code language interface
Three Data Types
Structured Structured data has very strong support including components to ingest
data from databases including running extract queries in the source
systems
Semi-Structured Semi-structured data such as JSON or XML can be read directly
from APIs. Prophecy has native support for nested schemas in every
component, and there are components to flatten these nested schemas
to get structured data for more efficient transformations
Unstructured Unstructured data such as slack messages can be directly read from
APIs. Specialized components exist to handle PDFs that can extract
sections as separate articles. Text can be vectorized using OpenAI or
other interfaces and can be stored in vector databases.
Three Pipeline Types
Streaming Prophecy support Spark streaming data pipelines and supports
streaming inputs and output such as Kafka
Regular ETL / Batch Regular ETL data pipelines are the primary workload
Ad hoc Data analysts usually have some ad hoc pipelines using excel sheets
and data from the warehouse, and they are very similar to batch ETL
pipelines as long as the interface is intuitive enough for data analysts
which Prophecy has built

The Future of Data Transformation | 19


Support Three Scales

Users Prophecy enables all the primary data users in the enterprise with a
visual interface, and all data users develop with same standards, in the
same system to build high quality pipelines
Data Prophecy generates high-quality code that is native to cloud based
distributed (MPP) data processing systems - Spark/Databricks or SQL
data warehouses and can run at very large scale data
Pipelines Prophecy enables building of packages that have visual components,
subgraphs, and configurable pipelines. A very high number of pipelines
can be built with these basic blocks in a very standardized and easy to
manage way.

Best Practices of Complete Lifecycle

Development
Low-code with intelligent All data users can build pipelines with highest productivity, seeing data
auto-complete after every step with suggestions based on active metadata such as
recommendation of most like tables that current table combines with
and how.
Active metadata and Metadata in Prophecy powers lineage across pipelines and projects,
analysis with lineage across data platforms coming soon.
No-code with natural Prophecy data copilot is based on LLMs and as the data users describe
language support what they want, we generate complete pipelines, sections of the
pipelines, expressions for transformations and also descriptions and
commit messages
Deployment
Versioning and Prophecy turns data pipelines into code on git, and all data users use git
Collaboration branches to develop and collaborate. The visual interface provided by
Prophecy makes this easy for users not familiar with git.

Easy test-development Prophecy can generate complete tests that supply input data to
individual transforms and check the output against given values or
predicates. Instead of starting from scratch, the data users only specify
business logic.
Automated deployment and The deployment of data pipelines into production can be manual and
rollback error-prone. Prophecy will run all unit tests for each commit. The data
pipelines and any dependencies are automatically packaged and
deployed, giving users high trust that manual errors have not occurred.
On failure, the previous version can be deployed for rollback.

The Future of Data Transformation | 20


Observability
Monitoring of data pipelines Prophecy monitors the data pipelines that are deployed. Instead of the
code based data engineering, where you navigate Airflow logs, Spark
logs and sift through code to find the error, Prophecy will point to the
exact component in the pipeline on an error.
Observing cost, Prophecy is developing this and it will be coming soon
performance
Observing data quality Prophecy is developing this and it will be coming soon

Prophecy and modern data stack


Modern data stack is the nascent set of technologies being built on top of the cloud data
platforms providing a Spark or SQL interface.

One development in the cloud is the higher use of EL technologies such as Fivetran, and its
competitors that include independent ones such as Rivery, integrated solutions provided
by the major platforms including SaaS applications such as Salesforce, hyperscale solutions
such as AWS ZeroETL, and data platform solution such as Databricks Arcion. It is the most
commoditized part of the stack.

However, ETL products such as Informatica and Prophecy can do the entire ETL process and
are not technology limited like DBT and Matillion that do not have the ability to build complete
ETL pipelines. The primary focus though remains on transformation of data.

The Future of Data Transformation | 21


Once the data lands inside the data platform in what is often called the Bronze or raw layer -
it needs to get transformed within a cloud data platform. This is the majority of the work and
takes the majority of the time of all data teams. One can choose low-code solutions, code, or
Prophecy that provides the best of both worlds.

Operationalize Prophecy
As we work with enterprises, they often are convinced of the merit of our approach, and this
quickly moves to the question of the journey to move to using Prophecy. This usually involves
one or more of the following three questions to be addressed:

• What does a canonical architecture of a team adoption Prophecy look like - what roles
are assumed by different users?

• How do I consolidate my legacy ETL pipelines from various on-premises solutions to


Prophecy?

• We have already invested in building frameworks, how do we import those into Prophecy
and make them more widely available?

Let’s address each of these.

Data platform and data user teams


We are seeing the canonical pattern across teams that work well in multiple organizations.
Let’s define three kinds of users that we see: (the separation is for clarity only - different kinds
of users are split differently into teams across organizations)

• Data platform providers that provide Databricks or SQL Data Warehouse to their
organization

• Data users that use Visual ETL (such as Informatica) or self-serve data prep (such as
Alteryx)

• Programming data engineers that build frameworks or libraries


Prophecy is often provided by the data platform team in conjunction with Databricks as a
service to multiple lines of business to ensure that maximum number of users are able to self-
serve and build data preparation pipelines themselves.

Prophecy enables programming data engineers to provide packages of visual components and
other constructs for use across their team and other lines of business teams to rapidly build
standardized, high-quality data pipelines.

The Future of Data Transformation | 22


The different teams are thus able to work collaboratively, each focusing on what’s important to
them without bottlenecks.

Modernize from Legacy ETL


Prophecy team has seen that many customers want ‘Prophecy on Databricks’ or ‘Prophecy on
SQL data warehouse’ as the target state, but are stuck with thousands of data pipelines built in
legacy ETL formats on-premises. The legacy players have made moving out difficult by locking
their users into their proprietary format and requiring manual re-building of data pipelines.

The Future of Data Transformation | 23


To address this concern, Prophecy has a Transpiler team that builds source-to-source
compilers that can convert existing pipelines with 90% plus automation to Prophecy pipelines.
This typically requires reverse engineering the formats and rewriting them. Currently, we are
committed to the following direct Transpiler sources:

• AbInitio
• IBM DataStage
• Informatica
• Alteryx
• SAS DI (through a partner)

Prophecy has moved multiple organizations (including Fortune 10 with tens of thousands of
pipelines) completely from legacy ETL solutions to Prophecy on Databricks in the cloud. We
can typically provide a fixed cost migration for the first application to be migrated to setup a
process and get first win in production, and then we are happy to work with your preferred SIs
for larger efforts.

The Future of Data Transformation | 24


Import Frameworks as Packages
Many data platform teams have built frameworks and libraries based on Spark code already in
Python or Scala, and some have reusable libraries for SQL built as dbt core macros. Prophecy
has multiple customers where the data team has imported these frameworks into Prophecy,
wrapped the functions into visual functions, and made the platform available to a large number
of teams. This includes traditional enterprise companies and modern technology companies
including a hyperscale player.

The Future of Data Transformation | 25


Summary
We covered how the ETL industry has evolved since its inception, the various approaches
taken, each with its strengths and weaknesses. Then we discussed what a state of the art
solution in the cloud should do. With this understanding, we introduced how Prophecy’s
unique technological approach is tailored to meet these challenges, as we build an ever more
complete solution on this core. Finally, we covered how various organizations have moved to
Prophecy from their existing solutions, the approaches and their success.

While we were able to cover the topics at a high level here, there is much more depth to each
topic and real examples of large enterprises that have gone through these journeys. You can
reach us for more details and we’ll be happy to share more and explore how we might be able
to help.

Try Prophecy and realize


self-service success
Prophecy helps organizations avoid the common obstacles standing in the way of implementing
self-service analytics. Through a low-code platform, both technical and business data users are
able to easily build high quality data products in a collaborative and transparent way, enabling
faster decision-making, and greater business agility.

Get started with a free trial


You can create a free account and get full access to all platform features for 14 days.

Want more of a guided experience?


Request a demo and we’ll walk you through how Prophecy can empower your entire data
team with low-code ETL today.

Prophecy is a low-code data transformation platform that offers an easy-to-use visual interface to build, deploy, and manage data
pipelines with software engineering best practices. Prophecy is trusted by enterprises including multiple companies in the Fortune
50 where hundreds of engineers run thousands of ETL workloads every day. Prophecy is backed by some of the top VCs including
Insight Partners and SignalFire. Learn how Prophecy can help your data engineering in the cloud at www.prophecy.io.

You might also like