WP_Future of Data Transformation-3
WP_Future of Data Transformation-3
Over the past five years, we’ve dedicated ourselves to solving this challenge once and for
all, and designed a zero-compromise solution. This has now been deployed across numerous
organizations, including multiple Fortune 50 and Fortune 10 companies, running tens of
thousands of pipelines. This includes top-tier banks, healthcare and pharmaceutical firms, as
well as technology giants, including hyper-scale leaders.
As we now embark on making our solution widely accessible, I’m excited to take this
opportunity to share our learnings from working with many organizations to get quality
solutions in place. We will introduce the new canonical design for data teams that truly delivers
results across the entire organization, and discuss the improvements that are important, but
are yet to be delivered by any product, including ours.
• Evolution of ETL that talks about how various solutions have evolved over the past few
decades, what prompted the new approaches and what challenges each has run into
• Zero lock-in
• We’ll introduce Prophecy’s unique approach that unifies visual and code development,
and see how it does on addressing some of the goals for the best solution, and plans to
address the rest.
• Finally, we’ll touch on modernization - if you have a legacy solution, or have developed
some ad hoc code for Databricks or Snowflake, how can you accelerate your journey?
This is a mission critical problem, there is impedance mismatch between code and data that
makes it hard, and the challenges of managing state and maintaining correct results are made
harder in the face of failures.
• The visual pipeline is self documenting, you can see the flow of data visually and inspect
the data at any point to quickly understand and edit a pipeline
• Complete product that includes search, column-level lineage (for impact analysis and
debugging data errors) and data pipeline observability (for break-fix, freshness of data,
and performance analysis)
With the introduction of Hadoop, and then the early move to the cloud, the existing ETL tools
no longer scaled, and their processing engines are being abandoned in favor of the high-
performance Spark and SQL based cloud platforms. Again, the challenges of coding have
re-emerged, with internal frameworks being developed in many companies as they struggle to
deal with scale and try to remain productive.
• It brings software engineering best practices to data with git versioning, unit tests,
continuous integration that make the development and deployment process more robust
and enable you to move changes faster to production
Unified Visual & Code ETL approach has been pioneered by Prophecy to combine the benefits
of visual and code development, while removing the deficiencies of each. It combines the
usability of Visual ETL, enabling all data users and making it productive, with the power of
code and software engineering best practices underneath. It runs natively in distributed cloud
data processing engines such as Spark and SQL data warehouses and seeks to provide a
solution that can evolve and be a complete solution for decades.
As we look to evaluate the solutions, we must first agree on what’s important to solve, let’s
explore this.
• The data engineers who build central data pipelines and standards
• The data scientists who prepare data for machine learning and AI
• The semi-structured data often in JSON coming from APIs and other applications
• The batch pipelines that do 90%+ of ETL built by data engineers, analysts and data
scientists that run regularly for production use cases
• The ad hoc transformations run by data analysts for one-time or exploratory analyses
However, supporting these users and use cases requires also handling the scale at which
Enterprises operate
Code ETL enables only a small subset of users. While Python expertise is limited, more data
users are comfortable with SQL for simple tasks. As transforms get more complex, however,
the SQL to describe them gets more complex. One needs to tackle nuanced intricacies of
the structure of data and all SQL data warehouses add extensions such as scripting with
loops and conditionals which slowly makes it like Python. Dbt core attempts to address this
by introducing Jinja templates and macros which is python. This solution ironically leads to
more tangled code, deviating from the goal of simplification. Apart from enablement, code
has issues with productivity that we’ll discuss in the development section. Stepping back,
we notice that SQL was there in the 1980s and 1990s - if it had worked well, the ETL tools
industry would not have started.
Visual ETL products such as Informatica had a processing engine that runs on a single server
and thus cannot process data at scale. As data inevitably gets larger, it has relied on query
pushdown - essentially chaining a series of visual components that each run SQL queries in
the underlying data warehouse such as Teradata that has distributed computing.
AbInitio is the one Visual ETL product that has always had a distributed processing engine.
Now as users move to the cloud, more standard distributed processing is available with Spark
or SQL data warehouses. Visual ETL tools that seek to bring their own proprietary processing
engines are no longer viable.
Code ETL typically can transform data at scale with Spark or SQL data warehouse. This does
however require developing the code carefully and with a high level of skill.
Visual ETL provides a standard palette of visual operators that is insufficient. The expert
users need to add new operators that read/write their proprietary systems, that standardize
Code ETL does not natively have standards, and after some experience with code
development, teams realize that they are repeating the same patterns over and over, and this
non-standard code is hard to maintain. Developers can write shareable functions, but all the
business logic will be in arguments - thus saving nothing. Data platform teams thus develop
frameworks that aim to standardize these repeated patterns - but this framework, often
expressed as XML or JSON components is a terrible one to develop all your pipelines against.
It is non-standard in the industry making it difficult to onboard new users and lacks tooling
support making it expensive to build pipelines and custom integrations. As an example when
you write the business logic in a field in your proprietary JSON file - there is no auto-complete,
there is no error-checking, there is no debugging - all leading to poor productivity. This
framework then has to be maintained by an internal team forever.
Summary
We see that visual ETL and code ETL each have their advantages, however neither is up to
the task of handling the three basic scales that all enterprises need. However, we must set
the bar higher as we look to the future. Let’s explore what the best in the complete lifecycle of
development, deployment and observability looks like next.
Best in Development
Visual development is the base way to enable a large number of users and make them
productive. As we consider our choices for the future, we should demand more than what is
the current state of the art:
The analyses help greatly with productivity and quality, let’s look at three
• Intelligent auto complete
• Natural language interface
• Impact and break analysis
Intelligent auto-complete
Let’s start with an example that demonstrates what a working solution will do. As you start
developing a visual pipeline on the canvas and want to pull a source component to read data
such as a dataset called customer, there will be numerous versions of ‘customer’, and the
development IDE should recommend the customer dataset with high-quality, freshness and
one that is used by multiple users. After you have added the first source, the tables you’re
most likely going to join with can be suggested and on picking one, the join conditions can
be auto-filled. Auto-complete can be supported by a knowledge graph built on the metadata
that has a deep understanding of the existing tables, their relationships and the commonly
computed expressions on them.
Best in Deployment
Development is only a part of the story, and any good solution must make users productive
through the entire lifecycle including deployment and observability. A good deployment
process can ensure lower error rates contributing to the trust in the data that is essential given
the high business impact of decisions made on data. DataOps is a term that is sometimes used
to refer to the application of software best practices to data operations, however it can mean
different things to different people, so let us explain the gist of it. Some key things in a good
deployment process are:
Existing Solutions
Code works well for this since software best practices have proven to scale. Most customers
see that git, tests, and continuous integration - coupled with carefully controlled, but automated
deployment works well for them. This is however hard to set up, hard to use and test coverage is
always low which translates to low trust in moving changes rapidly to production.
Visual ETL tools have their own processes and typically do not use software best practices,
and are just outdated and behind the industry.
Best in Observability
Once data pipelines have been deployed, they need to be observed and managed in
production. It is necessary to provide mechanisms to quickly identify errors and fix them,
whether these errors are breaks in the data pipelines or issues with the quality of data
produced.
As users move to cloud, making the data pipeline performance and cost transparent to the end
user is necessary to produce data in a timely manner and in reasonable cost budgets.
On the cloud, however, you will pay for every pipeline by the minute and therefore it is
especially important to optimize for performance and thus price.Management of operating
costs is always a concern, but is especially scrutinized during periods of economic uncertainty
and therefore this has rightly become and is of priority.
Most solutions do not provide an easy way to manage performance and cost.
• As soon as you build a pipeline, you should have the performance and cost displayed in
the development environment - including an analysis that can point you to the resource-
intensive portions of the pipeline.
• For the management team, this should roll up into cost and performance analytics by
projects and organizations so that they can be more deliberate about investments.
First is the ability to write business rules - for example the user might know the range for a
number that represents blood pressure and can write an expression. The rules can also be
recommended using the application of pattern matching (AI) applied to data distribution.
Business rules are highly accurate and cheap to run, but the effort for development is
high. Machine learning might be able to help generate checks for many columns extending
coverage of this.
Second big piece of data quality is near zero-effort observability. All important datasets
should be monitored actively, and any divergence should be flagged. This usually requires
data quality monitoring solutions that use machine learning algorithms that are based on
fitting the data of each dataset, and then tracking divergence across time. These solutions
are still maturing, and most have issues with scale, issues with false positives and have not
reached critical scale.
Zero lock-in
Proprietary format lock-in & walled gardens
If you’re developing thousands of data pipelines that contain your business logic,
it’s important to ask what standard are you developing against? How will you
transition to another platform if you need to? The correct solution must use an
open format that is widely supported.
Visual ETL products lock you into their proprietary format and you have to continue
to pay them to run your business logic. There is also the walled garden effect,
where you are stuck with the features provided and cannot use any innovation
done by the community in open-source such as using data quality libraries.
Code ETL is typically in open format. Spark and SQL have emerged as the primary
standards in the cloud with multiple vendors supporting each standard with only
minor changes across vendors.
The complete solution cannot be covered here, however we will cover the core tenets that
provide a better core for data pipelines. We continue to build an ever more complete solution
on top of the core with a focus on what is important to our customers.
Prophecy invented a solution that merges visual and code development. This approach has the
following primary tenets:
This is it! good arguments are simple and so are good solutions, now let us understand the
implications of it, which are quite substantial.
Prophecy generates code that is native to the cloud data platforms and designed to the
highest performance standards. It makes the code match the readability, structure and
performance of the code that the best data engineers develop.
• Visual data pipelines turn into code in open formats with Prophecy that is stored in the
user’s git (such as Github) along with the unit tests for these data pipelines.
• Users develop in git branches for collaboration. Every time the user commits their changes,
all the unit tests are run to ensure that the quality of the data pipeline remains high.
• When the users decide to deploy the data pipeline to production, the assets are
automatically created to reduce errors and have no runtime dependencies on Prophecy.
Extensibility
Prophecy enables coding data engineers to create new visual components using templated
open format Spark or SQL code that runs a data scale. The users can simply define an
excellent user interface that comes with the same capabilities as the inbuilt components such
as auto-complete and format checks.
• Packages are developed and used with software best practices. The data engineers
develop them using standard code, as projects on Git, with versioning and proper release
process and evolution built in.
• Packages for the data users are extremely simple. Data users can look at the packages
available to them, pick the one they want to use, and it shows up as an extension to the
palette of visual components already available to them, as a new menu item.
• Visual data users can self-service and build data pipelines, and when they need a new or
complex task, they can ask for a visual component to be built for it
• Visual pipelines use the same small set of high-quality components making them fast to
develop, easy to understand and high quality. The data pipelines built by business data
users run at scale and are ready for production.
This complete solution, when put in the context of the ecosystem and what different users are
able to achieve shines further light on the power of the solution.
Three Users
Data Engineers Data engineers are able to build data pipelines visually and build
reusable standards for all other data users. Streaming pipelines are also
available to them.
Data Scientists Data scientists are able to visually build data pipelines that includes
building pipelines for precision ML including developing features,
and pipelines for generative AI that includes unstructured data
transformation and interactions with LLMs
Data Analysts Data analysts are able to build pipelines as they would in a self-serve
data preparation product, and orchestrate them without needing a
second system. The low-code interface is made further accessible with
no-code language interface
Three Data Types
Structured Structured data has very strong support including components to ingest
data from databases including running extract queries in the source
systems
Semi-Structured Semi-structured data such as JSON or XML can be read directly
from APIs. Prophecy has native support for nested schemas in every
component, and there are components to flatten these nested schemas
to get structured data for more efficient transformations
Unstructured Unstructured data such as slack messages can be directly read from
APIs. Specialized components exist to handle PDFs that can extract
sections as separate articles. Text can be vectorized using OpenAI or
other interfaces and can be stored in vector databases.
Three Pipeline Types
Streaming Prophecy support Spark streaming data pipelines and supports
streaming inputs and output such as Kafka
Regular ETL / Batch Regular ETL data pipelines are the primary workload
Ad hoc Data analysts usually have some ad hoc pipelines using excel sheets
and data from the warehouse, and they are very similar to batch ETL
pipelines as long as the interface is intuitive enough for data analysts
which Prophecy has built
Users Prophecy enables all the primary data users in the enterprise with a
visual interface, and all data users develop with same standards, in the
same system to build high quality pipelines
Data Prophecy generates high-quality code that is native to cloud based
distributed (MPP) data processing systems - Spark/Databricks or SQL
data warehouses and can run at very large scale data
Pipelines Prophecy enables building of packages that have visual components,
subgraphs, and configurable pipelines. A very high number of pipelines
can be built with these basic blocks in a very standardized and easy to
manage way.
Development
Low-code with intelligent All data users can build pipelines with highest productivity, seeing data
auto-complete after every step with suggestions based on active metadata such as
recommendation of most like tables that current table combines with
and how.
Active metadata and Metadata in Prophecy powers lineage across pipelines and projects,
analysis with lineage across data platforms coming soon.
No-code with natural Prophecy data copilot is based on LLMs and as the data users describe
language support what they want, we generate complete pipelines, sections of the
pipelines, expressions for transformations and also descriptions and
commit messages
Deployment
Versioning and Prophecy turns data pipelines into code on git, and all data users use git
Collaboration branches to develop and collaborate. The visual interface provided by
Prophecy makes this easy for users not familiar with git.
Easy test-development Prophecy can generate complete tests that supply input data to
individual transforms and check the output against given values or
predicates. Instead of starting from scratch, the data users only specify
business logic.
Automated deployment and The deployment of data pipelines into production can be manual and
rollback error-prone. Prophecy will run all unit tests for each commit. The data
pipelines and any dependencies are automatically packaged and
deployed, giving users high trust that manual errors have not occurred.
On failure, the previous version can be deployed for rollback.
One development in the cloud is the higher use of EL technologies such as Fivetran, and its
competitors that include independent ones such as Rivery, integrated solutions provided
by the major platforms including SaaS applications such as Salesforce, hyperscale solutions
such as AWS ZeroETL, and data platform solution such as Databricks Arcion. It is the most
commoditized part of the stack.
However, ETL products such as Informatica and Prophecy can do the entire ETL process and
are not technology limited like DBT and Matillion that do not have the ability to build complete
ETL pipelines. The primary focus though remains on transformation of data.
Operationalize Prophecy
As we work with enterprises, they often are convinced of the merit of our approach, and this
quickly moves to the question of the journey to move to using Prophecy. This usually involves
one or more of the following three questions to be addressed:
• What does a canonical architecture of a team adoption Prophecy look like - what roles
are assumed by different users?
• We have already invested in building frameworks, how do we import those into Prophecy
and make them more widely available?
• Data platform providers that provide Databricks or SQL Data Warehouse to their
organization
• Data users that use Visual ETL (such as Informatica) or self-serve data prep (such as
Alteryx)
Prophecy enables programming data engineers to provide packages of visual components and
other constructs for use across their team and other lines of business teams to rapidly build
standardized, high-quality data pipelines.
• AbInitio
• IBM DataStage
• Informatica
• Alteryx
• SAS DI (through a partner)
Prophecy has moved multiple organizations (including Fortune 10 with tens of thousands of
pipelines) completely from legacy ETL solutions to Prophecy on Databricks in the cloud. We
can typically provide a fixed cost migration for the first application to be migrated to setup a
process and get first win in production, and then we are happy to work with your preferred SIs
for larger efforts.
While we were able to cover the topics at a high level here, there is much more depth to each
topic and real examples of large enterprises that have gone through these journeys. You can
reach us for more details and we’ll be happy to share more and explore how we might be able
to help.
Prophecy is a low-code data transformation platform that offers an easy-to-use visual interface to build, deploy, and manage data
pipelines with software engineering best practices. Prophecy is trusted by enterprises including multiple companies in the Fortune
50 where hundreds of engineers run thousands of ETL workloads every day. Prophecy is backed by some of the top VCs including
Insight Partners and SignalFire. Learn how Prophecy can help your data engineering in the cloud at www.prophecy.io.