Data Contracts Early Release 042024
Data Contracts Early Release 042024
Editors: Melissa Potter and Aaron Black Cover Designer: Karen Montgomery
Production Editor: Katherine Tozer Illustrator: Kate Dullea
Interior Designer: David Futato
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Contracts, the cover image, and
related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your
own risk. If any code samples or other technology this work contains or describes is subject to open
source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-098-15763-0
Table of Contents
iii
Translation Issues Between OLTP and OLAP Data Worldviews 38
The Cost of Poor Data Quality 41
Measuring Data Quality 42
Who Is Impacted 47
Conclusion 32
Additional Resources 32
References 32
iv | Table of Contents
Brief Table of Contents (Not Yet Final)
v
CHAPTER 1
Why the Industry Now Needs Data
Contracts
We believe that data contracts, an agreement between data producers and consumers
that is established, updated, and enforced via an API, is necessary for scaling and
maintaining data quality within an organization. Unfortunately, data quality and its
foundations, such as data modeling, have been severely deprioritized with the rise of
big data, cloud computing, and the Modern Data Stack. Though these advancements
enabled the prolific use of data within organizations and codified professions such as
data science and data engineering, its ease of use also came with a lack of constraints–
leading many organizations to take on substantial data debt. With pressure for data
teams to move from R&D to actually driving revenue, as well as the shift from
model-centric to data-centric AI, organizations are once again accepting the merits
of data quality being a must-have instead of a nice-to-have. Before going in depth
about what data contracts are and their implementation, this chapter highlights why
our industry forgone data quality best practices, why we’re prioritizing data quality
7
again, and the unique conditions of the data industry post-2020 that warrant the need
of data contracts to drive data quality.
According to Bill, the data warehouse is more than just a repository; it’s a subject-
oriented structure that aligns with the way an organization thinks about its data
from a semantic perspective. This structure provides a holistic view of the business,
allowing decision-makers to gain a deep understanding of trends, patterns, ultimately
leveraging data for visualizations, machine learning, and operational use cases.
In order for a data structure to be a warehouse it must fulfill three core capabilities.
• First, the data warehouse is designed around the key subjects of an organization,
such as customers, products, sales, and other domain-specific aspects.
• Second, data in a warehouse is sourced from a variety of upstream systems
across the organization. The data is unified into a single common format, resolv‐
ing inconsistencies and redundancies.This integration is what creates a single
source of truth and allows data consumers to take reliable upstream dependencies
without worrying about replication.
• Third, data in a warehouse is collected and stored over time, enabling historical
analysis. This is essential for time bounded analytics, such as understanding
how many customers purchased a product over a 30 day window, or observing
trends in the data that can be leveraged in machine learning or other forms of
predictive analytics. Unlike operational databases that constantly change as new
The creation of a Data Warehouse usually begins with an Entity Relationship Dia‐
gram (ERD), as illustrated in Figure 1-1. ERD’s represent the logical and semantic
structure of a business’s core operations and are meant to provide a map that can
be used to guide the development of the Warehouse. An entity is a business subject
that can be expressed in a tabular format, with each row corresponding to a unique
subject unit. Each entity is paired with a set of dimensions that contain specific details
about the entity in the form of columns. For example, a customer entity might contain
dimensions such:
Customer_id
Which identifies a unique string for each new customer registered to the website
Birthday
A datetime which a customer fills out during the registration process
FirstName
The first name of the customer
LastName
The last name of the customer
An important dimension in ERD design are foreign keys. Foreign keys are unique
identifiers which allow analysts to combine data across multiple entities in a single
query. As an example, the customers_table might contain the following relevant
foreign keys:
Address_id
A unique address field which maps to the address_table and contains city, county,
and zip code.
Account_id
A unique account identifier which contains details on the customers account
data, such as their total rewards points, account registration date, and login
details.
By leveraging foreign keys, it is possible for a data scientist to easily derive the
number of logins per user, or count the number of website registrations by city or
state.
The relationship that any entity has to another is called its cardinality. Cardinality
is what allows analysts to understand the nature of the relationship between entities,
which is essential to performing trustworthy analytics at scale. For instance, if the
The 2000s marked a period of incredible change in the business sector.. After the
dotcom bubble of the late 90s, new global superpowers had emerged in the form
of high margin, low cost internet startups with a mind boggling pace of technology
innovation and growth. Sergey Brin and Larry Page had grown Google from a search
engine operating out of a Stanford dorm room in 1998, to a global advertising
behemoth with a market capitalization of over $23 billion by the late 2000s. Amazon
had all but replaced Walmart as the dominant force in commerce, Netflix had killed
Blockbuster, and Facebook had grown from a college people-search app to a $153
million a year in revenue in only 3 years.
One of the most important internal changes caused by the rise of software companies
was the propagation of AGILE. AGILE was a software development methodology
popularized by the consultancy Thoughtworks. Up until this point, software releases
were managed sequentially and typically required teams to design, build, test, and
deploy entire products end-to-end. The waterfall model was similar to movie releases,
where the customer gets a complete product that has been thoroughly validated and
gone through rigorous QA. However, AGILE was different. Instead of waiting for an
entire product to be ready to ship, releases were managed far more iteratively with a
heavy focus on rapid change management and short customer feedback loops.
Rapid Growth
The Modern Data Stack grew rapidly in the ten years between 2012-2022. This
was the time teams began transitioning from on-prem only applications to the
cloud, and it was sensible for their data environments to follow shortly after. After
adopting the core tooling like S3 for data lakes and an analytical data environment
such as Snowflake or Redshift, businesses realized they lacked core functionality in
data movement, data transformation, data governance, and data quality. Companies
needed to replicate their old workflows in a new environment, which led data teams
to rapidly acquire a suite of tools in order to make all the pieces work smoothly.
Other internal factors contributed to the acquisition of new tools as well. IT teams
which were most commonly responsible for procurement began to become supplan‐
Problems in Paradise
Despite all the excitement for the Modern Data Stack, there were noticeable cracks
that began to emerge over time. Teams who were beginning to reach scale were
complaining - The amount of tech debt was growing rapidly, pipelines were breaking,
However, over time changes upstream and downstream impact the evolution of
this query in subtle ways. The software engineering team decides to distinguish
between visits and impressions. An impression is ANY application open to any new
or previous screen, whereas a visit is defined as a period of activity lasting for more
than 10 seconds. Before, all ‘visits’ were counted towards the active customer count.
Now some percentage of those visits would be logged as impressions. To account for
this, the analyst creates a CASE WHEN statement that defines the new impression
logic, then sums the total number of impressions and visits to effectively get the same
answer as their previous query using the updated data.
WITH impressions_counts AS (
SELECT
customer_id,
SUM(CASE WHEN duration_seconds >= 10 THEN 1 ELSE 0 END) AS visit_count,
SUM(CASE WHEN duration_seconds < 10 THEN 1 ELSE 0 END) AS impression_count
FROM
impressions
WHERE
DATE_FORMAT(impression_date, '%Y-%m') =
DATE_FORMAT(CURDATE(), '%Y-%m')
GROUP BY
customer_id
HAVING
(visit_count + impression_count) >= 3
)
SELECT
COUNT(DISTINCT customers.customer_id) AS active_customers
FROM
customers
LEFT JOIN
impressions_counts ON
impressions_counts.customer_id = customers.customer_id
WHERE
COALESCE(impressions_counts.visit_count, 0) >= 3;
The more the upstream changes, the longer these queries become. All the context
about why CASE statements or WHERE clauses exist is lost. When new data develop‐
ers join the company and go looking for existing definitions of common business
concepts, they are often shocked at the complexity of the queries being written
and cannot interpret the layers of tech debt that had crusted over the in analytical
environment. Because these queries are not easily parsed or understood, data teams
Figure 1-3. The impact of the varying levels of data volume and noise on predictability.
Example 1-3. Code to generate data volume and noise graphs in Figure 1-3.
import numpy as np
import matplotlib.pyplot as plt
def generate_exponential_data(min_X, max_x, num_points, noise):
x_data = np.linspace(min_X, max_x, num_points)
y_data = np.exp(x_data * 2)
y_noise = np.random.normal(loc=0.0, scale=noise, size=x_data.shape)
y_data_with_noise = y_data + y_noise
return x_data, y_data_with_noise
def plot_curved_line_example(min_X, max_x, num_points, noise, plot_title):
np.random.seed(10)
x_data, y_data = generate_exponential_data(min_X, max_x, num_points, noise)
plt.scatter(x_data, y_data)
plt.title(plot_title)
plt.show()
example_params = {
'small_data_high_noise': {
'num_points':100,
'noise':25.0,
Furthermore, as the data industry matured, less emphasis has been placed on devel‐
oping ML models and instead the focus has turned to putting ML models in produc‐
tion. Early data science teams could get by with a few STEM PhDs toiling away in
jupyter notebooks for R&D purposes, or sticking to traditional statistical learning
methods such as regression or random forest algorithms. Fast forward to today,
and data scientists have a plethora of advanced models they can quickly download
from GitHub or can leverage auto-ml via dedicated vendors or products within their
cloud provider. Also, there are entire open-source ecosystems such as scikit-learn or
TensorFlow that have made developing ML models easier than ever before. It’s simply
not enough for a data team to create ML models to drive value within an organiza‐
Conclusion
In this chapter we provided an overview of historical and market context as to why
data quality has been deprioritized in the data industry for the past two decades.
In addition we highlighted how data quality is again being deemed as integral as
we evolve from the Modern Data Stack era and shift towards data-centric AI. In
summary, this chapter covered:
In Chapter 2, we will define data quality and how it fits within the current state of the
data industry, as well as highlight how current data architecture best practices creates
an environment that leads to data quality issues.
Additional Resources
• “The open-source AI boom is built on Big Tech’s handouts. How long will it
last?” by Will Douglas Heaven
• “The State Of Big Data in 2014: a Chart” by Matt Turck
• “The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape”
by Matt Turck
• “A Chat with Andrew on MLOps: From Model-centric to Data-centric AI” by
Andrew Ng
References
One of the early mistakes Mark made in his data career was trying to internally sell
data quality on the merits of what pristine data could provide the organization. The
harsh reality is that, beyond data practitioners, very few people in the business care
about data– they instead care what they can do with data. Coupled with data being an
abstract concept (especially among non-technical stakeholders), screaming into the
corporate void about data quality won’t get one far as it’s challenging to connect it to
business value, and thus data quality is relegated to being a “nice-to-have” investment.
This dynamic changed dramatically for Mark when he stopped trying to internally
sell pristine data and instead focused on the risk to important business workflows
(e.g. often revenue driving) due to poor data quality. In this chapter, we expand
on this lesson by defining data quality, highlight how our current architecture best
practices create an environment for data quality issues, and what the cost of poor data
quality is for the business.
33
Defining Data Quality
“What is data quality?” is a simple question that’s deceptively hard to answer, given
the vast reach of the concept, but its definition is core to why data contracts are
needed. The first historically recorded form of data dates all the way back to 19,000
BCE, with data quality being an important factor for every century thereafter ranging
from agriculture, manufacturing, to computer systems– thus, where does one draw
the line? For this book, our emphasis is on data quality in relation to database
systems, where 1970 is the cutoff given that’s when Edgar F. Codd’s seminal paper
A Relational Model of Data for Large Shared Data Banks kicked off the discipline of
relational databases.
During this time, the field of Data Quality Management emerged with prominent
voices, such as Richard Y. Wang from MIT’s Total Data Quality Management pro‐
gram, formalizing the discipline. In Dr. Richard Wang’s and Dr. Diane Strong’s most
cited research article, they define data quality in 1996 as “data that are fit for use
by data consumers…” among the following four dimensions: 1) conformity to the
true values the data represents, 2) pertinence to the data user’s task, 3) clarity in the
presentation of the data, and 4) availability of the data.
Throughout the academic works of Wang and colleagues, there is a massive empha‐
sis on the ways in which the field is interdisciplinary and is greatly impacted by
“...new challenges that arise from ever-changing business environments, ...increasing
varieties of data forms/media, and Internet technologies that fundamentally impact
how information is generated, stored, manipulated, and consumed.” Thus, this is
where this book’s definition of data quality diverges from the 1996 definition above.
Specifically, our viewpoint of data quality is greatly shaped by the rise of cloud
infrastructure, big data for data science workflows, and the emergence of the modern
data stack between the 2010s and the present day.
We define data quality as “an organization’s ability to understand the degree of cor‐
rectness of its data assets, and the tradeoffs of operationalizing such data at various
degrees of correctness throughout the data lifecycle, as it pertains to being fit for use
by the data consumer.”
We especially want to emphasize the phrase “... tradeoffs of operationalizing such data
at various degrees of correctness…” as it’s key to a major shift in the data industry.
Specifically, NoSQL was coined in 1998 by Carlo Strozzi and popularized again in
2009 by Johan Oskarsson. (source: https://www.quickbase.com/articles/timeline-of-
database-history) Since then, there has been a proliferation of the various ways data
is stored beyond a relational database, leading to increased complexities and tradeoffs
for data infrastructure. As noted earlier, one popular tradeoff was the rise of the
Modern Data Stack that opted for ELT and data lakes. Among this use case, many
data teams have forgone the merits of proper data modeling to instead have vast
amounts of data that can be quickly iterated on for data science workflows. Though
it would be easier to have a standard way of approaching data quality for all data
use cases, we must remember that data quality is as much of a people and process
problem as a technical problem. Being cognizant of the tradeoffs being made by data
teams, for better or worse, is key to changing the behavior of individuals operating
within the data lifecycle.
In addition, we also want to emphasize the phrase “...ability to understand the
degree of correctness…” within our definition. A common pitfall is the belief that
perfect data is required for data quality; resulting in unrealistic expectations among
stakeholders. The unfortunate reality is that data is in a constant state of decay that
requires consistent monitoring and iteration that will never be complete. By shifting
the language from a “desired state of correctness” for data assets to instead a “desired
process for understanding correctness” among data assets, data teams account for the
ever-shifting nature of data and thus its data quality.
This splitting of databases marks an important inflection point in the data maturity of
an organization. This split provides substantial gains in understanding organizational
data at the tradeoff of increased complexity. This split also creates silos in the respec‐
tive OLTP and OLAP “data world-views” that lead to miscommunications. Please
note that there are other database formats, such as NoSQL or data lakes, but we’ve
placed our emphasis here on relational OLTP and OLAP databases for simplicity.
Under the OLTP world-view, databases focus heavily on the speed of transactions of
user logs, with emphasis on three main attributes:
The above three components enable low latency of data retrieval, allowing user inter‐
faces to quickly and reliably show correct data in sub-seconds rather than minutes to
a user.
On the OLTP side of the data flow, illustrated in Figure 2-2, you will see software
engineers as the main persona utilizing these databases, and are often the individuals
implementing such databases before a data engineer is hired. This persona’s role heav‐
ily emphasizes the maintainability, scalability, and reliability of the OLTP database
and the related product software– data itself is a means to an end and not their
main focus. Furthermore, while product implementations will vary, requirements and
scoping are often clear with tangible outcomes.
Though this additional database increases the complexity of the data system, the
tradeoff is that this increased flexibility enables the business to discover new opportu‐
nities not fully apparent in the product’s CRUD data format.
On the OLAP side of the data flow, illustrated in Figure 2-2, you will see data
analysts, data scientists, and ML engineers as primary roles working solely within
OLAP data systems. Key to these roles is the iterative nature of analytics and ML
workflows, hence the flexibility in the resulting data models compared to OLTP’s
third-normalized form data.
-- Aquarium Table
+-----------+-------------+-----------------+-------------+---------------------+
|aquarium_id|aquarium_name|aquarium_location|aquarium_size|adult_admission_price|
+-----------+-------------+-----------------+-------------+---------------------+
|123456 |'Monteray...'|'Monteray, CA' |'Large' |25.99 |
+-----------+-------------+-----------------+-------------+---------------------+
-- Reviews Table
+-----------+---------+-------+------------+----------------+-----------+
|aquarium_id|review_id|user_id|number_stars|review_timestamp|review_text|
+-----------+---------+-------+------------+----------------+-----------+
|123456 |00001 |1234 |1 |2019-05-24 07...|'Fish we..'|
|123456 |00002 |1234 |5 |2019-05-25 07...|'I revi...'|
|123456 |00003 |5678 |5 |2019-05-29 05...|'Amazin...'|
+-----------+---------+-------+------------+----------------+-----------+
In Figure 1-3 below, we have Kelp’s review data represented as a denormalized wide
table created by the data analyst in the OLAP database, along with an ad-hoc table
of various ways in which average stars can be represented. Beyond questions around
average stars calculations, the data analyst would also consider other nuances of the
business logic such as:
• How to calculate website session duration where multiple sessions are near each
other.
• Are there instances of single users with multiple Kelp accounts?
• What session duration change is relevant to the business?
• Is it reviews directly or a combination of attributes leading to changes in session
duration?
Example 2-2. Example of Kelp’s Denormalized Wide Table Used by Data Analyst
Given these differences in data worldviews, how can a data team determine which
perspective for calculating average stars is correct? In reality, both data worldviews
are correct and dependent on the constraints the individual, and ultimately the busi‐
ness, cares about. On the OLTP side, Kelp’s software engineers cared about the sim‐
plicity of the feature implementation and wanted to avoid any additional complexity
unless deemed necessary; hence the default to averaging all the star reviews rather
than applying business logic– in other words, it’s a product decision rather than a
Data debt
Data debt is a measure of how complex your data environment is and what is its
capacity to scale. While the debt itself doesn’t represent a breaking change, there is a
direct correlation between the amount of data debt and the development velocity of
data teams, the cost of the data environment as a whole, and its ultimate scalability.
There are a few heuristics for measuring data debt.
Trustworthiness
The trustworthiness of the data is an excellent leading indicator because it correlates
strongly with an increase in replication and (as a result) rising data costs. The less
trust data consumers have in the data they are using, the more likely it is they will
recreate the wheel to ultimately arrive at the same answer.
Trustworthiness can be measured through both qualitative and quantitative method‐
ologies. A quarterly survey to the data team with the following questions is a strong
temperature check, such as the following example.
How much do you agree or disagree with the following questions:
Additionally, the amount of replicated data assets is a fuzzy metric that is correlated
with trust. The more a dataset is trustworthy, the less likely it is it will be rebuilt using
slightly different logic to answer the same question. When this scenario does occur,
it usually means that either A.) the logic of the dataset was not transparent to data
developers, which prompted a lengthy amount of discovery that ultimately ended in
replication, or B.) the data asset was simply not discoverable - meaning that it likely
would have been reused if only the data asset in question had been easier to find.
In this author’s personal experience, it is more rarely the latter case than data teams
might assume. A motivated data developer will find the data they need, but no matter
how motivated they are they won’t take a dependency on it for a business outcome if
they can’t trust it!
Ownership
Ownership is a predictive metric, as it measures the likelihood and speed errors
will be resolved upstream of the quality issue when they happen. Ownership can be
measured at the individual table level, but I recommend measuring the ownership as
Data downtime
Data downtime is becoming one of the most popular metrics for data engineering
teams attempting to quantify data quality. Downtime refers to the amount of time
critical business data can’t be accessed due to data quality, reliability, or accessibility
issues. We recommend a three-step process for tracking and actioning on data down-
time:
Track and Analyze Downtime Incidents
Keep a log of all data incidents, including their duration, cause, and resolution
process. This data is crucial for understanding how often downtime occurs, its
common causes, and how quickly your team can resolve issues.
Calculate Downtime Metrics
Use the collected data to calculate specific metrics, such as the average downtime
duration, frequency of downtime incidents, mean time to detect (MTTD) a
data issue, and mean time to resolve (MTTR) the issue. These metrics provide
a quantitative measure of your data’s reliability and the effectiveness of your
response strategies.
Assess Impact
Beyond just measuring the downtime itself, assess the impact on business opera‐
tions. This can include the cost of lost opportunities, decreased productivity, or
any financial losses associated with the downtime.
Once completed, data engineering teams should not only have a comprehensive view
of how their critical metrics are changing over time but clear impact on the business
from downtime. This impact can be used to leverage the business to implement
additional tooling for managing quality, roping in additional headcount, or driving
greater upstream ownership to prevent problems before they occur.
Data engineers. Data Engineers often pull the shortest end of the stick, Because the
word ‘data’ is in their name business teams make the assumption that any data issue
can be dumped on their plate and promptly resolved. This is anything but true! Most
data engineers do not have a deep understanding of the business logic leveraged by
product and marketing organizations for analytics and ML. When requests to resolve
DQ issues arise on the data engineers backlogs, it takes them days or even weeks to
track down the offending issue through root cause analysis and do something about
it.
The time it takes to resolve issues leads to an enormous on-call backlog, and frequent
tension with the analytics and AI teams are high due to a litany of unresolved or
partially resolved outages. This is doubly bad for data engineers, because when things
are running smoothly they are rarely acknowledged at all! Data Engineers unfortu‐
nately fall into the rare category of work whose skills are essential to the business, but
because they can’t claim quick wins in the same way a data scientist, software engineer,
or product manager might, given their visibility in the organization is comparatively
diminished. Life is not good when the only time people hear about you is when
something is failing!
Data scientists . Data Scientists are scrappy builders that often come from academic
backgrounds. In academia, the emphasis is more on ensuring that research is prop‐
erly conducted, interesting, and ethically validated. The business world represents
a sudden shift from this approach to research - focusing instead of money making
exercises, executing quickly, and tackling the low hanging fruits (see: boring prob‐
lems) first and foremost. While data scientists all know how to perform validation,
they are much less used to data suddenly changing, not being able to trust data, or
data losing its quality over time.
This makes the data scientist particularly susceptible to the impacts of data quality.
Machine Learning models frequently make poor or incorrect predictions. Data sets
developed to support model training and other rigorous analysis are not maintained
for long periods of time until they suddenly fail. Expected to deliver tangible business
value, Data scientists may find themselves in a bind when the model they have been
reporting was making the business millions of dollars was actually off by an order of
magnitude, and they just recently found out.
Analysts. Analysts is a broad term. It could refer to financial analysts making deci‐
sions on revenue data, or product analysts reviewing web logs, clickstream events,
Software engineers. Software engineers are impacted by data quality in a more round‐
about way, in the sense that their changes are usually the root cause of most prob‐
lems. So while they may not be impacted in the same way a data engineer, analyst, or
data scientist might be - they often find themselves being shouted at by data consum‐
ers if an incompatible change is made upstream that causes issues downstream.
This isn’t (usually) for lack of trying. Software engineers will often announce poten‐
tially breaking changes on more public channels hoping that any current or would-be
users will notice and prepare for the migration accordingly. Inevitably though, after
receiving very little feedback and making the change, data teams immediately start
screaming and asking to rollback.
Conclusion
In this chapter we defined data quality and contextualized it to the current state of
the data industry including data architecture implications and the cost of poor data
quality. In summary, this chapter covered:
• Defining data quality in a way that looks back to established practices but
accounts for recent changes in the data industry.
• How OLTP and OLAP data architecture patterns create silos that lead to data
miscommunications.
• The cost of poor data quality is a loss of trust in one of the business’s most
important assets.
In Chapter 3, we will discuss the challenges of scaling data infrastructure within the
era of the Modern Data Stack, how scaling data is not like scaling software, and why
data contracts are necessary to enable the scalability of data.
Additional Resources
“The open-source AI boom is built on Big Tech’s handouts. How long will it last?” by
Will Douglas Heaven
“The State Of Big Data in 2014: a Chart” by Matt Turck
“The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape” by
Matt Turck
“A Chat with Andrew on MLOps: From Model-centric to Data-centric AI” by
Andrew Ng
References
References | 49
About the Authors
Chad Sanderson is one of the most well-known and prolific writers and speakers on
data contracts. He is passionate about data quality and fixing the muddy relationship
between data producers and consumers. He is a former head of data at Convoy,
a LinkedIn writer, and a published author. Chad created the first implementation
of data contracts at scale during his time at Convoy, and also created the first
engineering guide to deploying contracts in streaming, batch, and even oriented envi‐
ronments. He lives in Seattle, Washington, and operates the Data Quality Camp Slack
group and the “Data Products” newsletter, both of which focus on data contracts and
their technical implementation.
Mark Freeman is a community health advocate turned data engineer interested in
the intersection of social impact, business, and technology. His life’s mission is to
improve the well-being of as many people as possible through data. Mark received
his M.S. from the Stanford School of Medicine and is also certified in entrepreneur‐
ship and innovation from the Stanford Graduate School of Business. In addition,
Mark has worked within numerous startups where he has put machine learning
models into production, integrated data analytics into products, and led migrations to
improve data infrastructure.