0% found this document useful (0 votes)
57 views

04 - Introduction To Synthetic Data

The document discusses synthetic data generation on the watsonx.ai platform. It describes two types of synthetic data, statistical and generative AI models, and provides guidance on use cases for each. It also covers privacy techniques like differential privacy and metrics to assess synthetic data quality.

Uploaded by

Dung Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

04 - Introduction To Synthetic Data

The document discusses synthetic data generation on the watsonx.ai platform. It describes two types of synthetic data, statistical and generative AI models, and provides guidance on use cases for each. It also covers privacy techniques like differential privacy and metrics to assess synthetic data quality.

Uploaded by

Dung Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

watsonx.

ai
Synthetic data
generator


Tom Gaffney
Synthetic Data on watsonx.ai, Product manager
[email protected]

Anshupriya Srivastava
Advisory, Learning Content Development, Data & AI
[email protected]

Instructor:
Farah Auni Hisham
Technical Enablement Specialist | Data & AI
[email protected]
watsonx
The platform watsonx.ai watsonx.data watsonx.governance
for AI and data Train, validate, tune, Scale AI workloads, for Responsible, transparent,
and deploy AI models all your data, anywhere explainable AI workflows

A next generation enterprise Fit-for-purpose data store, End-to-end toolkit for AI


studio for AI builders to built on an open lakehouse governance across the entire
train, validate, tune, and architecture, supported by model lifecycle to enable
deploy both traditional querying, governance and responsible, transparent,
machine learning and new open data formats to and explainable AI workflows.
generative AI capabilities access and share data.
powered by foundation
models. It enables you
to build AI applications
in a fraction of the time
with a fraction of the data.

Today’s focus
watsonx.ai
Train, validate, tune and deploy AI models

A next generation enterprise studio for AI


builders to build, train, validate, tune and
deploy generative AI, foundation models, and
machine learning capabilities
• Foundation Model Library with IBM and open-
source models

• Prompt Lab to experiment with foundation models


and build prompts for various use cases and tasks

• Tuning Studio to tune your foundation models


with labeled data

• Data Science and MLOps to build machine


learning models automatically with model training,
development and visual modeling, and synthetic
data generation
Today’s focus
“Real data provides real challenges
regarding privacy, security, process,
and regulation which…stifles our
ability to innovate”
- Telecommunication company

“We developed over 200 models


and deployed only 4”
- Financial technology company
The problem
Real data = risk, costs, & delays

Costs2 Heavy penalties1 Transformation delays

59% $2.3bn 4-6weeks +


of AI budgets on average in cumulative GDPR fines for teams to get access to
are spent on training data. since 2017, with 50% production data, setting
attributable to non- back project timelines &
compliance with general delaying progress. This can
data processing principles.
extend to months for
certain industries and data
types, like financial
institutions and healthcare.
The solution
Real data + synthetic data

Benefits Synthetic adoption1

5 60%
advantages of leveraging
synthetic data: of all data used for the
development of AI and
1. Innovation/GTM speed
analytics projects will be
2. Minimal risk synthetically generated,
3. Reduced costs by 2024
4. Scale

5. Sharing & monetization


There are Unstructured Structured
synthetic data PoC/`` synthetic data
two types of PoC/purchase = Q2’23

synthetic
data Data not arranged
according to a preset
Data that has a
standardized format,
data model or schema, typically tabular with rows
and therefore, cannot be and columns that clearly
stored in a traditional define data attributes
relational database

Examples are text, images, Examples are Excel files or


and videos data in relational databases

watsonx.ai watsonx.ai
Large Language Synthetic Data
foundation models to generate Generator
synthetic text (images and video
not supported)
Common use
cases for Client Employee training AI model

synthetic
demo data assets training

Creating synthetic data to Generate data needed to Generating more data or

tabular data
tailor demos for particular improve the realism of edge case data to combine
clients/industries, before internal training programs with real data to improve
real client predictive accuracy
data becomes available of AI models

Monetize/share Extract Application What-if


externally insights test data assessments

Generating more data or Create 1-for-1 synthetic High-fidelity, synthetic test Simulate how synthetic
edge case data to combine copy of sensitive data, data to expedite test cases agents’ individual
with real data to improve to share internally for and validation of software decisions impact macro-
predictive accuracy insight extraction and functionality, performance, level metrics, like fraud,
of AI model strategic analysis and reliability sales, or patient diagnoses
Two model
types to
generate
tabular Statistical Generative-AI

synthetic The benefits The benefits

data Control

Cost efficient
Automated High-fidelity

User friendly

Speed Privacy Privacy Automated

Custom schema

The shortcomings The shortcomings

Approximates Higher cost & longer runtimes

Stats knowledge Less flexibility


Guidance on
which model
can enable
client use
cases
Use cases 1 Schema Data types 2

Statistics-
based

Generative Good fit


AI Limited
Not a good fit
Privacy
What is differential Identifies rare individuals in datasets and adds “noise” to
protection privacy? obscure their individually specific information

when
leveraging What are its
benefits?
sensitive data
Can’t identify data Protection against
specific to an individual 3rd party attackers

Computational Privacy bounds


transparency guarantee

.
Assessing
the quality of
Fidelity Privacy
the synthetic Measures the quality of the Assesses leakage of real

output synthetic data in terms of


its closeness in distribution
data in synthetic output, as
well as membership
& correlations to the real inference attacks, such as
data
nearest neighbor

Utility Fairness
Measures the accuracy and Assesses bias in the
performance of a predictive synthetic data and the
downstream task where fairness of the predictions
predictive models are with respect to sensitive
trained on the synthetic and protected communities
data Roadmap item
Case study

AI model training
Compare AI model predictive accuracy when trained
on synthetic data vs real data

Challenge & approach Results


Understand the trustworthiness IBM’s generative-AI model
of using synthetic data, by performed similarly to real data
comparing the predictive and outperformed other models,
accuracy of an AI model trained with regard to compute time &
on production data vs synthetic predictive accuracy
data:
• Data set: Loan application,
demographic & historical credit ROC-AUC of Downstream Neural Network
data; 300K rows, 122 columns Predictions when Trained on Different
Synthetic data sets
• Create 3 different synthetic
versions of the data set
0.756
0.723
• For each synthetic data set, train 0.69
0.66
a downstream neural network on
a prediction task & compare
predictive performance Baseline IBM AIM MWEMPGM
(Real Data)

Compute ~5 min ~60 min ~60 min


time
Opportunities
to engage and Private preview
Product
learn more demo of technology

Get access to new models


Demo of existing synthetic
(upcoming gen-AI models), for
capabilities & user experience
early validation and testing prior
on watsonx.ai
to their release

Early feedback Proof-of-concept


and discovery (PoC)
Be part of the early discovery
process where you’ll provide IBM Collaborate with IBM to validate the
with input on upcoming features solution with a small use case
to ensure they can tackle over a 6-week engagement
your use cases

You might also like