04 - Introduction To Synthetic Data
04 - Introduction To Synthetic Data
ai
Synthetic data
generator
—
Tom Gaffney
Synthetic Data on watsonx.ai, Product manager
[email protected]
Anshupriya Srivastava
Advisory, Learning Content Development, Data & AI
[email protected]
Instructor:
Farah Auni Hisham
Technical Enablement Specialist | Data & AI
[email protected]
watsonx
The platform watsonx.ai watsonx.data watsonx.governance
for AI and data Train, validate, tune, Scale AI workloads, for Responsible, transparent,
and deploy AI models all your data, anywhere explainable AI workflows
Today’s focus
watsonx.ai
Train, validate, tune and deploy AI models
5 60%
advantages of leveraging
synthetic data: of all data used for the
development of AI and
1. Innovation/GTM speed
analytics projects will be
2. Minimal risk synthetically generated,
3. Reduced costs by 2024
4. Scale
synthetic
data Data not arranged
according to a preset
Data that has a
standardized format,
data model or schema, typically tabular with rows
and therefore, cannot be and columns that clearly
stored in a traditional define data attributes
relational database
watsonx.ai watsonx.ai
Large Language Synthetic Data
foundation models to generate Generator
synthetic text (images and video
not supported)
Common use
cases for Client Employee training AI model
synthetic
demo data assets training
tabular data
tailor demos for particular improve the realism of edge case data to combine
clients/industries, before internal training programs with real data to improve
real client predictive accuracy
data becomes available of AI models
Generating more data or Create 1-for-1 synthetic High-fidelity, synthetic test Simulate how synthetic
edge case data to combine copy of sensitive data, data to expedite test cases agents’ individual
with real data to improve to share internally for and validation of software decisions impact macro-
predictive accuracy insight extraction and functionality, performance, level metrics, like fraud,
of AI model strategic analysis and reliability sales, or patient diagnoses
Two model
types to
generate
tabular Statistical Generative-AI
data Control
Cost efficient
Automated High-fidelity
User friendly
Custom schema
Statistics-
based
when
leveraging What are its
benefits?
sensitive data
Can’t identify data Protection against
specific to an individual 3rd party attackers
.
Assessing
the quality of
Fidelity Privacy
the synthetic Measures the quality of the Assesses leakage of real
Utility Fairness
Measures the accuracy and Assesses bias in the
performance of a predictive synthetic data and the
downstream task where fairness of the predictions
predictive models are with respect to sensitive
trained on the synthetic and protected communities
data Roadmap item
Case study
AI model training
Compare AI model predictive accuracy when trained
on synthetic data vs real data