0% found this document useful (0 votes)
31 views

redp5748

The document is an IBM Redbook titled 'IBM Synthetic Data Sets' published in February 2025, detailing the use of synthetic data in artificial intelligence applications. It covers various topics including the methodology for generating synthetic data, its applications in finance and insurance, and considerations for data privacy and compliance. The document also discusses the ethical implications of AI and provides resources for getting started with synthetic data sets.

Uploaded by

maulet2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

redp5748

The document is an IBM Redbook titled 'IBM Synthetic Data Sets' published in February 2025, detailing the use of synthetic data in artificial intelligence applications. It covers various topics including the methodology for generating synthetic data, its applications in finance and insurance, and considerations for data privacy and compliance. The document also discusses the ethical implications of AI and provides resources for getting started with synthetic data sets.

Uploaded by

maulet2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Front cover

IBM Synthetic Data Sets

Erik Altman
Dipali Aphale
Joy Deng
Yadu Nandan B
Saurabh Srivastava
Kelly Xiang

Artificial Intelligence
IBM Redbooks

IBM Synthetic Data Sets

February 2025

REDP-5748-00
Note: Before using this information and the product it supports, read the information in “Notices” on page v.

First Edition (February 2025)

This edition applies to IBM Synthetic Data Sets.

© Copyright International Business Machines Corporation 2025. All rights reserved.


Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Contents

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Introducing IBM Synthetic Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Synthetic data in the AI model lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Dataset deep dive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


IBM Synthetic Data Sets for Payment Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
IBM Synthetic Data Sets for Core Banking and Money Laundering . . . . . . . . . . . . . . . . . . . 5
IBM Synthetic Data Sets for Homeowners Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Available editions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Trial Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Pro Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Enterprise Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Previewing data schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Using real data versus synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


Speeding up time to value with privacy-compliant data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Broader and richer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Data privacy, security, and compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Saving costs with synthetic training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Data generation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


Simulating a realistic world. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Creating regular and varied consumer behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Constructing real assets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Connecting different parts of a simulated world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Understanding criminal behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Artificial intelligence ethics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Value alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Data laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Intellectual property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Legal usage terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

© Copyright IBM Corp. 2025. iii


Artificial intelligence on IBM Z Solution Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
IBM Technology Expert Labs Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Starting a proof-of-concept with the AI on IBM Z team . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Frequently asked questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Appendix: Data schemes for IBM Synthetic Data Sets. . . . . . . . . . . . . . . . . . . . . . . . . . 28


Payment cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Core banking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

iv IBM Synthetic Data Sets


Notices

This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS”


WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.

The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.

Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.

Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.

© Copyright IBM Corp. 2025. v


Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at https://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
IBM® IBM Z® Redbooks (logo) ®
IBM Cloud® Passport Advantage® z/OS®
IBM Security® Redbooks®

The following terms are trademarks of other companies:

Other company, product, or service names may be trademarks or service marks of others.

vi IBM Synthetic Data Sets


Preface

IBM Synthetic Data Sets is a family of artificially generated, enterprise-grade datasets that
enhance predictive artificial intelligence (AI) model training and large language models
(LLMs) to benefit IBM Z® and IBM LinuxONE clients, ecosystems, and independent software
vendors. These pre-built datasets are downloadable and packaged as comma-separated
values (CSVs) and data definition language (DDL) files, making them familiar to use, and
compatible with everything from databases to spreadsheets to hardware platforms to
standard AI tools. These datasets also leverage the IBM® industry expertise and domain
knowledge of the financial services sector without using any real client seed data, which
alleviates security concerns with Personally Identifiable Information (PII). Real data at client
sites is often limited in scope to only their own organization's transactions, and clients do not
always know which transactions are fraudulent or not. To address this scenario, IBM Synthetic
Data Sets were modified for fraud detection use cases so that clients can download and
enable development of predictive AI models and LLMs for financial services or optimize
existing models for improved accuracy and risk mitigation.

The IBM Synthetic Data Sets family contains the following features:
򐂰 IBM Synthetic Data Sets for Payment Cards
򐂰 IBM Synthetic Data Sets for Core Banking and Money Laundering
򐂰 IBM Synthetic Data Sets for Homeowners Insurance

This IBM Redbooks® publication introduces IBM Synthetic Data Sets and provides
information about how IBM Synthetic Data Sets can enhance and optimize your predictive AI
model training and LLMs.

Authors
This publication was produced by a team of specialists from around the world working with
the IBM Redbooks team.

Erik Altman is a Research Scientist at the IBM T.J. Watson Research Center. He has worked
across many technical disciplines, such as computer architecture and artificial intelligence
(AI). He has written dozens of scientific papers, and has dozens of issued patents. His works
include five papers on credit card fraud and money laundering that he presented at leading AI
conferences, such as Neurips, AAAI, and ICAIF. He has served for more than 10 years on the
investment committee of the Association for Computing Machinery (ACM), where he acts as a
steward for more than $100 million in assets. He received a bachelor’s degree in Computer
Science and in Economics from MIT. He received his master’s degree and PhD in Electrical
Engineering from McGill University.

Dipali Aphale is a Lead AI Design Researcher who is based in San Francisco, California.
She has 7 years of experience in design and technology. She holds a Bachelor of Industrial
Design degree from NC State College of Design a Master of Art degree in Design
Entrepreneurship from the Royal College of Art, and a Master of Science degree in Design
Engineering from Imperial College London. Her areas of expertise include design research,
speculative design futures, product and industrial design, brand identity, and marketing.
Before she entered tech, she worked extensively in medical product design and care delivery
systems.

© Copyright IBM Corp. 2025. vii


Joy Deng is an Enterprise Product Manager for AI on IBM Z and IBM LinuxONE who is
based in Raleigh, North Carolina. She has 6 years of experience in technical product
management, and she has experience in market research, strategy, and operations finance
across Consumer Packaged Goods (CPG) and retail. She holds a bachelor’s degree in
Marketing and Psychology from Washington University in St. Louis, and also a Masters of
Business Administration degree from the Fuqua School of Business at Duke University, with
concentrations in Strategy and Tech Management. Her areas of expertise include
customer-centered product design, and launching data and AI offerings.

Saurabh Srivastava is an AI Architect for AI on IBM Z and LinuxONE who is based in


Bangalore, India. He has 17 years of experience in data science, AI, and machine learning
(ML). He holds a master’s degree in Statistics from University of Lucknow, Uttar Pradesh,
India, and a post-graduate degree in AI and Machine Learning from the Great Lakes Institute
of Management, Chennai, Tamil Nadu, India. His areas of expertise are building AI use case
architectures, model optimization, and the integration of AI and ML features into enterprise
systems to design scalable and efficient AI solutions.

Kelly Xiang is a Content Designer for AI on IBM Z who is based in Poughkeepsie, New York.
She has 2 years of experience in content development and technical writing. She holds a
degree in English Literature and International Development from McGill University. Her areas
of expertise include content editing, content strategy, technical documentation, and UI and
UX writing. Before joining the AI on IBM Z organization, Kelly wrote extensively for IBM Data
and AI and on various projects that were related to AI ethics.

Yadu Nandan B is a Back-end Developer in the AI on IBM Z team who is based in Bengaluru,
India. He has 6 months of experience, and has been actively contributing to IBM Synthetic
Data Sets since then. He holds a bachelor’s degree in Information Science and Engineering
from the Global Academy of Technology, Bengaluru. His expertise is in the areas of
programming in C++, Python, and AI and Machine Learning.

Thanks to the following people for their contributions to this project:

Lydia Parziale
IBM Redbooks, Poughkeepsie Center

Shin Kelly Yang


IBM, Senior Product Manager for AI on IBM Z and LinuxONE

Now you can become a published author, too!


Here’s an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an IBM Redbooks residency project and help write a book
in your area of expertise, while honing your experience using leading-edge technologies. Your
efforts will help to increase product acceptance and customer satisfaction, as you expand
your network of technical contacts and relationships. Residencies run from two to six weeks
in length, and you can participate either in person or as a remote resident working from your
home base.

Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html

viii IBM Synthetic Data Sets


Comments welcome
Your comments are important to us!

We want our to be as helpful as possible. Send us your comments about this or other IBM
Redbooks publications in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
򐂰 Send your comments in an email to:
[email protected]
򐂰 Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400

Stay connected to IBM Redbooks


򐂰 Find us on LinkedIn:
https://www.linkedin.com/groups/2130806
򐂰 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks
weekly newsletter:
https://www.redbooks.ibm.com/subscribe
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
https://www.redbooks.ibm.com/rss.html

Preface ix
x IBM Synthetic Data Sets
Introducing IBM Synthetic Data Sets

The goal of the tailored datasets in this publication is to produce real-time artificial intelligence
(AI) use cases on IBM Z and LinuxONE (for example, fraud detection, anti-money laundering,
and insurance datasets) and generate business insights without violating data privacy and
security. The IBM Synthetic Data Sets feature is designed to keep real data secure from
threats by training models with artificial data and leveraging data that uses no real Personally
Identifiable Information (PII) and requires no encryption or redaction.

IBM Synthetic Data Sets trains and enhances predictive models and composite AI methods.
Those models can be deployed to IBM Z and LinuxONE with inferencing tools, such as IBM
Machine Learning for IBM z/OS®, AI Toolkit for IBM Z and IBM LinuxONE, and
IBM Cloud® Pak for Data on IBM Z.

This section provides an overview of the typical stages in the AI model lifecycle, with a
description of each stage and how IBM Synthetic Data Sets can provide value to each of the
stages.

© Copyright IBM Corp. 2025. 1


Synthetic data in the AI model lifecycle
IBM Synthetic Data Sets can be used for the following typical stages in the AI model lifecycle.
Stages 2 and 3 can be done repeatedly in succession to systematically improve the quality of
models.
1. Building AI models
When a customer does not have an AI model or access to real data, synthetic data serves
as an accessible and reliable alternative that aims to quickly train models from scratch.
Real data is also challenging to access and might take up to 6 months to obtain. As a
result, realistic synthetic data is a fast alternative for building AI solutions. With
IBM Synthetic Data Sets, clients can accelerate their AI solutions by using pre-built
datasets.
Value: Quick data access, simple use and integration, faster time to value, and data
compliance and privacy.
2. Enhancing AI models
When there is an existing AI model or LLM, synthetic data serves as extra data that is rich,
labeled, and diverse to fine-tune the model. IBM Synthetic Data Sets combines data from
multiple sources and builds large, artificial populations that are composed of fictitious
people participating in overall population behavior. IBM Synthetic Data Sets also simulates
data for businesses, merchants, and both business-to-business and
business-to-consumer activity. The simulated datasets focus on banking and insurance
companies in particular, and extensive analysis is dedicated to provide realistic data for
these two industries. For example, the datasets identify reasons for money movement,
such as salary payment, personal expenses, or contribution to savings, which help
distinguish between legitimate and illegal activity.
Also, synthetic data can establish ground truth, which refers to the accurate, verified data
that is used to evaluate the performance of a model, and fraud and money laundering.
Specifically, IBM Synthetic Data Sets labels all simulated transactions as fraudulent or not
with 100% accuracy. In comparison, real data often lacks such detailed labeling. This
accuracy aims to provide a solid training foundation for AI models and to increase model
quality and reliability. The simulated datasets also contain more instances of fraud than
real data, and a broader scope of scenarios. This increase in frequency and range aids AI
models to detect subtle patterns and anomalies that might be overlooked with real data.
Value: Improved data and model quality, and broader data access.
3. Validating AI models
When there is an existing AI model, synthetic data can evaluate the model's predictive
abilities. With its 100% accuracy on ground truth, IBM Synthetic Data Sets serves as an
answer sheet about whether a transaction is fraudulent or not. As a result, a model's
performance can be evaluated by comparing whether its predictions match the datasets'
conclusions.
Value: The ground truth is known.

2 IBM Synthetic Data Sets


Dataset deep dive

As listed in “Introducing IBM Synthetic Data Sets” on page 1, the IBM Synthetic Data Sets
family contains the following features:
򐂰 IBM Synthetic Data Sets for Payment Cards
򐂰 IBM Synthetic Data Sets for Core Banking and Money Laundering
򐂰 IBM Synthetic Data Sets for Homeowners Insurance

These datasets are available for purchase and are described in this section.

© Copyright IBM Corp. 2025. 3


IBM Synthetic Data Sets for Payment Cards
IBM Synthetic Data Sets for Payment Cards can enable rich artificial intelligence (AI) model
training for various financial processes, such as credit card fraud, debit card fraud, and
targeted marketing. This dataset contains information about simulated credit card holders,
lists of cards that are owned by each holder, and transactions on each card. The simulated
payment cards include debit cards, credit cards, and gift cards, and cash transactions. Each
transaction is labeled with 100% accuracy in two ways: whether it is fraud, and an identifying
ID of the criminal perpetrating the fraud (fraudster ID). The fraudster ID might appear across
many transactions and many stolen cards. This labeling is not available in real data and might
help improve fraud detection accuracy when training AI models.

Synthetic data can also be used in honeypot operations that attract and capture security
threats. Specifically, companies can place IBM Synthetic Data Sets where they fear hackers
might penetrate. However, because IBM Synthetic Data Sets only contains simulated data,
the loss from stolen synthetic data is smaller for the company than from stolen real data.
Nevertheless, the experience of the data theft helps the company monitor and improve its
cybersecurity. Companies can combine IBM Synthetic Data Sets with their real data to deter
data theft. Even if hackers obtain access to real data, they must spend considerable time
differentiating real data from synthetic data. This increased effort can reduce the incentive to
steal the data.

IBM Synthetic Data Sets for Payment Cards is best suited for the following business use
cases:
򐂰 Credit card fraud
򐂰 Debit card fraud
򐂰 Targeted marketing such as product recommendations
򐂰 Honeypot

4 IBM Synthetic Data Sets


IBM Synthetic Data Sets for Core Banking and Money
Laundering
IBM Synthetic Data Sets for Core Banking and Money Laundering supports AI model training
for essential banking services. This dataset simulates an entire banking ecosystem with lists
of bank transfers, personal accounts for individuals, and corporate accounts for companies. It
is specialized to find and label illegal banking transactions, such as check fraud, money
laundering, and automated push payment (APP) fraud.

Because money laundering often goes undetected, having a dataset that is specialized in
identifying transactions for fraud and money laundering is highly valuable. The dataset helps
models determine the type of laundering, for example, fan-in, fan-out, or cycle. As a result,
Synthetic Data Sets for Core Banking and Money Laundering can offer key insights for
creating an anti-money laundering solution.

IBM Synthetic Data Sets for Core Banking and Money Laundering is best suited for the
following business use cases:
򐂰 Money laundering detection
򐂰 Check fraud
򐂰 APP fraud
򐂰 Loan default prediction
򐂰 Honeypot

Dataset deep dive 5


IBM Synthetic Data Sets for Homeowners Insurance
IBM Synthetic Data Sets for Homeowners Insurance empowers AI model training for core
activities in the insurance industry, for example, pricing and underwriting, fraud detection on
datasets, and general verification processes. This dataset contains information about policy
owners and their insured homes, which include details on datasets, insurance policies, and
natural phenomenon that affect datasets. Each claim describes the reason for the claim and
any associated natural phenomena, for example, hurricanes, hail, and earthquakes.

Although many insurance companies have rich, real data about policy holders and datasets,
IBM Synthetic Data Sets for Homeowners Insurance enhances insights by providing a broad
scope of loss scenarios. These extra and diverse scenarios can help detect fraudulent
datasets and flag fraud indicators, which might establish accurate pricing and better risk
assessment. The datasets data can provide greater transparency when determining fraud
because it provides the type or types of fraud that are committed on the claim and the
monetary amount of each fraud type.

Therefore, IBM Synthetic Data Sets for Homeowners Insurance is a rich tool for training,
enhancing, and validating AI models that detect fraudulent homeowners insurance datasets.
This dataset can expand to support other areas, such as loan underwriting and credit scoring.
For example, knowing that a customer has unpaid, outstanding, or pending datasets can
provide further insights into their financial behavior and risk profile.

To expedite communication between insurance companies and their customers, IBM


Synthetic Data Sets of Homeowners Insurance offers free text comments with its datasets.
Simulated customers describe issues or raise questions about their claim, and the generator
of this text knows the semantic content and delivers various semantic labels describing the
content. With these semantic labels, insurance companies can enhance their customers'
experience by better tailoring their responses to customers' requests and inquiries. In
contrast, analyzing and labeling real data for such semantic information tends to be
error-prone, time-consuming, and expensive.

A notable application of semantic analysis and labeling is determining whether customers


require an automated or human response to their text inquiries. For example, if a customer
notes in a claim that “I was told an agent would be available in two hours ago, but no one has
come. When will they be here?”, it is more helpful to direct them to a human agent than an
automated chatbot. Although automation might be able to handle this scenario, insurance
companies can elevate their customer experience by connecting customers that require live
assistance to the correct destination rather than leaving them in an endless loop with a
chatbot or automated call center.

Conversely, some text inquiries might be answered effectively by automated agents. For
example, policy questions such as “What is the deductible on my policy?” can be answered
without real human assistance. By distinguishing these interactions, insurance companies
can leverage their human agents more efficiently and cost-effectively.

IBM Synthetic Data Sets for Homeowners Insurance is best suited for the following business
use cases:
򐂰 Fraud detection
򐂰 Underwriting and pricing
򐂰 Loan underwriting
򐂰 Credit scoring

6 IBM Synthetic Data Sets


Available editions

IBM Synthetic Data Sets are available in three sizes or editions: Trial, Pro, and Enterprise. In
the agent-based model generation of IBM Synthetic Data Sets (See “Data generation
methodology” on page 13), simulated agents or people transact over a period, and those
recorded transactions become the data input for IBM Synthetic Data Sets.

This section described each edition. Review each edition to determine the most suitable data
set for your artificial intelligence (AI) solutions.

© Copyright IBM Corp. 2025. 7


Trial Edition
The Trial Edition is the smallest sized dataset and is great for trials and proof-of-concepts.
The transaction generation parameters are 500 simulated people transacting over a period of
3 months. At the end of the trial, clients must delete all copies of the datasets.

Pro Edition
The Pro Edition is a medium-sized dataset and ideal for independent software vendors and
small customers on a budget that need a large, rich data set for creating their AI solutions.
This edition is roughly 360x the size of the Trial Edition dataset, and its transaction generation
parameters are 15,000 simulated people transacting over a period of 25 months. It is
available for purchase through an IBM Passport Advantage® account or by contacting
[email protected].

Enterprise Edition
The Enterprise Edition is the largest sized data set and recommended for large IBM Z and
LinuxONE enterprises who need the largest, richest data to create their AI solutions. It is
roughly 1950x the size of the Trial Edition dataset, and its transaction generation parameters
are 150,000 simulated people transacting over a period of 37 months. It is available for
purchase through Passport Advantage® or by contacting [email protected].

Table 1 shows the three IBM Synthetic Data Sets editions.

Table 1 Synthetic Data Sets editions


Edition name Trial Pro Enterprise

Size Small (1x) Medium (360x) Large (1950x)

Transaction generation 500 simulated people 15,000 simulated 150,000 simulated


parameters transacting over a people transacting people transacting
period of 3 months over a period of over a period of
25 months 37 months

Best suited for Trials and proofs of Independent software IBM Z and LinuxONE
concept vendors and small enterprises
customers

8 IBM Synthetic Data Sets


Previewing data schemas

A data schema describes what data is included in a dataset. It is the blueprint that defines
how the data is structured, organized, and related to other data attributes. Data schemas for
each IBM Synthetic Data Sets edition can be found in “Appendix: Data schemes for IBM
Synthetic Data Sets” on page 28. The schemas are formatted to display data from top to
bottom for visual fit, but the original datasets display data from right to left.

In the data schemas, you see that the column letter indicates where the attribute is, what the
attribute is, an example of the attribute, and comments explaining the attribute and the range
of options.

© Copyright IBM Corp. 2025. 9


Using real data versus synthetic data

Real data is important for artificial intelligence (AI) model training. However, there are many
times where synthetic data can add value to real data or serve as an alternative when real
data is not available. To answer the question, “I have real data, why would I need synthetic
data?”, IBM Synthetic Data Sets does not contain any real Personally Identifiable Information
(PII) data; labels transactions for fraud or money laundering; and is a less expensive
alternative to real data. As a result, enterprises can jump-start their AI projects with rich,
privacy-compliant, and cost-effective synthetic data.

This section describes the following topics:


򐂰 Speeding up time to value with privacy-compliant data
򐂰 Broader and richer data
򐂰 Data privacy, security, and compliance
򐂰 Saving costs with synthetic training data

© Copyright IBM Corp. 2025. 10


Speeding up time to value with privacy-compliant data
Accessing and organizing real enterprise-grade data is a long, tedious process. Getting
permissions can take up to 6 months, and then the data must be cleansed. All PII must be
identified, redacted, encrypted, or anonymized before AI model training. These steps might
slow down a data scientist's ability to focus on model-building and providing value to the
business.

With IBM Synthetic Data Sets, data scientists can focus on the model sooner. Each dataset is
pre-built, contains no PII, and includes the key attributes for many IBM Z and LinuxONE AI
use cases so that data scientists can immediately begin training models. The datasets come
in comma-separated value (CSV) and data definition language (DDL) formats to make them
compatible across many systems and software. As a result, data scientists can conveniently
use IBM Synthetic Data Sets to create proof-of-concepts, which illustrate the value and
potential capabilities of AI on a business. For independent software vendors who do not have
access to their IBM Z and LinuxONE customers’ data, these datasets aim to empower AI
solution creation by supplying artificial transactional data that is realistic.

Broader and richer data


Real data often faces limitations in scope and range, which can hinder an AI model's
accuracy and reliability. Real data is often limited to the organization that owns it. For
example, a bank or insurance company has data only on what their customers do, which is
further limited by demographics and geography. However, IBM Synthetic Data Sets contains
data from many different banks and insurance companies, which provides a large and rich
view of the overall market and population behavior.

Identifying fraud and money laundering in real data can be challenging. Money laundering is
difficult because criminals use complex techniques to disguise illicit funds as legitimate
financial assets and avoid detection. With IBM Synthetic Data Sets, all transactions are
labeled Yes or No to indicate whether they involve money laundering or other criminal
activities, such as check fraud or automated push payment (APP) fraud. Due to the synthetic
data generation methodology, all labels are assigned with 100% accuracy. No laundering,
check fraud, or scams are missed, and all transactions that are determined to be fraudulent
are instances of the criminal activity.

To illustrate, when a criminal forges or alters a check, or deceives victims into sending money,
these transactions are always identified as check and APP fraud. Subsequently, these
transactions lead to money laundering as the criminals try to conceal or legitimize their illegal
funds. Other types of criminal activity can also result in illicit funds, with the laundering of
those funds labeled. By establishing ground truth in its data, IBM Synthetic Data Sets strives
to provide reliable, high-quality training data that improves models' ability to detect money
laundering and other criminal activity.

To help ensure further transparency about transactions, IBM Synthetic Data Sets also offers
labels specifying the reason for money movement. Some of these labels include salary
payment, credit card payment, and transfers to a retirement account. They are also 100%
accurate and give more context about transactions that is not often available in real data.

As a result, AI models that are built by using IBM Synthetic Data Sets have an advantage over
real data because synthetic training data is complete, correctly labeled, and cover a wide
scope of information.

11
Data privacy, security, and compliance
Even with masking, real data often enables sophisticated AI tools to re-identify sensitive PII
and the person to whom that data belongs. By using no real individual’s information and only
statistical representations at a population level to generate the data, IBM Synthetic Data Sets
aims to remove all risk for potential data breaches and to ensure that real data stays private
and secure. Because there is no real individual’s information, IBM Synthetic Data Sets are
designed to make it simpler to meet data compliance and regulations about using sensitive
information.

Saving costs with synthetic training data


When training models, synthetic data is a cost-saving and cost-efficient alternative to real
data. To create a fraud detection model, the training data requires both fraudulent and
legitimate transactions. With real data, real fraud would need to be committed. There would
also need to be multiple occurrences of both fraudulent and legitimate transactions to ensure
that the training data is an acceptable size and scope. As a result, companies potentially lose
millions of dollars to fraud before properly collecting enough real data to train a fraud
detection model. With IBM Synthetic Data Sets, these data points are artificially generated
and come pre-labeled for fraud and money laundering. As a result, AI business leaders have
the option can train their models for fraud detection and money laundering with fewer financial
costs.

12
Data generation methodology

Datasets are created by simulating a world that is filled with artificial people, alongside tens of
millions of merchants and companies, and observing the transactional behaviors within this
virtual world. The merchants and companies span many countries across the world, but the
simulated population lives in the US.

However, the simulated US population travels and does business across the world and in all
the currencies of the world. As a result, there is business activity in many locations and in
many forms: credit and debit card transactions, bank accounts and transfers, and
investments. Some of this activity is criminal, with the simulated individuals and merchants
committing payment card fraud, insurance fraud, and money laundering.

This section describes the following topic:


򐂰 Simulating a realistic world
򐂰 Creating regular and varied consumer behavior
򐂰 Constructing real assets
򐂰 Connecting different parts of a simulated world
򐂰 Understanding criminal behavior

© Copyright IBM Corp. 2025. 13


Simulating a realistic world
A key goal in this simulated virtual world is to create realistic data. To accomplish this goal,
IBM Synthetic Data Sets leverages a broad set of statistical population data. For example, the
US Census Bureau has a wealth of information down to the postal code level, with a typical
address code containing 10,000 people. This information includes distributions for income,
age, homeowners versus renters, monthly mortgage or rent payments, housing construction
type, housing age, and other information. The US Federal Reserve supplies related
information on the value and types of financial assets and debts, such as checking and
savings accounts, real estate, and home, vehicle, and student loans. The Federal Reserve
also presents statistics on credit and debit card spending. The US Bureau of Labor Statistics
also provides a distribution of approximately 800 job types, and the pay ranges for those job
types.

With this information, IBM Synthetic Data Sets builds a population whose attributes mimic the
overall US population in terms of income, age, and geographic distribution. To emphasize, the
simulated people that are created by IBM Synthetic Data Sets are not built from anonymized
real individuals. Instead, the simulated people are built by using the previously mentioned
statistical distributions. Although the aggregate behavior of the simulated people matches the
aggregate behavior of real people, data security, privacy, or compliance risks are alleviated
because no simulated individual person is based on any real individual person.

Similar to real people, every simulated person is unique. People living in the same
neighborhood with similar income might have different spending habits: frugal versus
expansive, high expenditures on clothes versus high expenditures on travel, and other habits.
This behavior generally follows statistical patterns. For example, individuals with a higher
income can afford to do more activities and have a greater tendency to spend on luxury items
than someone with a lower income. However, some high-income people might spend
modestly, and others spend lavishly. Low- and middle-income people also vary in their overall
spending and in their specific tastes.

Creating regular and varied consumer behavior


When the simulated people and companies are created, they must participate in activities. To
support these activities, IBM Synthetic Data Sets assigns other attributes, such as
occupations or family size. Some of the simulated people live alone, and some are
unemployed or retired. Based on their situation, people move through simulated years,
months, days, and hours, and engage in different consumer behavior. For example, some
people stop for coffee on weekday mornings on the way to work. The coffee purchase yields a
transaction at a merchant in a specific locale. This transaction might be with a credit card, a
debit card, or cash. IBM Synthetic Data Sets sees and tracks all transactions and consumer
activity, which includes cash transactions. In contrast, real data often misses cash
transactions. This universal data collection is one of many advantages over real data because
synthetic data captures a broad, full picture of consumer behavior.

14
Also, IBM Synthetic Data Sets incorporates patterns and variety in consumer behavior. For
example, real people's weekend consumer behavior likely differs from their weekday
consumer behavior. The simulated people in IBM Synthetic Data Sets mimic this change in
behavior. Simulated people take business trips and vacations at varying frequencies and
spend for the destination. Simulated people spend more on gifts around certain months or
holidays as well. Most simulated people are paid at regular intervals, such as weekly,
biweekly, semi-monthly. Rent, mortgage, and other loan payments are typically paid once per
month, with a skew toward the end of the month. IBM Synthetic Data Sets models all these
details and many others with precision, which generate a realistic record of consumer
behavior and spending activity.

In summary, IBM Synthetic Data Sets simulates realistic people, companies, and activity.
Consumer activity and behavior follow realistic time intervals with purchases that are made on
appropriate days, times, and locations.

Constructing real assets


In addition to financial transactions, IBM Synthetic Data Sets carefully models homes and
other real assets. Based on census distributions, IBM Synthetic Data Sets assigns a certain
home size, style of construction, and type of roofing to each simulated person. Different
insurance risks are also assigned to each person and home, such as hurricanes,
earthquakes, and volcanoes. These risks are based on appropriate geographical and time
constraints. For example, hurricanes are more likely to hit the US state of Florida than North
Dakota, and earthquakes are more likely to occur in California than in Iowa. IBM Synthetic
Data Sets models the occurrence of these natural disasters with their simulated population
because when a disaster occurs, home damage likely arises and leads to insurance datasets.
For each claim, there is a rich set of information about exact losses, such as the home itself or
loss of furniture or jewelry, and the cause of the loss, such as hurricane, fire, or theft. The
claim also details exact dollar amounts in each item category and in aggregate. To enhance
compatibility with databases and spreadsheets, IBM Synthetic Data Sets structures its
information in tabular form and is packaged as comma-separated value (CSV) files.

IBM Synthetic Data Sets also attaches free text descriptions to each claim. This text content
is generated based on exact knowledge of the underlying claim, which makes it consistent
with the tabular data. For example, the tabular data might note specific items that are
damaged in a flood and the loss amount for those items. The text might provide a brief
description of the claim, such as “Last week my home was damaged in a flood and there is a
great deal of damage to my furniture and carpets. Can you please get me reimbursed quickly
for these items?”

Connecting different parts of a simulated world


Interdependence is another important aspect of how IBM Synthetic Data Sets constructs its
virtual world and population. IBM Synthetic Data Sets contains a mix of over 300 large,
multi-national real companies and tens of millions of small, fictitious companies. Companies
can serve as both merchants that provide goods to consumers and as employers that provide
salaries to simulated people. Companies can be buyers to some businesses and suppliers to
others. Simulated people also contribute through consumption and investment, with their
purchases increasing revenue and stock for companies. Revenue for large companies is
based on the company's Form 10-K filings, and these large companies add a further element
of realism to the dataset.

15
Understanding criminal behavior
Criminal activity is an important part of IBM Synthetic Data Sets. Having data around fraud
and money laundering is imperative when training artificial intelligence (AI) models to
recognize similar activity. This criminal activity includes check fraud, insurance fraud,
payment card fraud, automated push payment (APP) scams, and money laundering. The
criminal activity expands to a broader set of pursuits, such as yielding illicit income through
extortion, smuggling, and illegal gambling. Like other aspects of the simulated world,
IBM Synthetic Data Set treats each criminal entity as unique entities with their own amounts
and types of unlawful activity. Nevertheless, it is emphasized that in IBM Synthetic Data Sets
only a few companies and people engage in criminal activity, that is, about 1 in 1000 or fewer.

Furthermore, with its knowledge of ground truth and universal data collection, IBM Synthetic
Data Sets offers a key advantage over real data when training models to recognize criminal
activity. The dataset knows who is engaged in criminal activity, when they do it, and the
financial amounts that are involved. As a result, all illegal activities are identified and labeled
with 100% accuracy in the dataset, which includes all scams, credit card fraud, check fraud,
insurance fraud, and money laundering. With real data, this scale of illegal activity is
challenging to detect. Therefore, AI models that are trained with IBM Synthetic Data Sets
have a clear, accurate understanding of criminal behavior.

16
Artificial intelligence ethics

In today’s rapidly evolving technological landscape, artificial intelligence (AI) systems are
becoming integral to decision-making processes across many industries. Although AI has
tremendous potential to transform business operations and improve efficiency, its
implementation also raises ethical and security concerns. Trust in AI systems can be
established only through a foundation of ethical principles, secure design, and transparent
practices.

IBM’s approach to Security and Trust by Design integrates ethical safeguards from the outset
by focusing on six key pillars: Fairness, Robustness, Value Alignment, Data Laws, Intellectual
Property (IP), Transparency, and Privacy.

The following sections delve into each of these areas by exploring IBM’s methods for
mitigating risks and fostering trustworthy AI. For more information about each pillar, see
Foundation models: Opportunities, risks, and mitigations.

© Copyright IBM Corp. 2025. 17


Fairness
Ensuring fairness in AI is a foundational ethical consideration. AI systems, particularly ones
that are trained on large datasets, are at risk of inheriting biases that are present in the data
itself. These biases might be historical, societal, or representational, and if they are left
unaddressed, they can lead to outcomes that unfairly impact certain groups. For example,
training a model with biased data can result in outputs that unintentionally favor or
discriminate against certain groups. In industries like finance, insurance, or healthcare, the
implications of such biases can be especially harmful.

Therefore, IBM uses the AI Fairness 360 Toolkit, which is a comprehensive suite of tools to
detect and mitigate biases in IBM Synthetic Data Sets. This toolkit can identify areas where
biases might influence outcomes and implement corrections to help ensure that all users are
treated equitably. In specific applications like fraud detection, factors such as race are
intentionally excluded to prevent unintended discriminatory outcomes. By continuously
validating IBM Synthetic Data Sets through fairness testing, IBM upholds a commitment to
equity and helps ensure that AI systems contribute positively and fairly to society.

Robustness
Robustness in AI systems is essential to help ensure that datasets and AI models remain
resilient in the face of adversarial attacks. One significant threat to AI robustness is data
poisoning, where a malicious actor intentionally introduces corrupted or misleading data into
a training or validation set. Such tampered data can distort model behavior, which can
potentially lead the AI to produce outputs that favor the adversary’s objectives. This situation
poses serious risks because poisoned models might produce harmful or inaccurate
decisions, with implications for organizational reputation and operational stability.

IBM addresses robustness concerns through a Security and Privacy by Design (SPbD) threat
assessment process, which actively monitors and verifies IBM Synthetic Data Sets to prevent
tampering throughout the product supply chain from creation to delivery. The SPbD review
process is an official process that development teams must use to receive approval for their
datasets from the IBM Business Information Security Officer. SPbD involves systematic
checks to help ensure the integrity of the IBM Synthetic Data Sets data that is used to train,
enhance, or validate AI models. These proactive measures enable IBM to maintain high
standards of security and resilience, which makes it more challenging for adversaries to
manipulate AI outputs. By prioritizing robust design and adopting stringent security protocols,
IBM reinforces the trustworthiness of its AI solutions.

Value alignment
For AI systems to be effective and ethical, they must align with the values and objectives of
the organizations that deploy them. Achieving value alignment requires careful data curation
during the training and tuning phases because improper data generation, collection, and
annotation can lead to models that deviate from ground truth. If AI training data does not
accurately reflect an organization’s ethical standards, the subsequent outputs might not align
with wanted outcomes and lead to unintentional ethical or operational consequences.

18
IBM helps ensure value alignment by following a robust process that vets and governs data
that is used for AI. This process is set by the IBM Office of Privacy and Responsible
Technology. The process oversees data curation and verifies that only approved datasets are
used for training. Also, the process helps secure third-party data and content by helping
ensure that each data set adheres to organizational standards. With these practices, IBM
builds AI systems that are technologically advanced and deeply aligned with organizational
values, which enhance the trustworthiness and social responsibility of its AI solutions.

Data laws
Compliance with data usage laws is a critical aspect of ethical AI implementation. Different
regions have different regulations on the usage of data, with some laws strictly prohibiting the
usage of specific data types for AI applications. Non-compliance with these regulations can
result in financial penalties, legal repercussions, and damage to an organization’s reputation.
As governments worldwide enact stringent data protection laws, AI developers must help
ensure that their systems adhere to all relevant regulations to avoid these consequences.

To navigate the complexities of data compliance, IBM integrates data governance into its AI
development processes. By registering AI use cases through the Integrated Governance
Registration process, IBM ensures that IBM Synthetic Data Sets comply with applicable laws.
Legal consultation is a standard part of this process, which helps IBM to address compliance
proactively. This approach reinforces IBM’s commitment to ethical data collection and usage,
and strengthens the legal and ethical standing of its AI systems.

Intellectual property
IP rights are a significant consideration when developing AI systems because training models
on proprietary datasets might raise copyright, licensing, and compliance issues. Navigating
these IP challenges is essential to help ensure that AI systems are built within legal
boundaries and do not infringe on the rights of data owners. Moreover, each country has its
own regulatory framework, which adds to the complexity of IP compliance for AI development.

IBM approaches IP issues by coordinating closely with legal teams through regular meetings.
These meetings with IBM Z Brand legal experts help clarify the terms and conditions of
IBM Synthetic Data Sets usage, which helps ensure that service descriptions meet all
relevant legal requirements. By maintaining strict compliance with IP laws, IBM minimizes
risks that are related to data misuse, supports ethical AI practices, and fosters innovation
within legally permissible frameworks.

Transparency
Transparency is key to fostering trust in AI systems. Documenting how data is collected,
processed, and used in model training enables stakeholders to understand and evaluate the
ethical considerations that are involved. Lack of transparency can undermine confidence in AI
systems because users might question the source, quality, or handling of data that informs
AI-driven decisions. Clear, accessible explanations about data processes promote
accountability and facilitates a deeper understanding of AI mechanisms.

19
IBM addresses transparency concerns by publishing detailed papers on synthetic data
generation methods. For more information, see Synthesizing credit card transactions and
Realistic Synthetic Financial Transactions for Anti-Money Laundering Models.

Also, IBM provides a data schema that labels each data set component, what the attribute in
the column is named, an example of the data, and options and ranges for that attribute. This
level of transparency clarifies IBM’s commitment to ethical AI and empowers users to assess
data practices, which enhance trust in IBM’s AI systems.

Privacy
Protecting privacy is a fundamental ethical obligation in AI. With growing concerns about data
re-identification, even datasets that exclude Personally Identifiable Information (PII) pose
privacy risks if patterns can be used to infer individuals’ identities. Privacy breaches
compromise user trust, and can lead to significant legal and reputational damages for
organizations.

To address privacy concerns, IBM Synthetic Data Sets do not contain real PII but instead use
statistical representations of populations. By generating synthetic data that simulates
real-world patterns without identifying individuals, IBM minimizes privacy risks and helps
ensure compliance with privacy regulations. This approach allows IBM to build powerful AI
models without compromising user privacy, which reinforces IBM’s commitment to ethical and
responsible AI.

Conclusion
The IBM Security® and Trust by Design framework is a comprehensive approach to generate
synthetic datasets with ethical practices that make trusted AI development. By focusing on
fairness, robustness, value alignment, compliance with data laws, IP rights, transparency, and
privacy, IBM addresses the complex ethical and security challenges that accompany AI
advancement. These pillars form the foundation of IBM’s commitment to responsible AI,
which help ensures that AI systems are innovative, fair, secure, and aligned with societal
values. Through these practices, IBM fosters trust in AI, which paves the way for ethical and
secure AI deployment across industries.

20
Legal usage terms

For the full legal terms for IBM Synthetic Data Sets, which include how to use and redistribute
the datasets, see IBM Terms.

© Copyright IBM Corp. 2025. 21


Getting started

This section describes a few different ways to get started with IBM Synthetic Data Sets:
򐂰 Artificial intelligence on IBM Z Solution Templates
򐂰 IBM Technology Expert Labs Services
򐂰 Starting a proof-of-concept with the AI on IBM Z team

© Copyright IBM Corp. 2025. 22


Artificial intelligence on IBM Z Solution Templates
AI Solution Templates is a suite of pre-built blueprints that guide you through the full artificial
intelligence (AI) lifecycle on IBM Z with various enterprise use cases while leveraging various
technologies at no charge. Whether you are a senior data scientist or have no previous AI
skills, you can build your own AI model, deploy it on IBM Z, and integrate it into a business
application.

For more information, see AI Solution Templates on GitHub.

IBM Technology Expert Labs Services


IBM Expert Labs is a professional services organization that is powered by an experienced
team of product experts. This knowledgeable team brings deep technical expertise across
software and infrastructure areas. IBM Expert Labs uses proven methodologies, best
practices, and patterns to help IBM Business Partners develop complex solutions and achieve
better business outcomes.

There are three paid services offerings through IBM Technology Expert Labs for using IBM
Synthetic Data Sets for model training and deployment:
򐂰 AI Exploration and Model Training: Integrate and blend data from IBM Synthetic Data Sets
and real data, including from IBM Z and LinuxONE. Transform the data and use it for
training a machine learning and deep learning model.
򐂰 Implement Machine Learning for z/OS: Install and configure Machine Learning for z/OS for
model deployment on IBM Z.
򐂰 Model Deployment to IBM Z and LinuxONE: Deploy the model to IBM Z and LinuxONE for
accelerated inferencing with Machine Learning for z/OS or AI Toolkit for IBM Z and
LinuxONE

For more information, contact [email protected] or your local IBM Technology


Expert Labs team.

Starting a proof-of-concept with the AI on IBM Z team


Interested in getting started with a discovery workshop to discover a use case for AI on IBM Z
with synthetic datasets? Want to get started on a proof-of-concept?

If so, engage with the team by reaching out to [email protected].

23
Frequently asked questions

Here is a list of frequently asked questions (FAQ) about IBM Synthetic Data Sets:
򐂰 What are the benefits of IBM Synthetic Data Sets?
For examples about how to leverage IBM Synthetic Data Sets for AI models and large
language models (LLMs), see “Introducing IBM Synthetic Data Sets” on page 1.
򐂰 How large are the datasets?
Each dataset comes in three editions or sizes: Trial, Pro, and Enterprise. For more
information, see “Available editions” on page 7.
򐂰 What is included in the datasets?
Information about column titles and data attributes, including examples and options, is
described in “Previewing data schemas” on page 9 and “Appendix: Data schemes for IBM
Synthetic Data Sets” on page 28.
򐂰 What is the methodology for creating the datasets?
In short, the datasets are created by using the agent-based modeling method. For more
information, see “Data generation methodology” on page 13 and the academic papers that
are referenced in “Artificial intelligence ethics” on page 17.
򐂰 What environment or platforms can I download the datasets on?
These datasets are downloadable, comma-separated value (CSV) files that are
compatible with the training platform of your choice. The intention is that IBM Synthetic
Data Sets can be used by IBM Z and LinuxONE customers and ISVs to build models on
any platform and deploy those models back to IBM Z and LinuxONE, where the core
enterprise data is for accelerated inferencing.
򐂰 How realistic are the datasets?
IBM Synthetic Data Sets is realistic because they were created with real statistical
population data from various sources, which include the US Census, Federal Reserve,
Bureau of Labor Statistics, and FBI Crimes Insights, among other sources. Also, a large
US national card provider compared the distribution of the datasets against their real
transactions data and found that it matched well.

© Copyright IBM Corp. 2025. 24


Figure 1 displays the distribution of synthetic data compared to real data for payment card
transactions. This data was sourced from a large US national card provider. For more
information, see Synthesizing credit card transactions.

Figure 1 Synthetic Data Realism

򐂰 Will I need to transform the data?


You might need to transform the datasets for model training or to better match the
company’s real data. Typical data transformation processes are permitted. For the data
usage terms, read the Service description, which can be found in “Legal usage terms” on
page 21.
If you need help with transforming data, combining data sources, and training models, you
can use an IBM Technology Expert Labs offering to do these tasks. To learn more about
the Expert Labs offering, see “Getting started” on page 22.
򐂰 How is IBM Synthetic Data Sets different than a synthetic data generator?
Synthetic data generators are great tools when you have access to your real data, and
many of them can redact Personally Identifiable Information (PII). However, many
generator tools do not produce the quality of data and logic from real data that you get
from IBM Synthetic Data Sets. For example, a synthetic data generator can generate
16-digit credit card numbers but might not maintain the logic of what those numbers mean.
For example, Mastercard starts with a 2 or a 5, and is aligned correctly with the column for
card company as Mastercard.
Another frequent issue with synthetic data generators is that city, state, country, and postal
codes do not match in the generator outputs. For example, the city of New Orleans shows
up in Italy, or Los Angeles is assigned a postal code of 2215 when only 90001 to 90042 is
available. This mismatch occurs because most synthetic generators generate new data
based on statistical representations from each column attribute. However, the generators
do not tie into the underlying logic to produce the quality of data that is needed.

25
To get the same quality of synthetic data as IBM Synthetic Data Sets, an organization
would need time and money for a data scientist and a subject matter expert to spend years
finding the right source data and potentially writing extra code to maintain the data logic.
However, clients can promptly begin modeling and LLM training with IBM Synthetic Data
Sets.
򐂰 IBM Synthetic Data Sets offers only US-based data. How does it help me if I am not in the
US?
IBM Synthetic Data Sets is most directly useful for the US. However, they can provide
significant benefits worldwide:
– The core of many AI models is pattern detection and deviations from those patterns.
For example, AI models look for deviations from common or typical behavior to detect
fraud and money laundering. Then, the model flags these deviations as potential fraud,
or money laundering. This approach is geographically independent. If a model can find
patterns in US-based data, the model is typically capable of doing so anywhere.
– The patterns are geographically independent. For example, it is always unusual to
have multiple purchases in an hour at brick-and-mortar merchants when the merchants
are separated by hundreds of kilometers. It is always unusual for someone who spends
frugally to suddenly spend large amounts on expensive luxury items. Certain patterns
of transfers between bank accounts are common, such as moving money from
checking to savings. Other patterns might be less common, such as suddenly moving
small amounts of money to a large set of other accounts. As a result, although
IBM Synthetic Data Sets is US-based, the logic behind pattern detection and deviation
can be applied universally.
Patterns might be more subtle than these examples. Use broad, well-labeled data to
create and train AI models to detect such subtleties.
– The data generation that is used for IBM Synthetic Data Sets simulates international
companies and business transactions worldwide. The simulated people and
companies travel and conduct transactions in 223 countries around the simulated
world, and use international currencies and banks to facilitate their activities.
Therefore, although the datasets’ transactions center is in the US, they cover the world.
IBM Synthetic Data Sets has many attributes that are not available in real data.
IBM Synthetic Data Sets has fully accurate labeling for a broad set of categories.
IBM Synthetic Data Sets also provides data for all banks and insurance companies in the
ecosystem, which includes cash transactions that are frequently overlooked by real data.
Clients can combine IBM Synthetic Data Sets with local data to develop enhanced, robust
capabilities that are beyond what IBM Synthetic Data Sets or local data alone can
independently offer. IBM Synthetic Data Sets can also fine-tune models that are created
from local data.
򐂰 If I have feedback on how to improve the datasets, how do I provide that feedback?
We appreciate your feedback and aim to include relevant suggestions in future updates to
the datasets. Updates are available with the purchase of a subscription service.
To submit new ideas, see ideas.ibm.com.

26
Additional resources

򐂰 For more information about synthetic datasets, see the following resources:2021
International Conference on AI in Finance (ICAIF): Synthesizing credit card transactions
򐂰 2024 ICAIF:
– FraudGT: A Simple, Effective, and Efficient Graph Transformer for Financial Fraud
Detection
– Graph Feature Preprocessor: Real-time Subgraph-based Feature Extraction for
Financial Crime Detection
򐂰 2023 Neural Information Processing Systems (Neurips) paper: Realistic Synthetic
Financial Transactions for Anti-Money Laundering Models
򐂰 2024 Association for the Advancement of Artificial Intelligence (AAAI) paper: Provably
Powerful Graph Neural Networks for Directed Multigraphs

© Copyright IBM Corp. 2025. 27


Appendix: Data schemes for IBM
Synthetic Data Sets

This section describes the data schemas for each of the IBM Synthetic Data Sets:
򐂰 Payment cards
򐂰 Core banking
򐂰 Insurance

© Copyright IBM Corp. 2025. 28


Payment cards
Here are the data schemas for payment card:
򐂰 Payment cards
򐂰 Payment cards users
򐂰 Payment cards transactions

Table 1 is the data schema for payment cards.

Table 1 Data schema for payment cards

29
Table 2 is the data schema for payment cards users.

Table 2 Data schema for payment card users

30
Table 3 is the data schema for payment transactions.

Table 3 Data schema for payment transactions

31
Core banking
Here are the data schemas for core banking:
򐂰 Banks
򐂰 Liquid accounts people
򐂰 Liquid accounts companies
򐂰 Bank transfers
򐂰 Business-to-business (B2B)

Table 4 is the data schema for banks.

Table 4 Data schema for banks

32
Table 5 is the data schema for liquid accounts people.

Table 5 Data schema for liquid accounts people

33
Table 6 is the data schema for liquid accounts companies.

Table 6 Data schema for liquid accounts companies

34
Table 7 is the data schema for bank transfers.

Table 7 Data schema for bank transfers

35
Table 8 is the data schema for business-to-business (B2B).

Table 8 Data schema for B2B

36
Insurance
Here are the data schemas for insurance:
򐂰 Insurance Application
򐂰 Insurance Policy
򐂰 Insurance Claims
򐂰 Insurance Freetext
򐂰 Storms
򐂰 Quakes
򐂰 Volcanoes

Table 9 is the data schema for insurance applications.

Table 9 Data schema for insurance applications


Column Field Name Sample Value Comment In Kaggle

A Index to Insurance 235 Claims use this value No


Policy CSV to refer to the policy /
applicant. The value
is the same in
"policy.csv", that is,
rows of "policy.csv"
logically continue the
corresponding
"applic.csv" row.

B Index - Applicant 1 17 Index to applicant in No


the users.csv file.
Information in the two
files and others may
be cross-linked. If the
value is not in the
users.csv file, it
indicates that the
applicant is not a
"primary" person in
the simulation, but
instead, for example,
a spouse of a primary
person. The primary
can be male or
female, as can the
spouse.

C Name - Applicant 1 Cayson Hayes No

D Date of Birth - 04/01/1987 Month / Day / Year No


Applicant 1 format.

E Social Security 786-38-7809 No


Number - Applicant 1

F Drivers License Z979-3439-5902-75 No


Number - Applicant 1

G Drivers License State Wisconsin At the time of writing, No


- Applicant 1 only the 50 US states
are supported.

37
Column Field Name Sample Value Comment In Kaggle

H Drivers License United States At the time of writing, No


Country - Applicant 1 only the United
States are
supported, but this
field facilitates adding
other countries in the
future.

I Marital Status - Married Six values are No


Applicant 1 supported: Married,
Separated, Always
Single, Divorced,
Widowed, and
Cohabiting.

J Education Level - Associates Nine values are No


Applicant 1 supported: None,
High School,
Associates, Some
College, Bachelors,
Masters, PhD, MD,
and JD.

K Personal Phone 414-633-6424 No


Number - Applicant 1

L Email Address - Hayes.7934@google Like other aspects of No


Applicant 1 mail.com IBM Synthetic Data
Sets, the email
addresses are fake,
but realistic.

M Street Address - To 945 Federal Street No


Be Insured

N Unit Number - To Be No
Insured

O City - To Be Insured Milwaukee No

P State - To Be Insured WI No

Q Postal Code - To Be 53214 No


Insured

R Country - To Be United States No


Insured

S Months at this 61 No
Address

38
Column Field Name Sample Value Comment In Kaggle

T Previous Address - If When no previous No


less than 36 months address is needed,
at Insured Address for example, there is
more than 36 months
at the current
address, no values
are provided for
previous address. In
the
comma-separated
value (CSV) file,
these fields will be
consecutive
commas, which
indicate empty fields.

U Previous Unit No
Number - Applicant 1

V Previous City - No
Applicant 1

W Previous State - No
Applicant 1

X Previous Postal Code No


- Applicant 1

Y Previous Country - No
Applicant 1

Z Current Employer - Hilton No


Applicant 1

AA Street Address - 36532 Eighth Drive No


Employer of
Applicant 1

AB Unit Number - No
Employer of
Applicant 1

AC City - Employer of Milwaukee No


Applicant 1

AD State - Employer of WI No
Applicant 1

AE Postal Code - 53214 No


Employer of
Applicant 1

AF Country - Employer United States No


of Applicant 1

AG Type of Employer - Hotels No


Applicant 1

AH Position at Employer Interviewer No


- Applicant 1

39
Column Field Name Sample Value Comment In Kaggle

AI Are Self-Employed - No No
Applicant 1?

AJ Years on Job - 3 No
Applicant 1

AK Years in this 18 No
Profession -
Applicant 1

AL Business Phone - 414-280-7042 No


Applicant 1

AM Index - Applicant 2 17 Index to applicant in No


the users.csv file.
Information in the two
files and others may
be cross-linked. If the
value is not in the
users.csv file, it
indicates that the
applicant is not a
"primary" person in
the simulation, but
instead, for example,
a spouse of a primary
person. The primary
can be male or
female, as can the
spouse.

AN Name - Applicant 2 Zoey Hayes No

AO Date of Birth - 01/07/1976 No


Applicant 2

AP Social Security 392-56-6826 No


Number - Applicant 2

AQ Drivers License G407-2062-5784-12 No


Number - Applicant 2

AR Drivers License State Wisconsin No


- Applicant 2

AS Drivers License United States No


Country - Applicant 2

AT Marital Status - Married No


Applicant 2

AU Education Level - Bachelor’s No


Applicant 2

AV Personal Phone 414-312-2984 No


Number - Applicant 2

AW Email Address - [email protected] No


Applicant 2 m

AX Current Employer - Katelyn's Bank No


Applicant 2

40
Column Field Name Sample Value Comment In Kaggle

AY Street Address - 358 Madison No


Employer of Boulevard
Applicant 2

AZ Unit Number - No
Employer of
Applicant 2

BA City - Employer of Milwaukee No


Applicant 2

BB State - Employer of WI No
Applicant 2

BC Postal Code - 53215 No


Employer of
Applicant 2

BD Country - Employer United States No


of Applicant 2

BE Type of Employer - Financial Institution No


Applicant 2

BF Position at Employer Loan Officer No


- Applicant 2

BG Are Self-Employed - No No
Applicant 2?

BH Years on Job - 1 No
Applicant 2

BI Years in this 27 No
Profession -
Applicant 2

BJ Business Phone - 414-705-2426 No


Applicant 2

BK Any foreclosures; No No
repossessions; or
bankruptcies in the
last 5 years?

BL Any insurance No No
declined; canceled;
or non-renewed in
the last 3 years?

BM Has anyone with a No No


financial interest in
the property been
convicted of arson;
fraud; or other crime
related to a loss on a
property?

BN Residence Type to be Single Family House No


Insured

BO Number of Units 1 No

41
Column Field Name Sample Value Comment In Kaggle

BP Year Built 2022 US Dollars No

BQ House Value 251573 No

BR Distance to Fire 350 No


Hydrant (Feet)

BS Distance to Fire 1 No
Station (Miles)

BT Distance to Tidal 700 No


Water (Miles)

BU Angle of Slope with 0 No


House (Degrees)

BV Lot Size (Square 21903 No


Feet)

BW Living Area (Square 2633 No


Feet)

BX Basement Area 0 No
(Square Feet)

BY Garage Area 420 No


(Square Feet)

BZ Garage Capacity 2 No
(Number of Cars)

CA Basement Finished 0 No
(Percentage)

CB Number of Stories 2 No

CC Construction Style Wood Frame No

CD Number of Bedrooms 4 No

CE Number of Full Baths 2 No

CF Number of Half Baths 1 No

CG Bathroom Quality High No

CH Kitchen Quality High No

CI Fireplace Count 1 No

CJ Wood Stove Count 0 No

CK Electrical Service 150 No


(Amps)

CL Roof - Last Update 2022 No


Year

CM Roof - Type of None No


Update (Full / Partial /
None)

CN Wiring and Electrical 2022 No


- Last Update Year

42
Column Field Name Sample Value Comment In Kaggle

CO Wiring and Electrical None No


- Type of Update (Full
/ Partial / None)

CP Heating - Last 2022 No


Update Year

CQ Heating- Type of None No


Update (Full / Partial /
None)

CR Plumbing - Last 2022 No


Update Year

CS Plumbing - Type of None 1 = Concrete Slab; 2 No


Update (Full / Partial / = Crawlspace; 3 =
None) Cinderblock
Basement; 4 =
Poured Concrete
Basement; 5 = Stone
Basement; 6 = Wood
Pilings; 7 = Concrete
Pilings

CT Foundation Type 1 1 = Brick; 2 = Wood No


(Numeric Code) Siding; 3 = Vinyl
Siding; 4 = Aluminum
Siding; 5 = Stucco; 6
= Concrete Board; 7
= Wood Shingles; 8 =
Synthetic Shingles; 9
= Stone; 10 = Poured
Concrete; 11 = Logs;
12 = Asbestos Tiles;
13 = EIFSCB:
Exterior Insulation
Finishing System
over Cinder Block; 14
= EIFSS: Exterior
Insulation Finishing
System over Studs

CU Exterior Wall Type 1 1 = A-Frame; 2 = Flat; No


(Numeric Code) 3 = Gable with Valler;
4 = Gable with
Dormer; 5 = Bonnet;
6 = Butterfly; 7 =
Gambrel; 8 = Dome;
9 = Mansard

CV Roof Shape 4 1 = Asphalt Shingles; No


(Numeric Code) 2 = Shake - Wood; 3
= Shake - Cement; 4
= Aluminum / Metal;
5 = Copper; 6 = Clay
Tiles; 7 = Slate Tiles;
8 = Polymer Tiles; 9 =
Thatch; 10 = T-Lock;
11 = Asbestos

43
Column Field Name Sample Value Comment In Kaggle

CW Roof Material 1 1 = Toe Nailing; 2 = No


(Numeric Code) Clips; 3 = Single
Straps; 4 = Double
Straps; 5 =
Structural; 6 =
Unknown

CX Roof Anchor 1 1 = Strong Glass; 2 = No


(Numeric Code) Wooden Storm
Shutters; 3 = Electric
Metal Shutters; 4 =
Manual Metal
Shutters; 5 = None; 6
= Unknown

CY Wind Protection 6 1 = Great to 10 = No


(Numeric Code) Horrible

CZ Protection Class 1 No
(Numeric Code)

DA Is Manufactured No No
Home?

DB Is Historic? No No

DC Has Historic Tours? No No

DD Is Garage Attached? Yes No

DE Is Garage Heated? No No

DF Has Automated Yes No


Garage Doors?

DG Has Carport? No No

DH Has Screen No No
Enclosure?

DI Has Walkout No No
Basement?

DJ Has Walkup Attic? Yes No

DK Has T-Lock No No
Shingles?

DL Has Asbestos No No
Shingles?

DM Is Under No No
Construction?

DN Is Bolted To No No
Foundation?

DO Has Visible No No
Damage?

DP Has Deadbolt Locks? Yes No

DQ Has Sprinklers? No No

44
Column Field Name Sample Value Comment In Kaggle

DR Has Smoke Yes No


Detectors?

DS Has Carbon Yes No


Monoxide Detectors?

DT Has Local Theft No No


Alarm?

DU Has Central Theft No No


Alarm?

DV Has Central Fire No No


Alarm?

DW Has Video No No
Surveillance?

DX Has Video No No
Monitoring?

DY Has Leak Defense Yes No


System?

DZ Has Motion Lighting? Yes No

EA Is Teardown? No No

EB Is Gutted and No No
Remodeled?

EC Is Visible from Road? Yes No

ED Is Visible to Yes No
Neighbors?

EE Occupied Daily? Yes No

EF Has Flood No No
Insurance?

EG Has Knob and Tube No No


Wiring?

EH Has Fuses? No No

EI Has FPE Electric No No


Panel?

EJ Has Lead Pipes? No No

EK Has Iron Pipes? No No

EL Has Polybutylene No No
Pipes?

EM Has Lead Paint? No No

EN Has Asbestos? No No

EO Has Fuel Tank No No


Underground?

45
Column Field Name Sample Value Comment In Kaggle

EP Has Fuel Tank above No No


Ground?

EQ Has Fuel Tank in No No


Basement?

ER Converted to Private No No
Home from other
Use?

Table 10 is the data schema for insurance policies.

Table 10 Data schema for insurance policies


Column Field Name Sample Value Comment In Kaggle

A Index to Insurance 235 Claims use this value No


Application CSV to refer to the policy /
applicant. The value
is the same in
"policy.csv", that is,
rows of "policy.csv"
logically continue the
corresponding
"applic.csv" row.

B Index to Insurance 1 Insurance agency No


Agency CSV information was not
provided in the initial
datasets.

C ID for Insurance 10A3CE5D0 Insurance agency No


Company CSV information was not
provided in the initial
datasets.

D Coverage Class HO-4 A Standard No


Insurance Coverage

E Premium Amount 439.00 In "Monetary No


Currency"

F Monetary Currency USD USD = US Dollars. No


The value typically
matches the country
where the home is.

G Months Covered by 3 Quarterly payments No


Premium

H Start Date 01/15/2024 Month / Day / Year No


format

I End Date 04/15/2024 Month / Day / Year No


format

J Theft - Physical 40000 In "Monetary No


Goods: Coverage Currency" -- as are
Limit all monetary values
below

46
Column Field Name Sample Value Comment In Kaggle

K Theft - Physical 200 No


Goods: Deductible

L Vandalism: Coverage 61000 No


Limit

M Vandalism: 200 No
Deductible

N Riots: Coverage Limit 51000 No

O Riots: Deductible 200 No

P Explosion: Coverage 34000 No


Limit

Q Explosion: 200 No
Deductible

R Fire Damage: 28000 No


Coverage Limit

S Fire Damage: 200 No


Deductible

T Hail Damage: 49000 No


Coverage Limit

U Hail Damage: 200 No


Deductible

V Wind Damage: 45000 No


Coverage Limit

W Wind Damage: 200 No


Deductible

X Flood: Coverage 0 No
Limit

Y Flood: Deductible 0 No

Z Water Damage - 61000 No


Weather: Coverage
Limit

AA Water Damage - 200 No


Weather: Deductible

AB Water Damage - 29000 No


Plumbing: Coverage
Limit

AC Water Damage - 200 No


Plumbing: Deductible

AD Water Damage - 60000 No


Heating Overflow:
Coverage Limit

AE Water Damage - 200 No


Heating Overflow:
Deductible

47
Column Field Name Sample Value Comment In Kaggle

AF Water Damage - AC 43000 No


Overflow: Coverage
Limit

AG Water Damage - AC 200 No


Overflow: Deductible

AH Appliance Flood: 74000 No


Coverage Limit

AI Appliance Flood: 200 No


Deductible

AJ Water Heater: 59000 No


Coverage Limit

AK Water Heater: 200 No


Deductible

AL Frozen Pipes: 26000 No


Coverage Limit

AM Frozen Pipes: 200 No


Deductible

AN Snow / Ice Buildup: 38000 No


Coverage Limit

AO Snow / Ice Buildup: 200 No


Deductible

AP Lightning: Coverage 57000 No


Limit

AQ Lightning: Deductible 200 No

AR Electrical Current: 68000 No


Coverage Limit

AS Electrical Current: 200 No


Deductible

AT Tree: Coverage Limit 68000 No

AU Tree: Deductible 200 No

AV Falling Object: 39000 No


Coverage Limit

AW Falling Object: 200 No


Deductible

AX Aircraft Damage: 52000 No


Coverage Limit

AY Aircraft Damage: 200 No


Deductible

AZ Vehicle caused 51000 No


Damage: Coverage
Limit

48
Column Field Name Sample Value Comment In Kaggle

BA Vehicle caused 200 No


Damage: Deductible

BB Sinkhole: Coverage 0 No
Limit

BC Sinkhole: Deductible 0 No

BD Earthquake: 0 No
Coverage Limit

BE Earthquake: 0 No
Deductible

BF Volcano: Coverage 43000 No


Limit

BG Volcano: Deductible 200 No

BH Mandatory 0 No
Evacuation:
Coverage Limit

BI Mandatory 0 No
Evacuation:
Deductible

BJ Ordinance Change: 0 No
Coverage Limit

BK Ordinance Change: 0 No
Deductible

BL Building Codes: 0 No
Coverage Limit

BM Building Codes: 0 No
Deductible

BN Eco Upgrade: 0 No
Coverage Limit

BO Eco Upgrade: 0 No
Deductible

BP Identity Theft: 8000 No


Coverage Limit

BQ Identity Theft: 250 No


Deductible

BR Mold: Coverage Limit 0 No

BS Mold: Deductible 0 No

BT Termites: Coverage 0 No
Limit

BU Termites: Deductible 0 No

BV Decayed Foundation: 68000 No


Coverage Limit

49
Column Field Name Sample Value Comment In Kaggle

BW Decayed Foundation: 200 No


Deductible

BX Failure to keep safe 25000 No


env: Coverage Limit

BY Failure to keep safe 200 No


env: Deductible

BZ Dwelling: No No
Replacement Cost?

CA Dwelling: Coverage 215000 No


Limit

CB Dwelling: Deductible 250 No

CC Extended Premises: No No
Replacement Cost?

CD Extended Premises: 0 No
Coverage Limit

CE Extended Premises: 0 No
Deductible

CF Other Structures: No No
Replacement Cost?

CG Other Structures: 60000 No


Coverage Limit

CH Other Structures: 200 No


Deductible

CI Roof Surfaces: No No
Replacement Cost?

CJ Roof Surfaces: 0 No
Coverage Limit

CK Roof Surfaces: 0 No
Deductible

CL Yard and Garden: No No


Replacement Cost?

CM Yard and Garden: 0 No


Coverage Limit

CN Yard and Garden: 0 No


Deductible

CO Data Recovery: No No
Replacement Cost?

CP Data Recovery: 0 No
Coverage Limit

CQ Data Recovery: 0 No
Deductible

50
Column Field Name Sample Value Comment In Kaggle

CR Credit Cards: No No
Replacement Cost?

CS Credit Cards: 0 No
Coverage Limit

CT Credit Cards: 0 No
Deductible

CU Financial Assets: No No
Replacement Cost?

CV Financial Assets: 0 No
Coverage Limit

CW Financial Assets: 0 No
Deductible

CX Rental Income Loss: No No


Replacement Cost?

CY Rental Income Loss: 0 No


Coverage Limit

CZ Rental Income Loss: 0 No


Deductible

DA Business Property: No No
Replacement Cost?

DB Business Property: 0 No
Coverage Limit

DC Business Property: 0 No
Deductible

DD Home Daycare: No No
Replacement Cost?

DE Home Daycare: 0 No
Coverage Limit

DF Home Daycare: 0 No
Deductible

DG Medical Payments: No No
Replacement Cost?

DH Medical Payments: 0 No
Coverage Limit

DI Medical Payments: 0 No
Deductible

DJ Liability - Bodily No No
Injury: Replacement
Cost?

DK Liability - Bodily 145000 No


Injury: Coverage
Limit

51
Column Field Name Sample Value Comment In Kaggle

DL Liability - Bodily 250 No


Injury: Deductible

DM Liability - Property No No
Damage:
Replacement Cost?

DN Liability - Property 761000 No


Damage: Coverage
Limit

DO Liability - Property 250 No


Damage: Deductible

DP Loss Assessment: No No
Replacement Cost?

DQ Loss Assessment: 0 No
Coverage Limit

DR Loss Assessment: 0 No
Deductible

DS Fire Department No No
Charges:
Replacement Cost?

DT Fire Department 0 No
Charges: Coverage
Limit

DU Fire Department 0 No
Charges: Deductible

DV Living Expenses: No No
Replacement Cost?

DW Living Expenses: 51000 No


Coverage Limit

DX Living Expenses: 250 No


Deductible

DY Furniture: No No
Replacement Cost?

DZ Furniture: Coverage 234000 No


Limit

EA Furniture: Deductible 250 No

EB Appliances: No No
Replacement Cost?

EC Appliances: 8000 No
Coverage Limit

ED Appliances: 250 No
Deductible

EE Electronics: No No
Replacement Cost?

52
Column Field Name Sample Value Comment In Kaggle

EF Electronics: 165000 No
Coverage Limit

EG Electronics: 250 No
Deductible

EH Beds & Mattresses: No No


Replacement Cost?

EI Beds & Mattresses: 36000 No


Coverage Limit

EJ Beds & Mattresses: 250 No


Deductible

EK Apparel: No No
Replacement Cost?

EL Apparel: Coverage 50000 No


Limit

EM Apparel: Deductible 250 No

EN Jewelry: No No
Replacement Cost?

EO Jewelry: Coverage 0 No
Limit

EP Jewelry: Deductible 0 No

EQ Silverware: No No
Replacement Cost?

ER Silverware: Coverage 0 No
Limit

ES Silverware: 0 No
Deductible

ET Tools: Replacement No No
Cost?

EU Tools: Coverage 28000 No


Limit

EV Tools: Deductible 250 No

EW Construction No No
Material:
Replacement Cost?

EX Construction 0 No
Material: Coverage
Limit

EY Construction 0 No
Material: Deductible

EZ Books & Magazines: No No


Replacement Cost?

53
Column Field Name Sample Value Comment In Kaggle

FA Books & Magazines: 36000 No


Coverage Limit

FB Books & Magazines: 250 No


Deductible

FC Sporting Goods: No No
Replacement Cost?

FD Sporting Goods: 170000 No


Coverage Limit

FE Sporting Goods: 250 No


Deductible

FF Golf Cart: No No
Replacement Cost?

FG Golf Cart: Coverage 0 No


Limit

FH Golf Cart: Deductible 0 No

FI Cameras: No No
Replacement Cost?

FJ Cameras: Coverage 0 No
Limit

FK Cameras: Deductible 0 No

FL Watches: No No
Replacement Cost?

FM Watches: Coverage 0 No
Limit

FN Watches: Deductible 0 No

FO Furs: Replacement No No
Cost?

FP Furs: Coverage Limit 0 No

FQ Furs: Deductible 0 No

FR Medical Instruments: No No
Replacement Cost?

FS Medical Instruments: 0 No
Coverage Limit

FT Medical Instruments: 0 No
Deductible

FU Musical Instruments: No No
Replacement Cost?

FV Musical Instruments: 0 No
Coverage Limit

FW Musical Instruments: 0 No
Deductible

54
Column Field Name Sample Value Comment In Kaggle

FX Other Personal No No
Property:
Replacement Cost?

FY Other Personal 142000 No


Property: Coverage
Limit

FZ Other Personal 250 No


Property: Deductible

GA Special Deductibles: 0 No
Wind - Percentage

GB Special Deductibles: 0 No
Wind - Dollar

GC Special Deductibles: 0 No
Named Storm

GD Special Deductibles: 0 No
Hurricane

GE Special Deductibles: 0 No
Theft

GF Special Deductibles: 0 No
Water

GG Special Deductibles: 0 No
All Other Perils

Table 11 is the data schema for insurance claims.

Table 11 Data schema for insurance claims


Column Field Name Sample Value Comment 1 In Kaggle

A Index to Insurance 235 The value in the first No


Application/Policy column of
CSVs "applic.csv" or
"policy.csv".

B Policy ID 6A9B0C8D0 Alternative value for No


"index" in previous
column.

C Home ID 6A9B05DF0 No

D Monetary Currency USD USD = US Dollars. A


typical value matches
the country of the
owner and home.

E Date 07/18/2023 Month / Day / Year No


Format.

55
Column Field Name Sample Value Comment 1 In Kaggle

F Cause of Claim Wind Damage IBM Synthetic Data No


Sets supports over
30 causes for claims.
Among these causes
are Physical Theft,
Vandalism, Riots,
Explosion, Fire
Damage, Hail
Damage, Wind
Damage, and Flood.

G Assoc w Hurricane 0 Is this claim No


associated with a
hurricane? FALSE ->
No

H Assoc w Earthquake 0 Is this claim No


associated with an
earthquake? FALSE
-> No

I Assoc w Volcano 0 Is this claim No


associated with a
volcano? FALSE ->
No

J Total $Claimed 12380 No

K Total $Paid 1723 No

L Deductible $on Claim 5000 No

M Is Claim Cause 1 No
Covered

N Is Fraud on Claim 0 No

O Is Detected Fraud on 0 No
Claim

P Item 1 - Dwelling: 6723 US Dollars No


$Loss Claimed

56
Column Field Name Sample Value Comment 1 In Kaggle

Q Item 1: $Loss 6723 Detailed breakdowns No


Allowed are provided for 35
types of items:
Item 1: House; Item
2: Extended
Premises; Item 3:
Other Structures;
Item 4: Roof
Surfaces; Item 5:
Yard And Garden;
Item 6: Data
Recovery; Item 7:
Credit Card; Item 8:
Financial Assets;
Item 9: Rental
Income Loss; Item
10: Business
Property; Item 11:
Home Daycare; Item
12: Medical
Payments; Item 13:
Liability Bodily Injury;
Item 14: Liability
Property Damage;
Item 15: Loss
Assessment; Item
16: Fire Department
Charges; Item 17:
Living Expenses;
Item 18: Furniture;
Item 19: Appliances;
Item 20: Electronics;
Item 21: Beds
Mattresses; Item 22:
Apparel; Item 23:
Jewelry; Item 24:
Silverware; Item 25:
Tools; Item 26:
Construction
Material; Item 27:
Books Magazines;
Item 28: Sporting
Goods; Item 29: Golf
Cart; Item 30:
Cameras; Item 31:
Watches; Item 32:
Furs; Item 33:
Medical Instruments;
Item 34: Musical
Instruments; Item 35:
Other Personal
Property

R Item 1 - Fraud: 0 No
Overstated Value

S Item 1 - Fraud: 0 No
Intentional Damage

57
Column Field Name Sample Value Comment 1 In Kaggle

T Item 1 - Fraud: Fake 0 No


Theft

U Item 1 - Fraud: Fake 0 No


Repair Bills

V Item 1 - Fraud: 0 No
Inflated Repair Bills

W Item 1 - Fraud: 0 No
Non-Covered Use

X Item 1 - Fraud: 0 No
Non-Covered
Damage

Y Item 1 - Disallowed: 0 No
Fraud

Z Item 1 - Disallowed: 0 No
Under Deductible

AA Item 1 - Disallowed: 0 No
Not Covered

AB Item 1 - Non-Full: 1 No
Over Limit

AC Item 1 - Non-Full: 0 No
Depreciation

AD Item 1 - Non-Full: 0 No
Over Market Price

AE Item 2 - Extended 0 No
Premises: $Loss
Claimed

AF Item 2: $Loss 0 No
Allowed

AG Item 2 - Fraud: 0 No
Overstated Value

AH Item 2 - Fraud: 0 No
Intentional Damage

AI Item 2 - Fraud: Fake 0 No


Theft

AJ Item 2 - Fraud: Fake 0 No


Repair Bills

AK Item 2 - Fraud: 0 No
Inflated Repair Bills

AL Item 2 - Fraud: 0 No
Non-Covered Use

AM Item 2 - Fraud: 0 No
Non-Covered
Damage

58
Column Field Name Sample Value Comment 1 In Kaggle

AN Item 2 - Disallowed: 0 No
Fraud

AO Item 2 - Disallowed: 0 No
Under Deductible

AP Item 2 - Disallowed: 0 No
Not Covered

AQ Item 2 - Non-Full: 0 No
Over Limit

AR Item 2 - Non-Full: 0 No
Depreciation

AS Item 2 - Non-Full: 0 No
Over Market Price

AT Item 3 - Other 0 No
Structures: $Loss
Claimed

AU Item 3: $Loss 0 No
Allowed

AV Item 3 - Fraud: 0 No
Overstated Value

AW Item 3 - Fraud: 0 No
Intentional Damage

AX Item 3 - Fraud: Fake 0 No


Theft

AY Item 3 - Fraud: Fake 0 No


Repair Bills

AZ Item 3 - Fraud: 0 No
Inflated Repair Bills

BA Item 3 - Fraud: 0 No
Non-Covered Use

BB Item 3 - Fraud: 0 No
Non-Covered
Damage

BC Item 3 - Disallowed: 0 No
Fraud

BD Item 3 - Disallowed: 0 No
Under Deductible

BE Item 3 - Disallowed: 0 No
Not Covered

BF Item 3 - Non-Full: 0 No
Over Limit

BG Item 3 - Non-Full: 0 No
Depreciation

59
Column Field Name Sample Value Comment 1 In Kaggle

BH Item 3 - Non-Full: 0 No
Over Market Price

BI Item 4 - Roof 5656 No


Surfaces: $Loss
Claimed

BJ Item 4: $Loss 0 No
Allowed

BK Item 4 - Fraud: 0 No
Overstated Value

BL Item 4 - Fraud: 0 No
Intentional Damage

BM Item 4 - Fraud: Fake 0 No


Theft

BN Item 4 - Fraud: Fake 0 No


Repair Bills

BO Item 4 - Fraud: 0 No
Inflated Repair Bills

BP Item 4 - Fraud: 0 No
Non-Covered Use

BQ Item 4 - Fraud: 0 No
Non-Covered
Damage

BR Item 4 - Disallowed: 0 No
Fraud

BS Item 4 - Disallowed: 0 No
Under Deductible

BT Item 4 - Disallowed: 1 No
Not Covered

BU Item 4 - Non-Full: 0 No
Over Limit

BV Item 4 - Non-Full: 0 No
Depreciation

BW Item 4 - Non-Full: 0 No
Over Market Price

BX Item 5 - Yard and 0 No


Garden: $Loss
Claimed

BY Item 5: $Loss 0 No
Allowed

BZ Item 5 - Fraud: 0 No
Overstated Value

CA Item 5 - Fraud: 0 No
Intentional Damage

60
Column Field Name Sample Value Comment 1 In Kaggle

CB Item 5 - Fraud: Fake 0 No


Theft

CC Item 5 - Fraud: Fake 0 No


Repair Bills

CD Item 5 - Fraud: 0 No
Inflated Repair Bills

CE Item 5 - Fraud: 0 No
Non-Covered Use

CF Item 5 - Fraud: 0 No
Non-Covered
Damage

CG Item 5 - Disallowed: 0 No
Fraud

CH Item 5 - Disallowed: 0 No
Under Deductible

CI Item 5 - Disallowed: 0 No
Not Covered

CJ Item 5 - Non-Full: 0 No
Over Limit

CK Item 5 - Non-Full: 0 No
Depreciation

CL Item 5 - Non-Full: 0 No
Over Market Price

CM Item 6 - Data 0 No
Recovery: $Loss
Claimed

CN Item 6: $Loss 0 No
Allowed

CO Item 6 - Fraud: 0 No
Overstated Value

CP Item 6 - Fraud: 0 No
Intentional Damage

CQ Item 6 - Fraud: Fake 0 No


Theft

CR Item 6 - Fraud: Fake 0 No


Repair Bills

CS Item 6 - Fraud: 0 No
Inflated Repair Bills

CT Item 6 - Fraud: 0 No
Non-Covered Use

CU Item 6 - Fraud: 0 No
Non-Covered
Damage

61
Column Field Name Sample Value Comment 1 In Kaggle

CV Item 6 - Disallowed: 0 No
Fraud

CW Item 6 - Disallowed: 0 No
Under Deductible

CX Item 6 - Disallowed: 0 No
Not Covered

CY Item 6 - Non-Full: 0 No
Over Limit

CZ Item 6 - Non-Full: 0 No
Depreciation

DA Item 6 - Non-Full: 0 No
Over Market Price

DB Item 7 - Credit Cards: 0 No


$Loss Claimed

DC Item 7: $Loss 0 No
Allowed

DD Item 7 - Fraud: 0 No
Overstated Value

DE Item 7 - Fraud: 0 No
Intentional Damage

DF Item 7 - Fraud: Fake 0 No


Theft

DG Item 7 - Fraud: Fake 0 No


Repair Bills

DH Item 7 - Fraud: 0 No
Inflated Repair Bills

DI Item 7 - Fraud: 0 No
Non-Covered Use

DJ Item 7 - Fraud: 0 No
Non-Covered
Damage

DK Item 7 - Disallowed: 0 No
Fraud

DL Item 7 - Disallowed: 0 No
Under Deductible

DM Item 7 - Disallowed: 0 No
Not Covered

DN Item 7 - Non-Full: 0 No
Over Limit

DO Item 7 - Non-Full: 0 No
Depreciation

DP Item 7 - Non-Full: 0 No
Over Market Price

62
Column Field Name Sample Value Comment 1 In Kaggle

DQ Item 8 - Financial 0 No
Assets: $Loss
Claimed

DR Item 8: $Loss 0 No
Allowed

DS Item 8 - Fraud: 0 No
Overstated Value

DT Item 8 - Fraud: 0 No
Intentional Damage

DU Item 8 - Fraud: Fake 0 No


Theft

DV Item 8 - Fraud: Fake 0 No


Repair Bills

DW Item 8 - Fraud: 0 No
Inflated Repair Bills

DX Item 8 - Fraud: 0 No
Non-Covered Use

DY Item 8 - Fraud: 0 No
Non-Covered
Damage

DZ Item 8 - Disallowed: 0 No
Fraud

EA Item 8 - Disallowed: 0 No
Under Deductible

EB Item 8 - Disallowed: 0 No
Not Covered

EC Item 8 - Non-Full: 0 No
Over Limit

ED Item 8 - Non-Full: 0 No
Depreciation

EE Item 8 - Non-Full: 0 No
Over Market Price

EF Item 9 - Rental 0 No
Income Loss: $Loss
Claimed

EG Item 9: $Loss 0 No
Allowed

EH Item 9 - Fraud: 0 No
Overstated Value

EI Item 9 - Fraud: 0 No
Intentional Damage

EJ Item 9 - Fraud: Fake 0 No


Theft

63
Column Field Name Sample Value Comment 1 In Kaggle

EK Item 9 - Fraud: Fake 0 No


Repair Bills

EL Item 9 - Fraud: 0 No
Inflated Repair Bills

EM Item 9 - Fraud: 0 No
Non-Covered Use

EN Item 9 - Fraud: 0 No
Non-Covered
Damage

EO Item 9 - Disallowed: 0 No
Fraud

EP Item 9 - Disallowed: 0 No
Under Deductible

EQ Item 9 - Disallowed: 0 No
Not Covered

ER Item 9 - Non-Full: 0 No
Over Limit

ES Item 9 - Non-Full: 0 No
Depreciation

ET Item 9 - Non-Full: 0 No
Over Market Price

EU Item 10 - Business 0 No
Property: $Loss
Claimed

EV Item 10: $Loss 0 No


Allowed

EW Item 10 - Fraud: 0 No
Overstated Value

EX Item 10 - Fraud: 0 No
Intentional Damage

EY Item 10 - Fraud: Fake 0 No


Theft

EZ Item 10 - Fraud: Fake 0 No


Repair Bills

FA Item 10 - Fraud: 0 No
Inflated Repair Bills

FB Item 10 - Fraud: 0 No
Non-Covered Use

FC Item 10 - Fraud: 0 No
Non-Covered
Damage

FD Item 10 - Disallowed: 0 No
Fraud

64
Column Field Name Sample Value Comment 1 In Kaggle

FE Item 10 - Disallowed: 0 No
Under Deductible

FF Item 10 - Disallowed: 0 No
Not Covered

FG Item 10 - Non-Full: 0 No
Over Limit

FH Item 10 - Non-Full: 0 No
Depreciation

FI Item 10 - Non-Full: 0 No
Over Market Price

FJ Item 11 - Home 0 No
Daycare: $Loss
Claimed

FK Item 11: $Loss 0 No


Allowed

FL Item 11 - Fraud: 0 No
Overstated Value

FM Item 11 - Fraud: 0 No
Intentional Damage

FN Item 11 - Fraud: Fake 0 No


Theft

FO Item 11 - Fraud: Fake 0 No


Repair Bills

FP Item 11 - Fraud: 0 No
Inflated Repair Bills

FQ Item 11 - Fraud: 0 No
Non-Covered Use

FR Item 11 - Fraud: 0 No
Non-Covered
Damage

FS Item 11 - Disallowed: 0 No
Fraud

FT Item 11 - Disallowed: 0 No
Under Deductible

FU Item 11 - Disallowed: 0 No
Not Covered

FV Item 11 - Non-Full: 0 No
Over Limit

FW Item 11 - Non-Full: 0 No
Depreciation

FX Item 11 - Non-Full: 0 No
Over Market Price

65
Column Field Name Sample Value Comment 1 In Kaggle

FY Item 12 - Medical 0 No
Payments: $Loss
Claimed

FZ Item 12: $Loss 0 No


Allowed

GA Item 12 - Fraud: 0 No
Overstated Value

GB Item 12 - Fraud: 0 No
Intentional Damage

GC Item 12 - Fraud: Fake 0 No


Theft

GD Item 12 - Fraud: Fake 0 No


Repair Bills

GE Item 12 - Fraud: 0 No
Inflated Repair Bills

GF Item 12 - Fraud: 0 No
Non-Covered Use

GG Item 12 - Fraud: 0 No
Non-Covered
Damage

GH Item 12 - Disallowed: 0 No
Fraud

GI Item 12 - Disallowed: 0 No
Under Deductible

GJ Item 12 - Disallowed: 0 No
Not Covered

GK Item 12 - Non-Full: 0 No
Over Limit

GL Item 12 - Non-Full: 0 No
Depreciation

GM Item 12 - Non-Full: 0 No
Over Market Price

GN Item 13 - Liability - 0 No
Bodily Injury: $Loss
Claimed

GO Item 13: $Loss 0 No


Allowed

GP Item 13 - Fraud: 0 No
Overstated Value

GQ Item 13 - Fraud: 0 No
Intentional Damage

GR Item 13 - Fraud: Fake 0 No


Theft

66
Column Field Name Sample Value Comment 1 In Kaggle

GS Item 13 - Fraud: Fake 0 No


Repair Bills

GT Item 13 - Fraud: 0 No
Inflated Repair Bills

GU Item 13 - Fraud: 0 No
Non-Covered Use

GV Item 13 - Fraud: 0 No
Non-Covered
Damage

GW Item 13 - Disallowed: 0 No
Fraud

GX Item 13 - Disallowed: 0 No
Under Deductible

GY Item 13 - Disallowed: 0 No
Not Covered

GZ Item 13 - Non-Full: 0 No
Over Limit

HA Item 13 - Non-Full: 0 No
Depreciation

HB Item 13 - Non-Full: 0 No
Over Market Price

HC Item 14 - Liability - 0 No
Property Damage:
$Loss Claimed

HD Item 14: $Loss 0 No


Allowed

HE Item 14 - Fraud: 0 No
Overstated Value

HF Item 14 - Fraud: 0 No
Intentional Damage

HG Item 14 - Fraud: Fake 0 No


Theft

HH Item 14 - Fraud: Fake 0 No


Repair Bills

HI Item 14 - Fraud: 0 No
Inflated Repair Bills

HJ Item 14 - Fraud: 0 No
Non-Covered Use

HK Item 14 - Fraud: 0 No
Non-Covered
Damage

HL Item 14 - Disallowed: 0 No
Fraud

67
Column Field Name Sample Value Comment 1 In Kaggle

HM Item 14 - Disallowed: 0 No
Under Deductible

HN Item 14 - Disallowed: 0 No
Not Covered

HO Item 14 - Non-Full: 0 No
Over Limit

HP Item 14 - Non-Full: 0 No
Depreciation

HQ Item 14 - Non-Full: 0 No
Over Market Price

HR Item 15 - Loss 0 No
Assessment: $Loss
Claimed

HS Item 15: $Loss 0 No


Allowed

HT Item 15 - Fraud: 0 No
Overstated Value

HU Item 15 - Fraud: 0 No
Intentional Damage

HV Item 15 - Fraud: Fake 0 No


Theft

HW Item 15 - Fraud: Fake 0 No


Repair Bills

HX Item 15 - Fraud: 0 No
Inflated Repair Bills

HY Item 15 - Fraud: 0 No
Non-Covered Use

HZ Item 15 - Fraud: 0 No
Non-Covered
Damage

IA Item 15 - Disallowed: 0 No
Fraud

IB Item 15 - Disallowed: 0 No
Under Deductible

IC Item 15 - Disallowed: 0 No
Not Covered

ID Item 15 - Non-Full: 0 No
Over Limit

IE Item 15 - Non-Full: 0 No
Depreciation

IF Item 15 - Non-Full: 0 No
Over Market Price

68
Column Field Name Sample Value Comment 1 In Kaggle

IG Item 16 - Fire 0 No
Department
Charges: $Loss
Claimed

IH Item 16: $Loss 0 No


Allowed

II Item 16 - Fraud: 0 No
Overstated Value

IJ Item 16 - Fraud: 0 No
Intentional Damage

IK Item 16 - Fraud: Fake 0 No


Theft

IL Item 16 - Fraud: Fake 0 No


Repair Bills

IM Item 16 - Fraud: 0 No
Inflated Repair Bills

IN Item 16 - Fraud: 0 No
Non-Covered Use

IO Item 16 - Fraud: 0 No
Non-Covered
Damage

IP Item 16 - Disallowed: 0 No
Fraud

IQ Item 16 - Disallowed: 0 No
Under Deductible

IR Item 16 - Disallowed: 0 No
Not Covered

IS Item 16 - Non-Full: 0 No
Over Limit

IT Item 16 - Non-Full: 0 No
Depreciation

IU Item 16 - Non-Full: 0 No
Over Market Price

IV Item 17 - Living 0 No
Expenses: $Loss
Claimed

IW Item 17: $Loss 0 No


Allowed

IX Item 17 - Fraud: 0 No
Overstated Value

IY Item 17 - Fraud: 0 No
Intentional Damage

IZ Item 17 - Fraud: Fake 0 No


Theft

69
Column Field Name Sample Value Comment 1 In Kaggle

JA Item 17 - Fraud: Fake 0 No


Repair Bills

JB Item 17 - Fraud: 0 No
Inflated Repair Bills

JC Item 17 - Fraud: 0 No
Non-Covered Use

JD Item 17 - Fraud: 0 No
Non-Covered
Damage

JE Item 17 - Disallowed: 0 No
Fraud

JF Item 17 - Disallowed: 0 No
Under Deductible

JG Item 17 - Disallowed: 0 No
Not Covered

JH Item 17 - Non-Full: 0 No
Over Limit

JI Item 17 - Non-Full: 0 No
Depreciation

JJ Item 17 - Non-Full: 0 No
Over Market Price

JK Item 18 - Furniture: 0 No
$Loss Claimed

JL Item 18: $Loss 0 No


Allowed

JM Item 18 - Fraud: 0 No
Overstated Value

JN Item 18 - Fraud: 0 No
Intentional Damage

JO Item 18 - Fraud: Fake 0 No


Theft

JP Item 18 - Fraud: Fake 0 No


Repair Bills

JQ Item 18 - Fraud: 0 No
Inflated Repair Bills

JR Item 18 - Fraud: 0 No
Non-Covered Use

JS Item 18 - Fraud: 0 No
Non-Covered
Damage

JT Item 18 - Disallowed: 0 No
Fraud

70
Column Field Name Sample Value Comment 1 In Kaggle

JU Item 18 - Disallowed: 0 No
Under Deductible

JV Item 18 - Disallowed: 0 No
Not Covered

JW Item 18 - Non-Full: 0 No
Over Limit

JX Item 18 - Non-Full: 0 No
Depreciation

JY Item 18 - Non-Full: 0 No
Over Market Price

JZ Item 19 - Appliances: 0 No
$Loss Claimed

KA Item 19: $Loss 0 No


Allowed

KB Item 19 - Fraud: 0 No
Overstated Value

KC Item 19 - Fraud: 0 No
Intentional Damage

KD Item 19 - Fraud: Fake 0 No


Theft

KE Item 19 - Fraud: Fake 0 No


Repair Bills

KF Item 19 - Fraud: 0 No
Inflated Repair Bills

KG Item 19 - Fraud: 0 No
Non-Covered Use

KH Item 19 - Fraud: 0 No
Non-Covered
Damage

KI Item 19 - Disallowed: 0 No
Fraud

KJ Item 19 - Disallowed: 0 No
Under Deductible

KK Item 19 - Disallowed: 0 No
Not Covered

KL Item 19 - Non-Full: 0 No
Over Limit

KM Item 19 - Non-Full: 0 No
Depreciation

KN Item 19 - Non-Full: 0 No
Over Market Price

KO Item 20 - Electronics: 0 No
$Loss Claimed

71
Column Field Name Sample Value Comment 1 In Kaggle

KP Item 20: $Loss 0 No


Allowed

KQ Item 20 - Fraud: 0 No
Overstated Value

KR Item 20 - Fraud: 0 No
Intentional Damage

KS Item 20 - Fraud: Fake 0 No


Theft

KT Item 20 - Fraud: Fake 0 No


Repair Bills

KU Item 20 - Fraud: 0 No
Inflated Repair Bills

KV Item 20 - Fraud: 0 No
Non-Covered Use

KW Item 20 - Fraud: 0 No
Non-Covered
Damage

KX Item 20 - Disallowed: 0 No
Fraud

KY Item 20 - Disallowed: 0 No
Under Deductible

KZ Item 20 - Disallowed: 0 No
Not Covered

LA Item 20 - Non-Full: 0 No
Over Limit

LB Item 20 - Non-Full: 0 No
Depreciation

LC Item 20 - Non-Full: 0 No
Over Market Price

LD Item 21 - Beds & 0 No


Mattresses: $Loss
Claimed

LE Item 21: $Loss 0 No


Allowed

LF Item 21 - Fraud: 0 No
Overstated Value

LG Item 21 - Fraud: 0 No
Intentional Damage

LH Item 21 - Fraud: Fake 0 No


Theft

LI Item 21 - Fraud: Fake 0 No


Repair Bills

72
Column Field Name Sample Value Comment 1 In Kaggle

LJ Item 21 - Fraud: 0 No
Inflated Repair Bills

LK Item 21 - Fraud: 0 No
Non-Covered Use

LL Item 21 - Fraud: 0 No
Non-Covered
Damage

LM Item 21 - Disallowed: 0 No
Fraud

LN Item 21 - Disallowed: 0 No
Under Deductible

LO Item 21 - Disallowed: 0 No
Not Covered

LP Item 21 - Non-Full: 0 No
Over Limit

LQ Item 21 - Non-Full: 0 No
Depreciation

LR Item 21 - Non-Full: 0 No
Over Market Price

LS Item 22 - Apparel: 0 No
$Loss Claimed

LT Item 22: $Loss 0 No


Allowed

LU Item 22 - Fraud: 0 No
Overstated Value

LV Item 22 - Fraud: 0 No
Intentional Damage

LW Item 22 - Fraud: Fake 0 No


Theft

LX Item 22 - Fraud: Fake 0 No


Repair Bills

LY Item 22 - Fraud: 0 No
Inflated Repair Bills

LZ Item 22 - Fraud: 0 No
Non-Covered Use

MA Item 22 - Fraud: 0 No
Non-Covered
Damage

MB Item 22 - Disallowed: 0 No
Fraud

MC Item 22 - Disallowed: 0 No
Under Deductible

73
Column Field Name Sample Value Comment 1 In Kaggle

MD Item 22 - Disallowed: 0 No
Not Covered

ME Item 22 - Non-Full: 0 No
Over Limit

MF Item 22 - Non-Full: 0 No
Depreciation

MG Item 22 - Non-Full: 0 No
Over Market Price

MH Item 23 - Jewelry: 0 No
$Loss Claimed

MI Item 23: $Loss 0 No


Allowed

MJ Item 23 - Fraud: 0 No
Overstated Value

MK Item 23 - Fraud: 0 No
Intentional Damage

ML Item 23 - Fraud: Fake 0 No


Theft

MM Item 23 - Fraud: Fake 0 No


Repair Bills

MN Item 23 - Fraud: 0 No
Inflated Repair Bills

MO Item 23 - Fraud: 0 No
Non-Covered Use

MP Item 23 - Fraud: 0 No
Non-Covered
Damage

MQ Item 23 - Disallowed: 0 No
Fraud

MR Item 23 - Disallowed: 0 No
Under Deductible

MS Item 23 - Disallowed: 0 No
Not Covered

MT Item 23 - Non-Full: 0 No
Over Limit

MU Item 23 - Non-Full: 0 No
Depreciation

MV Item 23 - Non-Full: 0 No
Over Market Price

MW Item 24 - Silverware: 0 No
$Loss Claimed

MX Item 24: $Loss 0 No


Allowed

74
Column Field Name Sample Value Comment 1 In Kaggle

MY Item 24 - Fraud: 0 No
Overstated Value

MZ Item 24 - Fraud: 0 No
Intentional Damage

NA Item 24 - Fraud: Fake 0 No


Theft

NB Item 24 - Fraud: Fake 0 No


Repair Bills

NC Item 24 - Fraud: 0 No
Inflated Repair Bills

ND Item 24 - Fraud: 0 No
Non-Covered Use

NE Item 24 - Fraud: 0 No
Non-Covered
Damage

NF Item 24 - Disallowed: 0 No
Fraud

NG Item 24 - Disallowed: 0 No
Under Deductible

NH Item 24 - Disallowed: 0 No
Not Covered

NI Item 24 - Non-Full: 0 No
Over Limit

NJ Item 24 - Non-Full: 0 No
Depreciation

NK Item 24 - Non-Full: 0 No
Over Market Price

NL Item 25 - Tools: 0 No
$Loss Claimed

NM Item 25: $Loss 0 No


Allowed

NN Item 25 - Fraud: 0 No
Overstated Value

NO Item 25 - Fraud: 0 No
Intentional Damage

NP Item 25 - Fraud: Fake 0 No


Theft

NQ Item 25 - Fraud: Fake 0 No


Repair Bills

NR Item 25 - Fraud: 0 No
Inflated Repair Bills

NS Item 25 - Fraud: 0 No
Non-Covered Use

75
Column Field Name Sample Value Comment 1 In Kaggle

NT Item 25 - Fraud: 0 No
Non-Covered
Damage

NU Item 25 - Disallowed: 0 No
Fraud

NV Item 25 - Disallowed: 0 No
Under Deductible

NW Item 25 - Disallowed: 0 No
Not Covered

NX Item 25 - Non-Full: 0 No
Over Limit

NY Item 25 - Non-Full: 0 No
Depreciation

NZ Item 25 - Non-Full: 0 No
Over Market Price

OA Item 26 - 0 No
Construction
Material: $Loss
Claimed

OB Item 26: $Loss 0 No


Allowed

OC Item 26 - Fraud: 0 No
Overstated Value

OD Item 26 - Fraud: 0 No
Intentional Damage

OE Item 26 - Fraud: Fake 0 No


Theft

OF Item 26 - Fraud: Fake 0 No


Repair Bills

OG Item 26 - Fraud: 0 No
Inflated Repair Bills

OH Item 26 - Fraud: 0 No
Non-Covered Use

OI Item 26 - Fraud: 0 No
Non-Covered
Damage

OJ Item 26 - Disallowed: 0 No
Fraud

OK Item 26 - Disallowed: 0 No
Under Deductible

OL Item 26 - Disallowed: 0 No
Not Covered

OM Item 26 - Non-Full: 0 No
Over Limit

76
Column Field Name Sample Value Comment 1 In Kaggle

ON Item 26 - Non-Full: 0 No
Depreciation

OO Item 26 - Non-Full: 0 No
Over Market Price

OP Item 27 - Books & 0 No


Magazines: $Loss
Claimed

OQ Item 27: $Loss 0 No


Allowed

OR Item 27 - Fraud: 0 No
Overstated Value

OS Item 27 - Fraud: 0 No
Intentional Damage

OT Item 27 - Fraud: Fake 0 No


Theft

OU Item 27 - Fraud: Fake 0 No


Repair Bills

OV Item 27 - Fraud: 0 No
Inflated Repair Bills

OW Item 27 - Fraud: 0 No
Non-Covered Use

OX Item 27 - Fraud: 0 No
Non-Covered
Damage

OY Item 27 - Disallowed: 0 No
Fraud

OZ Item 27 - Disallowed: 0 No
Under Deductible

PA Item 27 - Disallowed: 0 No
Not Covered

PB Item 27 - Non-Full: 0 No
Over Limit

PC Item 27 - Non-Full: 0 No
Depreciation

PD Item 27 - Non-Full: 0 No
Over Market Price

PE Item 28 - Sporting 0 No
Goods: $Loss
Claimed

PF Item 28: $Loss 0 No


Allowed

PG Item 28 - Fraud: 0 No
Overstated Value

77
Column Field Name Sample Value Comment 1 In Kaggle

PH Item 28 - Fraud: 0 No
Intentional Damage

PI Item 28 - Fraud: Fake 0 No


Theft

PJ Item 28 - Fraud: Fake 0 No


Repair Bills

PK Item 28 - Fraud: 0 No
Inflated Repair Bills

PL Item 28 - Fraud: 0 No
Non-Covered Use

PM Item 28 - Fraud: 0 No
Non-Covered
Damage

PN Item 28 - Disallowed: 0 No
Fraud

PO Item 28 - Disallowed: 0 No
Under Deductible

PP Item 28 - Disallowed: 0 No
Not Covered

PQ Item 28 - Non-Full: 0 No
Over Limit

PR Item 28 - Non-Full: 0 No
Depreciation

PS Item 28 - Non-Full: 0 No
Over Market Price

PT Item 29 - Golf Cart: 0 No


$Loss Claimed

PU Item 29: $Loss 0 No


Allowed

PV Item 29 - Fraud: 0 No
Overstated Value

PW Item 29 - Fraud: 0 No
Intentional Damage

PX Item 29 - Fraud: Fake 0 No


Theft

PY Item 29 - Fraud: Fake 0 No


Repair Bills

PZ Item 29 - Fraud: 0 No
Inflated Repair Bills

QA Item 29 - Fraud: 0 No
Non-Covered Use

78
Column Field Name Sample Value Comment 1 In Kaggle

QB Item 29 - Fraud: 0 No
Non-Covered
Damage

QC Item 29 - Disallowed: 0 No
Fraud

QD Item 29 - Disallowed: 0 No
Under Deductible

QE Item 29 - Disallowed: 0 No
Not Covered

QF Item 29 - Non-Full: 0 No
Over Limit

QG Item 29 - Non-Full: 0 No
Depreciation

QH Item 29 - Non-Full: 0 No
Over Market Price

QI Item 30 - Cameras: 0 No
$Loss Claimed

QJ Item 30: $Loss 0 No


Allowed

QK Item 30 - Fraud: 0 No
Overstated Value

QL Item 30 - Fraud: 0 No
Intentional Damage

QM Item 30 - Fraud: Fake 0 No


Theft

QN Item 30 - Fraud: Fake 0 No


Repair Bills

QO Item 30 - Fraud: 0 No
Inflated Repair Bills

QP Item 30 - Fraud: 0 No
Non-Covered Use

QQ Item 30 - Fraud: 0 No
Non-Covered
Damage

QR Item 30 - Disallowed: 0 No
Fraud

QS Item 30 - Disallowed: 0 No
Under Deductible

QT Item 30 - Disallowed: 0 No
Not Covered

QU Item 30 - Non-Full: 0 No
Over Limit

79
Column Field Name Sample Value Comment 1 In Kaggle

QV Item 30 - Non-Full: 0 No
Depreciation

QW Item 30 - Non-Full: 0 No
Over Market Price

QX Item 31 - Watches: 0 No
$Loss Claimed

QY Item 31: $Loss 0 No


Allowed

QZ Item 31 - Fraud: 0 No
Overstated Value

RA Item 31 - Fraud: 0 No
Intentional Damage

RB Item 31 - Fraud: Fake 0 No


Theft

RC Item 31 - Fraud: Fake 0 No


Repair Bills

RD Item 31 - Fraud: 0 No
Inflated Repair Bills

RE Item 31 - Fraud: 0 No
Non-Covered Use

RF Item 31 - Fraud: 0 No
Non-Covered
Damage

RG Item 31 - Disallowed: 0 No
Fraud

RH Item 31 - Disallowed: 0 No
Under Deductible

RI Item 31 - Disallowed: 0 No
Not Covered

RJ Item 31 - Non-Full: 0 No
Over Limit

RK Item 31 - Non-Full: 0 No
Depreciation

RL Item 31 - Non-Full: 0 No
Over Market Price

RM Item 32 - Furs: $Loss 0 No


Claimed

RN Item 32: $Loss 0 No


Allowed

RO Item 32 - Fraud: 0 No
Overstated Value

RP Item 32 - Fraud: 0 No
Intentional Damage

80
Column Field Name Sample Value Comment 1 In Kaggle

RQ Item 32 - Fraud: Fake 0 No


Theft

RR Item 32 - Fraud: Fake 0 No


Repair Bills

RS Item 32 - Fraud: 0 No
Inflated Repair Bills

RT Item 32 - Fraud: 0 No
Non-Covered Use

RU Item 32 - Fraud: 0 No
Non-Covered
Damage

RV Item 32 - Disallowed: 0 No
Fraud

RW Item 32 - Disallowed: 0 No
Under Deductible

RX Item 32 - Disallowed: 0 No
Not Covered

RY Item 32 - Non-Full: 0 No
Over Limit

RZ Item 32 - Non-Full: 0 No
Depreciation

SA Item 32 - Non-Full: 0 No
Over Market Price

SB Item 33 - Medical 0 No
Instruments: $Loss
Claimed

SC Item 33: $Loss 0 No


Allowed

SD Item 33 - Fraud: 0 No
Overstated Value

SE Item 33 - Fraud: 0 No
Intentional Damage

SF Item 33 - Fraud: Fake 0 No


Theft

SG Item 33 - Fraud: Fake 0 No


Repair Bills

SH Item 33 - Fraud: 0 No
Inflated Repair Bills

SI Item 33 - Fraud: 0 No
Non-Covered Use

SJ Item 33 - Fraud: 0 No
Non-Covered
Damage

81
Column Field Name Sample Value Comment 1 In Kaggle

SK Item 33 - Disallowed: 0 No
Fraud

SL Item 33 - Disallowed: 0 No
Under Deductible

SM Item 33 - Disallowed: 0 No
Not Covered

SN Item 33 - Non-Full: 0 No
Over Limit

SO Item 33 - Non-Full: 0 No
Depreciation

SP Item 33 - Non-Full: 0 No
Over Market Price

SQ Item 34 - Musical 0 No
Instruments: $Loss
Claimed

SR Item 34: $Loss 0 No


Allowed

SS Item 34 - Fraud: 0 No
Overstated Value

ST Item 34 - Fraud: 0 No
Intentional Damage

SU Item 34 - Fraud: Fake 0 No


Theft

SV Item 34 - Fraud: Fake 0 No


Repair Bills

SW Item 34 - Fraud: 0 No
Inflated Repair Bills

SX Item 34 - Fraud: 0 No
Non-Covered Use

SY Item 34 - Fraud: 0 No
Non-Covered
Damage

SZ Item 34 - Disallowed: 0 No
Fraud

TA Item 34 - Disallowed: 0 No
Under Deductible

TB Item 34 - Disallowed: 0 No
Not Covered

TC Item 34 - Non-Full: 0 No
Over Limit

TD Item 34 - Non-Full: 0 No
Depreciation

82
Column Field Name Sample Value Comment 1 In Kaggle

TE Item 34 - Non-Full: 0 No
Over Market Price

TF Item 35 - Other 0 No
Personal Property:
$Loss Claimed

TG Item 35: $Loss 0 No


Allowed

TH Item 35 - Fraud: 0 No
Overstated Value

TI Item 35 - Fraud: 0 No
Intentional Damage

TJ Item 35 - Fraud: Fake 0 No


Theft

TK Item 35 - Fraud: Fake 0 No


Repair Bills

TL Item 35 - Fraud: 0 No
Inflated Repair Bills

TM Item 35 - Fraud: 0 No
Non-Covered Use

TN Item 35 - Fraud: 0 No
Non-Covered
Damage

TO Item 35 - Disallowed: 0 No
Fraud

TP Item 35 - Disallowed: 0 No
Under Deductible

TQ Item 35 - Disallowed: 0 No
Not Covered

TR Item 35 - Non-Full: 0 No
Over Limit

TS Item 35 - Non-Full: 0 No
Depreciation

TT Item 35 - Non-Full: 0 No
Over Market Price

83
Table 12 is the data schema for insurance freetext.

Table 12 Data schema for insurance freetext


Column Field Name Sample Value Comment In Kaggle

A Insured Claim I hope this damage is These columns in No


Narrative covered. Please have freetext can be
somebody come to viewed as extensions
my house. Suddenly I to the columns in
heard something. "insur_claims.csv",
There were howling that is, there is a 1:1
winds on Jul 18. I still mapping of rows here
have to check things, to rows in the claims
but here is what I file.
think is lost or
damaged: lots of the
house and the roof
pieces. The losses
totaled $12380. How
much is my
deductible? My
house needs help.
My house has 6
bedrooms. My house
has 2350 square
feet. This is urgent.

B Generic Request for 0 Labels about the No


a Person attributes of the
narrative. 1 means
that the entry is an
instance of the
specified type, for
example, "Request
for a Person". 0
means that the entry
is not an instance.
More than one field
can be 1, and more
than one field can be
0. Indeed, the fields
can be all 1s or all 0s.

C Request only a 1 Labels about the No


Person can Answer attributes of the
narrative. 1 means
that the entry is an
instance of the
specified type, for
example, "Request
only a Person can
Answer". 0 means
that the entry is not
an instance. More
than one field can be
1, and more than one
field can be 0.
Indeed, the fields can
be all 1s or all 0s.

84
Column Field Name Sample Value Comment In Kaggle

D Request for a Fact 1 Labels about the No


attributes of the
narrative. 1 means
that the entry is an
instance of the
specified type, for
example, "Request
for a Fact". 0
means that the entry
is not an instance.
More than one field
can be 1, and more
than one field can be
0. Indeed, the fields
can be all 1s or all 0s.

E Request for Advice 0 Labels about the No


attributes of the
narrative. 1 means
that the entry is an
instance of the
specified type, for
example, "Request
for Advice". 0
means that the entry
is not an instance.
More than one field
can be 1, and more
than one field can be
0. Indeed, the fields
can be all 1s or all 0s.

F Request for a 0 Labels about the No


Prediction attributes of the
narrative. 1 means
that the entry is an
instance of the
specified type, for
example, "Request
for a Prediction". 0
means that the entry
is not an instance.
More than one field
can be 1, and more
than one field can be
0. Indeed, the fields
can be all 1s or all 0s.

85
Table 13 is the data schema for storms.

Table 13 Data schema for storms

Table 14 is the data schema for quakes.

Table 14 Data schema for quakes

86
Table 15 is the data schema for volcanoes.

Table 15 Data schema for volcanoes

87
Back cover

REDP-5748-00

ISBN 0738461997

Printed in U.S.A.

®
ibm.com/redbooks

You might also like