redp5748
redp5748
Erik Altman
Dipali Aphale
Joy Deng
Yadu Nandan B
Saurabh Srivastava
Kelly Xiang
Artificial Intelligence
IBM Redbooks
February 2025
REDP-5748-00
Note: Before using this information and the product it supports, read the information in “Notices” on page v.
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Available editions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Trial Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Pro Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Enterprise Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Additional resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
IBM® IBM Z® Redbooks (logo) ®
IBM Cloud® Passport Advantage® z/OS®
IBM Security® Redbooks®
Other company, product, or service names may be trademarks or service marks of others.
IBM Synthetic Data Sets is a family of artificially generated, enterprise-grade datasets that
enhance predictive artificial intelligence (AI) model training and large language models
(LLMs) to benefit IBM Z® and IBM LinuxONE clients, ecosystems, and independent software
vendors. These pre-built datasets are downloadable and packaged as comma-separated
values (CSVs) and data definition language (DDL) files, making them familiar to use, and
compatible with everything from databases to spreadsheets to hardware platforms to
standard AI tools. These datasets also leverage the IBM® industry expertise and domain
knowledge of the financial services sector without using any real client seed data, which
alleviates security concerns with Personally Identifiable Information (PII). Real data at client
sites is often limited in scope to only their own organization's transactions, and clients do not
always know which transactions are fraudulent or not. To address this scenario, IBM Synthetic
Data Sets were modified for fraud detection use cases so that clients can download and
enable development of predictive AI models and LLMs for financial services or optimize
existing models for improved accuracy and risk mitigation.
The IBM Synthetic Data Sets family contains the following features:
IBM Synthetic Data Sets for Payment Cards
IBM Synthetic Data Sets for Core Banking and Money Laundering
IBM Synthetic Data Sets for Homeowners Insurance
This IBM Redbooks® publication introduces IBM Synthetic Data Sets and provides
information about how IBM Synthetic Data Sets can enhance and optimize your predictive AI
model training and LLMs.
Authors
This publication was produced by a team of specialists from around the world working with
the IBM Redbooks team.
Erik Altman is a Research Scientist at the IBM T.J. Watson Research Center. He has worked
across many technical disciplines, such as computer architecture and artificial intelligence
(AI). He has written dozens of scientific papers, and has dozens of issued patents. His works
include five papers on credit card fraud and money laundering that he presented at leading AI
conferences, such as Neurips, AAAI, and ICAIF. He has served for more than 10 years on the
investment committee of the Association for Computing Machinery (ACM), where he acts as a
steward for more than $100 million in assets. He received a bachelor’s degree in Computer
Science and in Economics from MIT. He received his master’s degree and PhD in Electrical
Engineering from McGill University.
Dipali Aphale is a Lead AI Design Researcher who is based in San Francisco, California.
She has 7 years of experience in design and technology. She holds a Bachelor of Industrial
Design degree from NC State College of Design a Master of Art degree in Design
Entrepreneurship from the Royal College of Art, and a Master of Science degree in Design
Engineering from Imperial College London. Her areas of expertise include design research,
speculative design futures, product and industrial design, brand identity, and marketing.
Before she entered tech, she worked extensively in medical product design and care delivery
systems.
Kelly Xiang is a Content Designer for AI on IBM Z who is based in Poughkeepsie, New York.
She has 2 years of experience in content development and technical writing. She holds a
degree in English Literature and International Development from McGill University. Her areas
of expertise include content editing, content strategy, technical documentation, and UI and
UX writing. Before joining the AI on IBM Z organization, Kelly wrote extensively for IBM Data
and AI and on various projects that were related to AI ethics.
Yadu Nandan B is a Back-end Developer in the AI on IBM Z team who is based in Bengaluru,
India. He has 6 months of experience, and has been actively contributing to IBM Synthetic
Data Sets since then. He holds a bachelor’s degree in Information Science and Engineering
from the Global Academy of Technology, Bengaluru. His expertise is in the areas of
programming in C++, Python, and AI and Machine Learning.
Lydia Parziale
IBM Redbooks, Poughkeepsie Center
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
We want our to be as helpful as possible. Send us your comments about this or other IBM
Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface ix
x IBM Synthetic Data Sets
Introducing IBM Synthetic Data Sets
The goal of the tailored datasets in this publication is to produce real-time artificial intelligence
(AI) use cases on IBM Z and LinuxONE (for example, fraud detection, anti-money laundering,
and insurance datasets) and generate business insights without violating data privacy and
security. The IBM Synthetic Data Sets feature is designed to keep real data secure from
threats by training models with artificial data and leveraging data that uses no real Personally
Identifiable Information (PII) and requires no encryption or redaction.
IBM Synthetic Data Sets trains and enhances predictive models and composite AI methods.
Those models can be deployed to IBM Z and LinuxONE with inferencing tools, such as IBM
Machine Learning for IBM z/OS®, AI Toolkit for IBM Z and IBM LinuxONE, and
IBM Cloud® Pak for Data on IBM Z.
This section provides an overview of the typical stages in the AI model lifecycle, with a
description of each stage and how IBM Synthetic Data Sets can provide value to each of the
stages.
As listed in “Introducing IBM Synthetic Data Sets” on page 1, the IBM Synthetic Data Sets
family contains the following features:
IBM Synthetic Data Sets for Payment Cards
IBM Synthetic Data Sets for Core Banking and Money Laundering
IBM Synthetic Data Sets for Homeowners Insurance
These datasets are available for purchase and are described in this section.
Synthetic data can also be used in honeypot operations that attract and capture security
threats. Specifically, companies can place IBM Synthetic Data Sets where they fear hackers
might penetrate. However, because IBM Synthetic Data Sets only contains simulated data,
the loss from stolen synthetic data is smaller for the company than from stolen real data.
Nevertheless, the experience of the data theft helps the company monitor and improve its
cybersecurity. Companies can combine IBM Synthetic Data Sets with their real data to deter
data theft. Even if hackers obtain access to real data, they must spend considerable time
differentiating real data from synthetic data. This increased effort can reduce the incentive to
steal the data.
IBM Synthetic Data Sets for Payment Cards is best suited for the following business use
cases:
Credit card fraud
Debit card fraud
Targeted marketing such as product recommendations
Honeypot
Because money laundering often goes undetected, having a dataset that is specialized in
identifying transactions for fraud and money laundering is highly valuable. The dataset helps
models determine the type of laundering, for example, fan-in, fan-out, or cycle. As a result,
Synthetic Data Sets for Core Banking and Money Laundering can offer key insights for
creating an anti-money laundering solution.
IBM Synthetic Data Sets for Core Banking and Money Laundering is best suited for the
following business use cases:
Money laundering detection
Check fraud
APP fraud
Loan default prediction
Honeypot
Although many insurance companies have rich, real data about policy holders and datasets,
IBM Synthetic Data Sets for Homeowners Insurance enhances insights by providing a broad
scope of loss scenarios. These extra and diverse scenarios can help detect fraudulent
datasets and flag fraud indicators, which might establish accurate pricing and better risk
assessment. The datasets data can provide greater transparency when determining fraud
because it provides the type or types of fraud that are committed on the claim and the
monetary amount of each fraud type.
Therefore, IBM Synthetic Data Sets for Homeowners Insurance is a rich tool for training,
enhancing, and validating AI models that detect fraudulent homeowners insurance datasets.
This dataset can expand to support other areas, such as loan underwriting and credit scoring.
For example, knowing that a customer has unpaid, outstanding, or pending datasets can
provide further insights into their financial behavior and risk profile.
Conversely, some text inquiries might be answered effectively by automated agents. For
example, policy questions such as “What is the deductible on my policy?” can be answered
without real human assistance. By distinguishing these interactions, insurance companies
can leverage their human agents more efficiently and cost-effectively.
IBM Synthetic Data Sets for Homeowners Insurance is best suited for the following business
use cases:
Fraud detection
Underwriting and pricing
Loan underwriting
Credit scoring
IBM Synthetic Data Sets are available in three sizes or editions: Trial, Pro, and Enterprise. In
the agent-based model generation of IBM Synthetic Data Sets (See “Data generation
methodology” on page 13), simulated agents or people transact over a period, and those
recorded transactions become the data input for IBM Synthetic Data Sets.
This section described each edition. Review each edition to determine the most suitable data
set for your artificial intelligence (AI) solutions.
Pro Edition
The Pro Edition is a medium-sized dataset and ideal for independent software vendors and
small customers on a budget that need a large, rich data set for creating their AI solutions.
This edition is roughly 360x the size of the Trial Edition dataset, and its transaction generation
parameters are 15,000 simulated people transacting over a period of 25 months. It is
available for purchase through an IBM Passport Advantage® account or by contacting
[email protected].
Enterprise Edition
The Enterprise Edition is the largest sized data set and recommended for large IBM Z and
LinuxONE enterprises who need the largest, richest data to create their AI solutions. It is
roughly 1950x the size of the Trial Edition dataset, and its transaction generation parameters
are 150,000 simulated people transacting over a period of 37 months. It is available for
purchase through Passport Advantage® or by contacting [email protected].
Best suited for Trials and proofs of Independent software IBM Z and LinuxONE
concept vendors and small enterprises
customers
A data schema describes what data is included in a dataset. It is the blueprint that defines
how the data is structured, organized, and related to other data attributes. Data schemas for
each IBM Synthetic Data Sets edition can be found in “Appendix: Data schemes for IBM
Synthetic Data Sets” on page 28. The schemas are formatted to display data from top to
bottom for visual fit, but the original datasets display data from right to left.
In the data schemas, you see that the column letter indicates where the attribute is, what the
attribute is, an example of the attribute, and comments explaining the attribute and the range
of options.
Real data is important for artificial intelligence (AI) model training. However, there are many
times where synthetic data can add value to real data or serve as an alternative when real
data is not available. To answer the question, “I have real data, why would I need synthetic
data?”, IBM Synthetic Data Sets does not contain any real Personally Identifiable Information
(PII) data; labels transactions for fraud or money laundering; and is a less expensive
alternative to real data. As a result, enterprises can jump-start their AI projects with rich,
privacy-compliant, and cost-effective synthetic data.
With IBM Synthetic Data Sets, data scientists can focus on the model sooner. Each dataset is
pre-built, contains no PII, and includes the key attributes for many IBM Z and LinuxONE AI
use cases so that data scientists can immediately begin training models. The datasets come
in comma-separated value (CSV) and data definition language (DDL) formats to make them
compatible across many systems and software. As a result, data scientists can conveniently
use IBM Synthetic Data Sets to create proof-of-concepts, which illustrate the value and
potential capabilities of AI on a business. For independent software vendors who do not have
access to their IBM Z and LinuxONE customers’ data, these datasets aim to empower AI
solution creation by supplying artificial transactional data that is realistic.
Identifying fraud and money laundering in real data can be challenging. Money laundering is
difficult because criminals use complex techniques to disguise illicit funds as legitimate
financial assets and avoid detection. With IBM Synthetic Data Sets, all transactions are
labeled Yes or No to indicate whether they involve money laundering or other criminal
activities, such as check fraud or automated push payment (APP) fraud. Due to the synthetic
data generation methodology, all labels are assigned with 100% accuracy. No laundering,
check fraud, or scams are missed, and all transactions that are determined to be fraudulent
are instances of the criminal activity.
To illustrate, when a criminal forges or alters a check, or deceives victims into sending money,
these transactions are always identified as check and APP fraud. Subsequently, these
transactions lead to money laundering as the criminals try to conceal or legitimize their illegal
funds. Other types of criminal activity can also result in illicit funds, with the laundering of
those funds labeled. By establishing ground truth in its data, IBM Synthetic Data Sets strives
to provide reliable, high-quality training data that improves models' ability to detect money
laundering and other criminal activity.
To help ensure further transparency about transactions, IBM Synthetic Data Sets also offers
labels specifying the reason for money movement. Some of these labels include salary
payment, credit card payment, and transfers to a retirement account. They are also 100%
accurate and give more context about transactions that is not often available in real data.
As a result, AI models that are built by using IBM Synthetic Data Sets have an advantage over
real data because synthetic training data is complete, correctly labeled, and cover a wide
scope of information.
11
Data privacy, security, and compliance
Even with masking, real data often enables sophisticated AI tools to re-identify sensitive PII
and the person to whom that data belongs. By using no real individual’s information and only
statistical representations at a population level to generate the data, IBM Synthetic Data Sets
aims to remove all risk for potential data breaches and to ensure that real data stays private
and secure. Because there is no real individual’s information, IBM Synthetic Data Sets are
designed to make it simpler to meet data compliance and regulations about using sensitive
information.
12
Data generation methodology
Datasets are created by simulating a world that is filled with artificial people, alongside tens of
millions of merchants and companies, and observing the transactional behaviors within this
virtual world. The merchants and companies span many countries across the world, but the
simulated population lives in the US.
However, the simulated US population travels and does business across the world and in all
the currencies of the world. As a result, there is business activity in many locations and in
many forms: credit and debit card transactions, bank accounts and transfers, and
investments. Some of this activity is criminal, with the simulated individuals and merchants
committing payment card fraud, insurance fraud, and money laundering.
With this information, IBM Synthetic Data Sets builds a population whose attributes mimic the
overall US population in terms of income, age, and geographic distribution. To emphasize, the
simulated people that are created by IBM Synthetic Data Sets are not built from anonymized
real individuals. Instead, the simulated people are built by using the previously mentioned
statistical distributions. Although the aggregate behavior of the simulated people matches the
aggregate behavior of real people, data security, privacy, or compliance risks are alleviated
because no simulated individual person is based on any real individual person.
Similar to real people, every simulated person is unique. People living in the same
neighborhood with similar income might have different spending habits: frugal versus
expansive, high expenditures on clothes versus high expenditures on travel, and other habits.
This behavior generally follows statistical patterns. For example, individuals with a higher
income can afford to do more activities and have a greater tendency to spend on luxury items
than someone with a lower income. However, some high-income people might spend
modestly, and others spend lavishly. Low- and middle-income people also vary in their overall
spending and in their specific tastes.
14
Also, IBM Synthetic Data Sets incorporates patterns and variety in consumer behavior. For
example, real people's weekend consumer behavior likely differs from their weekday
consumer behavior. The simulated people in IBM Synthetic Data Sets mimic this change in
behavior. Simulated people take business trips and vacations at varying frequencies and
spend for the destination. Simulated people spend more on gifts around certain months or
holidays as well. Most simulated people are paid at regular intervals, such as weekly,
biweekly, semi-monthly. Rent, mortgage, and other loan payments are typically paid once per
month, with a skew toward the end of the month. IBM Synthetic Data Sets models all these
details and many others with precision, which generate a realistic record of consumer
behavior and spending activity.
In summary, IBM Synthetic Data Sets simulates realistic people, companies, and activity.
Consumer activity and behavior follow realistic time intervals with purchases that are made on
appropriate days, times, and locations.
IBM Synthetic Data Sets also attaches free text descriptions to each claim. This text content
is generated based on exact knowledge of the underlying claim, which makes it consistent
with the tabular data. For example, the tabular data might note specific items that are
damaged in a flood and the loss amount for those items. The text might provide a brief
description of the claim, such as “Last week my home was damaged in a flood and there is a
great deal of damage to my furniture and carpets. Can you please get me reimbursed quickly
for these items?”
15
Understanding criminal behavior
Criminal activity is an important part of IBM Synthetic Data Sets. Having data around fraud
and money laundering is imperative when training artificial intelligence (AI) models to
recognize similar activity. This criminal activity includes check fraud, insurance fraud,
payment card fraud, automated push payment (APP) scams, and money laundering. The
criminal activity expands to a broader set of pursuits, such as yielding illicit income through
extortion, smuggling, and illegal gambling. Like other aspects of the simulated world,
IBM Synthetic Data Set treats each criminal entity as unique entities with their own amounts
and types of unlawful activity. Nevertheless, it is emphasized that in IBM Synthetic Data Sets
only a few companies and people engage in criminal activity, that is, about 1 in 1000 or fewer.
Furthermore, with its knowledge of ground truth and universal data collection, IBM Synthetic
Data Sets offers a key advantage over real data when training models to recognize criminal
activity. The dataset knows who is engaged in criminal activity, when they do it, and the
financial amounts that are involved. As a result, all illegal activities are identified and labeled
with 100% accuracy in the dataset, which includes all scams, credit card fraud, check fraud,
insurance fraud, and money laundering. With real data, this scale of illegal activity is
challenging to detect. Therefore, AI models that are trained with IBM Synthetic Data Sets
have a clear, accurate understanding of criminal behavior.
16
Artificial intelligence ethics
In today’s rapidly evolving technological landscape, artificial intelligence (AI) systems are
becoming integral to decision-making processes across many industries. Although AI has
tremendous potential to transform business operations and improve efficiency, its
implementation also raises ethical and security concerns. Trust in AI systems can be
established only through a foundation of ethical principles, secure design, and transparent
practices.
IBM’s approach to Security and Trust by Design integrates ethical safeguards from the outset
by focusing on six key pillars: Fairness, Robustness, Value Alignment, Data Laws, Intellectual
Property (IP), Transparency, and Privacy.
The following sections delve into each of these areas by exploring IBM’s methods for
mitigating risks and fostering trustworthy AI. For more information about each pillar, see
Foundation models: Opportunities, risks, and mitigations.
Therefore, IBM uses the AI Fairness 360 Toolkit, which is a comprehensive suite of tools to
detect and mitigate biases in IBM Synthetic Data Sets. This toolkit can identify areas where
biases might influence outcomes and implement corrections to help ensure that all users are
treated equitably. In specific applications like fraud detection, factors such as race are
intentionally excluded to prevent unintended discriminatory outcomes. By continuously
validating IBM Synthetic Data Sets through fairness testing, IBM upholds a commitment to
equity and helps ensure that AI systems contribute positively and fairly to society.
Robustness
Robustness in AI systems is essential to help ensure that datasets and AI models remain
resilient in the face of adversarial attacks. One significant threat to AI robustness is data
poisoning, where a malicious actor intentionally introduces corrupted or misleading data into
a training or validation set. Such tampered data can distort model behavior, which can
potentially lead the AI to produce outputs that favor the adversary’s objectives. This situation
poses serious risks because poisoned models might produce harmful or inaccurate
decisions, with implications for organizational reputation and operational stability.
IBM addresses robustness concerns through a Security and Privacy by Design (SPbD) threat
assessment process, which actively monitors and verifies IBM Synthetic Data Sets to prevent
tampering throughout the product supply chain from creation to delivery. The SPbD review
process is an official process that development teams must use to receive approval for their
datasets from the IBM Business Information Security Officer. SPbD involves systematic
checks to help ensure the integrity of the IBM Synthetic Data Sets data that is used to train,
enhance, or validate AI models. These proactive measures enable IBM to maintain high
standards of security and resilience, which makes it more challenging for adversaries to
manipulate AI outputs. By prioritizing robust design and adopting stringent security protocols,
IBM reinforces the trustworthiness of its AI solutions.
Value alignment
For AI systems to be effective and ethical, they must align with the values and objectives of
the organizations that deploy them. Achieving value alignment requires careful data curation
during the training and tuning phases because improper data generation, collection, and
annotation can lead to models that deviate from ground truth. If AI training data does not
accurately reflect an organization’s ethical standards, the subsequent outputs might not align
with wanted outcomes and lead to unintentional ethical or operational consequences.
18
IBM helps ensure value alignment by following a robust process that vets and governs data
that is used for AI. This process is set by the IBM Office of Privacy and Responsible
Technology. The process oversees data curation and verifies that only approved datasets are
used for training. Also, the process helps secure third-party data and content by helping
ensure that each data set adheres to organizational standards. With these practices, IBM
builds AI systems that are technologically advanced and deeply aligned with organizational
values, which enhance the trustworthiness and social responsibility of its AI solutions.
Data laws
Compliance with data usage laws is a critical aspect of ethical AI implementation. Different
regions have different regulations on the usage of data, with some laws strictly prohibiting the
usage of specific data types for AI applications. Non-compliance with these regulations can
result in financial penalties, legal repercussions, and damage to an organization’s reputation.
As governments worldwide enact stringent data protection laws, AI developers must help
ensure that their systems adhere to all relevant regulations to avoid these consequences.
To navigate the complexities of data compliance, IBM integrates data governance into its AI
development processes. By registering AI use cases through the Integrated Governance
Registration process, IBM ensures that IBM Synthetic Data Sets comply with applicable laws.
Legal consultation is a standard part of this process, which helps IBM to address compliance
proactively. This approach reinforces IBM’s commitment to ethical data collection and usage,
and strengthens the legal and ethical standing of its AI systems.
Intellectual property
IP rights are a significant consideration when developing AI systems because training models
on proprietary datasets might raise copyright, licensing, and compliance issues. Navigating
these IP challenges is essential to help ensure that AI systems are built within legal
boundaries and do not infringe on the rights of data owners. Moreover, each country has its
own regulatory framework, which adds to the complexity of IP compliance for AI development.
IBM approaches IP issues by coordinating closely with legal teams through regular meetings.
These meetings with IBM Z Brand legal experts help clarify the terms and conditions of
IBM Synthetic Data Sets usage, which helps ensure that service descriptions meet all
relevant legal requirements. By maintaining strict compliance with IP laws, IBM minimizes
risks that are related to data misuse, supports ethical AI practices, and fosters innovation
within legally permissible frameworks.
Transparency
Transparency is key to fostering trust in AI systems. Documenting how data is collected,
processed, and used in model training enables stakeholders to understand and evaluate the
ethical considerations that are involved. Lack of transparency can undermine confidence in AI
systems because users might question the source, quality, or handling of data that informs
AI-driven decisions. Clear, accessible explanations about data processes promote
accountability and facilitates a deeper understanding of AI mechanisms.
19
IBM addresses transparency concerns by publishing detailed papers on synthetic data
generation methods. For more information, see Synthesizing credit card transactions and
Realistic Synthetic Financial Transactions for Anti-Money Laundering Models.
Also, IBM provides a data schema that labels each data set component, what the attribute in
the column is named, an example of the data, and options and ranges for that attribute. This
level of transparency clarifies IBM’s commitment to ethical AI and empowers users to assess
data practices, which enhance trust in IBM’s AI systems.
Privacy
Protecting privacy is a fundamental ethical obligation in AI. With growing concerns about data
re-identification, even datasets that exclude Personally Identifiable Information (PII) pose
privacy risks if patterns can be used to infer individuals’ identities. Privacy breaches
compromise user trust, and can lead to significant legal and reputational damages for
organizations.
To address privacy concerns, IBM Synthetic Data Sets do not contain real PII but instead use
statistical representations of populations. By generating synthetic data that simulates
real-world patterns without identifying individuals, IBM minimizes privacy risks and helps
ensure compliance with privacy regulations. This approach allows IBM to build powerful AI
models without compromising user privacy, which reinforces IBM’s commitment to ethical and
responsible AI.
Conclusion
The IBM Security® and Trust by Design framework is a comprehensive approach to generate
synthetic datasets with ethical practices that make trusted AI development. By focusing on
fairness, robustness, value alignment, compliance with data laws, IP rights, transparency, and
privacy, IBM addresses the complex ethical and security challenges that accompany AI
advancement. These pillars form the foundation of IBM’s commitment to responsible AI,
which help ensures that AI systems are innovative, fair, secure, and aligned with societal
values. Through these practices, IBM fosters trust in AI, which paves the way for ethical and
secure AI deployment across industries.
20
Legal usage terms
For the full legal terms for IBM Synthetic Data Sets, which include how to use and redistribute
the datasets, see IBM Terms.
This section describes a few different ways to get started with IBM Synthetic Data Sets:
Artificial intelligence on IBM Z Solution Templates
IBM Technology Expert Labs Services
Starting a proof-of-concept with the AI on IBM Z team
There are three paid services offerings through IBM Technology Expert Labs for using IBM
Synthetic Data Sets for model training and deployment:
AI Exploration and Model Training: Integrate and blend data from IBM Synthetic Data Sets
and real data, including from IBM Z and LinuxONE. Transform the data and use it for
training a machine learning and deep learning model.
Implement Machine Learning for z/OS: Install and configure Machine Learning for z/OS for
model deployment on IBM Z.
Model Deployment to IBM Z and LinuxONE: Deploy the model to IBM Z and LinuxONE for
accelerated inferencing with Machine Learning for z/OS or AI Toolkit for IBM Z and
LinuxONE
23
Frequently asked questions
Here is a list of frequently asked questions (FAQ) about IBM Synthetic Data Sets:
What are the benefits of IBM Synthetic Data Sets?
For examples about how to leverage IBM Synthetic Data Sets for AI models and large
language models (LLMs), see “Introducing IBM Synthetic Data Sets” on page 1.
How large are the datasets?
Each dataset comes in three editions or sizes: Trial, Pro, and Enterprise. For more
information, see “Available editions” on page 7.
What is included in the datasets?
Information about column titles and data attributes, including examples and options, is
described in “Previewing data schemas” on page 9 and “Appendix: Data schemes for IBM
Synthetic Data Sets” on page 28.
What is the methodology for creating the datasets?
In short, the datasets are created by using the agent-based modeling method. For more
information, see “Data generation methodology” on page 13 and the academic papers that
are referenced in “Artificial intelligence ethics” on page 17.
What environment or platforms can I download the datasets on?
These datasets are downloadable, comma-separated value (CSV) files that are
compatible with the training platform of your choice. The intention is that IBM Synthetic
Data Sets can be used by IBM Z and LinuxONE customers and ISVs to build models on
any platform and deploy those models back to IBM Z and LinuxONE, where the core
enterprise data is for accelerated inferencing.
How realistic are the datasets?
IBM Synthetic Data Sets is realistic because they were created with real statistical
population data from various sources, which include the US Census, Federal Reserve,
Bureau of Labor Statistics, and FBI Crimes Insights, among other sources. Also, a large
US national card provider compared the distribution of the datasets against their real
transactions data and found that it matched well.
25
To get the same quality of synthetic data as IBM Synthetic Data Sets, an organization
would need time and money for a data scientist and a subject matter expert to spend years
finding the right source data and potentially writing extra code to maintain the data logic.
However, clients can promptly begin modeling and LLM training with IBM Synthetic Data
Sets.
IBM Synthetic Data Sets offers only US-based data. How does it help me if I am not in the
US?
IBM Synthetic Data Sets is most directly useful for the US. However, they can provide
significant benefits worldwide:
– The core of many AI models is pattern detection and deviations from those patterns.
For example, AI models look for deviations from common or typical behavior to detect
fraud and money laundering. Then, the model flags these deviations as potential fraud,
or money laundering. This approach is geographically independent. If a model can find
patterns in US-based data, the model is typically capable of doing so anywhere.
– The patterns are geographically independent. For example, it is always unusual to
have multiple purchases in an hour at brick-and-mortar merchants when the merchants
are separated by hundreds of kilometers. It is always unusual for someone who spends
frugally to suddenly spend large amounts on expensive luxury items. Certain patterns
of transfers between bank accounts are common, such as moving money from
checking to savings. Other patterns might be less common, such as suddenly moving
small amounts of money to a large set of other accounts. As a result, although
IBM Synthetic Data Sets is US-based, the logic behind pattern detection and deviation
can be applied universally.
Patterns might be more subtle than these examples. Use broad, well-labeled data to
create and train AI models to detect such subtleties.
– The data generation that is used for IBM Synthetic Data Sets simulates international
companies and business transactions worldwide. The simulated people and
companies travel and conduct transactions in 223 countries around the simulated
world, and use international currencies and banks to facilitate their activities.
Therefore, although the datasets’ transactions center is in the US, they cover the world.
IBM Synthetic Data Sets has many attributes that are not available in real data.
IBM Synthetic Data Sets has fully accurate labeling for a broad set of categories.
IBM Synthetic Data Sets also provides data for all banks and insurance companies in the
ecosystem, which includes cash transactions that are frequently overlooked by real data.
Clients can combine IBM Synthetic Data Sets with local data to develop enhanced, robust
capabilities that are beyond what IBM Synthetic Data Sets or local data alone can
independently offer. IBM Synthetic Data Sets can also fine-tune models that are created
from local data.
If I have feedback on how to improve the datasets, how do I provide that feedback?
We appreciate your feedback and aim to include relevant suggestions in future updates to
the datasets. Updates are available with the purchase of a subscription service.
To submit new ideas, see ideas.ibm.com.
26
Additional resources
For more information about synthetic datasets, see the following resources:2021
International Conference on AI in Finance (ICAIF): Synthesizing credit card transactions
2024 ICAIF:
– FraudGT: A Simple, Effective, and Efficient Graph Transformer for Financial Fraud
Detection
– Graph Feature Preprocessor: Real-time Subgraph-based Feature Extraction for
Financial Crime Detection
2023 Neural Information Processing Systems (Neurips) paper: Realistic Synthetic
Financial Transactions for Anti-Money Laundering Models
2024 Association for the Advancement of Artificial Intelligence (AAAI) paper: Provably
Powerful Graph Neural Networks for Directed Multigraphs
This section describes the data schemas for each of the IBM Synthetic Data Sets:
Payment cards
Core banking
Insurance
29
Table 2 is the data schema for payment cards users.
30
Table 3 is the data schema for payment transactions.
31
Core banking
Here are the data schemas for core banking:
Banks
Liquid accounts people
Liquid accounts companies
Bank transfers
Business-to-business (B2B)
32
Table 5 is the data schema for liquid accounts people.
33
Table 6 is the data schema for liquid accounts companies.
34
Table 7 is the data schema for bank transfers.
35
Table 8 is the data schema for business-to-business (B2B).
36
Insurance
Here are the data schemas for insurance:
Insurance Application
Insurance Policy
Insurance Claims
Insurance Freetext
Storms
Quakes
Volcanoes
37
Column Field Name Sample Value Comment In Kaggle
N Unit Number - To Be No
Insured
P State - To Be Insured WI No
S Months at this 61 No
Address
38
Column Field Name Sample Value Comment In Kaggle
U Previous Unit No
Number - Applicant 1
V Previous City - No
Applicant 1
W Previous State - No
Applicant 1
Y Previous Country - No
Applicant 1
AB Unit Number - No
Employer of
Applicant 1
AD State - Employer of WI No
Applicant 1
39
Column Field Name Sample Value Comment In Kaggle
AI Are Self-Employed - No No
Applicant 1?
AJ Years on Job - 3 No
Applicant 1
AK Years in this 18 No
Profession -
Applicant 1
40
Column Field Name Sample Value Comment In Kaggle
AZ Unit Number - No
Employer of
Applicant 2
BB State - Employer of WI No
Applicant 2
BG Are Self-Employed - No No
Applicant 2?
BH Years on Job - 1 No
Applicant 2
BI Years in this 27 No
Profession -
Applicant 2
BK Any foreclosures; No No
repossessions; or
bankruptcies in the
last 5 years?
BL Any insurance No No
declined; canceled;
or non-renewed in
the last 3 years?
BO Number of Units 1 No
41
Column Field Name Sample Value Comment In Kaggle
BS Distance to Fire 1 No
Station (Miles)
BX Basement Area 0 No
(Square Feet)
BZ Garage Capacity 2 No
(Number of Cars)
CA Basement Finished 0 No
(Percentage)
CB Number of Stories 2 No
CD Number of Bedrooms 4 No
CI Fireplace Count 1 No
42
Column Field Name Sample Value Comment In Kaggle
43
Column Field Name Sample Value Comment In Kaggle
CZ Protection Class 1 No
(Numeric Code)
DA Is Manufactured No No
Home?
DB Is Historic? No No
DE Is Garage Heated? No No
DG Has Carport? No No
DH Has Screen No No
Enclosure?
DI Has Walkout No No
Basement?
DK Has T-Lock No No
Shingles?
DL Has Asbestos No No
Shingles?
DM Is Under No No
Construction?
DN Is Bolted To No No
Foundation?
DO Has Visible No No
Damage?
DQ Has Sprinklers? No No
44
Column Field Name Sample Value Comment In Kaggle
DW Has Video No No
Surveillance?
DX Has Video No No
Monitoring?
EA Is Teardown? No No
EB Is Gutted and No No
Remodeled?
ED Is Visible to Yes No
Neighbors?
EF Has Flood No No
Insurance?
EH Has Fuses? No No
EL Has Polybutylene No No
Pipes?
EN Has Asbestos? No No
45
Column Field Name Sample Value Comment In Kaggle
ER Converted to Private No No
Home from other
Use?
46
Column Field Name Sample Value Comment In Kaggle
M Vandalism: 200 No
Deductible
Q Explosion: 200 No
Deductible
X Flood: Coverage 0 No
Limit
Y Flood: Deductible 0 No
47
Column Field Name Sample Value Comment In Kaggle
48
Column Field Name Sample Value Comment In Kaggle
BB Sinkhole: Coverage 0 No
Limit
BC Sinkhole: Deductible 0 No
BD Earthquake: 0 No
Coverage Limit
BE Earthquake: 0 No
Deductible
BH Mandatory 0 No
Evacuation:
Coverage Limit
BI Mandatory 0 No
Evacuation:
Deductible
BJ Ordinance Change: 0 No
Coverage Limit
BK Ordinance Change: 0 No
Deductible
BL Building Codes: 0 No
Coverage Limit
BM Building Codes: 0 No
Deductible
BN Eco Upgrade: 0 No
Coverage Limit
BO Eco Upgrade: 0 No
Deductible
BS Mold: Deductible 0 No
BT Termites: Coverage 0 No
Limit
BU Termites: Deductible 0 No
49
Column Field Name Sample Value Comment In Kaggle
BZ Dwelling: No No
Replacement Cost?
CC Extended Premises: No No
Replacement Cost?
CD Extended Premises: 0 No
Coverage Limit
CE Extended Premises: 0 No
Deductible
CF Other Structures: No No
Replacement Cost?
CI Roof Surfaces: No No
Replacement Cost?
CJ Roof Surfaces: 0 No
Coverage Limit
CK Roof Surfaces: 0 No
Deductible
CO Data Recovery: No No
Replacement Cost?
CP Data Recovery: 0 No
Coverage Limit
CQ Data Recovery: 0 No
Deductible
50
Column Field Name Sample Value Comment In Kaggle
CR Credit Cards: No No
Replacement Cost?
CS Credit Cards: 0 No
Coverage Limit
CT Credit Cards: 0 No
Deductible
CU Financial Assets: No No
Replacement Cost?
CV Financial Assets: 0 No
Coverage Limit
CW Financial Assets: 0 No
Deductible
DA Business Property: No No
Replacement Cost?
DB Business Property: 0 No
Coverage Limit
DC Business Property: 0 No
Deductible
DD Home Daycare: No No
Replacement Cost?
DE Home Daycare: 0 No
Coverage Limit
DF Home Daycare: 0 No
Deductible
DG Medical Payments: No No
Replacement Cost?
DH Medical Payments: 0 No
Coverage Limit
DI Medical Payments: 0 No
Deductible
DJ Liability - Bodily No No
Injury: Replacement
Cost?
51
Column Field Name Sample Value Comment In Kaggle
DM Liability - Property No No
Damage:
Replacement Cost?
DP Loss Assessment: No No
Replacement Cost?
DQ Loss Assessment: 0 No
Coverage Limit
DR Loss Assessment: 0 No
Deductible
DS Fire Department No No
Charges:
Replacement Cost?
DT Fire Department 0 No
Charges: Coverage
Limit
DU Fire Department 0 No
Charges: Deductible
DV Living Expenses: No No
Replacement Cost?
DY Furniture: No No
Replacement Cost?
EB Appliances: No No
Replacement Cost?
EC Appliances: 8000 No
Coverage Limit
ED Appliances: 250 No
Deductible
EE Electronics: No No
Replacement Cost?
52
Column Field Name Sample Value Comment In Kaggle
EF Electronics: 165000 No
Coverage Limit
EG Electronics: 250 No
Deductible
EK Apparel: No No
Replacement Cost?
EN Jewelry: No No
Replacement Cost?
EO Jewelry: Coverage 0 No
Limit
EP Jewelry: Deductible 0 No
EQ Silverware: No No
Replacement Cost?
ER Silverware: Coverage 0 No
Limit
ES Silverware: 0 No
Deductible
ET Tools: Replacement No No
Cost?
EW Construction No No
Material:
Replacement Cost?
EX Construction 0 No
Material: Coverage
Limit
EY Construction 0 No
Material: Deductible
53
Column Field Name Sample Value Comment In Kaggle
FC Sporting Goods: No No
Replacement Cost?
FF Golf Cart: No No
Replacement Cost?
FI Cameras: No No
Replacement Cost?
FJ Cameras: Coverage 0 No
Limit
FK Cameras: Deductible 0 No
FL Watches: No No
Replacement Cost?
FM Watches: Coverage 0 No
Limit
FN Watches: Deductible 0 No
FO Furs: Replacement No No
Cost?
FQ Furs: Deductible 0 No
FR Medical Instruments: No No
Replacement Cost?
FS Medical Instruments: 0 No
Coverage Limit
FT Medical Instruments: 0 No
Deductible
FU Musical Instruments: No No
Replacement Cost?
FV Musical Instruments: 0 No
Coverage Limit
FW Musical Instruments: 0 No
Deductible
54
Column Field Name Sample Value Comment In Kaggle
FX Other Personal No No
Property:
Replacement Cost?
GA Special Deductibles: 0 No
Wind - Percentage
GB Special Deductibles: 0 No
Wind - Dollar
GC Special Deductibles: 0 No
Named Storm
GD Special Deductibles: 0 No
Hurricane
GE Special Deductibles: 0 No
Theft
GF Special Deductibles: 0 No
Water
GG Special Deductibles: 0 No
All Other Perils
C Home ID 6A9B05DF0 No
55
Column Field Name Sample Value Comment 1 In Kaggle
M Is Claim Cause 1 No
Covered
N Is Fraud on Claim 0 No
O Is Detected Fraud on 0 No
Claim
56
Column Field Name Sample Value Comment 1 In Kaggle
R Item 1 - Fraud: 0 No
Overstated Value
S Item 1 - Fraud: 0 No
Intentional Damage
57
Column Field Name Sample Value Comment 1 In Kaggle
V Item 1 - Fraud: 0 No
Inflated Repair Bills
W Item 1 - Fraud: 0 No
Non-Covered Use
X Item 1 - Fraud: 0 No
Non-Covered
Damage
Y Item 1 - Disallowed: 0 No
Fraud
Z Item 1 - Disallowed: 0 No
Under Deductible
AA Item 1 - Disallowed: 0 No
Not Covered
AB Item 1 - Non-Full: 1 No
Over Limit
AC Item 1 - Non-Full: 0 No
Depreciation
AD Item 1 - Non-Full: 0 No
Over Market Price
AE Item 2 - Extended 0 No
Premises: $Loss
Claimed
AF Item 2: $Loss 0 No
Allowed
AG Item 2 - Fraud: 0 No
Overstated Value
AH Item 2 - Fraud: 0 No
Intentional Damage
AK Item 2 - Fraud: 0 No
Inflated Repair Bills
AL Item 2 - Fraud: 0 No
Non-Covered Use
AM Item 2 - Fraud: 0 No
Non-Covered
Damage
58
Column Field Name Sample Value Comment 1 In Kaggle
AN Item 2 - Disallowed: 0 No
Fraud
AO Item 2 - Disallowed: 0 No
Under Deductible
AP Item 2 - Disallowed: 0 No
Not Covered
AQ Item 2 - Non-Full: 0 No
Over Limit
AR Item 2 - Non-Full: 0 No
Depreciation
AS Item 2 - Non-Full: 0 No
Over Market Price
AT Item 3 - Other 0 No
Structures: $Loss
Claimed
AU Item 3: $Loss 0 No
Allowed
AV Item 3 - Fraud: 0 No
Overstated Value
AW Item 3 - Fraud: 0 No
Intentional Damage
AZ Item 3 - Fraud: 0 No
Inflated Repair Bills
BA Item 3 - Fraud: 0 No
Non-Covered Use
BB Item 3 - Fraud: 0 No
Non-Covered
Damage
BC Item 3 - Disallowed: 0 No
Fraud
BD Item 3 - Disallowed: 0 No
Under Deductible
BE Item 3 - Disallowed: 0 No
Not Covered
BF Item 3 - Non-Full: 0 No
Over Limit
BG Item 3 - Non-Full: 0 No
Depreciation
59
Column Field Name Sample Value Comment 1 In Kaggle
BH Item 3 - Non-Full: 0 No
Over Market Price
BJ Item 4: $Loss 0 No
Allowed
BK Item 4 - Fraud: 0 No
Overstated Value
BL Item 4 - Fraud: 0 No
Intentional Damage
BO Item 4 - Fraud: 0 No
Inflated Repair Bills
BP Item 4 - Fraud: 0 No
Non-Covered Use
BQ Item 4 - Fraud: 0 No
Non-Covered
Damage
BR Item 4 - Disallowed: 0 No
Fraud
BS Item 4 - Disallowed: 0 No
Under Deductible
BT Item 4 - Disallowed: 1 No
Not Covered
BU Item 4 - Non-Full: 0 No
Over Limit
BV Item 4 - Non-Full: 0 No
Depreciation
BW Item 4 - Non-Full: 0 No
Over Market Price
BY Item 5: $Loss 0 No
Allowed
BZ Item 5 - Fraud: 0 No
Overstated Value
CA Item 5 - Fraud: 0 No
Intentional Damage
60
Column Field Name Sample Value Comment 1 In Kaggle
CD Item 5 - Fraud: 0 No
Inflated Repair Bills
CE Item 5 - Fraud: 0 No
Non-Covered Use
CF Item 5 - Fraud: 0 No
Non-Covered
Damage
CG Item 5 - Disallowed: 0 No
Fraud
CH Item 5 - Disallowed: 0 No
Under Deductible
CI Item 5 - Disallowed: 0 No
Not Covered
CJ Item 5 - Non-Full: 0 No
Over Limit
CK Item 5 - Non-Full: 0 No
Depreciation
CL Item 5 - Non-Full: 0 No
Over Market Price
CM Item 6 - Data 0 No
Recovery: $Loss
Claimed
CN Item 6: $Loss 0 No
Allowed
CO Item 6 - Fraud: 0 No
Overstated Value
CP Item 6 - Fraud: 0 No
Intentional Damage
CS Item 6 - Fraud: 0 No
Inflated Repair Bills
CT Item 6 - Fraud: 0 No
Non-Covered Use
CU Item 6 - Fraud: 0 No
Non-Covered
Damage
61
Column Field Name Sample Value Comment 1 In Kaggle
CV Item 6 - Disallowed: 0 No
Fraud
CW Item 6 - Disallowed: 0 No
Under Deductible
CX Item 6 - Disallowed: 0 No
Not Covered
CY Item 6 - Non-Full: 0 No
Over Limit
CZ Item 6 - Non-Full: 0 No
Depreciation
DA Item 6 - Non-Full: 0 No
Over Market Price
DC Item 7: $Loss 0 No
Allowed
DD Item 7 - Fraud: 0 No
Overstated Value
DE Item 7 - Fraud: 0 No
Intentional Damage
DH Item 7 - Fraud: 0 No
Inflated Repair Bills
DI Item 7 - Fraud: 0 No
Non-Covered Use
DJ Item 7 - Fraud: 0 No
Non-Covered
Damage
DK Item 7 - Disallowed: 0 No
Fraud
DL Item 7 - Disallowed: 0 No
Under Deductible
DM Item 7 - Disallowed: 0 No
Not Covered
DN Item 7 - Non-Full: 0 No
Over Limit
DO Item 7 - Non-Full: 0 No
Depreciation
DP Item 7 - Non-Full: 0 No
Over Market Price
62
Column Field Name Sample Value Comment 1 In Kaggle
DQ Item 8 - Financial 0 No
Assets: $Loss
Claimed
DR Item 8: $Loss 0 No
Allowed
DS Item 8 - Fraud: 0 No
Overstated Value
DT Item 8 - Fraud: 0 No
Intentional Damage
DW Item 8 - Fraud: 0 No
Inflated Repair Bills
DX Item 8 - Fraud: 0 No
Non-Covered Use
DY Item 8 - Fraud: 0 No
Non-Covered
Damage
DZ Item 8 - Disallowed: 0 No
Fraud
EA Item 8 - Disallowed: 0 No
Under Deductible
EB Item 8 - Disallowed: 0 No
Not Covered
EC Item 8 - Non-Full: 0 No
Over Limit
ED Item 8 - Non-Full: 0 No
Depreciation
EE Item 8 - Non-Full: 0 No
Over Market Price
EF Item 9 - Rental 0 No
Income Loss: $Loss
Claimed
EG Item 9: $Loss 0 No
Allowed
EH Item 9 - Fraud: 0 No
Overstated Value
EI Item 9 - Fraud: 0 No
Intentional Damage
63
Column Field Name Sample Value Comment 1 In Kaggle
EL Item 9 - Fraud: 0 No
Inflated Repair Bills
EM Item 9 - Fraud: 0 No
Non-Covered Use
EN Item 9 - Fraud: 0 No
Non-Covered
Damage
EO Item 9 - Disallowed: 0 No
Fraud
EP Item 9 - Disallowed: 0 No
Under Deductible
EQ Item 9 - Disallowed: 0 No
Not Covered
ER Item 9 - Non-Full: 0 No
Over Limit
ES Item 9 - Non-Full: 0 No
Depreciation
ET Item 9 - Non-Full: 0 No
Over Market Price
EU Item 10 - Business 0 No
Property: $Loss
Claimed
EW Item 10 - Fraud: 0 No
Overstated Value
EX Item 10 - Fraud: 0 No
Intentional Damage
FA Item 10 - Fraud: 0 No
Inflated Repair Bills
FB Item 10 - Fraud: 0 No
Non-Covered Use
FC Item 10 - Fraud: 0 No
Non-Covered
Damage
FD Item 10 - Disallowed: 0 No
Fraud
64
Column Field Name Sample Value Comment 1 In Kaggle
FE Item 10 - Disallowed: 0 No
Under Deductible
FF Item 10 - Disallowed: 0 No
Not Covered
FG Item 10 - Non-Full: 0 No
Over Limit
FH Item 10 - Non-Full: 0 No
Depreciation
FI Item 10 - Non-Full: 0 No
Over Market Price
FJ Item 11 - Home 0 No
Daycare: $Loss
Claimed
FL Item 11 - Fraud: 0 No
Overstated Value
FM Item 11 - Fraud: 0 No
Intentional Damage
FP Item 11 - Fraud: 0 No
Inflated Repair Bills
FQ Item 11 - Fraud: 0 No
Non-Covered Use
FR Item 11 - Fraud: 0 No
Non-Covered
Damage
FS Item 11 - Disallowed: 0 No
Fraud
FT Item 11 - Disallowed: 0 No
Under Deductible
FU Item 11 - Disallowed: 0 No
Not Covered
FV Item 11 - Non-Full: 0 No
Over Limit
FW Item 11 - Non-Full: 0 No
Depreciation
FX Item 11 - Non-Full: 0 No
Over Market Price
65
Column Field Name Sample Value Comment 1 In Kaggle
FY Item 12 - Medical 0 No
Payments: $Loss
Claimed
GA Item 12 - Fraud: 0 No
Overstated Value
GB Item 12 - Fraud: 0 No
Intentional Damage
GE Item 12 - Fraud: 0 No
Inflated Repair Bills
GF Item 12 - Fraud: 0 No
Non-Covered Use
GG Item 12 - Fraud: 0 No
Non-Covered
Damage
GH Item 12 - Disallowed: 0 No
Fraud
GI Item 12 - Disallowed: 0 No
Under Deductible
GJ Item 12 - Disallowed: 0 No
Not Covered
GK Item 12 - Non-Full: 0 No
Over Limit
GL Item 12 - Non-Full: 0 No
Depreciation
GM Item 12 - Non-Full: 0 No
Over Market Price
GN Item 13 - Liability - 0 No
Bodily Injury: $Loss
Claimed
GP Item 13 - Fraud: 0 No
Overstated Value
GQ Item 13 - Fraud: 0 No
Intentional Damage
66
Column Field Name Sample Value Comment 1 In Kaggle
GT Item 13 - Fraud: 0 No
Inflated Repair Bills
GU Item 13 - Fraud: 0 No
Non-Covered Use
GV Item 13 - Fraud: 0 No
Non-Covered
Damage
GW Item 13 - Disallowed: 0 No
Fraud
GX Item 13 - Disallowed: 0 No
Under Deductible
GY Item 13 - Disallowed: 0 No
Not Covered
GZ Item 13 - Non-Full: 0 No
Over Limit
HA Item 13 - Non-Full: 0 No
Depreciation
HB Item 13 - Non-Full: 0 No
Over Market Price
HC Item 14 - Liability - 0 No
Property Damage:
$Loss Claimed
HE Item 14 - Fraud: 0 No
Overstated Value
HF Item 14 - Fraud: 0 No
Intentional Damage
HI Item 14 - Fraud: 0 No
Inflated Repair Bills
HJ Item 14 - Fraud: 0 No
Non-Covered Use
HK Item 14 - Fraud: 0 No
Non-Covered
Damage
HL Item 14 - Disallowed: 0 No
Fraud
67
Column Field Name Sample Value Comment 1 In Kaggle
HM Item 14 - Disallowed: 0 No
Under Deductible
HN Item 14 - Disallowed: 0 No
Not Covered
HO Item 14 - Non-Full: 0 No
Over Limit
HP Item 14 - Non-Full: 0 No
Depreciation
HQ Item 14 - Non-Full: 0 No
Over Market Price
HR Item 15 - Loss 0 No
Assessment: $Loss
Claimed
HT Item 15 - Fraud: 0 No
Overstated Value
HU Item 15 - Fraud: 0 No
Intentional Damage
HX Item 15 - Fraud: 0 No
Inflated Repair Bills
HY Item 15 - Fraud: 0 No
Non-Covered Use
HZ Item 15 - Fraud: 0 No
Non-Covered
Damage
IA Item 15 - Disallowed: 0 No
Fraud
IB Item 15 - Disallowed: 0 No
Under Deductible
IC Item 15 - Disallowed: 0 No
Not Covered
ID Item 15 - Non-Full: 0 No
Over Limit
IE Item 15 - Non-Full: 0 No
Depreciation
IF Item 15 - Non-Full: 0 No
Over Market Price
68
Column Field Name Sample Value Comment 1 In Kaggle
IG Item 16 - Fire 0 No
Department
Charges: $Loss
Claimed
II Item 16 - Fraud: 0 No
Overstated Value
IJ Item 16 - Fraud: 0 No
Intentional Damage
IM Item 16 - Fraud: 0 No
Inflated Repair Bills
IN Item 16 - Fraud: 0 No
Non-Covered Use
IO Item 16 - Fraud: 0 No
Non-Covered
Damage
IP Item 16 - Disallowed: 0 No
Fraud
IQ Item 16 - Disallowed: 0 No
Under Deductible
IR Item 16 - Disallowed: 0 No
Not Covered
IS Item 16 - Non-Full: 0 No
Over Limit
IT Item 16 - Non-Full: 0 No
Depreciation
IU Item 16 - Non-Full: 0 No
Over Market Price
IV Item 17 - Living 0 No
Expenses: $Loss
Claimed
IX Item 17 - Fraud: 0 No
Overstated Value
IY Item 17 - Fraud: 0 No
Intentional Damage
69
Column Field Name Sample Value Comment 1 In Kaggle
JB Item 17 - Fraud: 0 No
Inflated Repair Bills
JC Item 17 - Fraud: 0 No
Non-Covered Use
JD Item 17 - Fraud: 0 No
Non-Covered
Damage
JE Item 17 - Disallowed: 0 No
Fraud
JF Item 17 - Disallowed: 0 No
Under Deductible
JG Item 17 - Disallowed: 0 No
Not Covered
JH Item 17 - Non-Full: 0 No
Over Limit
JI Item 17 - Non-Full: 0 No
Depreciation
JJ Item 17 - Non-Full: 0 No
Over Market Price
JK Item 18 - Furniture: 0 No
$Loss Claimed
JM Item 18 - Fraud: 0 No
Overstated Value
JN Item 18 - Fraud: 0 No
Intentional Damage
JQ Item 18 - Fraud: 0 No
Inflated Repair Bills
JR Item 18 - Fraud: 0 No
Non-Covered Use
JS Item 18 - Fraud: 0 No
Non-Covered
Damage
JT Item 18 - Disallowed: 0 No
Fraud
70
Column Field Name Sample Value Comment 1 In Kaggle
JU Item 18 - Disallowed: 0 No
Under Deductible
JV Item 18 - Disallowed: 0 No
Not Covered
JW Item 18 - Non-Full: 0 No
Over Limit
JX Item 18 - Non-Full: 0 No
Depreciation
JY Item 18 - Non-Full: 0 No
Over Market Price
JZ Item 19 - Appliances: 0 No
$Loss Claimed
KB Item 19 - Fraud: 0 No
Overstated Value
KC Item 19 - Fraud: 0 No
Intentional Damage
KF Item 19 - Fraud: 0 No
Inflated Repair Bills
KG Item 19 - Fraud: 0 No
Non-Covered Use
KH Item 19 - Fraud: 0 No
Non-Covered
Damage
KI Item 19 - Disallowed: 0 No
Fraud
KJ Item 19 - Disallowed: 0 No
Under Deductible
KK Item 19 - Disallowed: 0 No
Not Covered
KL Item 19 - Non-Full: 0 No
Over Limit
KM Item 19 - Non-Full: 0 No
Depreciation
KN Item 19 - Non-Full: 0 No
Over Market Price
KO Item 20 - Electronics: 0 No
$Loss Claimed
71
Column Field Name Sample Value Comment 1 In Kaggle
KQ Item 20 - Fraud: 0 No
Overstated Value
KR Item 20 - Fraud: 0 No
Intentional Damage
KU Item 20 - Fraud: 0 No
Inflated Repair Bills
KV Item 20 - Fraud: 0 No
Non-Covered Use
KW Item 20 - Fraud: 0 No
Non-Covered
Damage
KX Item 20 - Disallowed: 0 No
Fraud
KY Item 20 - Disallowed: 0 No
Under Deductible
KZ Item 20 - Disallowed: 0 No
Not Covered
LA Item 20 - Non-Full: 0 No
Over Limit
LB Item 20 - Non-Full: 0 No
Depreciation
LC Item 20 - Non-Full: 0 No
Over Market Price
LF Item 21 - Fraud: 0 No
Overstated Value
LG Item 21 - Fraud: 0 No
Intentional Damage
72
Column Field Name Sample Value Comment 1 In Kaggle
LJ Item 21 - Fraud: 0 No
Inflated Repair Bills
LK Item 21 - Fraud: 0 No
Non-Covered Use
LL Item 21 - Fraud: 0 No
Non-Covered
Damage
LM Item 21 - Disallowed: 0 No
Fraud
LN Item 21 - Disallowed: 0 No
Under Deductible
LO Item 21 - Disallowed: 0 No
Not Covered
LP Item 21 - Non-Full: 0 No
Over Limit
LQ Item 21 - Non-Full: 0 No
Depreciation
LR Item 21 - Non-Full: 0 No
Over Market Price
LS Item 22 - Apparel: 0 No
$Loss Claimed
LU Item 22 - Fraud: 0 No
Overstated Value
LV Item 22 - Fraud: 0 No
Intentional Damage
LY Item 22 - Fraud: 0 No
Inflated Repair Bills
LZ Item 22 - Fraud: 0 No
Non-Covered Use
MA Item 22 - Fraud: 0 No
Non-Covered
Damage
MB Item 22 - Disallowed: 0 No
Fraud
MC Item 22 - Disallowed: 0 No
Under Deductible
73
Column Field Name Sample Value Comment 1 In Kaggle
MD Item 22 - Disallowed: 0 No
Not Covered
ME Item 22 - Non-Full: 0 No
Over Limit
MF Item 22 - Non-Full: 0 No
Depreciation
MG Item 22 - Non-Full: 0 No
Over Market Price
MH Item 23 - Jewelry: 0 No
$Loss Claimed
MJ Item 23 - Fraud: 0 No
Overstated Value
MK Item 23 - Fraud: 0 No
Intentional Damage
MN Item 23 - Fraud: 0 No
Inflated Repair Bills
MO Item 23 - Fraud: 0 No
Non-Covered Use
MP Item 23 - Fraud: 0 No
Non-Covered
Damage
MQ Item 23 - Disallowed: 0 No
Fraud
MR Item 23 - Disallowed: 0 No
Under Deductible
MS Item 23 - Disallowed: 0 No
Not Covered
MT Item 23 - Non-Full: 0 No
Over Limit
MU Item 23 - Non-Full: 0 No
Depreciation
MV Item 23 - Non-Full: 0 No
Over Market Price
MW Item 24 - Silverware: 0 No
$Loss Claimed
74
Column Field Name Sample Value Comment 1 In Kaggle
MY Item 24 - Fraud: 0 No
Overstated Value
MZ Item 24 - Fraud: 0 No
Intentional Damage
NC Item 24 - Fraud: 0 No
Inflated Repair Bills
ND Item 24 - Fraud: 0 No
Non-Covered Use
NE Item 24 - Fraud: 0 No
Non-Covered
Damage
NF Item 24 - Disallowed: 0 No
Fraud
NG Item 24 - Disallowed: 0 No
Under Deductible
NH Item 24 - Disallowed: 0 No
Not Covered
NI Item 24 - Non-Full: 0 No
Over Limit
NJ Item 24 - Non-Full: 0 No
Depreciation
NK Item 24 - Non-Full: 0 No
Over Market Price
NL Item 25 - Tools: 0 No
$Loss Claimed
NN Item 25 - Fraud: 0 No
Overstated Value
NO Item 25 - Fraud: 0 No
Intentional Damage
NR Item 25 - Fraud: 0 No
Inflated Repair Bills
NS Item 25 - Fraud: 0 No
Non-Covered Use
75
Column Field Name Sample Value Comment 1 In Kaggle
NT Item 25 - Fraud: 0 No
Non-Covered
Damage
NU Item 25 - Disallowed: 0 No
Fraud
NV Item 25 - Disallowed: 0 No
Under Deductible
NW Item 25 - Disallowed: 0 No
Not Covered
NX Item 25 - Non-Full: 0 No
Over Limit
NY Item 25 - Non-Full: 0 No
Depreciation
NZ Item 25 - Non-Full: 0 No
Over Market Price
OA Item 26 - 0 No
Construction
Material: $Loss
Claimed
OC Item 26 - Fraud: 0 No
Overstated Value
OD Item 26 - Fraud: 0 No
Intentional Damage
OG Item 26 - Fraud: 0 No
Inflated Repair Bills
OH Item 26 - Fraud: 0 No
Non-Covered Use
OI Item 26 - Fraud: 0 No
Non-Covered
Damage
OJ Item 26 - Disallowed: 0 No
Fraud
OK Item 26 - Disallowed: 0 No
Under Deductible
OL Item 26 - Disallowed: 0 No
Not Covered
OM Item 26 - Non-Full: 0 No
Over Limit
76
Column Field Name Sample Value Comment 1 In Kaggle
ON Item 26 - Non-Full: 0 No
Depreciation
OO Item 26 - Non-Full: 0 No
Over Market Price
OR Item 27 - Fraud: 0 No
Overstated Value
OS Item 27 - Fraud: 0 No
Intentional Damage
OV Item 27 - Fraud: 0 No
Inflated Repair Bills
OW Item 27 - Fraud: 0 No
Non-Covered Use
OX Item 27 - Fraud: 0 No
Non-Covered
Damage
OY Item 27 - Disallowed: 0 No
Fraud
OZ Item 27 - Disallowed: 0 No
Under Deductible
PA Item 27 - Disallowed: 0 No
Not Covered
PB Item 27 - Non-Full: 0 No
Over Limit
PC Item 27 - Non-Full: 0 No
Depreciation
PD Item 27 - Non-Full: 0 No
Over Market Price
PE Item 28 - Sporting 0 No
Goods: $Loss
Claimed
PG Item 28 - Fraud: 0 No
Overstated Value
77
Column Field Name Sample Value Comment 1 In Kaggle
PH Item 28 - Fraud: 0 No
Intentional Damage
PK Item 28 - Fraud: 0 No
Inflated Repair Bills
PL Item 28 - Fraud: 0 No
Non-Covered Use
PM Item 28 - Fraud: 0 No
Non-Covered
Damage
PN Item 28 - Disallowed: 0 No
Fraud
PO Item 28 - Disallowed: 0 No
Under Deductible
PP Item 28 - Disallowed: 0 No
Not Covered
PQ Item 28 - Non-Full: 0 No
Over Limit
PR Item 28 - Non-Full: 0 No
Depreciation
PS Item 28 - Non-Full: 0 No
Over Market Price
PV Item 29 - Fraud: 0 No
Overstated Value
PW Item 29 - Fraud: 0 No
Intentional Damage
PZ Item 29 - Fraud: 0 No
Inflated Repair Bills
QA Item 29 - Fraud: 0 No
Non-Covered Use
78
Column Field Name Sample Value Comment 1 In Kaggle
QB Item 29 - Fraud: 0 No
Non-Covered
Damage
QC Item 29 - Disallowed: 0 No
Fraud
QD Item 29 - Disallowed: 0 No
Under Deductible
QE Item 29 - Disallowed: 0 No
Not Covered
QF Item 29 - Non-Full: 0 No
Over Limit
QG Item 29 - Non-Full: 0 No
Depreciation
QH Item 29 - Non-Full: 0 No
Over Market Price
QI Item 30 - Cameras: 0 No
$Loss Claimed
QK Item 30 - Fraud: 0 No
Overstated Value
QL Item 30 - Fraud: 0 No
Intentional Damage
QO Item 30 - Fraud: 0 No
Inflated Repair Bills
QP Item 30 - Fraud: 0 No
Non-Covered Use
QQ Item 30 - Fraud: 0 No
Non-Covered
Damage
QR Item 30 - Disallowed: 0 No
Fraud
QS Item 30 - Disallowed: 0 No
Under Deductible
QT Item 30 - Disallowed: 0 No
Not Covered
QU Item 30 - Non-Full: 0 No
Over Limit
79
Column Field Name Sample Value Comment 1 In Kaggle
QV Item 30 - Non-Full: 0 No
Depreciation
QW Item 30 - Non-Full: 0 No
Over Market Price
QX Item 31 - Watches: 0 No
$Loss Claimed
QZ Item 31 - Fraud: 0 No
Overstated Value
RA Item 31 - Fraud: 0 No
Intentional Damage
RD Item 31 - Fraud: 0 No
Inflated Repair Bills
RE Item 31 - Fraud: 0 No
Non-Covered Use
RF Item 31 - Fraud: 0 No
Non-Covered
Damage
RG Item 31 - Disallowed: 0 No
Fraud
RH Item 31 - Disallowed: 0 No
Under Deductible
RI Item 31 - Disallowed: 0 No
Not Covered
RJ Item 31 - Non-Full: 0 No
Over Limit
RK Item 31 - Non-Full: 0 No
Depreciation
RL Item 31 - Non-Full: 0 No
Over Market Price
RO Item 32 - Fraud: 0 No
Overstated Value
RP Item 32 - Fraud: 0 No
Intentional Damage
80
Column Field Name Sample Value Comment 1 In Kaggle
RS Item 32 - Fraud: 0 No
Inflated Repair Bills
RT Item 32 - Fraud: 0 No
Non-Covered Use
RU Item 32 - Fraud: 0 No
Non-Covered
Damage
RV Item 32 - Disallowed: 0 No
Fraud
RW Item 32 - Disallowed: 0 No
Under Deductible
RX Item 32 - Disallowed: 0 No
Not Covered
RY Item 32 - Non-Full: 0 No
Over Limit
RZ Item 32 - Non-Full: 0 No
Depreciation
SA Item 32 - Non-Full: 0 No
Over Market Price
SB Item 33 - Medical 0 No
Instruments: $Loss
Claimed
SD Item 33 - Fraud: 0 No
Overstated Value
SE Item 33 - Fraud: 0 No
Intentional Damage
SH Item 33 - Fraud: 0 No
Inflated Repair Bills
SI Item 33 - Fraud: 0 No
Non-Covered Use
SJ Item 33 - Fraud: 0 No
Non-Covered
Damage
81
Column Field Name Sample Value Comment 1 In Kaggle
SK Item 33 - Disallowed: 0 No
Fraud
SL Item 33 - Disallowed: 0 No
Under Deductible
SM Item 33 - Disallowed: 0 No
Not Covered
SN Item 33 - Non-Full: 0 No
Over Limit
SO Item 33 - Non-Full: 0 No
Depreciation
SP Item 33 - Non-Full: 0 No
Over Market Price
SQ Item 34 - Musical 0 No
Instruments: $Loss
Claimed
SS Item 34 - Fraud: 0 No
Overstated Value
ST Item 34 - Fraud: 0 No
Intentional Damage
SW Item 34 - Fraud: 0 No
Inflated Repair Bills
SX Item 34 - Fraud: 0 No
Non-Covered Use
SY Item 34 - Fraud: 0 No
Non-Covered
Damage
SZ Item 34 - Disallowed: 0 No
Fraud
TA Item 34 - Disallowed: 0 No
Under Deductible
TB Item 34 - Disallowed: 0 No
Not Covered
TC Item 34 - Non-Full: 0 No
Over Limit
TD Item 34 - Non-Full: 0 No
Depreciation
82
Column Field Name Sample Value Comment 1 In Kaggle
TE Item 34 - Non-Full: 0 No
Over Market Price
TF Item 35 - Other 0 No
Personal Property:
$Loss Claimed
TH Item 35 - Fraud: 0 No
Overstated Value
TI Item 35 - Fraud: 0 No
Intentional Damage
TL Item 35 - Fraud: 0 No
Inflated Repair Bills
TM Item 35 - Fraud: 0 No
Non-Covered Use
TN Item 35 - Fraud: 0 No
Non-Covered
Damage
TO Item 35 - Disallowed: 0 No
Fraud
TP Item 35 - Disallowed: 0 No
Under Deductible
TQ Item 35 - Disallowed: 0 No
Not Covered
TR Item 35 - Non-Full: 0 No
Over Limit
TS Item 35 - Non-Full: 0 No
Depreciation
TT Item 35 - Non-Full: 0 No
Over Market Price
83
Table 12 is the data schema for insurance freetext.
84
Column Field Name Sample Value Comment In Kaggle
85
Table 13 is the data schema for storms.
86
Table 15 is the data schema for volcanoes.
87
Back cover
REDP-5748-00
ISBN 0738461997
Printed in U.S.A.
®
ibm.com/redbooks