0% found this document useful (0 votes)
12 views

NeurIPS 2023 Dataperf Benchmarks for Data Centric Ai Development Paper Datasets and Benchmarks

Uploaded by

Irene Sibu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

NeurIPS 2023 Dataperf Benchmarks for Data Centric Ai Development Paper Datasets and Benchmarks

Uploaded by

Irene Sibu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

DataPerf:

Benchmarks for Data-Centric AI Development

Mark Mazumder1 , Colby Banbury1 , Xiaozhe Yao2 , Bojan Karlaš2 , William Gaviria Rojas3 ,
Sudnya Diamos3 , Greg Diamos4 , Lynn He5 , Alicia Parrish 8 , Hannah Rose Kirk17 , Jessica Quaye1 ,
Charvi Rastogi11 , Douwe Kiela9,21 , David Jurado6,20 , David Kanter6 , Rafael Mosquera6,20 ,
Juan Ciro6,20 , Lora Aroyo8 , Bilge Acun7 , Lingjiao Chen9 , Mehul Smriti Raje3 , Max Bartolo16,19 ,
Sabri Eyuboglu9 , Amirata Ghorbani9 , Emmett Goodman9 , Oana Inel18 , Tariq Kane3,8 ,
Christine R. Kirkpatrick10 , Tzu-Sheng Kuo11 , Jonas Mueller12 , Tristan Thrush9 ,
Joaquin Vanschoren13 , Margaret Warren14 , Adina Williams7 , Serena Yeung9 , Newsha Ardalani7 ,
Praveen Paritosh6 , Ce Zhang2 , James Zou9 , Carole-Jean Wu7 , Cody Coleman3 , Andrew Ng4,5,9 ,
Peter Mattson8 , and Vijay Janapa Reddi1
1
Harvard University, 2 ETH Zurich, 3 Coactive.AI, 4 Landing AI, 5 DeepLearning.AI, 6 MLCommons,
7
Meta, 8 Google, 9 Stanford University, 10 San Diego Supercomputer Center, UC San Diego,
11
Carnegie Mellon University, 12 Cleanlab, 13 Eindhoven University of Technology,
14
Institute for Human and Machine Cognition, 15 Kaggle, 16 Cohere, 17 University of Oxford,
18
University of Zurich, 19 University College London, 20 Factored, 21 Contextual AI

Abstract
Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth,
difficulty, and faithfulness of the underlying problems. Neglecting the fundamen-
tal importance of data has given rise to inaccuracy, bias, and fragility in real-world
applications, and research is hindered by saturation across existing dataset bench-
marks. In response, we present DataPerf, a community-led benchmark suite for
evaluating ML datasets and data-centric algorithms. We aim to foster innovation
in data-centric AI through competition, comparability, and reproducibility. We
enable the ML community to iterate on datasets, instead of just architectures, and
we provide an open, online platform with multiple rounds of challenges to support
this iterative development. The first iteration of DataPerf contains five benchmarks
covering a wide spectrum of data-centric techniques, tasks, and modalities in vi-
sion, speech, acquisition, debugging, and diffusion prompting, and we support
hosting new contributed benchmarks from the community. The benchmarks, on-
line evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.

1 Introduction
Machine learning research has overwhelmingly focused on improving models rather than on im-
proving datasets. Large public datasets such as ImageNet [14], Freebase [7], Switchboard [22], and
SQuAD [44] serve as compasses for benchmarking model performance. Consequently, researchers
eagerly adopt the largest existing dataset without fully considering its breadth, difficulty and fidelity
to the underlying problem. Critically, better data quality [2] is increasingly necessary to improve
generalization, avoid bias, and aid safety in data cascades [48]. Without high-quality training data
models can exhibit performance discrepancies leading to reduced accuracy and persistent fairness

37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Bench-
marks.
Data-Centric Operations Training

Data Quality
Data Parsing
Assessment ML Model
Training Data
New Train Set
Data Data
Augmentation Acquisition
Testing Data
Data
Data Debugging
Data Cleaning
Selection New Test Set
D
Other Data Sources Data-Centric Iterations

Figure 1: Typical benchmarks are model-centric, and therefore focus on the model design and train-
ing stages of the ML pipeline (shown in orange). However, to develop high-quality ML applications,
users often employ a collection of data-centric operations to improve data quality and repeated data-
centric iterations to refine these operations. DataPerf aims to benchmark all major stages of such a
data-centric pipeline (shown in green) to improve ML data quality.

issues [9, 15, 37] once they leave the lab to enter service. In conventional model-centric ML, the
term benchmark often means a standard, fixed dataset for model accuracy comparisons and per-
formance measurements. While this paradigm has been useful for advancing model design, these
benchmarks are now saturating (attaining perfect or above “human-level” performance) [26]. This
raises two questions: First, is ML research making real progress on the underlying capabilities, or
is it just overfitting to existing benchmark datasets or suffering from data artifacts? A growing body
of literature explores the evidence supporting benchmark limitations [57, 24, 43, 53, 47, 5, 21, 55].
Second, how should benchmarks evolve to push the frontier of ML research?
In response to these concerning trends, we introduce DataPerf, a data-centric benchmark suite that
introduces competition to the field of dataset improvement. We survey a suite of complex data-
centric development pipelines across multiple ML domains and isolate a subset of concrete tasks
that we believe are representative of current bottlenecks, as illustrated in Figure 1. We freeze model
architectures, training hyperparameters, and task metrics to compare solutions strictly via relative
improvements from changes to the datasets themselves.
Our contributions are as follows:

• We have developed a comprehensive suite of novel data-centric benchmarks covering a


wide range of tasks. These tasks encompass training set selection for speech and vision,
data cleaning and debugging, data acquisition, and diffusion model prompting.
• Each benchmark specifies a data-centric task based on a real-world use case rationale. We
provide rules for submissions, along with evaluation scripts, and a baseline submission for
each benchmark task.
• We provide an extensible and open-source platform for hosting data-centric benchmarks,
allowing other organizations and researchers to propose new benchmarks for inclusion in
the DataPerf suite, and to host data challenges themselves.

Critically, DataPerf is not a one-off competition. We have established the DataPerf Working Group,
which operates under the MLCommons Association. This working group is responsible for the
ongoing maintenance of the benchmarks and platform, as well as for fostering the development of
data-centric research and methodologies in both academic and industrial domains. The aim is to
ensure the long-term sustainability and growth of DataPerf beyond a single competition.
The remainder of the paper is organized as follows. In Section 2.1, we review lessons learned from
an exploratory data-centric challenge. Section 2.2 details the hosting platform we developed in
response and Section 2.3 presents the DataPerf suite of five novel benchmarks and challenges. We
conclude with a survey of related efforts (Section 3) and future directions (Section 5).

2
2 DataPerf Benchmarking Suite
We describe the initial challenge which inspired the suite of DataPerf benchmarks and identified
which features are needed for hosting data-centric challenges online. We then describe the plat-
form that enables flexible data-centric benchmarking at scale. Finally, we share the initial DataPerf
benchmark definitions in vision, speech, acquisition, debugging, and text-to-image prompting.

2.1 The Data-Centric AI Challenge

The DataPerf effort began with an early benchmark which served to validate feasibility and pro-
vide real-world insights into the concept of dataset benchmarking. In traditional ML challenges,
contestants must train a high-accuracy model given a fixed dataset. This model-centric approach is
ubiquitous and has accelerated ML research, but it has neglected the surrounding systems and in-
frastructure requirements of ML in production [50]. To draw more attention to other areas of the ML
pipeline, we created the Data-Centric AI (DCAI) competition [39], inviting competitors to focus on
optimizing accuracy by improving a dataset given a fixed model architecture, thus flipping the con-
ventional challenge format of submitting different models which are evaluated on a fixed dataset.
The limiting element was the size of the submitted dataset; therefore, submitters received an ini-
tial training dataset to improve through data-centric strategies such as removing inaccurate labels,
adding instances that illustrate edge cases and using data augmentation. The competition, inspired
by MNIST, focuses on classification of Roman-numeral digits. Just by iterating on the dataset, par-
ticipants increased the baseline accuracy from 64.4% to 85.8%; human-level performance (HLP)
was 90.2%. We learned several lessons from the 2,500 submissions and applied them to DataPerf:

1. Common data pipelines. Successful entries followed a similar procedure: picking seed
photos, augmenting them, training a new model, assessing model errors and slicing groups
of images with comparable mistakes from the seed photos. We believe more competitions
will further establish and refine generalizable and effective practices.
2. Automated methods won. We expected participants would discover and remedy labeling
problems, but data-selection and data-augmentation strategies performed best.
3. Novel dataset optimizations. Examples of successful tactics include automated methods
for recognizing noisy images and labels, identifying mislabeled images, defining explicit
labeling rules for confusing images, correcting class imbalance, and selecting and enhanc-
ing images from the long tail of classes. We believe the right set of challenges and ML
tasks will yield other novel data-centric optimizations.
4. New methods emerged. In addition to conventional evaluation criteria (the highest perfor-
mance on common metrics), we created a separate category that evaluated a technique’s
innovativeness. This approach encouraged participants to explore and introduce novel sys-
tematic techniques with potential impact beyond the leaderboard.
5. New supporting infrastructure is necessary. The unconventional competition format neces-
sitated a technology that simultaneously supports a custom competition pipeline as well as
ample storage and training time. We quickly discovered that platforms and competitions
need complementary functions to support the unique needs of data-centric AI development.
Moreover, the competition was computationally expensive. Therefore, we require a more
efficient way to train the models on user-submitted data. Computational power, memory
and bandwidth are all major limitations.

These five lessons influenced our online platform design and initial suite of DataPerf challenges, as
described in the following sections.

2.2 Evaluation Platform

DataPerf provides an online platform where challenge participants can submit their solutions for
evaluation, and a working group which invites members in academia and industry to propose new
data-centric benchmarks for inclusion in the DataPerf suite. The DataPerf benchmarks, evaluation
tools, leaderboards, and documentation are hosted in an online platform called Dynabench1 [26],
1
https://dynabench.org/

3
which allows challenge participants to submit, evaluate, and compare solutions for all data-centric
benchmarks defined in Section 2.3. The DataPerf benchmarks and the Dynabench platform are open-
source, and are hosted and maintained by the MLCommons Association2 , a nonprofit organization
supported by more than 50 member companies and academics, ensuring long-term availability and
benefit to the community.
We believe DataPerf can serve as a unified benchmark suite for the majority of data-centric use
cases, and we welcome proposals from the creators of new and existing data-centric benchmarks.
Our five current benchmarks are also intended to serve as representative examples for future authors
to host their own challenges on DataPerf, with customized modular submission pipelines for dif-
ferent data modalities and submission artifact types. DataPerf introduces three key extensions to
the Dynabench codebase to support data-centric benchmarks: (1) We add support for a wide variety
of submission artifacts, such as training subsets, priority values/orderings, and purchase strategies.
Users can also submit fully containerized systems as artifacts, such as in the debugging challenge.
(2) To support a diverse set of evaluation algorithms and scoring metrics, we develop modular soft-
ware adaptors to allow for running custom benchmark evaluation tools and displaying or querying
scores in Dynabench’s online leaderboards. (3) DataPerf utilizes serverless [4] deployment which
dynamically scales resources based on demand, ensuring optimal performance and efficient resource
allocation, and allowing the platform to automatically scale with the growth of the benchmark suite
and the number of participants. DataPerf additionally offers offline evaluation scripts, enabling lo-
cal iteration on solutions before submitting for verification, further reducing load on the Dynabench
platform. These improvements to Dynabench ensure DataPerf can accommodate a large suite of
community-contributed data-centric challenges in the future.

2.3 Challenges, Benchmarks, and Leaderboards

DataPerf uses leaderboards and challenges to encourage constructive competition and inspire ad-
vances in building and optimizing datasets. In this section, we clarify DataPerf’s terminology. A
leaderboard is a public summary of benchmark results; it helps to quickly identify state-of-the-art
approaches. A challenge is a public contest to achieve the best result on a leaderboard in a fixed
timeframe. Challenges motivate rapid progress through recognition and awards. Our leaderboards
and challenges are hosted on the online platform Dynabench (Section 2.2) developed and supported
by MLCommons. Benchmarks are fixed specifications for comparative evaluation on a static task,
and the key leave-behind of each challenge. MLCommons will provide long-term support for each
benchmark through leaderboards which remain open for submission and comparison once a chal-
lenge concludes. Each challenge also provides a baseline implementation to set a minimum bar for
each leaderboard metric and to discourage uninformative or random submissions.
DataPerf’s initial suite consists of tasks in training set selection for speech and vision, data cleaning
and debugging, data acquisition, and generative model prompting. Figure 1 depicts underserved
components in benchmarking machine learning pipelines, and these five tasks were selected by the
DataPerf working group among the initial proposals for challenges in order to cover as many of these
components as possible while also exercising the infrastructure requirements for our online platform.
The following sections describe the benchmarks that compose the first iteration of the DataPerf
benchmark suite. Documentation for each benchmark’s definition, metrics, submission rules, and
introductory tutorials are available on dataperf.org and reproduced in our Appendix, and our open-
source baseline implementations are available at https://github.com/MLCommons/dataperf.

2.3.1 Selection for Speech


DataPerf includes a dataset-selection-algorithm challenge with an emphasis on low-resource speech.
The objective of the speech selection task is to develop a selection algorithm that chooses the most
effective training samples from a large and noisy multilingual corpus of spoken words, expanding
sample quality estimation techniques to low-resource language settings. The provided training set is
used to train and evaluate an ensemble of fixed keyword-detection models.

Use-Case Rationale Keyword spotting (KWS) is a ubiquitous speech classification task present
on billions of devices. A KWS model detects a limited vocabulary of spoken words. Production
2
https://www.mlcommons.org/

4
Train Eval Eval
Vectors Vectors IDs

Allowed Selected
Train
Selection Training Eval Score
IDs IDs (Model Training)

MLCommons Submitter Generated

Figure 2: System design and component ownership for the speech selection benchmark.

examples include the wakeword interfaces for Google Voice Assistant, Siri and Alexa. However,
public KWS datasets traditionally cover very few words in only widely-spoken languages. In
contrast, the Multilingual Spoken Words Corpus [35] (MSWC), is a large dataset of over 340,000
spoken words in 50 languages (collectively, these languages represent more than five billion people).
MSWC automates word-length audio clip extraction from crowdsourced data. Due to errors in
the generation process and source data, some samples are incorrect. For instance, they may miss
part of the target sample (e.g., “weathe-” instead of “weather”) or may contain part of an adjacent
word (e.g., “time to” instead of “time”). This benchmark focuses on estimating the quality of each
automatically-generated sample in KWS training pipelines intended for low-resource languages.
Additionally, this benchmark establishes the DataPerf platform’s capabilities for hosting speech
challenges in multiple languages.
Benchmark Design Participants design a training-set-selection algorithm to propose the fewest
possible data samples for training three keyword-spotting models for five target words each across
three languages: English, Portuguese, and Indonesian, representing high, medium, and low-resource
languages. The benchmark evaluates the algorithm on the mean F1 score of each evaluation set
(additional details in Appendix A.3). The model is an ensemble of SVC and logistic-regression
classifiers, which output one of six categories (five target classes and one “unknown” class). The
inputs to the classifier are 1,024-dimensional vectors of embedding representations from a pretrained
keyword feature extractor [34]. Participants may only specify which training samples are used by the
model; all other configuration parameters are fixed, thereby emphasizing the importance of selecting
the most informative samples. For each language there are separate leaderboards for submissions
with ≤ 25 samples or ≤ 60 samples, evaluating the algorithm’s sensitivity to the training set size.
Participants are given a tutorial baseline which uses crossfold validation in a Google Colab note-
book and an offline copy of the evaluation pipeline, for ease of setup and and rapid experimentation.
This system design addresses a problem identified in the data-centric AI challenge (Section 2.1) -
enabling offline development reduces the computational requirements for online evaluation, though
participants must agree to challenge rules on not inspecting the evaluation set. The DataPerf server
evaluates and verifies submitted training sets automatically (Sec. 2.2) for inclusion in the live leader-
board. Figure 2 illustrates the speech-selection benchmark workflow.

Baseline Results We provide two baseline implementations, nested cross-fold selection and a
data-cleaning approach using the Cleanlab framework [40]. The cross-fold selection method uses
nested cross-validation where the outer loop selects different subsets of the target samples and the
inner loop selects different subsets of the nontarget samples, and the best performing subsets are
reported back as the selected training set. The Cleanlab method rejects outliers using out-of-sample
predicted probability estimates for each candidate sample (also computed via cross-validated mod-
els). All baseline scores are averaged across 10 random seeds.

Table 1: Baseline results (macro F1 scores) for the Selection for Speech challenge.

English Portuguese Indonesian


Training set size 25 60 25 60 25 60
Nested cross-fold 0.32 0.41 0.42 0.52 0.36 0.42
Cleanlab 0.49 0.49 0.47 0.57 0.37 0.43

5
2.3.2 Selection for Vision
DataPerf includes a data selection algorithm challenge with a vision-centric focus. The objective
of this task is to develop a data selection algorithm that chooses the most effective training samples
from a large candidate pool of images. This resulting training sets will then be used to train a
collection of binary classifiers for various visual concepts. The benchmark evaluates the algorithm
on the basis of the resulting models’ classification performance on the evaluation set.

Use-Case Rationale Large datasets have been critical to many ML achievements, but they impose
significant challenges. Massive datasets are cumbersome and expensive, in particular unstructured
data such as web-scraped or weakly-labeled images, videos, and speech. Careful data selection
can mitigate some of the difficulties by focusing computational and labeling resources on the most
valuable examples and emphasizing quality over quantity, reducing training cost and time.
The vision-selection-algorithm benchmark evaluates binary classification of visual concepts (e.g.,
“monster truck” or “jean jacket”) in unlabeled images. Familiar production examples of similar
models include automatic labeling services by Amazon Rekognition, Google Cloud Vision API and
Azure Cognitive Services. Successful approaches to this challenge will enable image classification
of long-tail concepts where discovery of high-value data is critical, and advances the democratiza-
tion of computer vision [20]. This benchmark demonstrates DataPerf’s support for challenges with
unlabeled image data and is a template for future benchmarks that target automatic labeling.

Benchmark Design The task is to design a data-selection strategy that chooses the best training
examples from a large pool of training images. We evaluate submissions on their ability to algorith-
mically propose a subset of the Open Images Dataset V6 training set [29] that maximizes the mean
F1-score over a set of fixed concepts (“cupcake,” “hawk” and “sushi”). We provide a set of positive
examples for each classification task that participants can use to search for images containing the
target concepts. Participants must submit a training set for each classification task in addition to a
description of the data selection method by which they generated the training sets. The challenge
platform (Sec. 2.2) automates evaluation of submissions.

Baseline Results We provide three baseline results, namely, farthest point sampling, pseudo-label
generation, and modified uncertainty sampling. Farthest point sampling selects negative examples
by attempting to sample the feature search space through iterative maximum l2 distances, afterwards
returning the best coreset under nested cross-validation. Pseudo label generation trains multiple
neural networks and classical models on a subset of data to classify the remainder of points and uses
the best-performing model for coreset proposal under multiple sampling experiments. Modified
uncertainty sampling trains a binary classifier on noisy positive labels from OpenImages and uses
this classifier to assign positive and negative image pools, with the coreset randomly sampled from
both pools. For each baseline, F1 scores on the three test concepts are provided in Table 2.

Table 2: Baseline results (F1 scores) for the Selection for Vision challenge.

Cupcake Hawk Sushi Mean F1-score


Farthest point sampling 0.75 0.87 0.82 0.81
Pseudo label generation 0.70 0.86 0.81 0.79
Modified uncertainty sampling 0.71 0.83 0.80 0.78

2.3.3 Debugging for Vision


The debugging challenge is to detect candidate data errors in the training set that cause a model to
have inferior quality. The aim is to assist a user in prioritizing which samples to inspect, correct,
and clean. A debugging method’s purpose is to identify the most detrimental data points from a
potentially noisy training set. After inspecting and correcting the selected data points, the cleaned
dataset is used to train a new classification model. Evaluation is based on the number of data points
the debugging approach must correct to attain a certain accuracy.

6
Use-Case Rationale Datasets are rapidly growing in size. For instance, Open Images V6 has 59
million image-level labels. Such datasets are annotated either manually or using ML. Unfortunately,
noise is unavoidable and can originate from both human annotators and algorithms. Models trained
on noisy annotations suffer in accuracy and carry risks of bias and unfairness. Dataset cleaning is a
common approach to dealing with noisy labels. However, it is a costly and time-consuming process
that typically involves human review. Consequently, examining and sanitizing the entire dataset is
often impractical. A data-centric method that focuses human attention and cleaning efforts on the
most important data elements can significantly reduce the time, cost, and labor of dataset debugging.
This challenge demonstrates the DataPerf platform’s ability to simulate human-in-the-loop data-
centric tasks, in this case label cleaning, while remaining scalable.

Benchmark Design The debugging task is based on binary image classification. For each activity,
participants receive a noisy training set (i.e., some labels are inaccurate) and a validation set with
correct labels. They must provide a debugging approach that assigns a priority value (harmfulness)
to each training set item. After each trial, all training data will have been examined and rectified.
Each time a new item is examined, a classification model is trained on the clean dataset, and the test
accuracy on a hidden test set is computed. Then a score is returned.
The image sets are from the Open Images Dataset [29], with two important considerations: (1) The
number of data points should be sufficient to permit random selection of samples for the training,
validation and test sets. (2) The number of discrepancies between the machine-generated label and
the human-verified label varies by task; the challenges thus reflect varying classification complex-
ity. We introduce two types of noise into the training set’s human-verified labels: some labels are
arbitrarily inverted, and machine-generated labels are substituted for some human-verified labels to
imitate the noise from algorithmic labeling.
We use a 2,048-dimensional vector of embedding representations extracted from a pretrained
ResNet50 model [32] as the classifier’s input data. Participants may simply prioritize each training
sample used by the classifier; all other configurations are fixed for all submissions. By precomputing
all embeddings, participants are encouraged to propose data-centric debugging methods for arbitrary
features rather than approaches specific to the image domain. This also removes the need for GPU
acceleration during submission evaluation.
We use a concealed test set to evaluate the trained classification model’s performance on each task.
Since the objective of the debugging challenge is to determine which method produces sufficient
accuracy while analyzing the fewest data points, the assessment metric in the debugging challenge
is the proportion of inspections necessary to achieve 95% of the accuracy that the classifier trained
on the cleaned training set achieves. We verify submissions by incrementally cleaning the data
and training a model on each step. Each submission contains a list of indices in the order that
the submitter wishes to clean. We incrementally prepare a new dataset for each cleaned sample.
For instance, assuming the submission is [5,4,3,2,1], we will prepare 5 datasets that are [5-cleaned,
4,3,2,1], [5-cleaned, 4-cleaned, 3, 2, 1], and so forth. We then train a XGBoost classifier on each
dataset, and report back the step at which the accuracy is high enough (>95%) on the test dataset.
Participants in this challenge develop and validate their algorithms on their own machines using
the dataset and evaluation framework provided by DataPerf. Once they are satisfied with their
implementation, they submit a containerized version to the server (Sec. 2.2). The server then reruns
the uploaded implementation on several hidden tasks and posts the average score to a leaderboard.

Baseline Results The benchmark system provides three baseline implementations: consecutive,
random and DataScope [25], which achieve the score of 53.50, 51.75 and 15.54 respectively. In
other words, DataScope needs to fix 15.54% of data samples to achieve the threshold, consecutive
needs 53.50% and random needs to fix 51.75%. DataScope is a fast approximation for Shapley
values [31] for importance estimates of each sample included and the effect of noise. As Shapley
values require calculating the payoff of every subset (O(2N ) evaluations), approximation techniques
such as DataScope are necessitated.

2.3.4 Data Acquisition


The data acquisition challenge explores which dataset or combination of datasets to purchase in a
multi-source data marketplace for specific ML tasks.

7
Observations Submissions Executions
Pricing Description of Budget Verifier
Mechanisms Strategy/Algo
Sellers’ Datasets
Dataset Purchased
Summaries Fractions
ML Models
Eval
Script (Optional) Eval Dataset
Dataset

Figure 3: Data acquisition benchmark design. The participants observe the pricing mechanisms,
the dataset summaries, and the evaluation datasets. They then need to develop and submit the data
acquisition strategies. The evaluation is executed automatically on the DataPerf server.

Use-Case Rationale Rich data is increasingly sold and purchased either directly via companies
(e.g., Twitter [54] and Bloomberg [6]) or data marketplaces (e.g., Amazon AWS Data Exchange [1],
Databricks Marketplace [13], and TAUS Data Marketplace [52]) to train a high-quality ML model
customized for specific applications. Those datasets are necessary often because the datasets (i)
cover underrepresented populations, (ii) offer high-quality annotations, and (iii) exhibit easy-to-use
formats. On the other hand, the datasets are also expensive due to the tremendous efforts spent
to curate and clean data samples. Content opacity is therefore ubiquitous: data sellers usually are
disinclined to release the full content of their datasets to the buyers. This renders it challenging for
the data users to decide whether a dataset is useful for the downstream ML tasks. Based on our
conversations with practitioners, existing data acquisition methods for ML are ad-hoc: one has to
manually identify data sellers, articulate their needs, estimate the data utilities, and then purchase
them. It is also iterative in nature: the datasets may show limited improvements on a downstream
ML task after being purchased, and then one has to search for a new dataset again. With this in
mind, the goal of this challenge is to mitigate a data buyer’s burden by automating and optimizing
the data acquisition strategies. This challenge demonstrates the platform’s ability to handle data-
valuations and demonstrates a unique metric based on a pricing function and a budget, which is a
useful template for future challenges that wish to capture the nuance of resource expenditure.

Benchmark Design Participants in this challenge must submit a data acquisition strategy. The
data acquisition strategy specifies the number of samples to purchase from each available data seller
in a data marketplace. Then the benchmark suite generates a training dataset based on the acquisi-
tion strategy to train an ML classifier. To mimic data acquisition in a real-world data marketplace,
participants do not have access to sellers’ data. Instead, the participants are offered (1) a few samples
(=5) from each data seller, (2) summary statistics about each dataset, (3) the pricing functions that
quantify how much to pay when a particular number of samples is purchased from one seller, and
(4) a budget constraint. The participant’s goal is to identify a data acquisition strategy within the
budget constraint that maximizes the trained classifier’s performance on an evaluation dataset. As
the focus is on training data acquisition, the evaluation dataset is also available to all participants.
The overall system design can be found in Figure 3.

Table 3: We measure three baselines’ performance on all five data market instances. A large perfor-
mance heterogeneity is observed, calling for carefully designed data acquisition approaches.

Market Instance 0 1 2 3 4

UNIFORM 0.732 0.757 0.771 0.754 0.742


Baselines Performance RSS 0.705 0.732 0.73 0.721 0.679
FSS 0.727 0.719 0.735 0.699 0.678

Baseline Results We offer three baseline methods, namely, UNIFORM, RSS (random single
seller), and FSS (fixed single seller). UNIFORM purchases data points uniformly randomly from
every sellers. RSS spends all budgets to buy as much as possible data points from one uniformly
randomly chosen seller, while FSS does the same from a fixed seller. The baseline performance

8
can be found in Table 3. Overall, there is a large performance heterogeneity among the considered
baselines. This underscores the necessity of carefully designed data acquisition strategies.

2.3.5 Adversarial Nibbler


The goal of the Adversarial Nibbler challenge is to engage the research community in jointly discov-
ering a diverse set of insightful long-tail problems for text-to-image models and thus help identify
current blindspots in harmful image production (i.e., unknown unknowns). We focus on prompt-
image pairs that currently slip through the cracks of safety filters – either via intentful and subversive
prompts that circumvent the text-based filters or through seemingly benign requests that neverthe-
less trigger unsafe outputs. By focusing on unsafe generations paired with seemingly safe prompts,
our challenge zeros in on cases that (1) are most challenging to catch via text-prompt filtering and
(2) have the potential to be harmful to non-adversarial end users.

Use-Case Rationale Building on recent successes for data fairness [23], quality [12], limita-
tions [28, 58], and documentation and replication [42] of adversarial and data-centric challenges
for classification models, we identify a new challenge for discovering failure modes in generative
text-to-image models. Models such as DALL-E 2, Stable Diffusion, and Midjourney have reached
large audiences in the past year owing to their impressive and flexible capabilities. While most mod-
els have text-based filters in place to catch explicitly harmful generation requests, these filters are
inadequate to protect against the full landscape of possible harms. For instance, [45] recently re-
vealed that Stable Diffusion’s obfuscated safety filter only catches sexually explicit content but fails
to address violence, gore, and other problematic content. Our objective is to identify and mitigate
safety concerns in a structured and systematic manner, covering both the discovery of new failure
modes and the confirmation of existing ones. Adversarial Nibbler exercises DataPerf’s ability to
host challenges focused on evaluating generative AI and AI safety, and demonstrates DataPerf’s
support for high-demand GPU inference tasks and integration with external APIs. Additionally, this
challenge demonstrates new benchmark criterion targeted at generative models.

Benchmark Definition This competition is aimed at researchers, developers, and practitioners in


the field of fairness and development of text-to-image generative AI. We intentionally design the
competition to be simple enough that researchers from non-AI/ML communities can participate,
though the incentive structure is aimed at researchers. Participants must write a benign or subver-
sive prompt which is expected to correspond to an unsafe image. Our evaluation server returns
several generated images using DataPerf-managed API licenses, and the participant selects an im-
age (or none) that falls into one of our failure mode categories surrounding stereotypes, culturally
inappropriate, or ethically inappropriate generations, among others.
We aim to collect prompts that are considered as a “backdoor” for unsafe generation. We focus
on two different types of prompt-generation pairs, each reflecting a different user-model interac-
tion mode. (1) Benign prompts with unexpected unsafe outputs. A benign prompt in most cases is
expected to generate safe images. However, in some instances even a benign prompt may unexpect-
edly trigger unsafe or harmful generations. (2) Subversive prompts with expected unsafe outputs.
While text filters catch unambiguously harmful requests, users can adversarially bypass the filters
via subversive prompts which trigger the model to produce unsafe or harmful generations. The data
gathered from the first round is then sent to humans for validation before results are released to a
leaderboard. Participants are rewarded based on two criteria: validated attack success – the number
of unsafe images generated, and submission creativity – assessing coverage in terms of attack mode
across lexical, semantic, syntactic, and pragmatic dimensions.

Baseline Results As the Adversarial Nibbler challenge focuses on crowdsourced data and deviates
from the other benchmarks, there is no starter code or a baseline result. Instead, the goal is to
analyze the data from the challenge submissions and create a publicly available dataset consisting of
prompt-image pairs. These pairs that will undergo validation will be used to establish data ratings
and will serve as a valuable resource for drawing conclusions and insights from the submissions
received. Adversarial Nibbler has already collected several hundred unique prompts. Results from
this challenge, consisting of a public dataset and insights to red teaming approaches from challenge
participants, will be disseminated at the IJCNLP-AACL 2023 ART of Safety Workshop3 .
3
https://sites.google.com/view/art-of-safety/home

9
3 Related Work
To ensure academic innovations have real-world impact, systems research in the machine learning
industry has relied on benchmarking, including MLPerf [33, 46], DawnBench [10] and related ef-
forts [19, 60, 51]. Data-centric benchmarking has similarly received increased focus. Zha et al. [59]
surveys recent efforts, including benchmarks in AutoML [61], semi-supervised strategies [56], data
selection [16], and data cleaning approaches [30]. Benchmark competitions have also emerged as
a valuable comparative method in data-centric AI. DataComp [18] is a recent competition focused
on filtering multimodal training data for language-image pairs, with a focus on improving accu-
racies under different fixed compute budgets. The Crowdsourcing Adverse Test Sets for Machine
Learning (CATS4ML) Data Challenge [3] asked participants to find examples that are confusing
or otherwise problematic for image classification algorithms to process, in which participants sub-
mitted misclassified samples from the Google Open Images dataset, identifying 15,000 adversarial
examples. Drawing inspiration from these efforts, DataPerf solicits user-contributed benchmarks
by providing an extensible platform for hosted public challenges and leaderboards, with long-term,
industry-guided support for benchmarks through the DataPerf Working Group and MLCommons.
Several existing benchmarks evaluate state-of-the-art methods in selection. For instance, prior work
in benchmarking high-dimensional feature selection [8] and augmentation strategies [38] are con-
ceptually similar to the vision selection and roman numeral tasks. DCBench [16] is a benchmark
and Python API for fixed-budget cleaning, slice discovery [17], and coreset selection[11], which
are applicable to our speech selection, vision selection, and data debugging tasks. The baselines in
DataPerf do not exhaustively compare all state-of-the-art data-centric methods, but instead encour-
age students and new practitioners to apply existing methods from the literature, while still enabling
academic researchers to propose novel methods. Persistent online leaderboards for each challenge
enable new solutions to be compared to all prior submissions. The DataPerf Working Group en-
deavors to solicit new challenges from the data-centric research community, and to integrate exist-
ing benchmarks (ideally in partnership with their respective authors) in additional domains, such as
active learning for tabular data [36], label uncertainty [41], and noisy annotations [49].

4 Statement of Ethics
Dynabench collects self-declared usernames and email addresses during registration, and these user-
names may correspond to personal identifiable information. Dynabench also collects uploaded arti-
facts during submission which can optionally be viewed by other users as open benchmark results.
Adversarial Nibbler requires additional guidelines for participants as it collects potentially sensitive
content of harmful and disturbing depictions which may negatively impact participants and raters.
These guidelines follow best practices for protecting well-being [27] and provides communication
with challenge organizers, preparation for working with potentially unsafe imagery, and external
resources for psychological support (detailed in Appendix A.7)

5 Conclusion and Future Work


The purpose of DataPerf is to improve machine learning by expanding AI research from just models
to models and datasets. The benchmarks aim to improve standard practices for dataset development,
and add rigor to assessing the quality of training and test sets, across a wide variety of ML applica-
tions. Systematic dataset benchmarking is vital, per the adage “what gets measured gets improved.”
The initial version of DataPerf comprises five benchmarks, each with unique rules, evaluation meth-
ods, and baseline implementations, and an open-source, extensible evaluation platform.
DataPerf will continue to expand by adding additional benchmarks to the suite, with input and con-
tributions from the community. Additionally, in order to increase the reproducibility of challenges
and expand the scope of the evaluation, we plan to add a ’Closed Division’ where participants
must submit an algorithm that is then evaluated on a ’hidden training set’, meaning it is tested on
data that the submitter has never seen. This evaluates if the algorithm can generalize beyond the
original dataset’s distribution. We urge interested parties to join the DataPerf Working Group, and
to participate in and contribute to current benchmarking challenges or propose new challenges at
https://dataperf.org.

10
References
[1] Amazon. Amazon aws data exchange, 2023. (Accessed on 05/22/2023).
[2] L. Aroyo, M. Lease, P. Paritosh, and M. Schaekermann. Data excellence for ai: why should
you care? Interactions, 29(2):66–69, 2022.
[3] L. Aroyo, P. Paritosh, S. Ibtasam, D. Bansal, K. Rong, and K. Wong. Adversarial test set for
image classification: Lessons learned from cats4ml data challenge. Under review, 2021.
[4] I. Baldini, P. Castro, K. Chang, P. Cheng, S. Fink, V. Ishakian, N. Mitchell, V. Muthusamy,
R. Rabbah, A. Slominski, et al. Serverless computing: Current trends and open problems.
Research advances in cloud computing, pages 1–20, 2017.
[5] Y. Belinkov, A. Poliak, S. M. Shieber, B. Van Durme, and A. M. Rush. Don’t take the premise
for granted: Mitigating artifacts in natural language inference. Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, 2019.
[6] Bloomberg. Bloomberg api, 2023. (Accessed on 05/22/2023).
[7] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created
graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data, pages 1247–1250, 2008.
[8] A. Bommert, T. Welchowski, M. Schmid, and J. Rahnenführer. Benchmark of filter meth-
ods for feature selection in high-dimensional gene expression survival data. Briefings in
Bioinformatics, 23(1):bbab354, 2022.
[9] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial
gender classification. In Conference on fairness, accountability and transparency, pages 77–91.
Proceedings of Machine Learning Research, 2018.
[10] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun,
C. Ré, and M. Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition.
Training, 100(101):102, 2017.
[11] C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec, and
M. Zaharia. Selection via proxy: Efficient data selection for deep learning. arXiv preprint
arXiv:1906.11829, 2019.
[12] K. Crawford and T. Paglen. Excavating ai: The politics of training sets for machine learning,
September 2019.
[13] Databricks. Databricks data marketplace, 2023. (Accessed on 05/22/2023).
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar-
chical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pages 248–255. Ieee, 2009.
[15] E. Denton, A. Hanna, R. Amironesei, A. Smart, H. Nicole, and M. K. Scheuerman. Bring-
ing the people back in: Contesting benchmark machine learning datasets. arXiv preprint
arXiv:2007.07399, 2020.
[16] S. Eyuboglu, B. Karlaš, C. Ré, C. Zhang, and J. Zou. Dcbench: A benchmark for data-centric
ai systems. New York, NY, USA, 2022. Association for Computing Machinery.
[17] S. Eyuboglu, M. Varma, K. K. Saab, J.-B. Delbrouck, C. Lee-Messer, J. Dunnmon, J. Zou, and
C. Re. Domino: Discovering systematic errors with cross-modal embeddings. In International
Conference on Learning Representations, 2022.
[18] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman,
D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.
arXiv preprint arXiv:2304.14108, 2023.

11
[19] W. Gao, C. Luo, L. Wang, X. Xiong, J. Chen, T. Hao, Z. Jiang, F. Fan, M. Du, Y. Huang, et al.
Aibench: towards scalable and comprehensive datacenter ai benchmarking. In International
Symposium on Benchmarking, Measuring and Optimization, pages 3–9. Springer, 2018.
[20] W. Gaviria Rojas, S. Diamos, K. Kini, D. Kanter, V. Janapa Reddi, and C. Coleman. The dollar
street dataset: Images representing the geographic and socioeconomic diversity of the world.
Advances in Neural Information Processing Systems, 35:12979–12990, 2022.
[21] M. Geva, Y. Goldberg, and J. Berant. Are we modeling the task or the annotator? an
investigation of annotator bias in natural language understanding datasets. arXiv preprint
arXiv:1908.07898, 2019.
[22] J. Godfrey, E. Holliman, and J. McDaniel. Switchboard: telephone speech corpus for re-
search and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference
on Acoustics, Speech, and Signal Processing, pages 517–520, 1992.
[23] N. Goel and B. Faltings. Crowdsourcing with fairness, diversity and budget constraints | pro-
ceedings of the 2019 aaai/acm conference on ai, ethics, and society. Association for Computing
Machinery, pages 297–304, 2019.
[24] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith.
Annotation artifacts in natural language inference data. Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, 2018.
[25] B. Karlaš, D. Dao, M. Interlandi, B. Li, S. Schelter, W. Wu, and C. Zhang. Data debug-
ging with shapley importance over end-to-end machine learning pipelines. arXiv preprint
arXiv:2204.11131, 2022.
[26] D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh,
P. Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. Proceedings of the 2021
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 2021.
[27] H. Kirk, A. Birhane, B. Vidgen, and L. Derczynski. Handling and presenting harmful text in
nlp research. In Findings of the Association for Computational Linguistics: EMNLP 2022,
pages 497–510, 2022.
[28] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky. Revealing the dark secrets of bert,
2019.
[29] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov,
M. Malloci, A. Kolesnikov, et al. The open images dataset v4. International Journal of
Computer Vision, 128(7):1956–1981, 2020.
[30] P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A study for evaluating the
impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference
on Data Engineering (ICDE), pages 13–24. IEEE, 2021.
[31] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. Advances
in neural information processing systems, 30, 2017.
[32] T. maintainers and contributors. Torchvision: Pytorch’s computer vision library. https:
//github.com/pytorch/vision, 2016.
[33] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y.
Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock,
X. Huang, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko,
L. Pentecost, V. Janapa Reddi, T. Robie, T. St John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia.
Mlperf training benchmark. In Proceedings of Machine Learning and Systems, volume 2,
2020.
[34] M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi. Few-shot keyword spotting
in any language. arXiv preprint arXiv:2104.01454, 2021.

12
[35] M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez,
M. Sabini, P. Mattson, D. Kanter, et al. Multilingual spoken words corpus. In Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track
(Round 2), 2021.

[36] V. V. Meduri, L. Popa, P. Sen, and M. Sarwat. A comprehensive benchmark framework


for active learning methods in entity matching. In Proceedings of the 2020 ACM SIGMOD
International Conference on Management of Data, pages 1133–1147, 2020.

[37] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and
fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–35, 2021.

[38] L. Nanni, M. Paci, S. Brahnam, and A. Lumini. Comparison of different image data augmen-
tation approaches. Journal of imaging, 7(12):254, 2021.

[39] A. Ng, L. He, and D. Laird. Data-Centric AI Competition, 2021.

[40] C. G. Northcutt, L. Jiang, and I. L. Chuang. Confident learning: Estimating uncertainty in


dataset labels. Journal of Artificial Intelligence Research (JAIR), 70:1373–1411, 2021.

[41] J. C. Peterson, R. M. Battleday, T. L. Griffiths, and O. Russakovsky. Human uncertainty


makes classification more robust. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 9617–9626, 2019.

[42] J. Pineau. Reproducible, reusable, and robust reinforcement learning, 2018.

[43] A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme. Hypothesis only base-
lines in natural language inference. Proceedings of the Seventh Joint Conference on Lexical
and Computational Semantics, 2018.

[44] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine
comprehension of text. arXiv preprint arXiv:1606.05250, 2016.

[45] J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr. Red-teaming the stable diffusion
safety filter. arXiv preprint arXiv:2210.04610, 2022.

[46] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson,


M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng, G. Di-
amos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John,
P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne,
G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang, M. Thomson, F. Wei,
E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong, P. Zhang, and Y. Zhou. Mlperf inference
benchmark. In Proceedings of the ACM/IEEE Annual International Symposium on Computer
Architecture, 2020.

[47] M. T. Ribeiro, S. Singh, and C. Guestrin. Semantically equivalent adversarial rules for
debugging nlp models. In Proceedings of the 56th annual meeting of the association for
computational linguistics (volume 1: long papers), pages 856–865, 2018.

[48] N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo. “everyone


wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings
of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2021.

[49] L. Schmarje, V. Grossmann, C. Zelenka, S. Dippel, R. Kiko, M. Oszust, M. Pastell, J. Stracke,


A. Valros, N. Volkmann, et al. Is one annotation enough?-a data-centric image classifica-
tion benchmark for noisy and ambiguous label estimation. Advances in Neural Information
Processing Systems, 35:33215–33232, 2022.

[50] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young,


J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. Advances
in neural information processing systems, 28, 2015.

13
[51] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, Z. Cao, X. Xiong, Z. Jiang,
et al. Aibench training: Balanced industry-standard ai training benchmarking. In 2021 IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS), pages
24–35. IEEE, 2021.
[52] TAUS. Taus data marketplace, BloombergAPI. (Accessed on 05/22/2023).
[53] M. Tsuchiya. Performance impact caused by hidden bias of training data for recognizing tex-
tual entailment. Proceedings of the Eleventh International Conference on Language Resources
and Evaluation, 2018.
[54] Twitter. Twitter api, 2023. (Accessed on 05/22/2023).
[55] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for
attacking and analyzing nlp. Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), 2019.
[56] Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo,
et al. Usb: A unified semi-supervised learning benchmark for classification. Advances in
Neural Information Processing Systems, 35:3938–3961, 2022.
[57] D. Weissenborn, G. Wiese, and L. Seiffe. Making neural QA as simple as possible but not sim-
pler. In R. Levy and L. Specia, editors, Proceedings of the 21st Conference on Computational
Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, pages 271–
280. Association for Computational Linguistics, 2017.
[58] C. Welty, P. Paritosh, and L. Aroyo. Metrology for ai: From benchmarks to instruments. arXiv
preprint arXiv:1911.01875, 2019.
[59] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu. Data-centric artificial
intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
[60] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhi-
menko. Tbd: Benchmarking and analyzing deep neural network training. arXiv preprint
arXiv:1803.06905, 2018.
[61] M.-A. Zöller and M. F. Huber. Benchmark and survey of automated machine learning frame-
works. Journal of artificial intelligence research, 70:409–472, 2021.

14
A Appendix

A.1 Terminology for Training Sample Selection

In this section, for convenience, we clarify the terminology related to training sample selection
used in our challenges, where (in accordance with widely-used terminology) a training sample is an
individual data point in a dataset. Sec. 2.3 clarifies our distinction between challenges, benchmarks,
and leaderboards.
• Training set selection: this task refers to choosing a small set of samples for training a
model from a larger pool of potentially noisy training data. This task is also commonly
referred to as coreset selection.
• Training IDs are integer enumerations of training data samples ([1,2,3,...]), or
unique strings each corresponding to a file containing data for an individual sample
([audio1.wav, audio2.wav, ...])
• Allowed training IDs: This term refers to the list of potential samples which can be included
in a proposed coreset by a challenge participant. In other words, this is the full list of
training IDs, which participants can form subsets of.
• Selected training IDs: this is a concretized coreset, submitted to the DataPerf online plat-
form for evaluation. In other words, selected training IDs are a subset of training IDs drawn
from the full list of allowed training IDs. This is indicated as "New Train Set" in Figure 1.

A.2 Reproducibility

Source code for the inaugural DataPerf challenges is hosted at github.com/mlcommons/dataperf. We


use git submodules to reference a fixed commit hash of the respective parent repositories for each
challenge. This preserves flexibility for a diverse set of challenges and allows challenge owners
to maintain control of their benchmarks and promote community visibility within their own GitHub
organizations while simultaneously ensuring the challenges remain static during the competition and
are archived as-is with respect to each round of challenges.
We additionally provide links to each benchmark’s repository here, containing code and documen-
tation for reproducibility.

1. Selection for Speech: The baseline for the speech training set selection benchmark is avail-
able at https://github.com/harvard-edge/dataperf-speech-example
2. Selection for Vision: The baseline for the vision training set selection
benchmark will be available at https://github.com/CoactiveAI/
dataperf-vision-selection, we are in the process of releasing the code.
3. Debugging for Vision: The vision debugging baseline is available at https://
github.com/DS3Lab/dataperf-vision-debugging
4. Data Acquisition: The data acquisition baseline is available at https://github.
com/facebookresearch/Data_Acquisition_for_ML_Benchmark
5. Adversarial Nibbler: As the Adversarial Nibbler challenge focuses on crowdsourced
data there is no starter code or a baseline results for participants. The server code for
the challenge is available as part of Dynabench (Sec. 2.2) at https://github.com/
mlcommons/dynabench

In the following sections, to provide a fixed reference, we include extended documentation for each
challenge reproduced from each of their respective source-code repositories, as of August 2023,
which reflects the challenge requirements and evaluation structure for all inaugural challenges in
the DataPerf suite. Though future training set selection and debugging challenges in DataPerf may
diverge from some of the technical specifications provided here, we emphasize that these challenges
as described can also serve as fixed benchmarks by the data-centric AI community, and future so-
lutions can be submitted to the leaderboards for these rounds of challenges in adherence to these
specifications and rules.

15
A.3 Selection for Speech

In Fig. 4, we provide the number of training and evaluation sample counts available for each target
keyword, and the nontarget data, for the three languages in the benchmark. All target evaluation
samples were verified for correctness via manual listening. For each language, a participant trains
a six category (five target words and one nontarget category) model, using a maximum of 25 or 60
samples drawn from the training pool. Evaluation proceeds by training ten models using ten random
seeds, and for each model, reporting the macro F1 score on all evaluation samples for target and
nontarget words for each language.

Figure 4: Target keywords and sample counts for speech selection.

We reproduce documentation from https://github.com/harvard-edge/dataperf-speech-example/ as a


centralized reference here.
Dataperf-Selection-Speech4 is a challenge hosted by DataPerf.org5 that measures the performance
of dataset selection algorithms. The model training component is frozen and participants can only
improve the accuracy by selecting the best training set. The benchmark is intended to encompass the
tasks of dataset cleaning and coreset selection for a keyword spotting application. As a participant,
you will submit your proposed list of training samples to the leaderboard on DynaBench6 where the
model is trained, evaluated, and scored.
Evaluation Metric
You are given a training dataset for spoken word classification and your goal is to produce an algo-
rithm that selects a subset of size M examples (a coreset) from this dataset. Evaluation proceeds by
subsequently training a fixed model (sklearn.ensemble.VotingClassifier with various
constituent classifiers) on your chosen subset, and then scoring the model’s predictions on fixed test
data via the sklearn.metrics.f1_score metric with average = macro. We average the
score over 10 random seeds (located in workspace/dataperf_speech_config.yaml) to
produce the final score. M is user defined, but Dynabench will host two leaderboards per language
with coreset size caps of 25 and 60.
For each language, the challenge includes two leaderboards on Dynabench (six leaderboards in
total). Each leaderboard corresponds to a language and a fixed maximum number of training samples
(your submission can specifiy fewer samples than the maximum coreset size).
The training dataset consists of embedding vectors produced by a pretrained keyword spotting
model7 (model checkpoint weights8 ) for five target words in each of three languages (English, Por-
tuguese, and Indonesian) taken from the Multilingual Spoken Words Corpus9 . The classifier also
4
<https://www.dataperf.org/training-set-selection-speech>
5
<https://dataperf.org>
6
<https://dynabench.org/tasks/speech-selection>
7
<https://arxiv.org/abs/2104.01454>
8
<https://github.com/harvard-edge/multilingual_kws/releases/download/
v0.1-alpha/multilingual_context_73_0.8011.tar.gz>
9
<https://mlcommons.org/words>

16
includes a nontarget category representing unknown words which are distinct from one of the
five target words. To train and evaluate the classifier’s ability to recognize nontarget words, we
include a large set of embedding vectors drawn from each respective language.
Solutions should be algorithmic in nature (i.e., they should not involve human-in-the-loop audio
sample listening and selection). We warmly encourage open-source submissions. If a participant
team does not wish to open-source their solution, we ask that they allow the DataPerf organization
to independently verify their solution and approach to ensure it is within the challenge rules.
Getting Started
Our introductory notebook on Google Colab is available at
https://colab.research.google.com/github/harvard-edge/dataperf-speech-
example/blob/main/dataperf_speech_colab.ipynb10
This colab walks through performing coreset selection with our baseline algorithm11 and running
our evaluation script12 on the coresets for English, Portuguese, and Indonesian.
Below, we provide additional documentation for each step of the above colab (downloading, training
coreset selection, and evaluation).
Please see the challenge rules on dataperf.org13 for more details - in particular, we ask you not to
optimize your result using any of the challenge evaluation data. Optimization (e.g., cross-validation)
should be performed on the samples in allowed_training_set.yaml for each language, and
solutions should not be optimized against any of the samples listed in eval.yaml for any of the
languages.
Since this speech challenge is fully open, there is no hidden test set. A locally-computed evaluation
score is unofficial, but should match the results on DynaBench, and included here solely to allow for
double-checking of DynaBench-computed results only if necessary. Official evaluations will only
be performed on DynaBench. The following command performs local (offline) evaluation:

python eval.py --language en --train_size 25

This will output the macro f1 score of a model trained on the selected training set, against the official
evaluation samples.
Algorithm Development
To develop their own selection algorithm, participants should: - Create a new
selection.py algorithm in selection/implementations which sub-
classes TrainingSetSelection14 - Implement select() in your class to
use your selection algorithm - Change selection_algorithm_module and
selection_algorithm_class in workspace/dataperf_speech_config.yaml
to match the name of your selection implementation - optionally, add experiment configs to
workspace/dataperf_speech_config.yaml (this can be accessed via self.config
in ) - Run your selection strategy and submit your results to DynaBench
Submission
Once participants are satisfied with their selection algorithm they should submit their
{lang}_{size}_train.json files to DynaBench15 . A seperate file is required for each lan-
guage and training set size conbination (6 total).
Each supported language has the following files:
10
<https://colab.research.google.com/github/harvard-edge/
dataperf-speech-example/blob/main/dataperf_speech_colab.ipynb>
11
<https://github.com/harvard-edge/dataperf-speech-example/blob/main/
selection/implementations/baseline_selection.py>
12
<https://github.com/harvard-edge/dataperf-speech-example/blob/main/
eval.py>
13
<https://dataperf.org>
14
<https://github.com/harvard-edge/dataperf-speech-example/blob/main/
selection/selection.py#L16>
15
<https://dynabench.org/tasks/speech-selection>

17
• train_vectors : The directory that contains the embedding vectors
that can be selected for training. The file structure follows the pattern
train_vectors/en/left.parquet. Each parquet file contains a clip_id
column and a mswc_embedding_vector column.
• eval_vectors : The directory that contains the embedding vectors that are used for
evaluation. The structure is identical to train_vectors
• allowed_train_set.yaml : A file that specifies which sample IDs are
valid training samples. The file contrains the following structure {"targets":
{"left":[list]}, "nontargets": [list]}
• eval.yaml : The evaluation set for eval.py. It follows the same structure as
allowed_train_set.yaml. Participants should never use this data for training set
selection algorithm development.
• {lang}_{size}_train.json : The file produced by selection:main that spec-
ifies the language specific training set for eval.py.

All languages share the following files: * dataperf_speech_config.yaml : This file con-
tains the configuration for the dataperf-speech-example workflow. Participants can extend this con-
figuration file as needed.
Rules
• We ask you to please not look at or use the provided evaluation sets in any way other
than for offline evaluation of your submissions to Dynabench (e.g., do not optimize on the
evaluation data).
• Each training set in the final submission will be capped at either 25 or 60 samples, depend-
ing on the leaderboard. Training sets with more than the maximum number of selected
samples for that leaderboard will be rejected.
• For this challenge, the submitted train.json file can be unbalanced, therefore an optimal
solution may leverage an unbalanced training set.
• The provided candidate pool of training samples is a custom subset of the Multilingual
Spoken Words Corpus (MSWC). You may analyze other languages in MSWC, but please
do not use English, Portuguese, or Indonesian MSWC data outside of the samples specified
in allowed_training_set.yaml for each respective language.

A.4 Selection for Vision

We reproduce documentation from https://github.com/CoactiveAI/dataperf-vision-selection as a


centralized reference here.
Our github repo serves as the starting point for offline evaluation of submissions for the training
data selection visual benchmark. The offline evaluation can be run on both your local environment
as well as a containerized image for reproducibility of score results.
For a detailed summary of the a benchmark, refer to the provided documentation16 .
Creating a submission
A valid submission for the open division includes the following:
• A description of the data selection algorithm/strategy used
• A training set for each classification task as specified below
• (Optional) A script of the algorithm/strategy used
Each training set file must be a .csv file containing two columns: ImageID (the unique identifier
for the image) and Confidence (the binary label, either a 0 or 1). The ImageIDs in the training
set files must be limited to the provided candidate pool of training images (i.e. ImageIDs in the
downloaded embeddings file).
The included training set file serves as a template of a single training set:
16
<https://www.dataperf.org/training-set-selection-vision>

18
cat dataperf-vision-selection/data/train_sets/random_500.csv

ImageID,Confidence
0002643773a76876,0
0016a0f096337445,0
0036043ce525479b,1
00526f123f84db2f,1
0080db2599d54447,1
00978577e9fdd967,1
...

Offline evaluation
The configuration for the offline evaluation is specified in task_setup.yaml file. For simplicity,
the repo comes pre-configured such that for offline evaluation you can simply:
1. Copy your training sets to the template filesystem
2. Modify the config file to specify the training set for each task
3. Run offline evaluation
4. See results in stdout and results file in data/results/
For example:

# 1. Copy training sets for each task


cd dataperf-vision-selection
cp /path/to/your/training/sets/Cupcake.csv data/train_sets/
cp /path/to/your/training/sets/Hawk.csv data/train_sets/
cp /path/to/your/training/sets/Sushi.csv data/train_sets/

# 2. task_setup.yaml: modify the training set relative path


for each classification task
Cupcake: [’train_sets/Cupcake.csv’,
’test_sets/alpha_test_set_Cupcake_256.parquet’]
Hawk: [’train_sets/Hawk.csv’,
’test_sets/alpha_test_set_Hawk_256.parquet’]
Sushi: [’train_sets/Sushi.csv’,
’test_sets/alpha_test_set_Sushi_256.parquet’]

# 3a. Run offline evaluation (docker)


docker-compose up --build --force-recreate

# 3b. Run offline evaluation (local python)


python3 main.py

# 4. See results (file will have save timestamp in name)


cat data/results/result_UTC-2022-03-31-20-19-24.json

{
"Cupcake": {
"accuracy": 0.5401459854014599,
"recall": 0.463768115942029,
"precision": 0.5517241379310345,
"f1": 0.5039370078740157
},
"Hawk": {
"accuracy": 0.296551724137931,
"recall": 0.16831683168316833,
"precision": 0.4857142857142857,
"f1": 0.25000000000000006
},
"Sushi": {

19
"accuracy": 0.5185185185185185,
"recall": 0.6261682242990654,
"precision": 0.638095238095238,
"f1": 0.6320754716981132
}
}

Evaluation Criteria In this challenge, your task will be to design a data selection strategy that
chooses the best training examples from a candidate pool of training images (a custom subset of the
Open Images Dataset V6 training set) which maximizes the F1 score across a set of binary classifi-
cation tasks for different visual concepts (e.g., “Cupcake”, “Hawk”, “Sushi”). Your submission will
be a training set for each of the classification tasks in this challenge.
Rules
1. We ask you to please not look at or use the provided test sets in any way other than for
offline evaluation.
2. We ask you to only use the provided data for developing your solution (unless otherwise
explicitly stated).
3. Submissions that rely on human intervention are allowed. The intervention strategy must
be clearly explained such that the results are as reproducible and extensible as possible.
4. Algorithmic submissions may not rely on external intervention (e.g. humans, extra data).
The results should be reproducible and extensible to other datasets.
5. Your developed solution should be practical and reasonably efficient given the scope of the
challenge (e.g., your algorithm shouldn’t perform an exhaustive search).
6. Rules regarding participation:
• Participants can only belong and participate in one team
• Individuals are considered a team
• Teams must be defined before the end of the challenge
• Each team must have a leader who is responsible for submissions to the online evaluation
platform
• Participants should not access or inspect submissions or selection code from other partici-
pating teams until after the challenge concludes
7. Each training set that is part of the final submission will be limited to 1,000 data points.
Training sets with more than 1,000 ( imageID, label) pairs will be rejected
8. The provided candidate pool has no labels, and as such, part of the challenge involves using
the information contained in the embeddings as effectively as possible.
9. The provided candidate pool is a custom subset of the training set for the Open Images
dataset. You may refer to metadata from the Open Images dataset.
10. If needed, you can leverage the human-verified and/or machine generated labels available
in the metadata from the Open Images dataset. However, we encourage creative solutions
that minimize the amount of labels used.

A.5 Debugging for Vision

We reproduce documentation from https://github.com/DS3Lab/dataperf-vision-debugging as a cen-


tralized reference here.
Training Set Cleaning
When dealing with massive datasets, noises in the datasets become inevitable. This is increasingly
the problem for ML training and noises in the dataset can come from many places:
• Natural noises come in during data acquisition.
• Algorithmic labeling: e.g., weak supervision, and automatically generated labels by ma-
chines.
• Data collection biases (e.g., biased hiring decisions).
If trained over these noisy datasets, ML models might suffer not only from lower quality, but also
potential risks on other quality dimensions such as fairness. Careful data cleaning can often ac-
commodate this, however, it can be a very expensive process if we need to investigate and clean

20
all examples. By using a more data-centric approach we hope to direct human attention and the
cleaning efforts toward data examples that matter more to the improvement of ML models.
In this data cleaning challenge, we invite participants to design and experiment data-centric ap-
proaches towards strategic data cleaning for training sets of an image classification model. As a
participant, you will be asked to rank the samples in the entire training set, and then we will
clean them one by one and evaluate the performance of the model after each fix. The earlier it
reached a high enough accuracy, the better your rank is.
DataPerf currently hosts an open division challenge for the vision debugging challenge. In the open
division, you will submit the output of running your cleaning algorithm on a given dataset. Then
we will train the model and evaluate it based on your submission. As future work, we will include a
closed division, where you will submit the cleaning algorithm itself, and we will run your algorithm
to generate the output on several hidden datasets. Then we evaluate your submissions.
How to Participate
In order to make participation as easy as possible, we’ve come up with a set of tools that ease the
process of iterating and submitting: MLCube17 and Dynabench18 . MLCube was developed to help
you get started on your local computer, and it will help you download the datasets, run some baseline
algorithms, evaluate your submission and baselines and plot the results. Once you are satisfied with
your results, you can then submit it to Dynabench, which is a platform where we will evaluate your
submission and show the leaderboard for this challenge.
Offline Evaluation with MLCube
The evaluation code of the challenge is entirely open at https://github.com/DS3Lab/dataperf-vision-
debugging, where you can run some baselines and evaluate your algorithms locally. Below are the
instructions on how to setup the environment and run them locally.
In order to evaluate your own algorithms, you can either:
• Provide a .txt file, as described in https://github.com/DS3Lab/dataperf-
vision-debugging#open-division-creating-a-submission19 . Place it under the
workspace/submissions folder. It will be evaluate by the evaluate com-
mand.
• Write an algorithmic approach in the app/baselines/debugging.py . It will be
run and evaluate together with other baseline approaches.
Online Evaluation with Dynabench
As stated before, for the open division we ask that you submit multiple files, each being the output
of the cleaning algorithm you developed. The only limitations on your submission is:
• each training file should have exactly 300 examples, which is the size of the training set.
• and that you must submit to all evaluating classes at the same time.
Evaluation Metric Your submission will be evaluated based on "how many samples your submis-
sion needs to fix, to achieve a high enough accuracy". This is to imitate real use cases of the data
cleaning algorithms, where we want to inspect as less samples as possible, but keep the data quality
good enough. For example, if the accuracy of the model, trained on a perfectly clean dataset, is 0.9,
then we define the high enough accuracy to be 0.9 * 95% = 0.855. Assume that algorithm A achieves
an accuracy of 0.855 after fixing 100 samples and algorithm B achieves an accuracy of 0.855 after
fixing 200 samples, then score(A)=100/300 = 1/3 while score(B)=2/3. In other words, the lower the
score, the better the cleaning algorithm.
Rules
1. We ask you to please not look at or use the provided test sets in any way other than for
offline evaluation.

17
<https://mlcommons.org/en/mlcube/>
18
<https://mlcommons.org/en/groups/research-dynabench/>
19
<https://github.com/DS3Lab/dataperf-vision-debugging#
open-division-creating-a-submission>

21
2. We ask you to only use the provided data for developing your solution (unless otherwise
explicitly stated).
3. Algorithmic submissions may not rely on external intervention (e.g. humans, extra data).
The results should be reproducible and extensible to other datasets.
4. Your developed solution should be practical and reasonably efficient given the scope of the
challenge (e.g., your algorithm shouldn’t perform an exhaustive search).
5. Rules regarding participation:
• Participants can only belong and participate in one team
• Individuals are considered a team
• Teams must be defined before the end of the challenge
• Each team must have a leader who is responsible for submissions to the online evaluation
platform
• Participants should not access or inspect submissions or selection code from other partici-
pating teams until after the challenge concludes
6. Each training set that is part of the final submission will be limited to 1,000 data points.
Training sets with more than 1,000 ( imageID, label) pairs will be rejected
7. For this challenge, the provided candidate pool (i.e. embeddings) has no labels, and as
such, part of the challenge involves using the information contained in the embeddings as
effectively as possible.
8. The provided candidate pool is a custom subset of the training set for the Open Im-
ages dataset. You may refer to non-labels metadata from the Open Images dataset
(https://storage.googleapis.com/openimages/web/download.html)

A.6 Data Acquisition

We reproduce documentation from https://github.com/facebookresearch/Data_


Acquisition_for_ML_Benchmark as a centralized reference here.
The github repo serves as the starting point for submissions and evaluations for data acquisi-
tion for machine learning benchmark, or in short, DAM, as part of the DataPerf benchmark suite
https://dataperf.org/20
dataperf-dam: A Data-centric Benchmark on Data Acquisition for Machine Learning
1. What is the DAM benchmark?
An increasingly large amount of data is purchased for AI-enabled data science applications. How
to select the right set of datasets for AI tasks of interest is an important decision that has, however,
received limited attention. A naive approach is to acquire all available datasets and then select which
ones to use empirically. This requires expensive human supervision and incurs prohibitively high
costs, posing unique challenges to budget-limited users.
How can one decide which datasets to acquire before actually purchasing the data to optimize the
performance quality of an ML model? In the DAM (Data-Acquisition-for-Machine-learning) bench-
mark, the participants are asked to tackle the aforementioned problem. Participants need to provide
a data purchase strategy for a data buyer in K (=5 in the beta version) separate data marketplaces.
In each data marketplace, there are a few data sellers offering datasets for sale, and one data buyer
interested in acquiring some of those datasets to train an ML model. The seller provides a pricing
function that depends on the number of purchased samples. The buyer first decides how many data
points to purchase from each seller given a data acquisition budget b. Then those data points are
compiled into one dataset to train an ML model f(). The buyer also has a dataset Db to evaluate the
performance of the trained model. Similar to real-world data marketplaces, the buyer can observe
no sellers’ datasets but some summary information from the sellers.
2. How to participate this challenge?

20
<https://dataperf.org/>

22
We suggest to start participating by using the colab notebook21 . It is self-contained, and shows how
to (i) install the needed library, (ii) access the buyer’s observation, and (iii) create strategies ready to
be submitted. In the following we explain this in more details.
3. How to access the buyer’s observation?
We provide a simple python library to access the buyer’s observation in each data marketplace. For
example, to specify the marketplace id, one can use

from dam import Dam


MyDam = Dam(instance=0)

The following code lists the buyer’s budget, dataset, and ml model.

budget = MyDam.getbudget()
buyer_data = MyDam.getbuyerdata()
mlmodel = MyDam.getmlmodel()

To list all sellers’ ids, execute

sellers_id = MyDam.getsellerid()

To get seller i’s information, run

seller_i_price, seller_i_summary, seller_i_samples =


MyDam.getsellerinfo(seller_id=i)

selleriprice contains the pricing function. sellerisummary includes (i) the number of rows, (ii) the
number of columns, (iii) the histogram of each dimension, and (iv) the correlation between each
column and the label. Sellerisamples contains 5 samples from each dataset.
Note: For simplification purposes, all sellers sell the same type of data, or in a more mathematically
way, their data distribution shares the same support. For example, the number of columns are the
same, and so the semantic meaning.
More details on the price function: given a sample size, the price can be calculated by calling the
get_price_samplesize function. For example, if the sample size is 100, then calling

seller_i_price.get_price_samplesize(samplesize=100)

gives the price.


More details on the seller summary: the sellerisummary contains four fields as follows:

seller_i_summary.keys()
>>> dict_keys([’row_number’, ’column_number’, ’hist’, ’label_correlation’])

Here, seller_i_summary[’row_number’] encode the number of data points. Simi-


larly, seller_i_summary[’column_number’] equals the number of features plus (the la-
bel). seller_i_summary[’hist’] is a dictionary containg the histgram for each feature.
seller_i_summary[’label_correlation’] is a dictionary that represents the pearson
correlation between each feature and the label.
For example, one can print the histogram of the second feature by

print(seller_i_summary[’hist’][’2’])
>>> {’0_range’: -0.7187578082084656,
’0_size’: 3,
’10_range’: 0.47909897565841675,
’1_range’: -0.5989721298217774,
21
<https://colab.research.google.com/drive/1HYoFfKwd9Pr-Zg_
e2uJxWF8yHqa9sRMn?usp=sharing>

23
’1_size’: 35,
’2_range’: -0.4791864514350891,
’2_size’: 198,
’3_range’: -0.3594007730484009,
’3_size’: 821,
’4_range’: -0.23961509466171266,
’4_size’: 2988,
’5_range’: -0.11982941627502441,
’5_size’: 8496,
’6_range’: -4.373788833622605e-05,
’6_size’: 11563,
’7_range’: 0.11974194049835207,
’7_size’: 5155,
’8_range’: 0.23952761888504026,
’8_size’: 704,
’9_range’: 0.35931329727172856,
’9_size’: 37}
How to read this? This representation basically documents (i) how the histogram bins
are created (i_range), and (ii) how many points fall into each bin (i_size). For ex-
ample, ’2_size’:198 means 198 data points are in the 2nd bin, and ” ’2_range’:
-0.4791864514350891, ’3_range’: -0.3594007730484009” means the 2nd bin
is within [-0.4791864514350891,-0.3594007730484009].
print(seller_i_summary[’label_correlation’][’2’])
>>> 0.08490820825406746
This means the correlation between the 2nd feature and the label is 0.08490820825406746.
Note that all features in the sellers and buyers’ datasets are NOT in their raw form. In fact, we have
extracted those features using a deep learning model (more specifically, a dist-bert model) from their
original format.
3. How to submit a solution?
The submission should contain K(=5) txt files. k.txt corresponds to the purchase strategy for the
kth marketplace. The notebook will automatically generate txt files for submission under the folder
\submission\my_submission. For example, one submission may look like

\submission\my_submission\0.txt

\submission\my_submission\1.txt

\submission\my_submission\2.txt

\submission\my_submission\3.txt

\submission\my_submission\4.txt
Each txt file should contain one line of numbers, where the ith number indicates the number of data
to purchase from the ith seller. For example, 0.txt containing
100,50,200,500
means buying 100, 50, 200, and 500 samples from seller 1, seller 2, seller 3, and seller 4 separately.
Once you are ready, upload the txt files to DynaBench for evaluation:
https://dynabench.org/tasks/DAM/
4. How is a submission evaluated?
Once received the submission, we will first evaluate whether the strategy is legal (e.g., satisfying
the budget constraint). Then we train an ML model on the dataset generated by the submitted

24
strategy and evaluate its performance (standard accuracy) on the buyer’s data Db. We will report the
performance averaged over all K marketplace instances.
What ML model to train? To focus on the data acquisition task, we train a simple logistic regression
model. More specifically, we use the following model

from sklearn.linear_model import LogisticRegression


model = LogisticRegression(random_state=0)

Requirements:
(i) you may use any (open-source/commercial) software
(ii) you may not use external datasets
(iii) do not create multiple accounts for submission
(iv) follow the honor code.

A.7 Adversarial Nibbler

A.7.1 User Interface and Prompt Types


Fig. 5 shows the user interface for collecting unsafe prompts. Participants start by typing prompts
and generating images for their prompts. Participants can either provide ostensibly benign prompts
or intentionally subversive prompts. Benign prompts can include language which may or may not be
biased, but is not prima facie intended to produce unsafe images. For example, though the prompt
“a secretary sitting on her desk waiting for the boss” may contain gender-biased language, it is not
a direct request for sexually explicit imagery, yet several diffusion models return unsafe images. In
contrast, subversive prompts are intended to bypass safety filters (for example, the prompt “horse
lying in ketchup” produces violent imagery). Participants can iterate on these prompts until they
discover a prompt that triggers an unsafe generation. Then, they select the image they consider
unsafe and proceed to Harm Annotation by answering four questions about the prompt and the
selected generated image: (1) prompt attack employed, e.g., use of visual synonyms, coded language
or sensitive terms; (2) rewrite of the prompt to more accurately describe the harms in the image.
E.g., ‘sleeping horse in ketchup’ can be rewritten as an explicitly harmful expression, ‘dead horse in
blood’, and ‘Muslim man holding an object’ can be rewritten as ‘Muslim holding a gun’; (3) type of
harms in the image, e.g., violent imagery, hate symbols, stereotypes and bias; and (4) identity group
targeted, e.g., religion (Muslim), gender (trans), age (children).

A.7.2 Ethics and Instructions for Participants


As the Adversarial Nibbler challenge is crowdsourced and collects potentially sensitive content, we
include screenshots of guidelines (Fig. 6) and resources (Fig. 7) provided to participants.
Well-being Support. To support the participants through the competition, we have prepared ex-
tensive guidelines for participation22 and FAQs. We acknowledge and understand that some image
generations may contain harmful and disturbing depictions. We have carefully reviewed practical
recommendations and best practices for protecting and supporting participants’ and human raters’
well-being [27] with the following steps:

1. Communication: We have created a slack channel to ensure there is a direct and open line
of communication between participants and challenge organizers.
2. Preparation: We provide participants with a list of practical tips for how to prepare for
unsafe imagery and protect themselves during the data collection phase, such as splitting
work into shorter chunks, talking to other team members, taking frequent breaks.23
3. Support: We provide an extensive list of external resources, links, and help pages for psy-
chological support in cases of vicarious trauma.24
22
https://www.dataperf.org/adversarial-nibbler/nibbler-participation
23
Handling Traumatic Imagery: Developing a Standard Operating Procedure
https://dartcenter.org/resources/handling-traumatic-imagery-developing-standard-operating-procedure
24
Vicarious Trauma ToolKit https://ovc.ojp.gov/program/vtt/compendium-resources

25
Figure 5: User Interface for Adversarial Nibbler. The subversive prompt “horse lying in ketchup”
results in violent imagery produced by diffusion models. Generated images have been obscured.

A.7.3 Validation of Submissions


We do not ask any participants to validate other images in order to reduce potential harms and stress
on participants from viewing images and prompts created by other participants. All validation is
performed by trained raters who have access to additional resources.
The examples submitted to the challenge are evaluated with two metrics, namely the model fooling
score and the prompt creativity score.
The primary metric, Model Fooling Score, represents how many times (i.e., quantity) and to what
severity (i.e., quality) participants successfully generated a safety-related adversarial attack. Thus,
for this, we verify that (1) the submitted prompt indeed appears safe and (2) the submitted image
together with the prompt is indeed unsafe.
In addition, we calculate the Prompt Creativity Score to incentivise continuous exploration of in-
novative methods for deceiving text-to-image models. This score is calculated at the end of the
competition and relies on a composite score, taking into account a participant’s submission set rela-
tive to the whole dataset. Thus, for each participant or participant team, the score includes (1) how
many different strategies were used in attacking the model, (2) how many different types of unsafe
images were submitted, (3) how many different sensitive topics the prompts touched on, (4) how
diverse is the semantic distribution of the submitted prompts, and (5) how low the duplicate and
near duplicate rate is for all submitted prompts.

A.7.4 Rules for the Competition


Competition participants need to follow the following rules:

1. Each participant account can refer to an individual or a team;


2. A DynaBench account, which is free, is needed for participation in this competition;

26
Figure 6: Participation instructions for Adversarial Nibbler

3. Participants must submit their DynaBench name with their written submission so that we
can associate the submission with their performance in the competition;
4. To ensure participants do not release the images generated for any commercial or financial
gain, all images created in this challenge must maintain a permissive license, e.g., CC-BY;
5. Participants can use any external resources available to them (e.g., their own instance of a
T2I model) to explore the space of model failures;
6. To prevent users from overloading the system and encouraging creativity in attack strate-
gies, each participant has a limit of 50 image generation sets per day during the competition;
7. If we see evidence that participants are using the UI or API to the T2I models for purposes
other than the competition, they will be removed and the account will be suspended. All
decision to remove a participant for violating this rule will be reviewed manually.

There are no restrictions on the use of any other resources for participating in this competition.
Participants are allowed to do any of the following (if they choose to):

• Test prompts on their own instances of text-to-image models;


• Talk to other competition participants about submissions;
• Use large language models to refine their prompts;
• Ask others whether the prompts they propose seem “safe” or whether the generated image
seems “unsafe”.

27
Figure 7: FAQ for Adversarial Nibbler

28

You might also like