NeurIPS 2023 Dataperf Benchmarks for Data Centric Ai Development Paper Datasets and Benchmarks
NeurIPS 2023 Dataperf Benchmarks for Data Centric Ai Development Paper Datasets and Benchmarks
Mark Mazumder1 , Colby Banbury1 , Xiaozhe Yao2 , Bojan Karlaš2 , William Gaviria Rojas3 ,
Sudnya Diamos3 , Greg Diamos4 , Lynn He5 , Alicia Parrish 8 , Hannah Rose Kirk17 , Jessica Quaye1 ,
Charvi Rastogi11 , Douwe Kiela9,21 , David Jurado6,20 , David Kanter6 , Rafael Mosquera6,20 ,
Juan Ciro6,20 , Lora Aroyo8 , Bilge Acun7 , Lingjiao Chen9 , Mehul Smriti Raje3 , Max Bartolo16,19 ,
Sabri Eyuboglu9 , Amirata Ghorbani9 , Emmett Goodman9 , Oana Inel18 , Tariq Kane3,8 ,
Christine R. Kirkpatrick10 , Tzu-Sheng Kuo11 , Jonas Mueller12 , Tristan Thrush9 ,
Joaquin Vanschoren13 , Margaret Warren14 , Adina Williams7 , Serena Yeung9 , Newsha Ardalani7 ,
Praveen Paritosh6 , Ce Zhang2 , James Zou9 , Carole-Jean Wu7 , Cody Coleman3 , Andrew Ng4,5,9 ,
Peter Mattson8 , and Vijay Janapa Reddi1
1
Harvard University, 2 ETH Zurich, 3 Coactive.AI, 4 Landing AI, 5 DeepLearning.AI, 6 MLCommons,
7
Meta, 8 Google, 9 Stanford University, 10 San Diego Supercomputer Center, UC San Diego,
11
Carnegie Mellon University, 12 Cleanlab, 13 Eindhoven University of Technology,
14
Institute for Human and Machine Cognition, 15 Kaggle, 16 Cohere, 17 University of Oxford,
18
University of Zurich, 19 University College London, 20 Factored, 21 Contextual AI
Abstract
Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth,
difficulty, and faithfulness of the underlying problems. Neglecting the fundamen-
tal importance of data has given rise to inaccuracy, bias, and fragility in real-world
applications, and research is hindered by saturation across existing dataset bench-
marks. In response, we present DataPerf, a community-led benchmark suite for
evaluating ML datasets and data-centric algorithms. We aim to foster innovation
in data-centric AI through competition, comparability, and reproducibility. We
enable the ML community to iterate on datasets, instead of just architectures, and
we provide an open, online platform with multiple rounds of challenges to support
this iterative development. The first iteration of DataPerf contains five benchmarks
covering a wide spectrum of data-centric techniques, tasks, and modalities in vi-
sion, speech, acquisition, debugging, and diffusion prompting, and we support
hosting new contributed benchmarks from the community. The benchmarks, on-
line evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.
1 Introduction
Machine learning research has overwhelmingly focused on improving models rather than on im-
proving datasets. Large public datasets such as ImageNet [14], Freebase [7], Switchboard [22], and
SQuAD [44] serve as compasses for benchmarking model performance. Consequently, researchers
eagerly adopt the largest existing dataset without fully considering its breadth, difficulty and fidelity
to the underlying problem. Critically, better data quality [2] is increasingly necessary to improve
generalization, avoid bias, and aid safety in data cascades [48]. Without high-quality training data
models can exhibit performance discrepancies leading to reduced accuracy and persistent fairness
37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Bench-
marks.
Data-Centric Operations Training
Data Quality
Data Parsing
Assessment ML Model
Training Data
New Train Set
Data Data
Augmentation Acquisition
Testing Data
Data
Data Debugging
Data Cleaning
Selection New Test Set
D
Other Data Sources Data-Centric Iterations
Figure 1: Typical benchmarks are model-centric, and therefore focus on the model design and train-
ing stages of the ML pipeline (shown in orange). However, to develop high-quality ML applications,
users often employ a collection of data-centric operations to improve data quality and repeated data-
centric iterations to refine these operations. DataPerf aims to benchmark all major stages of such a
data-centric pipeline (shown in green) to improve ML data quality.
issues [9, 15, 37] once they leave the lab to enter service. In conventional model-centric ML, the
term benchmark often means a standard, fixed dataset for model accuracy comparisons and per-
formance measurements. While this paradigm has been useful for advancing model design, these
benchmarks are now saturating (attaining perfect or above “human-level” performance) [26]. This
raises two questions: First, is ML research making real progress on the underlying capabilities, or
is it just overfitting to existing benchmark datasets or suffering from data artifacts? A growing body
of literature explores the evidence supporting benchmark limitations [57, 24, 43, 53, 47, 5, 21, 55].
Second, how should benchmarks evolve to push the frontier of ML research?
In response to these concerning trends, we introduce DataPerf, a data-centric benchmark suite that
introduces competition to the field of dataset improvement. We survey a suite of complex data-
centric development pipelines across multiple ML domains and isolate a subset of concrete tasks
that we believe are representative of current bottlenecks, as illustrated in Figure 1. We freeze model
architectures, training hyperparameters, and task metrics to compare solutions strictly via relative
improvements from changes to the datasets themselves.
Our contributions are as follows:
Critically, DataPerf is not a one-off competition. We have established the DataPerf Working Group,
which operates under the MLCommons Association. This working group is responsible for the
ongoing maintenance of the benchmarks and platform, as well as for fostering the development of
data-centric research and methodologies in both academic and industrial domains. The aim is to
ensure the long-term sustainability and growth of DataPerf beyond a single competition.
The remainder of the paper is organized as follows. In Section 2.1, we review lessons learned from
an exploratory data-centric challenge. Section 2.2 details the hosting platform we developed in
response and Section 2.3 presents the DataPerf suite of five novel benchmarks and challenges. We
conclude with a survey of related efforts (Section 3) and future directions (Section 5).
2
2 DataPerf Benchmarking Suite
We describe the initial challenge which inspired the suite of DataPerf benchmarks and identified
which features are needed for hosting data-centric challenges online. We then describe the plat-
form that enables flexible data-centric benchmarking at scale. Finally, we share the initial DataPerf
benchmark definitions in vision, speech, acquisition, debugging, and text-to-image prompting.
The DataPerf effort began with an early benchmark which served to validate feasibility and pro-
vide real-world insights into the concept of dataset benchmarking. In traditional ML challenges,
contestants must train a high-accuracy model given a fixed dataset. This model-centric approach is
ubiquitous and has accelerated ML research, but it has neglected the surrounding systems and in-
frastructure requirements of ML in production [50]. To draw more attention to other areas of the ML
pipeline, we created the Data-Centric AI (DCAI) competition [39], inviting competitors to focus on
optimizing accuracy by improving a dataset given a fixed model architecture, thus flipping the con-
ventional challenge format of submitting different models which are evaluated on a fixed dataset.
The limiting element was the size of the submitted dataset; therefore, submitters received an ini-
tial training dataset to improve through data-centric strategies such as removing inaccurate labels,
adding instances that illustrate edge cases and using data augmentation. The competition, inspired
by MNIST, focuses on classification of Roman-numeral digits. Just by iterating on the dataset, par-
ticipants increased the baseline accuracy from 64.4% to 85.8%; human-level performance (HLP)
was 90.2%. We learned several lessons from the 2,500 submissions and applied them to DataPerf:
1. Common data pipelines. Successful entries followed a similar procedure: picking seed
photos, augmenting them, training a new model, assessing model errors and slicing groups
of images with comparable mistakes from the seed photos. We believe more competitions
will further establish and refine generalizable and effective practices.
2. Automated methods won. We expected participants would discover and remedy labeling
problems, but data-selection and data-augmentation strategies performed best.
3. Novel dataset optimizations. Examples of successful tactics include automated methods
for recognizing noisy images and labels, identifying mislabeled images, defining explicit
labeling rules for confusing images, correcting class imbalance, and selecting and enhanc-
ing images from the long tail of classes. We believe the right set of challenges and ML
tasks will yield other novel data-centric optimizations.
4. New methods emerged. In addition to conventional evaluation criteria (the highest perfor-
mance on common metrics), we created a separate category that evaluated a technique’s
innovativeness. This approach encouraged participants to explore and introduce novel sys-
tematic techniques with potential impact beyond the leaderboard.
5. New supporting infrastructure is necessary. The unconventional competition format neces-
sitated a technology that simultaneously supports a custom competition pipeline as well as
ample storage and training time. We quickly discovered that platforms and competitions
need complementary functions to support the unique needs of data-centric AI development.
Moreover, the competition was computationally expensive. Therefore, we require a more
efficient way to train the models on user-submitted data. Computational power, memory
and bandwidth are all major limitations.
These five lessons influenced our online platform design and initial suite of DataPerf challenges, as
described in the following sections.
DataPerf provides an online platform where challenge participants can submit their solutions for
evaluation, and a working group which invites members in academia and industry to propose new
data-centric benchmarks for inclusion in the DataPerf suite. The DataPerf benchmarks, evaluation
tools, leaderboards, and documentation are hosted in an online platform called Dynabench1 [26],
1
https://dynabench.org/
3
which allows challenge participants to submit, evaluate, and compare solutions for all data-centric
benchmarks defined in Section 2.3. The DataPerf benchmarks and the Dynabench platform are open-
source, and are hosted and maintained by the MLCommons Association2 , a nonprofit organization
supported by more than 50 member companies and academics, ensuring long-term availability and
benefit to the community.
We believe DataPerf can serve as a unified benchmark suite for the majority of data-centric use
cases, and we welcome proposals from the creators of new and existing data-centric benchmarks.
Our five current benchmarks are also intended to serve as representative examples for future authors
to host their own challenges on DataPerf, with customized modular submission pipelines for dif-
ferent data modalities and submission artifact types. DataPerf introduces three key extensions to
the Dynabench codebase to support data-centric benchmarks: (1) We add support for a wide variety
of submission artifacts, such as training subsets, priority values/orderings, and purchase strategies.
Users can also submit fully containerized systems as artifacts, such as in the debugging challenge.
(2) To support a diverse set of evaluation algorithms and scoring metrics, we develop modular soft-
ware adaptors to allow for running custom benchmark evaluation tools and displaying or querying
scores in Dynabench’s online leaderboards. (3) DataPerf utilizes serverless [4] deployment which
dynamically scales resources based on demand, ensuring optimal performance and efficient resource
allocation, and allowing the platform to automatically scale with the growth of the benchmark suite
and the number of participants. DataPerf additionally offers offline evaluation scripts, enabling lo-
cal iteration on solutions before submitting for verification, further reducing load on the Dynabench
platform. These improvements to Dynabench ensure DataPerf can accommodate a large suite of
community-contributed data-centric challenges in the future.
DataPerf uses leaderboards and challenges to encourage constructive competition and inspire ad-
vances in building and optimizing datasets. In this section, we clarify DataPerf’s terminology. A
leaderboard is a public summary of benchmark results; it helps to quickly identify state-of-the-art
approaches. A challenge is a public contest to achieve the best result on a leaderboard in a fixed
timeframe. Challenges motivate rapid progress through recognition and awards. Our leaderboards
and challenges are hosted on the online platform Dynabench (Section 2.2) developed and supported
by MLCommons. Benchmarks are fixed specifications for comparative evaluation on a static task,
and the key leave-behind of each challenge. MLCommons will provide long-term support for each
benchmark through leaderboards which remain open for submission and comparison once a chal-
lenge concludes. Each challenge also provides a baseline implementation to set a minimum bar for
each leaderboard metric and to discourage uninformative or random submissions.
DataPerf’s initial suite consists of tasks in training set selection for speech and vision, data cleaning
and debugging, data acquisition, and generative model prompting. Figure 1 depicts underserved
components in benchmarking machine learning pipelines, and these five tasks were selected by the
DataPerf working group among the initial proposals for challenges in order to cover as many of these
components as possible while also exercising the infrastructure requirements for our online platform.
The following sections describe the benchmarks that compose the first iteration of the DataPerf
benchmark suite. Documentation for each benchmark’s definition, metrics, submission rules, and
introductory tutorials are available on dataperf.org and reproduced in our Appendix, and our open-
source baseline implementations are available at https://github.com/MLCommons/dataperf.
Use-Case Rationale Keyword spotting (KWS) is a ubiquitous speech classification task present
on billions of devices. A KWS model detects a limited vocabulary of spoken words. Production
2
https://www.mlcommons.org/
4
Train Eval Eval
Vectors Vectors IDs
Allowed Selected
Train
Selection Training Eval Score
IDs IDs (Model Training)
Figure 2: System design and component ownership for the speech selection benchmark.
examples include the wakeword interfaces for Google Voice Assistant, Siri and Alexa. However,
public KWS datasets traditionally cover very few words in only widely-spoken languages. In
contrast, the Multilingual Spoken Words Corpus [35] (MSWC), is a large dataset of over 340,000
spoken words in 50 languages (collectively, these languages represent more than five billion people).
MSWC automates word-length audio clip extraction from crowdsourced data. Due to errors in
the generation process and source data, some samples are incorrect. For instance, they may miss
part of the target sample (e.g., “weathe-” instead of “weather”) or may contain part of an adjacent
word (e.g., “time to” instead of “time”). This benchmark focuses on estimating the quality of each
automatically-generated sample in KWS training pipelines intended for low-resource languages.
Additionally, this benchmark establishes the DataPerf platform’s capabilities for hosting speech
challenges in multiple languages.
Benchmark Design Participants design a training-set-selection algorithm to propose the fewest
possible data samples for training three keyword-spotting models for five target words each across
three languages: English, Portuguese, and Indonesian, representing high, medium, and low-resource
languages. The benchmark evaluates the algorithm on the mean F1 score of each evaluation set
(additional details in Appendix A.3). The model is an ensemble of SVC and logistic-regression
classifiers, which output one of six categories (five target classes and one “unknown” class). The
inputs to the classifier are 1,024-dimensional vectors of embedding representations from a pretrained
keyword feature extractor [34]. Participants may only specify which training samples are used by the
model; all other configuration parameters are fixed, thereby emphasizing the importance of selecting
the most informative samples. For each language there are separate leaderboards for submissions
with ≤ 25 samples or ≤ 60 samples, evaluating the algorithm’s sensitivity to the training set size.
Participants are given a tutorial baseline which uses crossfold validation in a Google Colab note-
book and an offline copy of the evaluation pipeline, for ease of setup and and rapid experimentation.
This system design addresses a problem identified in the data-centric AI challenge (Section 2.1) -
enabling offline development reduces the computational requirements for online evaluation, though
participants must agree to challenge rules on not inspecting the evaluation set. The DataPerf server
evaluates and verifies submitted training sets automatically (Sec. 2.2) for inclusion in the live leader-
board. Figure 2 illustrates the speech-selection benchmark workflow.
Baseline Results We provide two baseline implementations, nested cross-fold selection and a
data-cleaning approach using the Cleanlab framework [40]. The cross-fold selection method uses
nested cross-validation where the outer loop selects different subsets of the target samples and the
inner loop selects different subsets of the nontarget samples, and the best performing subsets are
reported back as the selected training set. The Cleanlab method rejects outliers using out-of-sample
predicted probability estimates for each candidate sample (also computed via cross-validated mod-
els). All baseline scores are averaged across 10 random seeds.
Table 1: Baseline results (macro F1 scores) for the Selection for Speech challenge.
5
2.3.2 Selection for Vision
DataPerf includes a data selection algorithm challenge with a vision-centric focus. The objective
of this task is to develop a data selection algorithm that chooses the most effective training samples
from a large candidate pool of images. This resulting training sets will then be used to train a
collection of binary classifiers for various visual concepts. The benchmark evaluates the algorithm
on the basis of the resulting models’ classification performance on the evaluation set.
Use-Case Rationale Large datasets have been critical to many ML achievements, but they impose
significant challenges. Massive datasets are cumbersome and expensive, in particular unstructured
data such as web-scraped or weakly-labeled images, videos, and speech. Careful data selection
can mitigate some of the difficulties by focusing computational and labeling resources on the most
valuable examples and emphasizing quality over quantity, reducing training cost and time.
The vision-selection-algorithm benchmark evaluates binary classification of visual concepts (e.g.,
“monster truck” or “jean jacket”) in unlabeled images. Familiar production examples of similar
models include automatic labeling services by Amazon Rekognition, Google Cloud Vision API and
Azure Cognitive Services. Successful approaches to this challenge will enable image classification
of long-tail concepts where discovery of high-value data is critical, and advances the democratiza-
tion of computer vision [20]. This benchmark demonstrates DataPerf’s support for challenges with
unlabeled image data and is a template for future benchmarks that target automatic labeling.
Benchmark Design The task is to design a data-selection strategy that chooses the best training
examples from a large pool of training images. We evaluate submissions on their ability to algorith-
mically propose a subset of the Open Images Dataset V6 training set [29] that maximizes the mean
F1-score over a set of fixed concepts (“cupcake,” “hawk” and “sushi”). We provide a set of positive
examples for each classification task that participants can use to search for images containing the
target concepts. Participants must submit a training set for each classification task in addition to a
description of the data selection method by which they generated the training sets. The challenge
platform (Sec. 2.2) automates evaluation of submissions.
Baseline Results We provide three baseline results, namely, farthest point sampling, pseudo-label
generation, and modified uncertainty sampling. Farthest point sampling selects negative examples
by attempting to sample the feature search space through iterative maximum l2 distances, afterwards
returning the best coreset under nested cross-validation. Pseudo label generation trains multiple
neural networks and classical models on a subset of data to classify the remainder of points and uses
the best-performing model for coreset proposal under multiple sampling experiments. Modified
uncertainty sampling trains a binary classifier on noisy positive labels from OpenImages and uses
this classifier to assign positive and negative image pools, with the coreset randomly sampled from
both pools. For each baseline, F1 scores on the three test concepts are provided in Table 2.
Table 2: Baseline results (F1 scores) for the Selection for Vision challenge.
6
Use-Case Rationale Datasets are rapidly growing in size. For instance, Open Images V6 has 59
million image-level labels. Such datasets are annotated either manually or using ML. Unfortunately,
noise is unavoidable and can originate from both human annotators and algorithms. Models trained
on noisy annotations suffer in accuracy and carry risks of bias and unfairness. Dataset cleaning is a
common approach to dealing with noisy labels. However, it is a costly and time-consuming process
that typically involves human review. Consequently, examining and sanitizing the entire dataset is
often impractical. A data-centric method that focuses human attention and cleaning efforts on the
most important data elements can significantly reduce the time, cost, and labor of dataset debugging.
This challenge demonstrates the DataPerf platform’s ability to simulate human-in-the-loop data-
centric tasks, in this case label cleaning, while remaining scalable.
Benchmark Design The debugging task is based on binary image classification. For each activity,
participants receive a noisy training set (i.e., some labels are inaccurate) and a validation set with
correct labels. They must provide a debugging approach that assigns a priority value (harmfulness)
to each training set item. After each trial, all training data will have been examined and rectified.
Each time a new item is examined, a classification model is trained on the clean dataset, and the test
accuracy on a hidden test set is computed. Then a score is returned.
The image sets are from the Open Images Dataset [29], with two important considerations: (1) The
number of data points should be sufficient to permit random selection of samples for the training,
validation and test sets. (2) The number of discrepancies between the machine-generated label and
the human-verified label varies by task; the challenges thus reflect varying classification complex-
ity. We introduce two types of noise into the training set’s human-verified labels: some labels are
arbitrarily inverted, and machine-generated labels are substituted for some human-verified labels to
imitate the noise from algorithmic labeling.
We use a 2,048-dimensional vector of embedding representations extracted from a pretrained
ResNet50 model [32] as the classifier’s input data. Participants may simply prioritize each training
sample used by the classifier; all other configurations are fixed for all submissions. By precomputing
all embeddings, participants are encouraged to propose data-centric debugging methods for arbitrary
features rather than approaches specific to the image domain. This also removes the need for GPU
acceleration during submission evaluation.
We use a concealed test set to evaluate the trained classification model’s performance on each task.
Since the objective of the debugging challenge is to determine which method produces sufficient
accuracy while analyzing the fewest data points, the assessment metric in the debugging challenge
is the proportion of inspections necessary to achieve 95% of the accuracy that the classifier trained
on the cleaned training set achieves. We verify submissions by incrementally cleaning the data
and training a model on each step. Each submission contains a list of indices in the order that
the submitter wishes to clean. We incrementally prepare a new dataset for each cleaned sample.
For instance, assuming the submission is [5,4,3,2,1], we will prepare 5 datasets that are [5-cleaned,
4,3,2,1], [5-cleaned, 4-cleaned, 3, 2, 1], and so forth. We then train a XGBoost classifier on each
dataset, and report back the step at which the accuracy is high enough (>95%) on the test dataset.
Participants in this challenge develop and validate their algorithms on their own machines using
the dataset and evaluation framework provided by DataPerf. Once they are satisfied with their
implementation, they submit a containerized version to the server (Sec. 2.2). The server then reruns
the uploaded implementation on several hidden tasks and posts the average score to a leaderboard.
Baseline Results The benchmark system provides three baseline implementations: consecutive,
random and DataScope [25], which achieve the score of 53.50, 51.75 and 15.54 respectively. In
other words, DataScope needs to fix 15.54% of data samples to achieve the threshold, consecutive
needs 53.50% and random needs to fix 51.75%. DataScope is a fast approximation for Shapley
values [31] for importance estimates of each sample included and the effect of noise. As Shapley
values require calculating the payoff of every subset (O(2N ) evaluations), approximation techniques
such as DataScope are necessitated.
7
Observations Submissions Executions
Pricing Description of Budget Verifier
Mechanisms Strategy/Algo
Sellers’ Datasets
Dataset Purchased
Summaries Fractions
ML Models
Eval
Script (Optional) Eval Dataset
Dataset
Figure 3: Data acquisition benchmark design. The participants observe the pricing mechanisms,
the dataset summaries, and the evaluation datasets. They then need to develop and submit the data
acquisition strategies. The evaluation is executed automatically on the DataPerf server.
Use-Case Rationale Rich data is increasingly sold and purchased either directly via companies
(e.g., Twitter [54] and Bloomberg [6]) or data marketplaces (e.g., Amazon AWS Data Exchange [1],
Databricks Marketplace [13], and TAUS Data Marketplace [52]) to train a high-quality ML model
customized for specific applications. Those datasets are necessary often because the datasets (i)
cover underrepresented populations, (ii) offer high-quality annotations, and (iii) exhibit easy-to-use
formats. On the other hand, the datasets are also expensive due to the tremendous efforts spent
to curate and clean data samples. Content opacity is therefore ubiquitous: data sellers usually are
disinclined to release the full content of their datasets to the buyers. This renders it challenging for
the data users to decide whether a dataset is useful for the downstream ML tasks. Based on our
conversations with practitioners, existing data acquisition methods for ML are ad-hoc: one has to
manually identify data sellers, articulate their needs, estimate the data utilities, and then purchase
them. It is also iterative in nature: the datasets may show limited improvements on a downstream
ML task after being purchased, and then one has to search for a new dataset again. With this in
mind, the goal of this challenge is to mitigate a data buyer’s burden by automating and optimizing
the data acquisition strategies. This challenge demonstrates the platform’s ability to handle data-
valuations and demonstrates a unique metric based on a pricing function and a budget, which is a
useful template for future challenges that wish to capture the nuance of resource expenditure.
Benchmark Design Participants in this challenge must submit a data acquisition strategy. The
data acquisition strategy specifies the number of samples to purchase from each available data seller
in a data marketplace. Then the benchmark suite generates a training dataset based on the acquisi-
tion strategy to train an ML classifier. To mimic data acquisition in a real-world data marketplace,
participants do not have access to sellers’ data. Instead, the participants are offered (1) a few samples
(=5) from each data seller, (2) summary statistics about each dataset, (3) the pricing functions that
quantify how much to pay when a particular number of samples is purchased from one seller, and
(4) a budget constraint. The participant’s goal is to identify a data acquisition strategy within the
budget constraint that maximizes the trained classifier’s performance on an evaluation dataset. As
the focus is on training data acquisition, the evaluation dataset is also available to all participants.
The overall system design can be found in Figure 3.
Table 3: We measure three baselines’ performance on all five data market instances. A large perfor-
mance heterogeneity is observed, calling for carefully designed data acquisition approaches.
Market Instance 0 1 2 3 4
Baseline Results We offer three baseline methods, namely, UNIFORM, RSS (random single
seller), and FSS (fixed single seller). UNIFORM purchases data points uniformly randomly from
every sellers. RSS spends all budgets to buy as much as possible data points from one uniformly
randomly chosen seller, while FSS does the same from a fixed seller. The baseline performance
8
can be found in Table 3. Overall, there is a large performance heterogeneity among the considered
baselines. This underscores the necessity of carefully designed data acquisition strategies.
Use-Case Rationale Building on recent successes for data fairness [23], quality [12], limita-
tions [28, 58], and documentation and replication [42] of adversarial and data-centric challenges
for classification models, we identify a new challenge for discovering failure modes in generative
text-to-image models. Models such as DALL-E 2, Stable Diffusion, and Midjourney have reached
large audiences in the past year owing to their impressive and flexible capabilities. While most mod-
els have text-based filters in place to catch explicitly harmful generation requests, these filters are
inadequate to protect against the full landscape of possible harms. For instance, [45] recently re-
vealed that Stable Diffusion’s obfuscated safety filter only catches sexually explicit content but fails
to address violence, gore, and other problematic content. Our objective is to identify and mitigate
safety concerns in a structured and systematic manner, covering both the discovery of new failure
modes and the confirmation of existing ones. Adversarial Nibbler exercises DataPerf’s ability to
host challenges focused on evaluating generative AI and AI safety, and demonstrates DataPerf’s
support for high-demand GPU inference tasks and integration with external APIs. Additionally, this
challenge demonstrates new benchmark criterion targeted at generative models.
Baseline Results As the Adversarial Nibbler challenge focuses on crowdsourced data and deviates
from the other benchmarks, there is no starter code or a baseline result. Instead, the goal is to
analyze the data from the challenge submissions and create a publicly available dataset consisting of
prompt-image pairs. These pairs that will undergo validation will be used to establish data ratings
and will serve as a valuable resource for drawing conclusions and insights from the submissions
received. Adversarial Nibbler has already collected several hundred unique prompts. Results from
this challenge, consisting of a public dataset and insights to red teaming approaches from challenge
participants, will be disseminated at the IJCNLP-AACL 2023 ART of Safety Workshop3 .
3
https://sites.google.com/view/art-of-safety/home
9
3 Related Work
To ensure academic innovations have real-world impact, systems research in the machine learning
industry has relied on benchmarking, including MLPerf [33, 46], DawnBench [10] and related ef-
forts [19, 60, 51]. Data-centric benchmarking has similarly received increased focus. Zha et al. [59]
surveys recent efforts, including benchmarks in AutoML [61], semi-supervised strategies [56], data
selection [16], and data cleaning approaches [30]. Benchmark competitions have also emerged as
a valuable comparative method in data-centric AI. DataComp [18] is a recent competition focused
on filtering multimodal training data for language-image pairs, with a focus on improving accu-
racies under different fixed compute budgets. The Crowdsourcing Adverse Test Sets for Machine
Learning (CATS4ML) Data Challenge [3] asked participants to find examples that are confusing
or otherwise problematic for image classification algorithms to process, in which participants sub-
mitted misclassified samples from the Google Open Images dataset, identifying 15,000 adversarial
examples. Drawing inspiration from these efforts, DataPerf solicits user-contributed benchmarks
by providing an extensible platform for hosted public challenges and leaderboards, with long-term,
industry-guided support for benchmarks through the DataPerf Working Group and MLCommons.
Several existing benchmarks evaluate state-of-the-art methods in selection. For instance, prior work
in benchmarking high-dimensional feature selection [8] and augmentation strategies [38] are con-
ceptually similar to the vision selection and roman numeral tasks. DCBench [16] is a benchmark
and Python API for fixed-budget cleaning, slice discovery [17], and coreset selection[11], which
are applicable to our speech selection, vision selection, and data debugging tasks. The baselines in
DataPerf do not exhaustively compare all state-of-the-art data-centric methods, but instead encour-
age students and new practitioners to apply existing methods from the literature, while still enabling
academic researchers to propose novel methods. Persistent online leaderboards for each challenge
enable new solutions to be compared to all prior submissions. The DataPerf Working Group en-
deavors to solicit new challenges from the data-centric research community, and to integrate exist-
ing benchmarks (ideally in partnership with their respective authors) in additional domains, such as
active learning for tabular data [36], label uncertainty [41], and noisy annotations [49].
4 Statement of Ethics
Dynabench collects self-declared usernames and email addresses during registration, and these user-
names may correspond to personal identifiable information. Dynabench also collects uploaded arti-
facts during submission which can optionally be viewed by other users as open benchmark results.
Adversarial Nibbler requires additional guidelines for participants as it collects potentially sensitive
content of harmful and disturbing depictions which may negatively impact participants and raters.
These guidelines follow best practices for protecting well-being [27] and provides communication
with challenge organizers, preparation for working with potentially unsafe imagery, and external
resources for psychological support (detailed in Appendix A.7)
10
References
[1] Amazon. Amazon aws data exchange, 2023. (Accessed on 05/22/2023).
[2] L. Aroyo, M. Lease, P. Paritosh, and M. Schaekermann. Data excellence for ai: why should
you care? Interactions, 29(2):66–69, 2022.
[3] L. Aroyo, P. Paritosh, S. Ibtasam, D. Bansal, K. Rong, and K. Wong. Adversarial test set for
image classification: Lessons learned from cats4ml data challenge. Under review, 2021.
[4] I. Baldini, P. Castro, K. Chang, P. Cheng, S. Fink, V. Ishakian, N. Mitchell, V. Muthusamy,
R. Rabbah, A. Slominski, et al. Serverless computing: Current trends and open problems.
Research advances in cloud computing, pages 1–20, 2017.
[5] Y. Belinkov, A. Poliak, S. M. Shieber, B. Van Durme, and A. M. Rush. Don’t take the premise
for granted: Mitigating artifacts in natural language inference. Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, 2019.
[6] Bloomberg. Bloomberg api, 2023. (Accessed on 05/22/2023).
[7] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created
graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data, pages 1247–1250, 2008.
[8] A. Bommert, T. Welchowski, M. Schmid, and J. Rahnenführer. Benchmark of filter meth-
ods for feature selection in high-dimensional gene expression survival data. Briefings in
Bioinformatics, 23(1):bbab354, 2022.
[9] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial
gender classification. In Conference on fairness, accountability and transparency, pages 77–91.
Proceedings of Machine Learning Research, 2018.
[10] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun,
C. Ré, and M. Zaharia. Dawnbench: An end-to-end deep learning benchmark and competition.
Training, 100(101):102, 2017.
[11] C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec, and
M. Zaharia. Selection via proxy: Efficient data selection for deep learning. arXiv preprint
arXiv:1906.11829, 2019.
[12] K. Crawford and T. Paglen. Excavating ai: The politics of training sets for machine learning,
September 2019.
[13] Databricks. Databricks data marketplace, 2023. (Accessed on 05/22/2023).
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar-
chical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pages 248–255. Ieee, 2009.
[15] E. Denton, A. Hanna, R. Amironesei, A. Smart, H. Nicole, and M. K. Scheuerman. Bring-
ing the people back in: Contesting benchmark machine learning datasets. arXiv preprint
arXiv:2007.07399, 2020.
[16] S. Eyuboglu, B. Karlaš, C. Ré, C. Zhang, and J. Zou. Dcbench: A benchmark for data-centric
ai systems. New York, NY, USA, 2022. Association for Computing Machinery.
[17] S. Eyuboglu, M. Varma, K. K. Saab, J.-B. Delbrouck, C. Lee-Messer, J. Dunnmon, J. Zou, and
C. Re. Domino: Discovering systematic errors with cross-modal embeddings. In International
Conference on Learning Representations, 2022.
[18] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman,
D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.
arXiv preprint arXiv:2304.14108, 2023.
11
[19] W. Gao, C. Luo, L. Wang, X. Xiong, J. Chen, T. Hao, Z. Jiang, F. Fan, M. Du, Y. Huang, et al.
Aibench: towards scalable and comprehensive datacenter ai benchmarking. In International
Symposium on Benchmarking, Measuring and Optimization, pages 3–9. Springer, 2018.
[20] W. Gaviria Rojas, S. Diamos, K. Kini, D. Kanter, V. Janapa Reddi, and C. Coleman. The dollar
street dataset: Images representing the geographic and socioeconomic diversity of the world.
Advances in Neural Information Processing Systems, 35:12979–12990, 2022.
[21] M. Geva, Y. Goldberg, and J. Berant. Are we modeling the task or the annotator? an
investigation of annotator bias in natural language understanding datasets. arXiv preprint
arXiv:1908.07898, 2019.
[22] J. Godfrey, E. Holliman, and J. McDaniel. Switchboard: telephone speech corpus for re-
search and development. In [Proceedings] ICASSP-92: 1992 IEEE International Conference
on Acoustics, Speech, and Signal Processing, pages 517–520, 1992.
[23] N. Goel and B. Faltings. Crowdsourcing with fairness, diversity and budget constraints | pro-
ceedings of the 2019 aaai/acm conference on ai, ethics, and society. Association for Computing
Machinery, pages 297–304, 2019.
[24] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith.
Annotation artifacts in natural language inference data. Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, 2018.
[25] B. Karlaš, D. Dao, M. Interlandi, B. Li, S. Schelter, W. Wu, and C. Zhang. Data debug-
ging with shapley importance over end-to-end machine learning pipelines. arXiv preprint
arXiv:2204.11131, 2022.
[26] D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh,
P. Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. Proceedings of the 2021
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 2021.
[27] H. Kirk, A. Birhane, B. Vidgen, and L. Derczynski. Handling and presenting harmful text in
nlp research. In Findings of the Association for Computational Linguistics: EMNLP 2022,
pages 497–510, 2022.
[28] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky. Revealing the dark secrets of bert,
2019.
[29] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov,
M. Malloci, A. Kolesnikov, et al. The open images dataset v4. International Journal of
Computer Vision, 128(7):1956–1981, 2020.
[30] P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A study for evaluating the
impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference
on Data Engineering (ICDE), pages 13–24. IEEE, 2021.
[31] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. Advances
in neural information processing systems, 30, 2017.
[32] T. maintainers and contributors. Torchvision: Pytorch’s computer vision library. https:
//github.com/pytorch/vision, 2016.
[33] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y.
Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock,
X. Huang, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko,
L. Pentecost, V. Janapa Reddi, T. Robie, T. St John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia.
Mlperf training benchmark. In Proceedings of Machine Learning and Systems, volume 2,
2020.
[34] M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi. Few-shot keyword spotting
in any language. arXiv preprint arXiv:2104.01454, 2021.
12
[35] M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez,
M. Sabini, P. Mattson, D. Kanter, et al. Multilingual spoken words corpus. In Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track
(Round 2), 2021.
[37] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and
fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–35, 2021.
[38] L. Nanni, M. Paci, S. Brahnam, and A. Lumini. Comparison of different image data augmen-
tation approaches. Journal of imaging, 7(12):254, 2021.
[43] A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme. Hypothesis only base-
lines in natural language inference. Proceedings of the Seventh Joint Conference on Lexical
and Computational Semantics, 2018.
[44] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine
comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
[45] J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr. Red-teaming the stable diffusion
safety filter. arXiv preprint arXiv:2210.04610, 2022.
[47] M. T. Ribeiro, S. Singh, and C. Guestrin. Semantically equivalent adversarial rules for
debugging nlp models. In Proceedings of the 56th annual meeting of the association for
computational linguistics (volume 1: long papers), pages 856–865, 2018.
13
[51] F. Tang, W. Gao, J. Zhan, C. Lan, X. Wen, L. Wang, C. Luo, Z. Cao, X. Xiong, Z. Jiang,
et al. Aibench training: Balanced industry-standard ai training benchmarking. In 2021 IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS), pages
24–35. IEEE, 2021.
[52] TAUS. Taus data marketplace, BloombergAPI. (Accessed on 05/22/2023).
[53] M. Tsuchiya. Performance impact caused by hidden bias of training data for recognizing tex-
tual entailment. Proceedings of the Eleventh International Conference on Language Resources
and Evaluation, 2018.
[54] Twitter. Twitter api, 2023. (Accessed on 05/22/2023).
[55] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for
attacking and analyzing nlp. Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), 2019.
[56] Y. Wang, H. Chen, Y. Fan, W. Sun, R. Tao, W. Hou, R. Wang, L. Yang, Z. Zhou, L.-Z. Guo,
et al. Usb: A unified semi-supervised learning benchmark for classification. Advances in
Neural Information Processing Systems, 35:3938–3961, 2022.
[57] D. Weissenborn, G. Wiese, and L. Seiffe. Making neural QA as simple as possible but not sim-
pler. In R. Levy and L. Specia, editors, Proceedings of the 21st Conference on Computational
Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, pages 271–
280. Association for Computational Linguistics, 2017.
[58] C. Welty, P. Paritosh, and L. Aroyo. Metrology for ai: From benchmarks to instruments. arXiv
preprint arXiv:1911.01875, 2019.
[59] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu. Data-centric artificial
intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
[60] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhi-
menko. Tbd: Benchmarking and analyzing deep neural network training. arXiv preprint
arXiv:1803.06905, 2018.
[61] M.-A. Zöller and M. F. Huber. Benchmark and survey of automated machine learning frame-
works. Journal of artificial intelligence research, 70:409–472, 2021.
14
A Appendix
In this section, for convenience, we clarify the terminology related to training sample selection
used in our challenges, where (in accordance with widely-used terminology) a training sample is an
individual data point in a dataset. Sec. 2.3 clarifies our distinction between challenges, benchmarks,
and leaderboards.
• Training set selection: this task refers to choosing a small set of samples for training a
model from a larger pool of potentially noisy training data. This task is also commonly
referred to as coreset selection.
• Training IDs are integer enumerations of training data samples ([1,2,3,...]), or
unique strings each corresponding to a file containing data for an individual sample
([audio1.wav, audio2.wav, ...])
• Allowed training IDs: This term refers to the list of potential samples which can be included
in a proposed coreset by a challenge participant. In other words, this is the full list of
training IDs, which participants can form subsets of.
• Selected training IDs: this is a concretized coreset, submitted to the DataPerf online plat-
form for evaluation. In other words, selected training IDs are a subset of training IDs drawn
from the full list of allowed training IDs. This is indicated as "New Train Set" in Figure 1.
A.2 Reproducibility
1. Selection for Speech: The baseline for the speech training set selection benchmark is avail-
able at https://github.com/harvard-edge/dataperf-speech-example
2. Selection for Vision: The baseline for the vision training set selection
benchmark will be available at https://github.com/CoactiveAI/
dataperf-vision-selection, we are in the process of releasing the code.
3. Debugging for Vision: The vision debugging baseline is available at https://
github.com/DS3Lab/dataperf-vision-debugging
4. Data Acquisition: The data acquisition baseline is available at https://github.
com/facebookresearch/Data_Acquisition_for_ML_Benchmark
5. Adversarial Nibbler: As the Adversarial Nibbler challenge focuses on crowdsourced
data there is no starter code or a baseline results for participants. The server code for
the challenge is available as part of Dynabench (Sec. 2.2) at https://github.com/
mlcommons/dynabench
In the following sections, to provide a fixed reference, we include extended documentation for each
challenge reproduced from each of their respective source-code repositories, as of August 2023,
which reflects the challenge requirements and evaluation structure for all inaugural challenges in
the DataPerf suite. Though future training set selection and debugging challenges in DataPerf may
diverge from some of the technical specifications provided here, we emphasize that these challenges
as described can also serve as fixed benchmarks by the data-centric AI community, and future so-
lutions can be submitted to the leaderboards for these rounds of challenges in adherence to these
specifications and rules.
15
A.3 Selection for Speech
In Fig. 4, we provide the number of training and evaluation sample counts available for each target
keyword, and the nontarget data, for the three languages in the benchmark. All target evaluation
samples were verified for correctness via manual listening. For each language, a participant trains
a six category (five target words and one nontarget category) model, using a maximum of 25 or 60
samples drawn from the training pool. Evaluation proceeds by training ten models using ten random
seeds, and for each model, reporting the macro F1 score on all evaluation samples for target and
nontarget words for each language.
16
includes a nontarget category representing unknown words which are distinct from one of the
five target words. To train and evaluate the classifier’s ability to recognize nontarget words, we
include a large set of embedding vectors drawn from each respective language.
Solutions should be algorithmic in nature (i.e., they should not involve human-in-the-loop audio
sample listening and selection). We warmly encourage open-source submissions. If a participant
team does not wish to open-source their solution, we ask that they allow the DataPerf organization
to independently verify their solution and approach to ensure it is within the challenge rules.
Getting Started
Our introductory notebook on Google Colab is available at
https://colab.research.google.com/github/harvard-edge/dataperf-speech-
example/blob/main/dataperf_speech_colab.ipynb10
This colab walks through performing coreset selection with our baseline algorithm11 and running
our evaluation script12 on the coresets for English, Portuguese, and Indonesian.
Below, we provide additional documentation for each step of the above colab (downloading, training
coreset selection, and evaluation).
Please see the challenge rules on dataperf.org13 for more details - in particular, we ask you not to
optimize your result using any of the challenge evaluation data. Optimization (e.g., cross-validation)
should be performed on the samples in allowed_training_set.yaml for each language, and
solutions should not be optimized against any of the samples listed in eval.yaml for any of the
languages.
Since this speech challenge is fully open, there is no hidden test set. A locally-computed evaluation
score is unofficial, but should match the results on DynaBench, and included here solely to allow for
double-checking of DynaBench-computed results only if necessary. Official evaluations will only
be performed on DynaBench. The following command performs local (offline) evaluation:
This will output the macro f1 score of a model trained on the selected training set, against the official
evaluation samples.
Algorithm Development
To develop their own selection algorithm, participants should: - Create a new
selection.py algorithm in selection/implementations which sub-
classes TrainingSetSelection14 - Implement select() in your class to
use your selection algorithm - Change selection_algorithm_module and
selection_algorithm_class in workspace/dataperf_speech_config.yaml
to match the name of your selection implementation - optionally, add experiment configs to
workspace/dataperf_speech_config.yaml (this can be accessed via self.config
in ) - Run your selection strategy and submit your results to DynaBench
Submission
Once participants are satisfied with their selection algorithm they should submit their
{lang}_{size}_train.json files to DynaBench15 . A seperate file is required for each lan-
guage and training set size conbination (6 total).
Each supported language has the following files:
10
<https://colab.research.google.com/github/harvard-edge/
dataperf-speech-example/blob/main/dataperf_speech_colab.ipynb>
11
<https://github.com/harvard-edge/dataperf-speech-example/blob/main/
selection/implementations/baseline_selection.py>
12
<https://github.com/harvard-edge/dataperf-speech-example/blob/main/
eval.py>
13
<https://dataperf.org>
14
<https://github.com/harvard-edge/dataperf-speech-example/blob/main/
selection/selection.py#L16>
15
<https://dynabench.org/tasks/speech-selection>
17
• train_vectors : The directory that contains the embedding vectors
that can be selected for training. The file structure follows the pattern
train_vectors/en/left.parquet. Each parquet file contains a clip_id
column and a mswc_embedding_vector column.
• eval_vectors : The directory that contains the embedding vectors that are used for
evaluation. The structure is identical to train_vectors
• allowed_train_set.yaml : A file that specifies which sample IDs are
valid training samples. The file contrains the following structure {"targets":
{"left":[list]}, "nontargets": [list]}
• eval.yaml : The evaluation set for eval.py. It follows the same structure as
allowed_train_set.yaml. Participants should never use this data for training set
selection algorithm development.
• {lang}_{size}_train.json : The file produced by selection:main that spec-
ifies the language specific training set for eval.py.
All languages share the following files: * dataperf_speech_config.yaml : This file con-
tains the configuration for the dataperf-speech-example workflow. Participants can extend this con-
figuration file as needed.
Rules
• We ask you to please not look at or use the provided evaluation sets in any way other
than for offline evaluation of your submissions to Dynabench (e.g., do not optimize on the
evaluation data).
• Each training set in the final submission will be capped at either 25 or 60 samples, depend-
ing on the leaderboard. Training sets with more than the maximum number of selected
samples for that leaderboard will be rejected.
• For this challenge, the submitted train.json file can be unbalanced, therefore an optimal
solution may leverage an unbalanced training set.
• The provided candidate pool of training samples is a custom subset of the Multilingual
Spoken Words Corpus (MSWC). You may analyze other languages in MSWC, but please
do not use English, Portuguese, or Indonesian MSWC data outside of the samples specified
in allowed_training_set.yaml for each respective language.
18
cat dataperf-vision-selection/data/train_sets/random_500.csv
ImageID,Confidence
0002643773a76876,0
0016a0f096337445,0
0036043ce525479b,1
00526f123f84db2f,1
0080db2599d54447,1
00978577e9fdd967,1
...
Offline evaluation
The configuration for the offline evaluation is specified in task_setup.yaml file. For simplicity,
the repo comes pre-configured such that for offline evaluation you can simply:
1. Copy your training sets to the template filesystem
2. Modify the config file to specify the training set for each task
3. Run offline evaluation
4. See results in stdout and results file in data/results/
For example:
{
"Cupcake": {
"accuracy": 0.5401459854014599,
"recall": 0.463768115942029,
"precision": 0.5517241379310345,
"f1": 0.5039370078740157
},
"Hawk": {
"accuracy": 0.296551724137931,
"recall": 0.16831683168316833,
"precision": 0.4857142857142857,
"f1": 0.25000000000000006
},
"Sushi": {
19
"accuracy": 0.5185185185185185,
"recall": 0.6261682242990654,
"precision": 0.638095238095238,
"f1": 0.6320754716981132
}
}
Evaluation Criteria In this challenge, your task will be to design a data selection strategy that
chooses the best training examples from a candidate pool of training images (a custom subset of the
Open Images Dataset V6 training set) which maximizes the F1 score across a set of binary classifi-
cation tasks for different visual concepts (e.g., “Cupcake”, “Hawk”, “Sushi”). Your submission will
be a training set for each of the classification tasks in this challenge.
Rules
1. We ask you to please not look at or use the provided test sets in any way other than for
offline evaluation.
2. We ask you to only use the provided data for developing your solution (unless otherwise
explicitly stated).
3. Submissions that rely on human intervention are allowed. The intervention strategy must
be clearly explained such that the results are as reproducible and extensible as possible.
4. Algorithmic submissions may not rely on external intervention (e.g. humans, extra data).
The results should be reproducible and extensible to other datasets.
5. Your developed solution should be practical and reasonably efficient given the scope of the
challenge (e.g., your algorithm shouldn’t perform an exhaustive search).
6. Rules regarding participation:
• Participants can only belong and participate in one team
• Individuals are considered a team
• Teams must be defined before the end of the challenge
• Each team must have a leader who is responsible for submissions to the online evaluation
platform
• Participants should not access or inspect submissions or selection code from other partici-
pating teams until after the challenge concludes
7. Each training set that is part of the final submission will be limited to 1,000 data points.
Training sets with more than 1,000 ( imageID, label) pairs will be rejected
8. The provided candidate pool has no labels, and as such, part of the challenge involves using
the information contained in the embeddings as effectively as possible.
9. The provided candidate pool is a custom subset of the training set for the Open Images
dataset. You may refer to metadata from the Open Images dataset.
10. If needed, you can leverage the human-verified and/or machine generated labels available
in the metadata from the Open Images dataset. However, we encourage creative solutions
that minimize the amount of labels used.
20
all examples. By using a more data-centric approach we hope to direct human attention and the
cleaning efforts toward data examples that matter more to the improvement of ML models.
In this data cleaning challenge, we invite participants to design and experiment data-centric ap-
proaches towards strategic data cleaning for training sets of an image classification model. As a
participant, you will be asked to rank the samples in the entire training set, and then we will
clean them one by one and evaluate the performance of the model after each fix. The earlier it
reached a high enough accuracy, the better your rank is.
DataPerf currently hosts an open division challenge for the vision debugging challenge. In the open
division, you will submit the output of running your cleaning algorithm on a given dataset. Then
we will train the model and evaluate it based on your submission. As future work, we will include a
closed division, where you will submit the cleaning algorithm itself, and we will run your algorithm
to generate the output on several hidden datasets. Then we evaluate your submissions.
How to Participate
In order to make participation as easy as possible, we’ve come up with a set of tools that ease the
process of iterating and submitting: MLCube17 and Dynabench18 . MLCube was developed to help
you get started on your local computer, and it will help you download the datasets, run some baseline
algorithms, evaluate your submission and baselines and plot the results. Once you are satisfied with
your results, you can then submit it to Dynabench, which is a platform where we will evaluate your
submission and show the leaderboard for this challenge.
Offline Evaluation with MLCube
The evaluation code of the challenge is entirely open at https://github.com/DS3Lab/dataperf-vision-
debugging, where you can run some baselines and evaluate your algorithms locally. Below are the
instructions on how to setup the environment and run them locally.
In order to evaluate your own algorithms, you can either:
• Provide a .txt file, as described in https://github.com/DS3Lab/dataperf-
vision-debugging#open-division-creating-a-submission19 . Place it under the
workspace/submissions folder. It will be evaluate by the evaluate com-
mand.
• Write an algorithmic approach in the app/baselines/debugging.py . It will be
run and evaluate together with other baseline approaches.
Online Evaluation with Dynabench
As stated before, for the open division we ask that you submit multiple files, each being the output
of the cleaning algorithm you developed. The only limitations on your submission is:
• each training file should have exactly 300 examples, which is the size of the training set.
• and that you must submit to all evaluating classes at the same time.
Evaluation Metric Your submission will be evaluated based on "how many samples your submis-
sion needs to fix, to achieve a high enough accuracy". This is to imitate real use cases of the data
cleaning algorithms, where we want to inspect as less samples as possible, but keep the data quality
good enough. For example, if the accuracy of the model, trained on a perfectly clean dataset, is 0.9,
then we define the high enough accuracy to be 0.9 * 95% = 0.855. Assume that algorithm A achieves
an accuracy of 0.855 after fixing 100 samples and algorithm B achieves an accuracy of 0.855 after
fixing 200 samples, then score(A)=100/300 = 1/3 while score(B)=2/3. In other words, the lower the
score, the better the cleaning algorithm.
Rules
1. We ask you to please not look at or use the provided test sets in any way other than for
offline evaluation.
17
<https://mlcommons.org/en/mlcube/>
18
<https://mlcommons.org/en/groups/research-dynabench/>
19
<https://github.com/DS3Lab/dataperf-vision-debugging#
open-division-creating-a-submission>
21
2. We ask you to only use the provided data for developing your solution (unless otherwise
explicitly stated).
3. Algorithmic submissions may not rely on external intervention (e.g. humans, extra data).
The results should be reproducible and extensible to other datasets.
4. Your developed solution should be practical and reasonably efficient given the scope of the
challenge (e.g., your algorithm shouldn’t perform an exhaustive search).
5. Rules regarding participation:
• Participants can only belong and participate in one team
• Individuals are considered a team
• Teams must be defined before the end of the challenge
• Each team must have a leader who is responsible for submissions to the online evaluation
platform
• Participants should not access or inspect submissions or selection code from other partici-
pating teams until after the challenge concludes
6. Each training set that is part of the final submission will be limited to 1,000 data points.
Training sets with more than 1,000 ( imageID, label) pairs will be rejected
7. For this challenge, the provided candidate pool (i.e. embeddings) has no labels, and as
such, part of the challenge involves using the information contained in the embeddings as
effectively as possible.
8. The provided candidate pool is a custom subset of the training set for the Open Im-
ages dataset. You may refer to non-labels metadata from the Open Images dataset
(https://storage.googleapis.com/openimages/web/download.html)
20
<https://dataperf.org/>
22
We suggest to start participating by using the colab notebook21 . It is self-contained, and shows how
to (i) install the needed library, (ii) access the buyer’s observation, and (iii) create strategies ready to
be submitted. In the following we explain this in more details.
3. How to access the buyer’s observation?
We provide a simple python library to access the buyer’s observation in each data marketplace. For
example, to specify the marketplace id, one can use
The following code lists the buyer’s budget, dataset, and ml model.
budget = MyDam.getbudget()
buyer_data = MyDam.getbuyerdata()
mlmodel = MyDam.getmlmodel()
sellers_id = MyDam.getsellerid()
selleriprice contains the pricing function. sellerisummary includes (i) the number of rows, (ii) the
number of columns, (iii) the histogram of each dimension, and (iv) the correlation between each
column and the label. Sellerisamples contains 5 samples from each dataset.
Note: For simplification purposes, all sellers sell the same type of data, or in a more mathematically
way, their data distribution shares the same support. For example, the number of columns are the
same, and so the semantic meaning.
More details on the price function: given a sample size, the price can be calculated by calling the
get_price_samplesize function. For example, if the sample size is 100, then calling
seller_i_price.get_price_samplesize(samplesize=100)
seller_i_summary.keys()
>>> dict_keys([’row_number’, ’column_number’, ’hist’, ’label_correlation’])
print(seller_i_summary[’hist’][’2’])
>>> {’0_range’: -0.7187578082084656,
’0_size’: 3,
’10_range’: 0.47909897565841675,
’1_range’: -0.5989721298217774,
21
<https://colab.research.google.com/drive/1HYoFfKwd9Pr-Zg_
e2uJxWF8yHqa9sRMn?usp=sharing>
23
’1_size’: 35,
’2_range’: -0.4791864514350891,
’2_size’: 198,
’3_range’: -0.3594007730484009,
’3_size’: 821,
’4_range’: -0.23961509466171266,
’4_size’: 2988,
’5_range’: -0.11982941627502441,
’5_size’: 8496,
’6_range’: -4.373788833622605e-05,
’6_size’: 11563,
’7_range’: 0.11974194049835207,
’7_size’: 5155,
’8_range’: 0.23952761888504026,
’8_size’: 704,
’9_range’: 0.35931329727172856,
’9_size’: 37}
How to read this? This representation basically documents (i) how the histogram bins
are created (i_range), and (ii) how many points fall into each bin (i_size). For ex-
ample, ’2_size’:198 means 198 data points are in the 2nd bin, and ” ’2_range’:
-0.4791864514350891, ’3_range’: -0.3594007730484009” means the 2nd bin
is within [-0.4791864514350891,-0.3594007730484009].
print(seller_i_summary[’label_correlation’][’2’])
>>> 0.08490820825406746
This means the correlation between the 2nd feature and the label is 0.08490820825406746.
Note that all features in the sellers and buyers’ datasets are NOT in their raw form. In fact, we have
extracted those features using a deep learning model (more specifically, a dist-bert model) from their
original format.
3. How to submit a solution?
The submission should contain K(=5) txt files. k.txt corresponds to the purchase strategy for the
kth marketplace. The notebook will automatically generate txt files for submission under the folder
\submission\my_submission. For example, one submission may look like
\submission\my_submission\0.txt
\submission\my_submission\1.txt
\submission\my_submission\2.txt
\submission\my_submission\3.txt
\submission\my_submission\4.txt
Each txt file should contain one line of numbers, where the ith number indicates the number of data
to purchase from the ith seller. For example, 0.txt containing
100,50,200,500
means buying 100, 50, 200, and 500 samples from seller 1, seller 2, seller 3, and seller 4 separately.
Once you are ready, upload the txt files to DynaBench for evaluation:
https://dynabench.org/tasks/DAM/
4. How is a submission evaluated?
Once received the submission, we will first evaluate whether the strategy is legal (e.g., satisfying
the budget constraint). Then we train an ML model on the dataset generated by the submitted
24
strategy and evaluate its performance (standard accuracy) on the buyer’s data Db. We will report the
performance averaged over all K marketplace instances.
What ML model to train? To focus on the data acquisition task, we train a simple logistic regression
model. More specifically, we use the following model
Requirements:
(i) you may use any (open-source/commercial) software
(ii) you may not use external datasets
(iii) do not create multiple accounts for submission
(iv) follow the honor code.
1. Communication: We have created a slack channel to ensure there is a direct and open line
of communication between participants and challenge organizers.
2. Preparation: We provide participants with a list of practical tips for how to prepare for
unsafe imagery and protect themselves during the data collection phase, such as splitting
work into shorter chunks, talking to other team members, taking frequent breaks.23
3. Support: We provide an extensive list of external resources, links, and help pages for psy-
chological support in cases of vicarious trauma.24
22
https://www.dataperf.org/adversarial-nibbler/nibbler-participation
23
Handling Traumatic Imagery: Developing a Standard Operating Procedure
https://dartcenter.org/resources/handling-traumatic-imagery-developing-standard-operating-procedure
24
Vicarious Trauma ToolKit https://ovc.ojp.gov/program/vtt/compendium-resources
25
Figure 5: User Interface for Adversarial Nibbler. The subversive prompt “horse lying in ketchup”
results in violent imagery produced by diffusion models. Generated images have been obscured.
26
Figure 6: Participation instructions for Adversarial Nibbler
3. Participants must submit their DynaBench name with their written submission so that we
can associate the submission with their performance in the competition;
4. To ensure participants do not release the images generated for any commercial or financial
gain, all images created in this challenge must maintain a permissive license, e.g., CC-BY;
5. Participants can use any external resources available to them (e.g., their own instance of a
T2I model) to explore the space of model failures;
6. To prevent users from overloading the system and encouraging creativity in attack strate-
gies, each participant has a limit of 50 image generation sets per day during the competition;
7. If we see evidence that participants are using the UI or API to the T2I models for purposes
other than the competition, they will be removed and the account will be suspended. All
decision to remove a participant for violating this rule will be reviewed manually.
There are no restrictions on the use of any other resources for participating in this competition.
Participants are allowed to do any of the following (if they choose to):
27
Figure 7: FAQ for Adversarial Nibbler
28