0% found this document useful (0 votes)
28 views

Digital Discovery: Perspective

revision

Uploaded by

Laura Brand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Digital Discovery: Perspective

revision

Uploaded by

Laura Brand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Digital

Discovery
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

View Article Online


PERSPECTIVE View Journal | View Issue

Accelerated chemical science with AI


Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

Cite this: Digital Discovery, 2024, 3, 23 Seoin Back, *a Alán Aspuru-Guzik,†bc Michele Ceriotti, d Ganna Gryn'ova, ef
Bartosz Grzybowski, ghi Geun Ho Gu,j Jason Hein, k Kedar Hippalgaonkar, lm
Rodrigo Hormázabal, n Yousung Jung, †op Seonah Kim, q Woo Youn Kim, r
Seyed Mohamad Moosavi,s Juhwan Noh,t Changyoung Park,n Joshua Schrier, u
Philippe Schwaller,v Koji Tsuda, wxy Tejs Vegge, z O. Anatole von
Lilienfeld †caaab and Aron Walsh acad

In light of the pressing need for practical materials and molecular solutions to renewable energy and health
problems, to name just two examples, one wonders how to accelerate research and development in the
chemical sciences, so as to address the time it takes to bring materials from initial discovery to
commercialization. Artificial intelligence (AI)-based techniques, in particular, are having a transformative
and accelerating impact on many if not most, technological domains. To shed light on these questions,
the authors and participants gathered in person for the ASLLA Symposium on the theme of ‘Accelerated
Chemical Science with AI’ at Gangneung, Republic of Korea. We present the findings, ideas, comments,
and often contentious opinions expressed during four panel discussions related to the respective general
topics: ‘Data’, ‘New applications’, ‘Machine learning algorithms’, and ‘Education’. All discussions were
Received 25th October 2023
Accepted 6th December 2023
recorded, transcribed into text using Open AI's Whisper, and summarized using LG AI Research's
EXAONE LLM, followed by revision by all authors. For the broader benefit of current researchers,
DOI: 10.1039/d3dd00213f
educators in higher education, and academic bodies such as associations, publishers, librarians, and
rsc.li/digitaldiscovery companies, we provide chemistry-specific recommendations and summarize the resulting conclusions.

a
Department of Chemical and Biomolecular Engineering, Institute of Emergent q
Department of Chemistry, Colorado State University, 1301 Center Avenue, Fort
Materials, Sogang University, Seoul, Republic of Korea. E-mail: [email protected] Collins, CO 80523, USA
b
Departments of Chemistry, Computer Science, University of Toronto, St. George r
Department of Chemistry, KAIST, Daejeon, Republic of Korea
Campus, Toronto, ON, Canada s
Chemical Engineering & Applied Chemistry, University of Toronto, Toronto, Ontario
c
Acceleration Consortium and Vector Institute for Articial Intelligence, Toronto, ON, M5S 3E5, Canada
M5S 1M1, Canada t
Chemical Data-Driven Research Center, Korea Research Institute of Chemical
d
Laboratory of Computational Science and Modeling (COSMO), École Polytechnique Technology, Daejeon, 34114, Republic of Korea
Fédérale de Lausanne, Lausanne, Switzerland u
Department of Chemistry, Fordham University, The Bronx, NY 10458, USA
e
Heidelberg Institute for Theoretical Studies (HITS gGmbH), 69118, Heidelberg, Germany v
Laboratory of Articial Chemical Intelligence (LIAC) & National Centre of
f
Interdisciplinary Center for Scientic Computing, Heidelberg University, 69120, Competence in Research (NCCR) Catalysis, École Polytechnique Fédérale de
Heidelberg, Germany Lausanne, Lausanne, Switzerland
g
Center for Algorithmic and Robotized Synthesis (CARS), Institute for Basic Science w
Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-
(IBS), Ulsan, Republic of Korea 8561, Japan
h
Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland x
Center for Basic Research on Materials, National Institute for Materials Science,
i
Department of Chemistry, Ulsan National Institute of Science and Technology, Tsukuba, Ibaraki 305-0044, Japan
Ulsan, Republic of Korea y
RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan
j
Department of Energy Engineering, Korea Institute of Energy Technology z
Department of Energy Conversion and Storage, Technical University of Denmark,
(KENTECH), Naju, 58330, Republic of Korea 301 Anker Engelunds vej, Kongens Lyngby, Copenhagen, 2800, Denmark
k
Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, aa
Departments of Chemistry, Materials Science and Engineering, and Physics,
Canada University of Toronto, St George Campus, Toronto, ON, Canada
l
School of Materials Science and Engineering, Nanyang Technological University, 50 ab
Machine Learning Group, Technische Universität Berlin and Berlin Institute for the
Nanyang Avenue, Singapore 639798, Singapore Foundations of Learning and Data, 10587, Berlin, Germany
m
Institute of Materials Research and Engineering, Agency for Science Technology and ac
Department of Materials, Imperial College London, London SW7 2AZ, UK
Research, 2 Fusionopolis Way, 08-03, Singapore 138634, Singapore ad
Department of Physics, Ewha Women's University, Seoul, Republic of Korea
n
LG AI Research, Seoul, Republic of Korea † The symposium was organized by Yousung Jung, Alán Aspuru-Guzik, and O.
o
Department of Chemical and Biomolecular Engineering, KAIST, Daejeon, Republic Anatole von Lilienfeld. The authors are listed in alphabetical order, except for
of Korea the rst author who took charge of organizing the initial dra written by all
p
School of Chemical and Biological Engineering, Interdisciplinary Program in co-authors who contributed to different sections.
Articial Intelligence, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul
08826, Republic of Korea

© 2024 The Author(s). Published by the Royal Society of Chemistry Digital Discovery, 2024, 3, 23–33 | 23
View Article Online

Digital Discovery Perspective

consistently emerged as a focal point of discussion in all panel


I. Introduction sessions. This section aims to offer a concise summary of the
With the unprecedented developments of articial intelligence insightful discourse on database building, to facilitate the
(AI) technology, chemical science is now entering a radically new creation of robust and effective machine learning models.
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

era. High-performance computing and virtual screening tech-


niques identify compounds to synthesize for target applications,
Building better databases
while automated robotics perform synthesis and characteriza-
tions. Additionally, AI suggests new experiments based on the Comprehending the diversity and richness of datasets is vital
data collected by robotic platforms. In this autonomous labora- for developing generalizable ML models.34 Employing metrics
to assess novelty and methods to down-select datasets to elim-
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

tory workow, data science plays a central role in accelerating


discoveries in chemical science. The 15th ASLLA Symposium on inate redundant data can serve as remedies in certain cases.
‘Accelerated Chemical Science with AI’ was held at the Korea When confronted with limited data, hand-craed descriptors,
Institute of Science and Technology (KIST) on the 25–28 e.g., coarse-grained descriptors, can be a pragmatic approach in
September 2022 in Gangneung, Republic of Korea. The workshop low-data materials discovery tasks.
brought together 45 participants from around the world to discuss Furthermore, the availability of high/multi-delity bench-
machine learning and automation for the chemical sciences. mark datasets is essential.35 The benets of improved training
In addition to brief talks from the attendees, the conference data efficiency when using multi-level learning in chemical
placed emphasis on panel discussions on the themes of Data, compound space have been demonstrated on multiple
New applications, Machine learning (ML) algorithms, and occasions.36,37
Education. This Perspective aims to effectively communicate When dealing with high-cost, high-delity data acquisition,
the insights and discussions from these panels to the broader the development of automated workows that incorporate
research community. uncertainty quantication,38 encompassing both epistemic
Numerous recent review and perspective articles have (model's inability to t the data distribution) and aleatoric
extensively explored the role of data science, ML and AI in (noise in the data) uncertainties, along with active learning,
various domains of experimental chemistry, including general can be benecial. Moreover, delta learning methods and
chemistry,1 synthetic chemistry and chemical reactions,2–5 as incorporation of physical rules as inductive bias within the
well as theoretical topics such as chemical compound space machine learning algorithms have shown to reduce the size of
exploration6 and force-eld development.7,8 Additionally, recent required data.39 Furthermore, sampling techniques such as
reviews have addressed the application of autonomous research entropic sampling and self-learning population annealing can
systems in materials science,9–16 organic chemistry,17–19 inor- serve as effective data acquisition strategies. These techniques
ganic chemistry,20 porous materials,21 nanoscience,22,23 drug enable effective weighting of the density of states of the nal
formulation24,25 and biomaterials.26 Reviews also exist on the property in relation to input descriptors, facilitating a compre-
topic of self-driving laboratories27,28 and their low-cost incar- hensive understanding of different regions of the chemical
nations.29 While previous recommendations have covered ‘best space. In addition to forward models, observations have sug-
practices' in machine learning for chemistry,30 including gested that machine learning can also contribute to knowledge-
uncertainty quantication,31 our focus in this Perspective is to augmented data generation within a discrete and sparse
present specic recommendations derived from a very rich set chemical space, particularly in the context of inverse generative
of panel discussions by many active researchers in the eld design.12,40,41
rather than reiterating those already discussed themes. We refer Despite the signicant emphasis on developing theoretical
the reader to them to more in-depth conversations. strategies for efficiently constructing databases with high-
Continuing with the focus on AI, the Whisper program32 was delity data, there is a need for additional efforts to ensure
used to transcribe the panel discussions, and EXAONE33 was that these databases are also user-friendly for interdisciplinary
used to generate automated summaries. These algorithmically research, i.e., permit even non-domain-expert AI practitioners to
generated summaries served as the initial dras of the interact with the data with minimal intervention. This accessi-
following works, which we subsequently edited and annotated bility is essential for facilitating the test of new algorithmic
to ensure clarity. Through this process, it became clear that the developments. For instance, when the rst large quantum
panel discussions encompassed overlapping topics, high- dataset with coordinates and multiple molecular properties for
lighting the shared challenges in the eld of AI in chemical more than 100 000 small organic molecules, QM9, was pub-
science. To underscore these critical challenges, we have reor- lished in 2014, the total energy was included alongside the free
ganized the discussions into common themes: data, new atomic energy. While experts can easily calculate derived
applications, ML algorithms, and education. properties from this information, such as reaction energies or
atomization energies by respectively subtracting the total
energies of constitutional isomers or subtracting the free
II. Data atomic energy from the total energy for any given stoichiometry,
this process can pose an unnecessary barrier for non-experts,
The quality and scale of data play a pivotal role in developing requiring them to invest time and effort in understanding the
high-performance ML models. Thus, it is unsurprising that data underlying denitions of basic chemical properties. Hence, the

24 | Digital Discovery, 2024, 3, 23–33 © 2024 The Author(s). Published by the Royal Society of Chemistry
View Article Online

Perspective Digital Discovery

development of easy-to-use, web browser-based interfaces for large numbers of examples, typically a few hundred or more,
predictive models is of great importance.42 At the same time, the which hinders the development of practical/useful/general AI
systematic management of meta-information remains impor- models.56 Efforts such as the Open Reaction Database are
tant to ensure the reliability of the constructed database. For notable for trying to address these limitations,57 but remain
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

example, tools such as AiiDA43 and NoMaD44,45 record compre- populated with data from USPTO, with only a few hundred
hensive data provenance for ‘static’ materials simulations. brand-new entries – this poses a question of how to best
Finally, it is important to distinguish between multiple incentivize synthetic chemists to deposit their results (both
datasets categories: smaller, more accurate, and computation- positive and negative) into such databases.
ally challenging ones that serve specic practical purposes, and Correspondingly, purely data-driven approaches in organic
datasets specically designed for benchmarking ML models. synthesis planning would greatly benet from maximal training
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

This differentiation helps avoid situations where research solely data efficiency when learning. Potential solutions to enhance
focuses on improving model performance to surpass bench- efficiency include Delta-learning and transfer learning,58 multi-
marks without effectively translating those advancements into level learning,36,37 and few-shot learning techniques.59 However,
practical applications, (overtting). In this context, dynamic the challenge of sparse data becomes particularly pronounced
management of databases within the relevant research when attempting to identify the scope of “impossible” reactions.
community proves to be fruitful, as discussed below. If a certain reaction is not listed in a database, one oen assumes
it cannot happen. But this assumption is mostly true for the types
Dynamic community database of reactions that happen oen. As mentioned earlier, such
classes are relatively limited in number and occurrence.60
For ML algorithms to effectively capture the true complexity of
When high-quality datasets are lacking, an alternative, albeit
the chemical and materials compound space, it is crucial to
more labor-intensive approach, is expert coding within
overcome biases present in existing databases. This requires
programs like Chematica or AllChemy. These programs can
a collaborative effort within the community to enable true
perform advanced-level synthesis planning, even for complex
discovery. To facilitate this goal, the successful implementation
natural products.61
of the Common Task Framework (CTF) in the protein commu-
One conclusion reached with broad consensus is the ever-
nity, in conjunction with the Protein Data Bank, has served as
increasing need for improved quality and open databases in
a model. The following list outlines key components in datasets
all AI-related efforts, not only for reaction data but also for
that could help to facilitate and foster collaborations between
describing rules of chemical reactivity, or the properties of
non-experts and experts in solving such problems:
experimentally-available and virtual ligands to nd new cata-
(1) Tasks: clearly dened tasks with precise mathematical
lysts.62,63 Moreover, new featurization schemes may be neces-
interpretation, physical meaning, and chemical purpose.
sary, particularly ones that consider stereochemical, steric
(2) Accessibility: availability of easily accessible gold-
hindrance, and long-range interaction aspects of reactions on
standard datasets in a standardized format, publicly acces-
complex scaffolds.
sible and ready for use.
(3) Metrics: specication of one or more proposed quanti-
tative metrics for each task to measure success. Publisher's role
(4). Evaluation: continuously updated leaderboards that rank
The consensus among many participants was that funding
state-of-the-art methods and/or data-splits that allow us to
bodies and scientic journals should adopt stringent require-
better track the model improvements and generalization to out-
ments to foster the open availability, completeness, curation,
of-domain (OOD).
and standardized formatting of published data. However,
(5) Discovery: ability to generate new data as needed, by
determining the specic standards and formats for data
“Augmenting with chemical knowledge.”
remains an ongoing question.
Similarly, it was emphasized that the codes utilized to
Discussions specic to organic reactions databases generate the data should be accessible, unless licensed, and
While signicant progress has been made in the past decade well-documented. Such practices align with the increasing
with the emergence of deep learning, the effectiveness of purely adoption of FAIR (Findable, Accessible, Interoperable, and
data-driven approaches in organic synthesis planning remains Reusable) policies in the scientic community.64,65 Another
to be determined.46–51 related challenge is facilitating broader access to proprietary
Large databases of reactions, such as USPTO,52 Pistachio,53 data and/or establishing new repositories where researchers
Reaxys,54 and SciFinder,55 do exist. However, the knowledge can deposit results of both successful and, importantly,
contained within these databases falls short regarding quality, unsuccessful experiments they have conducted.
diversity, and accessibility. For instance, while USPTO offers On the former issue, the panelists agreed that professional
open access, its quality may be lower compared to the limited, non-prot organizations, such as the American Chemical
paid access but higher-quality Reaxys. Reproducibility has also Society (ACS), should consider opening up their extensive
become a point of concern. Additionally, despite the vast repositories or, at the very least, enabling broader academic
number of experimental data available in these reaction data- access. Currently, the SciFinder dataset contains approximately
bases, only a limited number of reaction types have sufficiently 100 million reactions, yet it remains completely inaccessible for

© 2024 The Author(s). Published by the Royal Society of Chemistry Digital Discovery, 2024, 3, 23–33 | 25
View Article Online

Digital Discovery Perspective

downloads, severely limiting systematic data analyses. Given its ultimately a tool that accelerates technological advancements
status as a non-prot organization, the ACS is seen to have an and scientic discoveries. The progress made in this eld has
ethical obligation to share the datasets it accumulates. While undeniably expedited the pace of invention. It can also be
the CAS Common Registry initiative66 is appreciated, restricted argued that AI enhances the occurrence of “eureka moments”
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

licensing hinders research progress. Thinking more broadly, by facilitating new insights and understanding. This aspect is
policies that require disseminating a complete set of data and intricately linked to the exploration of new concepts and the
code as a requirement for publication will help accelerate perception of reality. As a creative discipline, chemistry is driven
progress in this eld. ACS has started dening research data by scientists motivated to uncover novel phenomena, unen-
policy recommendations to achieve this goal.67 An excellent cumbered by pre-established physical laws. For example, this
example of this is RSC's new journal Digital Discovery,68 which could involve stabilizing challenging structures, creating
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

has a dedicated data and code reviewer to assess submitted unconventional solvation environments, or discovering previ-
materials for documentation and reproducibility. ously unknown and aesthetically pleasing spin states. There-
fore, by leveraging AI to comprehend the existing knowledge
and venture into unexplored territories, creative pursuits in
III. New applications chemistry can be truly enhanced. In particular, the question of
Non-equilibrium states what it entails for AI to gain scientic understanding based on
Particular emphasis should be given to developing benchmark data is a very relevant question due to the advent of large
training sets that extend beyond equilibrium structures.69 Such language models (LLMs) and their applications to
sets, e.g., Transition1x, should enable advancing methods chemistry.78–81 In this context, philosophical and conceptual
capable of describing dynamics, activated processes, and frameworks like the one proposed by Krenn, et al. are needed.82
chemical reaction networks/pathways.69,70
Addressing the multi-scale nature of materials
Utilizing experimental data An example discussed was the need to provide detailed
Computational data has played a signicant role in AI-driven descriptions of the operating conditions of functional materials
materials discovery. However, specic critical properties at their relevant scales,83–85 and under intended operating
remain inaccessible to these computational approaches conditions.86 This information is crucial for facilitating inverse
regarding real-world applications. To enhance the impact of design. Much of the work in the eld currently follows a bottom-
computational discoveries, it becomes crucial to develop AI up approach, focusing on the development of machine learning
methods that can predict the synthesizability of materials.71 The potentials to extend the accessible time- and length scales in
panel emphasized the importance of establishing an efficient atomic-scale simulations. This is necessary to ensure sufficient
two-way communication channel between theoreticians and statistical sampling for retaining predictive accuracy.87,88
experimentalists, as well as the need for integrated autonomous Different materials exhibit limiting processes and reactions at
workows that bridge both domains.72–75 various scales. For instance, catalysts' activity and selectivity89
Simultaneously, the experimental literature tends to exhibit and the performance of thermoelectric materials90,91 are gov-
bias towards “success stories” while failing experiments oen erned at the atomic scale, while durability and reliability involve
go unreported.76,77 This bias can arise from various factors, such processes at the meso- to micro-scale or beyond.
as the superior performance or ease of synthesis and charac- The concept of self-driving labs was also discussed,9 with
terization of certain materials for unrelated applications. considerations given to the expenses associated with building,
Consequently, the available data on chemical space for explo- maintaining, and operating such facilities, especially when
ration with AI becomes limited, impeding the discovery of tailored for testing various optimization algorithms. The idea of
genuinely novel systems. From a modeling perspective, a data “virtual labs” emerged as an alternative, where multi-level
point perceived as a “failure” in experimental terms can be just modeling is utilized to mimic real-world experiments. For
as valuable for training models as a data point from example, in the context of batteries, simulations running on
a “successful” experiment. Although the concept of a “Journal of materials could be linked to single-cell and battery-pack
Failed Research” remains elusive, the panel suggested that well- congurations to understand the key inuences from micro-
documented and openly available metadata from experiments, structure to system performance.
regardless of outcomes, could address this limitation by There is also a need to approach data dynamically. Building
providing theoreticians with more extensive and diverse data in a multi-modal capacity to capture different scales or
training sets in terms of structure and composition. Moreover, incorporating new experiments and calculations is critical for
it was highlighted that the context of an experiment matters in aiding chemical discovery. It is crucial to emphasize the
dening what constitutes a “failed experiment”. For instance, importance of top-down approaches, starting from the meso/
a seemingly failed experiment in one context may actually lead micro-scale phase-eld60 and seamlessly coupling them with
to successful outcomes or the discovery of new compounds in ML potentials92 for autonomous parameterization. Additionally,
a different context. to enable more meaningful AI-driven discoveries, it is highly
During the discussions, the topic of how AI empowers crea- desirable to restrict the search to compounds that are easy to
tivity in chemistry was addressed. It was acknowledged that AI is synthesize and provide synthesis recipes.

26 | Digital Discovery, 2024, 3, 23–33 © 2024 The Author(s). Published by the Royal Society of Chemistry
View Article Online

Perspective Digital Discovery

IV. Machine learning algorithms simulations and experimental data.100 Such models have the
potential to learn by effectively integrating diverse sources of
Given these considerations, the natural question also arises: information.
what other foundational AI advancements, explicitly addressing
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

the needs of science datasets, are yet to be developed? What are


the current and future needs? The following non-exhaustive list Going beyond the interpolative nature of machine learning
represents the open challenges discussed as areas of focus for In the pursuit of discovering crystals or molecules with new
the AI community when interacting with the sciences. functionalities or improved properties, enhancing the extrapo-
lative performance of machine learning models becomes
crucial. However, due to the interpolative nature of ML models,
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

Encoding algorithms for science accurately predicting data from domains outside the training
Identifying AI algorithm development specic to the sciences data distribution remains a challenge.101 One intriguing and
(chemistry, physics, materials) that has been driven by clearly challenging topic discussed was the development of AI tech-
dened needs is an important consideration. One notable niques that consider the minimum amount of information
example is the effect of differentiation and the loss function in necessary to learn everything from the system. Additionally,
the case of organic molecules, as observed in the QM9 dataset. there was a signicant focus on the necessity and development
In this dataset, which provides quantum chemical properties of multi-objective optimizations for new materials
for a comprehensive chemical space of small organic molecules, discovery.102,103
the use of different loss functions for the training and testing Considering these fundamental AI advancements for
sets was necessary to discover new motifs with desired func- enabling chemical discovery, it was noted that most multi-
tionality.35 This requirement arises due to the unique challenge objective, multi-delity constrained problems addressed in
of extrapolating from known molecules to identifying motifs self-driving labs today tend to prioritize higher performance
and properties that differ from the original set encountered by based on predened objectives. However, to advance chemistry
the algorithm. This specic example highlights the demand for knowledge, algorithms need to be further tailored for inter-
novel machine learning techniques tailored to the eld of pretability, extrapolation to learn new science, and hypothesis
chemistry. testing, which fundamentally require different approaches. A
An additional example of algorithmic developments, recent example involves dedicated exploration of the Pareto
partially inspired by chemical applications, involves the front, allowing the extraction of local correlations with near-
construction of models that incorporate physical symmetries optimal performance to aid in result understanding.104,105
into their structure. In the case of interatomic potentials, since The subsequent topic of discussion revolved around using
the early stages of this eld the crucial insight has been the the acceleration and discovery of new molecules/materials
requirement for models to be exactly invariant to rotations, successfully validated in the lab as metrics of success in
translations, and atom index permutations.93 More recently, applying machine learning in chemistry. However, going
these ideas have been expanded to create physics inspired beyond the speed of material development, true discovery of
models that build upon covariant features/representations, an new concepts,82 such as topological materials, remains elusive.
extension motivated by the widespread presence of vectorial This led to the question of exploring deeper paths in AI to
and tensorial targets in quantum chemistry.94 It is noteworthy unlock such possibilities.106 One potential avenue is consid-
that these developments have progressed independently and in ering an automatic system that generates novel questions,
parallel with similar efforts in computer science,95 albeit although formulating the problems is typically within the
formulated using different terminology and with less mathe- domain of human experts. In scientic discovery, anomalies or
matical generality. outliers oen lead to new ndings. Optimization algorithms are
During the panel discussions, intriguing questions were already designed to nd regions of high uncertainty in the
raised regarding the potential integration of data-centered and parameter space, which are oen unexplored. Rewarding data
expert methods and the extent to which this integration could points in those regions, even if only a small percentage results
be achieved.96,97 Hybrid approaches were proposed as a means in actual discoveries, can lead to the real discovery of new
to leverage the encoded knowledge of experts while maintaining phenomena. Additionally, digitizing existing knowledge in
the exibility and adaptability of data-driven approaches. It was chemistry and creating a comprehensive corpus of our current
also observed that the raw reaction rules derived from either of understanding can help dene a concept of “known unknowns”
these approaches can be signicantly enhanced through further for AI, making the idea less vague and facilitating exploration
renement using quantum mechanical (QM) or molecular beyond what is already known. An example was shared
mechanical (MM) calculations. For instance, MM methods can regarding an automated robotic system developed by David
be employed to calculate strains and estimate the applicability MacMillan's group at Princeton University, which achieved
of reaction rules to cyclization reactions.98 “accelerated serendipity” by assembling molecules with no
Another notable example of a hybrid approach involves known history of interactions and rewarding accidental reac-
breaking down the barriers between different methodologies. tivity.107 This approach resulted in discovering new reactions or
This includes merging electronic structure theory and machine improved methods for existing reactions. Furthermore,
learning99 or creating a unied framework that combines emphasizing the uncertainty quantication of AI models was

© 2024 The Author(s). Published by the Royal Society of Chemistry Digital Discovery, 2024, 3, 23–33 | 27
View Article Online

Digital Discovery Perspective

highlighted as a critical step, as rewarding areas of large There are both advantages and disadvantages to having this
uncertainty in active learning frameworks necessitates the course taught by a computer science department, considering
quantication and understanding of the epistemic and alea- university politics and topical relevance to students. On the one
toric uncertainty of the models,38,108,109 and the errors at each hand, departments may be protective of their specic areas of
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

step. study, and other departments may lack the staffing necessary to
support the teaching of new classes. On the other hand,
students oen benet from direct applications of programming
V. Education to their primary coursework, which may be lacking in broader
All participants unanimously agreed on the importance of service courses. Regardless of how it is offered, it is crucial that
students learn elementary programming as early as possible, as
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

introducing machine learning, AI, and autonomous research


throughout the chemistry curriculum, starting at the under- it serves as a foundational skill for the other topics covered in
graduate level and potentially even earlier. While acknowl- the curriculum. It also enables students to undertake projects in
edging the signicance of specialized graduate education, these their nal year focusing on automation or modeling. By
skills are deemed essential for all chemists. Both academia and adopting this approach, we can create a new generation of
industry increasingly seek applicants with a solid programming students procient in coding.
background. As a case study, Novo Nordisk, a major company in Data management encompasses various aspects, including
Northern Europe, is heavily investing in digital transformation importing, visualization, and adhering to scientic practices
and envisions a future where half of its chemists are compu- such as FAIR data principles. It also involves the development of
tational (“non-wet”) chemists. Universities play a vital role in ontologies, schema, and understanding of intellectual property
developing such a workforce. Therefore, our discussions rights. Incorporating data management into the education of all
primarily focused on undergraduate education unless specied chemists is crucial, as data generation is inherent to the eld,
otherwise. and funding agencies as well as publishers require data
Various educational strategies were explored during the management policies. One approach to instilling these prac-
discussions. One extreme example is Nanyang Technological tices is to have students create data management plans for
University in Singapore, where university students are projects or upload data from teaching labs to actual reposito-
mandated to have coursework in computational thinking, data ries. Emphasizing the importance of reporting every repetition
science, and machine learning. Similarly, Imperial College of an experiment is essential. Comprehensive data manage-
London and the Denmark Technical University have university- ment practices will greatly benet students when preparing
wide initiatives to incorporate data and machine learning papers, and reinforcing these practices throughout their
competencies within the undergraduate curriculum. Another undergraduate and graduate education is highly valuable.
approach involves offering dedicated single courses such as Statistics is a well-established eld and requires no intro-
“Data Science for Chemistry” or “Autonomous Discovery” as duction. However, an ideal curriculum would place greater
upper-level electives.110,111 Some participants shared experiences emphasis on computational approaches to statistics.119
of incorporating aspects of ML/AI/data science into existing
courses or pedagogical laboratory experiences.112–116 Some of New forms of education
these adaptations were driven by the restrictions imposed by
the COVID-19 pandemic. For instance, alternative machine- A side conversation discussed the potential role of virtual reality
learning-oriented “computational labs” were developed as (VR) in education.120,121 One panelist highlighted the use of VR
substitutes for traditional wet labs. Additionally, remote-control in classes to enhance students' understanding of internal
access to laboratory equipment117 and mailing students Lego structures and processes within battery cells, as well as assist in
kits to build and operate their autonomous systems were also building crystal structures. The discussion also touched upon
explored. A recent review of low-cost self-driving laboratories the application of VR in outreach programs. For instance, in
collects many of the above efforts in comprehensive 2021, chemistry was the theme of the “Explore Science” fair
categorizations.118 organized by the Klaus Tschira Foundation (Germany) to
inspire enthusiasm for natural sciences in schoolchildren.
Activities involved realistic and interactive VR explorations with
Curriculum underlying simulations to foster early intuitions about chem-
What should comprise this curriculum? At a minimum, this istry even before university. It was noted that students oen lack
coursework should train all chemistry students in (i) elementary these intuitions due to chemistry kits becoming less engaging
programming, (ii) data management best practices, (iii) statis- over time, depriving them of experiences that previous genera-
tics, (iv) elementary machine learning model construction and tions had at their age. To address this, the panel suggested the
evaluation. dissemination of VR methods and low-cost laboratory automa-
Many science and engineering degree programs already tion kits111,122–125 to high schools, which could improve student
require computer programming or numerical computing cour- experiences while adhering to modern safety and liability
ses. Historically, these courses were taught in FORTRAN or restrictions. At the university level, digital twin simulations of
MATLAB, although the recent trend is to move towards Python, laboratory processes could serve as pre-lab training opportuni-
which has become the standard language for machine learning. ties, familiarizing students with equipment and lab procedures.

28 | Digital Discovery, 2024, 3, 23–33 © 2024 The Author(s). Published by the Royal Society of Chemistry
View Article Online

Perspective Digital Discovery

The literature also offers examples of mixed-reality enhance- prestigious than careers in science. Factors such as higher
ments in teaching microuidics.47 salaries, early exposure to computers compared to chemistry
Another potential application for training is using “body sets, and negative perceptions of chemistry as ‘polluting’ or
cam” footage or similar technologies to provide mentorship in ‘bad’ may contribute to this disparity. Despite being the archi-
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

the laboratory. The COVID-19 pandemic, with its need for tects of matter, chemists oen remain in the background in
remote work and limited laboratory occupancy, presented many applications. The general public may need to be made
opportunities for pilot projects exploring augmented reality. In aware of the signicant role chemists and materials scientists
these projects, a trainer could supervise trainees from a remote play in scientic advancements, such as space exploration,
location and provide relevant information directly into the where chemical expertise is essential for activities like analyzing
trainee's eld of view. samples and developing chemical processes. Promoting green
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

chemistry can also enhance the appeal of chemistry by high-


Challenges lighting its potential to provide solutions rather than being
perceived as a source of problems.126,127
The essential ideas in machine learning models for chemistry
The prospect of increased productivity through AI and
are continually evolving, with rapid advancements in important
autonomous research presents an opportunity to elevate the
models and the rise of deep learning. However, certain core
career value of chemists. However, changing perceptions about
skills and best practices related to model construction, data
the role of AI in chemistry and securing investments in auton-
leakage prevention, data augmentation, and model evaluation
omous laboratories remain challenges. To attract attention and
remain consistent. The rapid development and accessibility of
support, it is not enough to have robots in laboratories; the
machine learning soware present their own challenges. It is
robots should engage in groundbreaking chemistry and
easy to become overwhelmed and attempt to learn everything at
contribute to discoveries that would otherwise be impossible.
once, leading to suboptimal understanding and application of
concepts. Participants were cautious about suggesting specic
topic selections due to the rapidly changing nature of the eld. On the brighter side
When incorporating new computational material into cour- The rise of data-driven approaches in chemistry education may
sework, trade-offs need to be made. Constructive overlaps can alleviate challenges by reducing the emphasis on memorization
be found by substituting programming exercises for lengthy and increasing focus on generally applicable concepts and
symbolic derivations or incorporating data analysis and sharing approaches. With the availability of databases and computa-
exercises instead of traditional laboratory report writing tional tools, students no longer need to rely solely on memo-
assignments. However, it is inevitable that some content will rizing vast amounts of information.128 Instead, they can learn to
need to be removed. For instance, some institutions have access and utilize information effectively. This aligns with the
chosen to reduce math components or replace manual experi- changing perspectives of today's students, who view knowledge
mental laboratory work with computer-based assignments, as the ability to access and apply information rather than simply
which has been well-received by students but has also caused remembering facts. The evolving nature of assessment also
tension within departments. Another approach could involve supports this shi. Many instructors adopted “open book”
creating summer coding “bootcamps” that provide anywhere examinations during the COVID-19 pandemic, realizing that
from 1–12 weeks of intensive coding experiences for under- online resources easily overcome traditional memorization-
graduate and graduate students, leveraging theory faculty based assessments.129,130 As a result, assessments now require
members and inviting guest speakers. However, it is important higher-order thinking and problem-solving skills. By incorpo-
to recognize that these extracurricular experiences may not rating AI-related topics and emphasizing data analysis and
engage all students and require faculty to donate their time. decision-making, the curriculum can foster long-term learning
More case studies are needed to further explore these trade-offs, and focus on critical concepts while equipping students with
and the evolution of curriculum is expected to progress slowly. practical problem-solving skills.
Barriers and challenges exist in promoting the incorporation
of machine learning into chemistry education. Many chemists
outside of the subeld may not perceive it as essential and may VI. Conclusions
lack the necessary skills to teach the material. However, there is
In conclusion, the 15th ASLLA Symposium offered valuable
value in providing a rigorous education based on fundamentals,
insights into the use of AI in accelerating chemical science. The
and statistical data analysis may serve as a starting point that
discussions on Data, New Applications, Machine Learning
can act as a gateway to statistical learning methods.
Algorithms, and Education highlighted the pivotal role of AI-
based techniques in driving the rapid advancement of
Public perception research and development in the eld, as well as the importance
Chemistry faces an image problem compared to computer of incorporating ML and data science into the curriculum to
science. Enrollments in computer science programs are educate future generations. Key future directions included
increasing while enrollments in physical sciences, including fostering data transparency, exploring novel applications of AI
chemistry, are generally decreasing. One possible reason for in chemistry, rening machine learning algorithms for more
this trend is the perception that soware jobs are more accurate predictions, and integrating AI-based learning into

© 2024 The Author(s). Published by the Royal Society of Chemistry Digital Discovery, 2024, 3, 23–33 | 29
View Article Online

Digital Discovery Perspective

chemical education. The importance of cooperation among 6 B. Huang and O. A. Von Lilienfeld, Chem. Rev., 2021, 121,
researchers, educators, associations, publishers, and compa- 10001–10036.
nies was emphasized in all panel discussions to facilitate AI in 7 I. Poltavsky and A. Tkatchenko, J. Phys. Chem. Lett., 2021, 12,
chemical science. The authors anticipate the continuation of 6551–6564.
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

efforts from various elds, expecting that such endeavors will 8 T. Zubatiuk and O. Isayev, Acc. Chem. Res., 2021, 54, 1575–
eventually lead to critical innovations in the eld of chemistry. 1585.
9 E. Stach, B. DeCost, A. G. Kusne, J. Hattrick-Simpers,
Data availability K. A. Brown, K. G. Reyes, J. Schrier, S. Billinge,
T. Buonassisi and I. Foster, Matter, 2021, 4, 2702–2726.
This Perspective is derived from the ASLLA symposium panel 10 J. H. Montoya, M. Aykol, A. Anapolsky, C. B. Gopal,
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

discussion, and as such, no data or codes are available for P. K. Herring, J. S. Hummelshøj, L. Hung, H.-K. Kwon,
sharing. D. Schweigert and S. Sun, Appl. Phys. Rev., 2022, 9, 011405.
11 K. Hippalgaonkar, Q. Li, X. Wang, J. W. Fisher III,
Author contributions J. Kirkpatrick and T. Buonassisi, Nat. Rev. Mater., 2023, 8,
241–260.
Writing − original dra, Alán Aspuru-Guzik, Seoin Back, 12 B. Sanchez-Lengeling and A. Aspuru-Guzik, Science, 2018,
Michele Ceriotti, Ganna Gryn'ova, Bartosz Grzybowski, Geun Ho 361, 360–365.
Gu, Jason Hein, Kedar Hippalgaonkar, Rodrigo Hormázabal, 13 D. P. Tabor, L. M. Roch, S. K. Saikin, C. Kreisbeck,
Yousung Jung, Seonah Kim, Woo Youn Kim, Seyed Mohamad D. Sheberla, J. H. Montoya, S. Dwaraknath, M. Aykol,
Moosavi, Juhwan Noh, Changyoung Park, Joshua Schrier, Phil- C. Ortiz and H. Tribukait, Nat. Rev. Mater., 2018, 3, 5–20.
ippe Schwaller, Koji Tsuda, Tejs Vegge, O. Anatole von Lil- 14 Z. Yao, Y. Lum, A. Johnston, L. M. Mejia-Mendoza, X. Zhou,
ienfeld, and Aron Walsh. Writing − review & editing, Seoin Y. Wen, A. Aspuru-Guzik, E. H. Sargent and Z. W. Seh, Nat.
Back. Funding acquisitions, Yousung Jung, Alán Aspuru-Guzik, Rev. Mater., 2023, 8, 202–215.
and O. Anatole von Lilienfeld. 15 R. Pollice, G. dos Passos Gomes, M. Aldeghi, R. J. Hickman,
M. Krenn, C. Lavigne, M. Lindner-D’Addario, A. Nigam,
Conflicts of interest C. T. Ser and Z. Yao, Acc. Chem. Res., 2021, 54, 849–860.
16 Z. Yao, B. Sánchez-Lengeling, N. S. Bobbitt, B. J. Bucior,
There are no conicts to declare. S. G. H. Kumar, S. P. Collins, T. Burns, T. K. Woo,
O. K. Farha and R. Q. Snurr, Nat. Mach. Intell., 2021, 3,
Acknowledgements 76–86.
17 N. S. Eyke, B. A. Koscher and K. F. Jensen, Trends Chem.,
The symposium organizers (YJ, AAG, and AVL) are grateful to 2021, 3, 120–132.
KIST for generous nancial support to organize the symposium. 18 B. A. Grzybowski, T. Badowski, K. Molga and S. Szymkuć,
YJ acknowledges support from IITP Korea (No. 2021-0-01343, Wiley Interdiscip. Rev.: Comput. Mol. Sci., 2023, 13, e1630.
Articial Intelligence Graduate School Program for Seoul 19 M. Seifrid, R. Pollice, A. Aguilar-Granda, Z. Morgan Chan,
National University & No. 2021-0-02068, Articial Intelligence K. Hotta, C. T. Ser, J. Vestfrid, T. C. Wu and A. Aspuru-
Innovation Hub) and NRF of Korea funded by Ministry of Guzik, Acc. Chem. Res., 2022, 55, 2454–2466.
Science and ICT (RS-2023-00283902). PS acknowledges support 20 N. J. Szymanski, Y. Zeng, H. Huo, C. J. Bartel, H. Kim and
from the NCCR Catalysis (grant number 180544), a National G. Ceder, Mater. Horiz., 2021, 8, 2169–2198.
Centre of Competence in Research funded by the Swiss National 21 S. M. Moosavi, K. M. Jablonka and B. Smit, J. Am. Chem.
Science Foundation. A. A.-G. acknowledges support from the Soc., 2020, 142, 20273–20287.
Acceleration Consortium, a Canada First Research Excellence 22 J. A. Bennett and M. Abolhasani, Curr. Opin. Chem. Eng.,
Fund at the University of Toronto as well as Anders G. Frøseth. 2022, 36, 100831.
23 H. Tao, T. Wu, M. Aldeghi, T. C. Wu, A. Aspuru-Guzik and
References E. Kumacheva, Nat. Rev. Mater., 2021, 6, 701–716.
24 Z. Bao, J. Buon, R. J. Hickman, A. Aspuru-Guzik,
1 J. Yano, K. J. Gaffney, J. Gregoire, L. Hung, A. Ourmazd, P. Bannigan and C. Allen, Adv. Drug Delivery Rev., 2023,
J. Schrier, J. A. Sethian and F. M. Toma, Nat. Rev. Chem., 115108.
2022, 6, 357–370. 25 Y. Ivanenkov, B. Zagribelnyy, A. Malyshev, S. Evteev,
2 K. Jorner, A. Tomberg, C. Bauer, C. Sköld and P.-O. Norrby, V. Terentiev, P. Kamya, D. Bezrukov, A. Aliper, F. Ren and
Nat. Rev. Chem., 2021, 5, 240–255. A. Zhavoronkov, ACS Med. Chem. Lett., 2023, 14, 901–915.
3 M. Meuwly, Chem. Rev., 2021, 121, 10218–10239. 26 A. L. Ferguson and K. A. Brown, Annu. Rev. Chem. Biomol.
4 P. Schwaller, A. C. Vaucher, R. Laplaza, C. Bunne, A. Krause, Eng., 2022, 13, 25–44.
C. Corminboeuf and T. Laino, Wiley Interdiscip. Rev.: 27 F. Häse, L. M. Roch and A. Aspuru-Guzik, Trends Chem.,
Comput. Mol. Sci., 2022, 12, e1604. 2019, 1, 282–291.
5 F. Strieth-Kalthoff, F. Sandfort, M. H. Segler and F. Glorius, 28 R. J. Hickman, P. Bannigan, Z. Bao, A. Aspuru-Guzik and
Chem. Soc. Rev., 2020, 49, 6154–6168. C. Allen, Matter, 2023, 6, 1071–1081.

30 | Digital Discovery, 2024, 3, 23–33 © 2024 The Author(s). Published by the Royal Society of Chemistry
View Article Online

Perspective Digital Discovery

29 S. Lo, S. Baird, J. Schrier, B. Blaiszik, S. Kalinin, H. Tran, 53 J. Mayeld, D. Lowe and R. Sayle, Pistachio, https://
T. Sparks and A. Aspuru-Guzik, ChemRxiv, 2023, DOI: www.nextmovesoware.com/pistachio.html, accessed 19th
10.26434/chemrxiv-2023-6z9mq. Sep, 2023.
30 N. Artrith, K. T. Butler, F.-X. Coudert, S. Han, O. Isayev, 54 Reaxys, https://www.reaxys.com, accessed 19th Sep, 2023.
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

A. Jain and A. Walsh, Nat. Chem., 2021, 13, 505–508. 55 SciFinder, https://scinder.cas.org, accessed 19th Sep,
31 G. Vishwakarma, A. Sonpal and J. Hachmann, Trends 2023.
Chem., 2021, 3, 146–156. 56 S. Szymkuć, T. Badowski and B. A. Grzybowski, Angew.
32 A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey and Chem., 2021, 133, 26430–26436.
I. Sutskever, arXiv preprint arXiv:2212.04356, 2022. 57 S. M. Kearnes, M. R. Maser, M. Wleklinski, A. Kast,
33 EXAONE, https://www.lgresearch.ai/exaone, accessed 19th A. G. Doyle, S. D. Dreher, J. M. Hawkins, K. F. Jensen and
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

Sep, 2023. C. W. Coley, J. Am. Chem. Soc., 2021, 143, 18820–18826.


34 S. M. Moosavi, A. Nandy, K. M. Jablonka, D. Ongari, 58 G. Pesciullesi, P. Schwaller, T. Laino and J.-L. Reymond,
J. P. Janet, P. G. Boyd, Y. Lee, B. Smit and H. J. Kulik, Nat. Nat. Commun., 2020, 11, 4874.
Commun., 2020, 11, 1–10. 59 P. Seidl, P. Renz, N. Dyubankova, P. Neves, J. Verhoeven,
35 R. Ramakrishnan, P. O. Dral, M. Rupp and O. A. Von J. K. Wegner, M. Segler, S. Hochreiter and G. Klambauer,
Lilienfeld, Sci. Data, 2014, 1, 1–7. J. Chem. Inf. Model., 2022, 62, 2111–2120.
36 P. Zaspel, B. Huang, H. Harbrecht and O. A. von Lilienfeld, 60 D. P. Kovács, W. McCorkindale and A. A. Lee, Nat. Commun.,
J. Chem. Theory Comput., 2018, 15, 1546–1559. 2021, 12, 1695.
37 R. Batra, G. Pilania, B. P. Uberuaga and R. Ramprasad, ACS 61 B. Mikulak-Klucznik, P. Gołe ˛biowska, A. A. Bayly, O. Popik,
Appl. Mater. Interfaces, 2019, 11, 24906–24918. T. Klucznik, S. Szymkuć, E. P. Gajewska, P. Dittwald,
38 J. Busk, P. B. Jørgensen, A. Bhowmik, M. N. Schmidt, O. Staszewska-Krajewska and W. Beker, Nature, 2020, 588,
O. Winther and T. Vegge, Mach. Learn.: Sci. Technol., 2021, 83–88.
3, 015012. 62 M. Busch, M. D. Wodrich and C. Corminboeuf, Chem. Sci.,
39 S. M. Moosavi, B. Á. Novotny, D. Ongari, E. Moubarak, 2015, 6, 6754–6761.
M. Asgari, Ö. Kadioglu, C. Charalambous, A. Ortega- 63 T. Gensch, G. dos Passos Gomes, P. Friederich, E. Peters,
Guerrero, A. H. Farmahini and L. Sarkisov, Nat. Mater., T. Gaudin, R. Pollice, K. Jorner, A. Nigam, M. Lindner-
2022, 21, 1419–1425. D’Addario and M. S. Sigman, J. Am. Chem. Soc., 2022, 144,
40 S. Kim, J. Noh, G. H. Gu, A. Aspuru-Guzik and Y. Jung, ACS 1205–1217.
Cent. Sci., 2020, 6, 1412–1420. 64 M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg,
41 J. Noh, J. Kim, H. S. Stein, B. Sanchez-Lengeling, G. Appleton, M. Axton, A. Baak, N. Blomberg,
J. M. Gregoire, A. Aspuru-Guzik and Y. Jung, Matter, 2019, J.-W. Boiten, L. B. da Silva Santos and P. E. Bourne, Sci.
1, 1370–1384. Data, 2016, 3, 1–9.
42 P. C. St. John, Y. Guan, Y. Kim, S. Kim and R. S. Paton, Nat. 65 Open Research Data and Data Management Plans, 2022,
Commun., 2020, 11, 2328. https://erc.europa.eu/sites/default/les/document/le/ERC
43 M. Uhrin, S. P. Huber, J. Yu, N. Marzari and G. Pizzi, _info_document-Open_Research_Data_and_Data_Manage
Comput. Mater. Sci., 2021, 187, 110086. ment_Plans.pdf.
44 C. Draxl and M. Scheffler, MRS Bull., 2018, 43, 676–682. 66 CAS Common Chemistry, https://commonchemi
45 C. Draxl and M. Scheffler, J. Phys.: Mater., 2019, 2, 036001. stry.cas.org/, accessed 22nd July, 2023.
46 M. H. Segler, M. Preuss and M. P. Waller, Nature, 2018, 555, 67 ACS Research Data Guidelines, https://publish.acs.org/
604–610. publish/data_guidelines, accessed 22nd July, 2023.
47 C. W. Coley, D. A. Thomas III, J. A. Lummiss, J. N. Jaworski, 68 Digital Discovery, https://www.rsc.org/journals-books-
C. P. Breen, V. Schultz, T. Hart, J. S. Fishman, L. Rogers and databases/about-journals/digital-discovery/, accessed 22nd
H. Gao, Science, 2019, 365, eaax1566. July, 2023.
48 P. Schwaller, T. Laino, T. Gaudin, P. Bolgar, C. A. Hunter, 69 M. Schreiner, A. Bhowmik, T. Vegge, J. Busk and
C. Bekas and A. A. Lee, ACS Cent. Sci., 2019, 5, 1572–1583. O. Winther, Sci. Data, 2022, 9, 779.
49 P. Schwaller, R. Petraglia, V. Zullo, V. H. Nair, 70 M. Schreiner, A. Bhowmik, T. Vegge, P. B. Jørgensen and
R. A. Haeuselmann, R. Pisoni, C. Bekas, A. Iuliano and O. Winther, Mach. Learn.: Sci. Technol., 2022, 3, 045022.
T. Laino, Chem. Sci., 2020, 11, 3316–3325. 71 J. Jang, G. H. Gu, J. Noh, J. Kim and Y. Jung, J. Am. Chem.
50 S. Genheden, A. Thakkar, V. Chadimová, J.-L. Reymond, Soc., 2020, 142, 18836–18843.
O. Engkvist and E. Bjerrum, J. Cheminf., 2020, 12, 70. 72 L. M. Roch, F. Häse, C. Kreisbeck, T. Tamayo-Mendoza,
51 S. Chen and Y. Jung, JACS Au, 2021, 1, 1612–1620. L. P. Yunker, J. E. Hein and A. Aspuru-Guzik, PLoS One,
52 D. Lowe, Chemical reactions from US patents (1976- 2020, 15, e0229862.
Sep2016), https://gshare.com/articles/dataset/ 73 L. M. Roch, F. Häse, C. Kreisbeck, T. Tamayo-Mendoza,
Chemical_reactions_from_US_patents_1976-Sep2016_/ L. P. Yunker, J. E. Hein and A. Aspuru-Guzik, Sci. Robot.,
5104873, accessed 19th Sep, 2023. 2018, 3, eaat5559.
74 M. Sim, M. G. Vakili, F. Strieth-Kalthoff, H. Hao,
R. Hickman, S. Miret, S. Pablo-Garcı́a and A. Aspuru-

© 2024 The Author(s). Published by the Royal Society of Chemistry Digital Discovery, 2024, 3, 23–33 | 31
View Article Online

Digital Discovery Perspective

Guzik, ChemRxiv, 2023, DOI: 10.26434/chemrxiv-2023- on Machine Learning, 2016, https://proceedings.mlr.press/


v2khf. v48/cohenc16.html.
75 M. Vogler, J. Busk, H. Hajiyani, P. B. Jørgensen, N. Safaei, 96 R. Gómez-Bombarelli, J. Aguilera-Iparraguirre, T. D. Hirzel,
I. E. Castelli, F. F. Ramirez, J. Carlsson, G. Pizzi and D. Duvenaud, D. Maclaurin, M. A. Blood-Forsythe,
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

S. Clark, Matter, 2023, 6, 2647–2665. H. S. Chae, M. Einzinger, D.-G. Ha and T. Wu, Nat.
76 S. M. Moosavi, A. Chidambaram, L. Talirz, M. Haranczyk, Mater., 2016, 15, 1120–1127.
K. C. Stylianou and B. Smit, Nat. Commun., 2019, 10, 539. 97 S. Nagasawa, E. Al-Naamani and A. Saeki, J. Phys. Chem.
77 X. Jia, A. Lynch, Y. Huang, M. Danielson, I. Lang’at, Lett., 2018, 9, 2639–2646.
A. Milder, A. E. Ruby, H. Wang, S. A. Friedler and 98 K. Molga, E. P. Gajewska, S. Szymkuć and B. A. Grzybowski,
A. J. Norquist, Nature, 2019, 573, 251–255. React. Chem. Eng., 2019, 4, 1506–1521.
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

78 M. Skreta, N. Yoshikawa, S. Arellano-Rubach, Z. Ji, 99 M. Ceriotti, MRS Bull., 2022, 47, 1045–1053.
L. B. Kristensen, K. Darvish, A. Aspuru-Guzik, F. Shkurti 100 J. Weinreich, N. J. Browning and O. A. von Lilienfeld, J.
and A. Garg, arXiv, 2023, preprint arXiv:2303.14100, DOI: Chem. Phys., 2021, 154, 134113.
10.48550/arXiv.2303.14100. 101 J. Schrier, A. J. Norquist, T. Buonassisi and J. Brgoch, J. Am.
79 A. M. Bran, S. Cox, A. D. White and P. Schwaller, arXiv, 2023, Chem. Soc., 2023, 145, 21699–21716.
preprint arXiv:2304.05376, DOI: 10.48550/ 102 F. Häse, L. M. Roch and A. Aspuru-Guzik, Chem. Sci., 2018,
arXiv.2304.05376. 9, 7642–7655.
80 D. A. Boiko, R. MacKnight and G. Gomes, arXiv, 2023, 103 R. Hickman, M. Sim, S. Pablo-Garcı́a, I. Woolhouse, H. Hao,
preprint arXiv:2304.05332, DOI: 10.48550/ Z. Bao, P. Bannigan, C. Allen, M. Aldeghi and A. Aspuru-
arXiv.2304.05332. Guzik, ChemRxiv, 2023, DOI: 10.26434/chemrxiv-2023-
81 G. M. Hocky and A. D. White, Digital Discovery, 2022, 1, 79– 8nrxx.
83. 104 C. J. Taylor, A. Pomberger, K. C. Felton, R. Grainger,
82 M. Krenn, R. Pollice, S. Y. Guo, M. Aldeghi, A. Cervera- M. Barecka, T. W. Chamberlain, R. A. Bourne,
Lierta, P. Friederich, G. dos Passos Gomes, F. Häse, C. N. Johnson and A. A. Lapkin, Chem. Rev., 2023, 123,
A. Jinich and A. Nigam, Nat. Rev. Phys., 2022, 4, 761–769. 3089–3126.
83 V. L. Deringer, N. Bernstein, G. Csányi, C. Ben Mahmoud, 105 J. C. Fromer and C. W. Coley, Patterns, 2023, 4, 100678.
M. Ceriotti, M. Wilson, D. A. Drabold and S. R. Elliott, 106 J. Schrier, A. J. Norquist, T. Buonassisi and J. Brgoch, J. Am.
Nature, 2021, 589, 59–64. Chem. Soc., 2023, 145, 21699–21716.
84 S. Han, G. Barcaro, A. Fortunelli, S. Lysgaard, T. Vegge and 107 A. McNally, C. K. Prier and D. W. MacMillan, Science, 2011,
H. A. Hansen, npj Comput. Mater., 2022, 8, 121. 334, 1114–1117.
85 G. H. Gu, J. Lim, C. Wan, T. Cheng, H. Pu, S. Kim, J. Noh, 108 J. Busk, M. Schmidt, O. Winther, T. Vegge and
C. Choi, J. Kim and W. A. Goddard III, J. Am. Chem. Soc., P. B. Jørgensen, Phys. Chem. Chem. Phys., 2023, 25,
2021, 143, 5355–5363. 25828–25837.
86 S. Han, S. Lysgaard, T. Vegge and H. A. Hansen, npj Comput. 109 S. Chen and Y. Jung, Nat. Mach. Intell., 2022, 4, 772–780.
Mater., 2023, 9, 139. 110 S. Vargas, S. Zamirpour, S. Menon, A. Rothman, F. Häse,
87 M. Ceriotti, C. Clementi and O. Anatole von Lilienfeld, T. Tamayo-Mendoza, J. Romero, S. Sim, T. Menke and
Chem. Rev., 2021, 121, 9719–9721. A. Aspuru-Guzik, J. Chem. Educ., 2020, 97, 689–694.
88 A. E. Mikkelsen, H. H. Kristoffersen, J. Schiøtz, T. Vegge, 111 L. Saar, H. Liang, A. Wang, A. McDannald, E. Rodriguez,
H. A. Hansen and K. W. Jacobsen, Phys. Chem. Chem. I. Takeuchi and A. G. Kusne, MRS Bull., 2022, 47, 881–885.
Phys., 2022, 24, 9885–9890. 112 A. K. Sharma, J. Comput. Sci. Educ., 2021, 12, 8–15.
89 Q. Wang, J. Pan, J. Guo, H. A. Hansen, H. Xie, L. Jiang, 113 E. S. Thrall, S. E. Lee, J. Schrier and Y. Zhao, J. Chem. Educ.,
L. Hua, H. Li, Y. Guan and P. Wang, Nat. Catal., 2021, 4, 2021, 98, 3269–3276.
959–967. 114 D. Revignas and V. Amendola, J. Chem. Educ., 2022, 99,
90 Y. Zhang, Y. Zheng, K. Rui, H. H. Hng, K. Hippalgaonkar, 2112–2120.
J. Xu, W. Sun, J. Zhu, Q. Yan and W. Huang, Small, 2017, 115 D. Lafuente, B. Cohen, G. Fiorini, A. A. Garcı́a, M. Bringas,
13, 1700661. E. Morzan and D. Onna, J. Chem. Educ., 2021, 98, 2892–
91 D. Bash, Y. Cai, V. Chellappan, S. L. Wong, X. Yang, 2898.
P. Kumar, J. D. Tan, A. Abutaha, J. J. Cheng and Y. F. Lim, 116 A. G. St James, L. Hand, T. Mills, L. Song, A. S. J. Brunt,
Adv. Funct. Mater., 2021, 31, 2102606. P. E. Bergstrom Mann, A. F. Worrall, M. I. Stewart and
92 P. Friederich, F. Häse, J. Proppe and A. Aspuru-Guzik, Nat. C. Vallance, J. Chem. Educ., 2023, 100, 1343–1350.
Mater., 2021, 20, 750–761. 117 R. C. Cachichi, G. Girotto Junior, E. Galembeck,
93 F. Musil, A. Grisa, A. P. Bartók, C. Ortner, G. Csányi and J. A. M. Schewinsky Junior, D. Ferreira Gomes and
M. Ceriotti, Chem. Rev., 2021, 121, 9759–9815. J. d. A. Simoni, J. Chem. Educ., 2020, 97, 3667–3672.
94 A. Grisa, D. M. Wilkins, G. Csányi and M. Ceriotti, Phys. 118 S. Lo, S. Baird, J. Schrier, B. Blaiszik, S. Kalinin, H. Tran,
Rev. Lett., 2018, 120, 036002. T. Sparks and A. Aspuru-Guzik, 2023, DOI: DOI: 10.26434/
95 T. Cohen and M. Welling, Group Equivariant Convolutional chemrxiv-2023-6z9mq-v2.
Networks, Proceedings of The 33rd International Conference 119 J. Vanderplas, Statistics for hackers, Portland, Oregon, 2016.

32 | Digital Discovery, 2024, 3, 23–33 © 2024 The Author(s). Published by the Royal Society of Chemistry
View Article Online

Perspective Digital Discovery

120 M. Abdinejad, B. Talaie, H. S. Qorbani and S. Dalili, J. Sci. 125 E. Li, A. T. Lam, T. Fuhrmann, L. Erikson, M. Wirth,
Educ. Technol., 2021, 30, 87–96. M. L. Miller, P. Blikstein and I. H. Riedel-Kruse, PLoS
121 R. van Dinther, L. de Putter and B. Pepin, J. Chem. Educ., One, 2022, 17, e0275688.
2023, 100, 1537–1546. 126 L. B. Armstrong, M. C. Rivas, Z. Zhou, L. M. Irie,
This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

122 S. G. Baird and T. D. Sparks, Matter, 2022, 5, 4170–4178. G. A. Kerstiens, M. T. Robak, M. C. Douskey and
123 R. Keesey, R. LeSuer and J. Schrier, HardwareX, 2022, 12, A. M. Baranger, J. Chem. Educ., 2019, 96, 2410–2419.
e00319. 127 Y. Liu, J. Chem. Educ., 2022, 99, 2588–2596.
124 L. C. Gerber, A. Calasanz-Kaiser, L. Hyman, K. Voitiuk, 128 G. N. Quam, J. Chem. Educ., 1940, 17, 363.
U. Patil and I. H. Riedel-Kruse, PLoS Biol., 2017, 15, 129 R. M. Baker, M. E. Leonard and B. H. Milosavljevic, J. Chem.
e2001413. Educ., 2020, 97, 3097–3101.
Open Access Article. Published on 06 December 2023. Downloaded on 5/31/2024 9:44:07 AM.

130 J. G. Nguyen, K. J. Keuseman and J. J. Humston, J. Chem.


Educ., 2020, 97, 3429–3435.

© 2024 The Author(s). Published by the Royal Society of Chemistry Digital Discovery, 2024, 3, 23–33 | 33

You might also like