0% found this document useful (0 votes)
201 views

SUSIE Pharmaceutical CMC Ontology-Based Information Extraction For Drug Development Using Machine Learning

The document describes a framework called SUSIE that was developed to automatically extract information from unstructured pharmaceutical documents using machine learning. SUSIE uses a custom-built pharmaceutical drug development ontology, a weak supervision framework, contextualization algorithms, and a fine-tuned BioBERT model. On entity identification tasks, SUSIE achieves a test accuracy and F1-score of 96% and 88%, respectively, on out-of-sample documents. The framework generates knowledge graphs representing crucial information in a structured format to accelerate pharmaceutical product development.

Uploaded by

Rakesh Sharbidre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views

SUSIE Pharmaceutical CMC Ontology-Based Information Extraction For Drug Development Using Machine Learning

The document describes a framework called SUSIE that was developed to automatically extract information from unstructured pharmaceutical documents using machine learning. SUSIE uses a custom-built pharmaceutical drug development ontology, a weak supervision framework, contextualization algorithms, and a fine-tuned BioBERT model. On entity identification tasks, SUSIE achieves a test accuracy and F1-score of 96% and 88%, respectively, on out-of-sample documents. The framework generates knowledge graphs representing crucial information in a structured format to accelerate pharmaceutical product development.

Uploaded by

Rakesh Sharbidre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Computers and Chemical Engineering 179 (2023) 108446

Contents lists available at ScienceDirect

Computers and Chemical Engineering


journal homepage: www.elsevier.com/locate/cace

SUSIE: Pharmaceutical CMC ontology-based information extraction for drug


development using machine learning
Vipul Mann a , Shekhar Viswanath b , Shankar Vaidyaraman b , Jeya Balakrishnan b ,
Venkat Venkatasubramanian a ,∗
a
Department of Chemical Engineering, Columbia University, New York, NY, United States of America
b Eli Lilly and Company, Lilly Corporate Center, Indianapolis, IN, United States of America

ARTICLE INFO ABSTRACT

Keywords: Automatically extracting information from unstructured text in pharmaceutical documents is important for
Ontology drug discovery and development. This information can be integrated with structured datasets to ultimately
Pharmaceutical drug development accelerate pharmaceutical product development. To this end, we report an end-to-end information extraction
Information extraction
framework based on a custom-built pharmaceutical drug development ontology, a weak supervision framework,
Hybrid machine learning
contextualization algorithms, and a fine-tuned BioBERT model (adaptation of BERT or Bidirectional Encoder
Chemistry manufacturing and control
Representations from Transformers for biomedical text). The proposed framework, SUSIE (Schema-based Un-
supervised Semantic Information Extraction), was trained on ICH (International Conference on Harmonization)
documents to identify important entities and relations from unstructured text and auto-generate knowledge
graphs representing crucial information in a structured format. On the entity identification task, the framework
achieves a test accuracy and F1-score of 96% and 88%, respectively, on out-of-sample documents. A major
contribution of this work is to build an automated, unsupervised information extraction framework around a
domain-specific, custom-built pharmaceutical drug development ontology without the need for manual curation
of training datasets for specific tasks. The efficacy of the approach was tested on out-of-sample documents
including an internal Eli Lilly technical document.

1. Introduction Information extraction primarily involves automatically identifying


specific domain-related information using the semantics and grammati-
As the biomedical literature is growing by over a million publica- cal structure of text. This broadly comprises of two steps — identifying
tions each year (Simon et al., 2019), developing an efficient, automated entities signifying specific information of interest (known as named
information extraction framework is required. An automated informa- entity recognition or NER), and inferring the relationships between
tion extraction framework that uncovers rich information buried in identified entities (known as relation extraction or RE). Approaches un-
thousands of unstructured documents sitting idle in data repositories derlying these tasks utilize grammatical structures to extract semantic
frames (Beuls et al., 2021; Schmitz et al., 2012), utilizing knowledge
would accelerate the drug development cycle. This would lead to
bases for relation inference (Zhang et al., 2019), syntactic dependency
quicker, efficient, and objective analysis of new drug applications, bet-
parsing-based information extraction (Gamallo et al., 2012), semantic
ter historical data search capabilities, efficient trend extraction from un-
role labeling (Christensen et al., 2011), coreference resolution (Versley
structured pharmaceutical text, and enable easier regulatory monitor-
et al., 2008), and so on. There have been prior attempts in the pharma-
ing by integrating compliance documents in FDA’s risk-based selection
ceutical domain to use natural language-based methods for information
model for prioritizing inspections (U.S. Food and Drug Administration,
extraction with various applications. Viswanath et al. (2021a) devel-
2018). Such a framework is inline with the U.S. Food and Drug Ad-
oped a data-driven tool called IMDP for rapidly developing a new
ministration’s (FDA) new pharmaceutical quality initiative, knowledge-
drug briefing document. The methods underlying their approach are
aided assessment and structured application (KASA) (Lawrence et al.,
calibrated quantum mesh (Kulkarni et al., 2018), image processing,
2019) based on formalizing drug product knowledge-base and
and search capability combined together to address various needs re-
knowledge-assessment using a structured approach.
quired for efficient document creation. Kang et al. (2017) developed an

∗ Corresponding author.
E-mail address: [email protected] (V. Venkatasubramanian).

https://doi.org/10.1016/j.compchemeng.2023.108446
Received 5 June 2023; Received in revised form 12 September 2023; Accepted 30 September 2023
Available online 7 October 2023
0098-1354/© 2023 Elsevier Ltd. All rights reserved.
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

open-source information extraction framework called ElilE for parsing drug development and manufacturing ontology that organizes informa-
and formalizing free-text clinical research eligibility criteria. Yuan et tion underlying the drug development process; (ii) a weak supervision
a. Yuan et al. (2019) proposed an information extraction pipeline to framework with labeling functions and unified medical language sys-
transform unstructured text-based eligibility criteria into structured tem (UMLS) database (Bodenreider, 2004) inspired by clinical entity
representation as shareable clinical data queries. Xu et al. (2010) classification work of Fries et al. (2021); (iii) natural language de-
developed MedEx, a system for medication information extraction from pendency structure-based contextual information extraction combined
clinical notes and utilized NLP-tools like semantic tagger and parser for with custom rules for processing various data formats; (iv) a fine-
generating structured information/data. Harmata et al. (2017) reported tuned BioBERT language model (Lee et al., 2020) trained on relevant
a semi-automatic approach for information extraction for legal and pharmaceutical documents; and (v) a relation extraction module based
regulatory applications. Gentile et al. (2019) developed a knowledge on linguistic structure (Angeli et al., 2015) to extract semantic triples
graph-based pipeline for semantic information extraction from medical (subject, relation, object) from text. We combine these triples with
package inserts. Such graph-based representations capture semantic extracted information to generate knowledge graphs representation of
information and are ubiquitous for representing proteins (Saidi et al., important information. The developed framework could be used for
2009), chemical reactions (Gothard et al., 2012; Mann and Venkata- performing automated relevant information extraction from pharma-
subramanian, 2023), process flowsheets (Mann et al., 2023b; Hirtreiter ceutical documents in a domain-specific and context-aware manner and
et al., 2022), and many more applications of graph-based data mining is adaptable to other domains.
as presented in Washio and Motoda (2003). Skeppstedt et al. (2014) The rest of the paper is organized as follows — in Section 2,
reported a conditional random field to recognize clinical entities in we formally define the problem statement and objectives along with
health records. Moreover, given the challenges posed by biomedical presentation of an example that demonstrates unstructured text and
text containing chemical names, symbols, tables, equations, technical structured knowledge graph representation; in Section 3, we provide
terms, and so on, several adapted named entity recognition approaches background information on the methodology underlying the developed
have been reported for identifying biomedical terms including proteins, SUSIE framework, namely the custom-built drug development ontology
genes, and disease names (Leser and Hakenberg, 2005; Gaizauskas in Section 3.1, overview of standard ontologies in Section 3.2, text pre-
et al., 2003; Collier et al., 2000; Shen et al., 2003; Sasaki et al., 2008). processing steps performed in Section 3.3, weak supervision framework
A detailed description of prior works across various applications are for programmatically generating annotations for unlabeled data in
provided in the excellent review by Bhatnagar et al. (2022). Section 3.4, details on machine learning classifier built using BioBERT
Despite previous works, the challenges in extracting end-to-end model in Section 3.5, extracting relevant context in Section 3.6, and
relevant important information from pharmaceutical documents still relation extraction to generate knowledge graphs in Section 3.7; details
remain largely unaddressed. First, there is a need to systematically
on the dataset used and associated model training are provided in
define the important information from pharmaceutical documents before
Section 4; the model performance statistics and examples of relevant
building a system that recognizes and extracts it automatically. For in-
information extraction along with generated knowledge graphs are
stance, the crucial information would not just be all the named entities
presented in Section 5; finally, the conclusions and future directions
(drugs, chemicals, etc.) but also the necessary conditions, dosage, level
of this work appear in Section 6.
of impurities, packaging type, risk factors involved, manufacturing
processes, additional contextual information, and so on. Moreover,
this information should be custom-defined by the user based on their 2. Problem statement and objectives
downstream applications that might vary significantly at various stages
of the drug development process. Due to this, the standard annotated Given an unstructured text document which possibly contains infor-
datasets with entities of a given type (disease names, chemicals, phe- mation relevant for drug discovery, development, and manufacturing at
notypes) and standard defined relations (Huang et al., 2020; Pilehvar any stage – from research on drug molecule to the final drug product
et al., 2022; Luo et al., 2022) could not be used for such generalized packaging – the objective is to automatically identify important infor-
information extractions. Second, as a results of these, general-purpose mation across the document, identify associated relationships between
labeled datasets are not available for a customized and flexible infor- them, and represent them in a structured manner as a knowledge
mation extraction problem that could be used for training ML-based graph. Thus, the input is a document, 𝐷𝑖 containing a sequence of
classifiers for identifying important information and benchmark their sentences, {𝑆𝑗 } where each sentence is a sequence of words, (𝑤𝑗,𝑡 ), and
performance. Hence, an unsupervised learning framework is neces- the objective is to identify collection of information chunks relevant for
sary to mitigate these issues and extract the relevant information drug development from each sentence in the document, 𝜖𝑗 , and their
from pharmaceutical documents automatically in a dataset-agnostic associated relations. Formally, consider a document 𝐷𝑖 comprising 𝑛
and domain-specific manner. Moreover, due to the lack of availability sentences,
of such labeled datasets and the need for a flexible, customizable
information extraction framework, large language models (LLMs) like 𝐷𝑖 = {𝑆𝑗 }𝑛𝑗=1 ; 𝑆𝑗 = (𝑤𝑗,1 , 𝑤𝑗,2 , … , 𝑤𝑗,𝑡 ) (1)
GPT-3 (Brown et al., 2020) are not suitable for a domain-informed
where 𝑆𝑗 is the 𝑗 𝑡ℎ
sentence comprising a sequence of 𝑡 words. The
pharmaceutical information extraction. LLMs are primarily trained on
relevant information contained in the document Γ(𝐷𝑖 ) is given as,
non-technical English language text, are characterized by a lack of
transparency, and are not flexible in capturing domain knowledge. Γ(𝐷𝑖 ) = {𝜖𝑗 }𝑛𝑗=1 = {{𝛿𝑗,𝑘 }𝑝𝑘=1 }𝑛𝑗=1 ; 𝜖𝑗 = {𝛿𝑗,1 , 𝛿𝑗,2 , … , 𝛿𝑗,𝑝 };
Further, the associated challenges with generative AI-based models (2)
𝛿𝑗,𝑘 = (𝑤𝑗,𝑚 , 𝑤𝑗,𝑚+1 , … , 𝑤𝑗,𝑛 )
such as ‘hallucinations’ due to their auto-regressive nature along with
other limitations highlighted in Lee et al. (2023) make them unsuitable where 𝜖𝑗 is the collection of 𝑝 information chunks in the 𝑗 𝑡ℎ sentence
for pharmaceutical information extraction. and 𝛿𝑗,𝑘 is a contiguous sequence of important words between 𝑚𝑡ℎ and
In this work, we propose a Schema (i.e., ontology)-based Unsuper- 𝑛𝑡ℎ positions in the sentence. Therefore, each sentence 𝑆𝑗 is character-
vised Semantic Information Extraction (or SUSIE), a custom pharma- ized by 𝜖𝑗 , the ordered collection of sets of important words 𝛿𝑗,𝑘 . Our
ceutical ontology-based weak supervision framework combined with a objective is to identify such chunks of information from each document
BioBERT language model for generalized, end-to-end information ex- 𝐷𝑖 that contains unlabeled sentences with no prior information on 𝛿𝑗,𝑘 .
traction from unstructured pharmaceutical documents. The underlying This is equivalent to learning the transformation function Γ(.) that
framework for SUSIE is based on (i) a custom-built pharmaceutical operates on a document 𝐷𝑖 to give {𝜖𝑗 }𝑛𝑗=1 as shown in Eq. (2).

2
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

are then used to generate a knowledge graph representing impor-


tant information. The knowledge graph combines all important
information along with relations extracted from unstructured text.

Further details on these sub-frameworks characterizing our approach


are provided in the subsequent sections.

3.1. Pharmaceutical CMC ontology development

The goal of pharmaceutical CMC (Chemistry Manufacturing and


Control) development is to design a drug product meeting patient needs
Fig. 1. Important information representation as knowledge graph with words/phrases and develop a safe, robust, environmentally friendly drug substance
as words and relations between them as directed edges. and drug product manufacturing process that delivers the drug product
with desired quality attributes. CMC development involves a sequence
of decisions on selecting activities and their timing to manage and
For instance, consider the sentence, mitigate risk to meeting design requirements through the development
cycle (Viswanath et al., 2021b). The activities that are selected to be
‘𝑇 ℎ𝑒 𝑐𝑒𝑙𝑙 𝑔𝑟𝑜𝑤𝑡ℎ 𝑖𝑠 𝑎𝑠𝑠𝑒𝑠𝑠𝑒𝑑 𝑏𝑦 𝑎 𝑠𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑚𝑒𝑡ℎ𝑜𝑑 𝑢𝑠𝑖𝑛𝑔 𝑎 𝑡𝑒𝑡𝑟𝑎𝑧𝑜𝑙𝑖𝑢𝑚 𝑠𝑎𝑙𝑡 𝑤ℎ𝑖𝑐ℎ
performed in turn generate information on materials, processes, and
𝑖𝑠 𝑐𝑜𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑏𝑦 𝑐𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑑𝑒ℎ𝑦𝑑𝑟𝑜𝑔𝑒𝑛𝑎𝑠𝑒𝑠 𝑡𝑜 𝑎 𝑐𝑜𝑙𝑜𝑟𝑒𝑑 𝑓 𝑜𝑟𝑚𝑎𝑧𝑎𝑛 𝑝𝑟𝑜𝑑𝑢𝑐𝑡’ properties that reduce risks over time. The high level interplay of
decisions, activities, risks and their relation to information generated
The important information chunks (or groups of words) and relations
has been discussed in detail in literature (Viswanath et al., 2021b).
in this sentence would be,
The CMC ontology that is the focus in this work is a more de-
𝛿1,1 = (𝑤1,2 , 𝑤1,3 ) = ‘cell growth’ tailed version of middle level ontology (Viswanath et al., 2021b) that
seeks to capture the key information generated during development.
𝛿1,2 = (𝑤1,8 , 𝑤1,9 ) = ‘staining method’
We organized the information hierarchically in this ontology using
𝛿1,3 = (𝑤1,12 , 𝑤1,13 ) = ‘tetrazolium salt’ class-subclass relationships combined with object and data proper-
𝛿1,4 = (𝑤1,18 , 𝑤1,19 ) = ‘cellular dehydrogenases’ ties (Hailemariam and Venkatasubramanian, 2010a,b; Remolona et al.,
𝛿1,5 = (𝑤1,22 , 𝑤1,23 , 𝑤1,24 ) = ‘colored formazan product’ 2017). The ontology was developed through various discussions with
assessed by using converted by
experts in the pharmaceutical industry across various functions. A
𝛿1,1 ←←←←←←←←←←←←←←←←←←←←←→
← 𝛿1,2 𝛿1,2 ←←←←←←←←←←→
← 𝛿1,3 𝛿1,3 ←←←←←←←←←←←←←←←←←←←←←←←←→
← 𝛿1,4 snapshot of the developed ontology is shown in Fig. 3.
converted to The final ontology comprised 1277 classes, 3268 axioms, 34 ob-
𝛿1,3 ←←←←←←←←←←←←←←←←←←←←←←←→
← 𝛿1,5
ject properties, and 113 data properties and was developed using the
Thus, the relevant information contained in the sentence 𝑆1 is given Protége software (Musen, 2015) in .owl format, a standard ontology
by 𝜖1 = {𝛿1,1 , 𝛿1,2 , 𝛿1,3 , 𝛿1,4 , 𝛿1,5 } and the associated relations shown as a development and sharing format. The central class in the ontology is
knowledge graph in Fig. 1. named Drug Manufacturing which represents information of various
The underlying methodology characterizing the various modules in types and at different stages associated with drug manufacturing and
SUSIE required to automatically generate such structured representa- development. The other classes corresponding to each information type
tions of important information from unstructured text are described in is associated with the central class through relations (or properties).
detail in the next section. While there are several different classes in the ontology, details on the
major classes are provided below.
3. Methodology Design specifications and risk grids. There are separate risk grids for
different components of manufacturing process and drug product. Fig. 4
The transformation function Γ(.) in Eq. (2) is characterized by a shows risk grid for API synthesis route, API crystallization, Drug Prod-
combination of strategies that underlies our information extraction uct, API solid state as sub-classes of the risk grid class. The risk grids
framework as shown in Fig. 2. The developed framework underlying capture risks associated with various stages and conditions of the
SUSIE consists of four major components – drug development process including risk associated with API route,
crystallization, solid state, or drug product. The level of risk (low,
• first, ontologies, structured data sources, and custom rules for
medium, high) is defined based on several categorical and numerical
weak supervision, an approach in machine learning that sidesteps
factors defined in the design specifications and design requirements. To
the requirement of having manually labeled datasets by using
capture additional factors contributing to drug development risk, the
structured (but often noisier) data sources for assigning labels
ontology could be expanded by creating further subclasses that capture
to unlabeled datasets in an automated manner. This requires a
the relevant risk factors.
custom-built drug development ontology developed as part of our
The individual risk grids have a list of design requirements with
work.
each design requirement having a set of design attributes whose levels
• second, a BioBERT-based binary classifier that learns a func-
determine the level of risk. For example, for API solid state, a design re-
tional mapping to identify important information from text and
quirement is to have desirable API properties. A few examples of design
improves generalization of the approach by learning patterns
attributes that determine desirable API properties are — crystallinity,
characterizing important information during the training stage.
moisture sensitivity, and stability under different conditions. In the
This model could then be used to identify all important words
ontology, the design requirement and design attribute are captured as
given an input text.
separate sub-classes of the class design specification as this makes it
• third, post-processors that capture semantics and contextual infor-
more amenable to instantiation (see Fig. 5).
mation about the identified entities by utilizing the grammatical
structure of sentences. The contextual information complements Materials. There are different types of material that make up the final
the important words identified in the previous step. packaged drug. A drug that a patient takes is referred to as drug prod-
• and fourth, a linguistics-based relations extractor that extracts uct. The drug product comprises the main active ingredient (referred
relationships between identified entities as semantic triples which to as API or Active Pharmaceutical Ingredient) along with excipients.

3
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Fig. 2. An overview of the end-to-end information extraction framework based on weak supervision, BioBERT model, contextualization, and relation extraction to generate
knowledge graphs. The BioBERT schematic is adapted from Lee et al. (2020).

Fig. 3. A representative snapshot of the custom-built drug development ontology.

While API is made with desired quality, there are always low levels has material properties associated with it that determine its quality
of impurities (well below acceptable limits for patient consumption) and suitability for use. The properties are either physical or chemical
present in the API. The manufacturing process of API involves multiple properties which are sub-classes of the material properties class. Fig. 7
synthesis steps each of which employ different reagents and solvents shows a schematic with a few illustrative material properties.
in the manufacture. These different material types are captured as
sub-classes of the material class as shown in Fig. 6. 3.2. Standard ontologies and additional terms
The ontology also lists the solvents and reagents that are commonly
used in most small molecule synthesis. The specific solvents and specific To improve the coverage of the weak supervision task, we augment
reagents are sub-classes of the class solvents and reagents, respectively. the developed pharmaceutical drug development ontology (Section 3.1)
with other sources. These include, namely — unified medical lan-
Material properties. Each material that is made through the synthesis guage system (or UMLS), standard ontologies that are not included in
whether it is an intermediate or is part of the final drug product UMLS but are relevant for drug manufacturing, and custom functions

4
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

textual data based on a set of predefined rules. For appropriate abbrevi-


ation detection, which again could be missing from the above discussed
structured information sources, we used the Schwartz Hearst algo-
rithm, a popular algorithm for identifying abbreviations in biomedical
text (Schwartz and Hearst, 2002).
These sources of structured information in the form of UMLS, stan-
dard ontologies, and regular expressions combined with our pharma-
ceutical drug development and manufacturing ontology provide a rich
source of structured information that could be used to perform weak
Fig. 4. Risk grid class and its subclasses.
supervision and assign labels to unlabeled dataset.

3.3. Text processing

Before performing information extraction, a given unstructured text


has to be preprocessed. Such preprocessing usually involves several
steps aimed at both text processing as well extraction of additional meta
information that could be useful for fine-tuning the NER task later. The
following sections provide details on these steps.

Fig. 5. Design specifications class and its subclasses.


Document cleaning. Pharmaceutical documents often contain not just
text but other types of content such as tables, figures, references, foot-
notes, page numbers, and reaction mechanisms as equations. We there-
fore use several NLP techniques, both custom built as well as those of-
based on regular expressions-based patterns to capture information not
fered by standard NLP processing Python libraries such as SpaCy (Hon-
contained in either of the former two.
nibal and Montani, 2017) and NLTK (Bird et al., 2009), to perform
document preprocessing. The standard document cleaning tasks involve
Unified medical language system. The unified medical language sys-
— extracting only the document body text and section/subsection
tem (or UMLS) is a repository of biomedical vocabularies and inte-
headings based on XML map and associated meta-information from
grates millions of names, relations, and concepts from several different
the document; identifying section headings/subheadings based on reg-
sources (Bodenreider, 2004). The three main components comprising
ular expressions-based pattern identification on text meta-information;
the UMLS are – a metathesaurus containing a repository of interre-
and identifying table/figure captions using expressions-based pattern
lated biomedical concepts, Semantic Network mapping metathesaurus
identification. While we have developed algorithms assuming input
concepts to various high level categories, and lexical resources for
documents as .docx files, they apply to other document types such as
generating lexical variants of biomedical terms. Though the UMLS
PDF since they could easily be converted to .docx format using off-the-
ecosystem is vast, we use the metathesaurus extensively and utilize the
shelf software, with the exception of PDFs that are scanned images of
ontologies that are relevant for our purpose. Therefore, we included
text documents where optical character recognition (OCR) techniques
the following subdomains from UMLS in our framework — UMLS
might be required.
MeSH ontology containing medical subject headings for documents
on biomedical and health related documents; comparative toxicoge- Natural language processing. standard text-based cleaning, the docu-
nomics database (or CTD) chemical ontology containing information ment is processed using NLP-based approaches that includes, for each
on environmental chemicals affecting human health and CTD disease word in each sentence in the document — identifying the position of
ontology with corresponding information on diseases; National Can- the word in the sentence and stored as an integer indicating the location
cer Institute (NCI) terminologies; and Systematized Nomenclature of of the start of the word; lemmatizing the word and stores the lemma
Medicine-Clinical Terms (SNOMED CT) providing the core general of a word, for instance, ‘mix’ is the lemma for ‘mixing’; assigning the
terminology for the electronic health record (EHR). We worked with part of speech (POS) tag for the word, for instance verb, pronoun,
the 2021AB version of the UMLS metathesaurus in this work. symbol, etc.; and getting the dependency parsing structure for words
in a sentence, for instance, subject, modifier, etc. Such grammatical
Chemical names and reaction ontologies. The other standard ontologies information is used later in the framework for fine-tuning the approach.
that are relevant for our purpose include CHEBI (chemical entities of More details on these are provided in the later sections.
biological interest) (Degtyarenko et al., 2007), RXNO (reaction ontol-
ogy), and MOP (molecular processes). These ontologies capture various Document hierarchy and tables extraction. In addition to the standard
important modules relevant for our work — CHEBI contains informa- NLP-based document preprocessing, we developed several custom func-
tion on chemical entities that are of biological interest and thus have tions. First, we developed a custom function including a document
high drug-likeness, RXNO contains list of named chemical reactions hierarchy extractor that extracts the section and subsection headings
that often appear in text containing reaction mechanisms, and MOP by inferring the level of depth for each section heading and associate
contains molecular processes that underlie named reactions and are the corresponding paragraph(s) text for each level. This document
thus a more formal way of referring to steps in chemical reactions. hierarchy is useful for processing and analyzing the identified entities,
if required, during the downstream tasks such as ontology instantiation.
Additional custom terms. In addition, there are certain patterns and Second, we extracted all the tables from the document using docu-
nomenclature that are not captured by the various ontologies or knowl- ment processors based on the XML maps. Tables needed to be processed
edge bases given the peculiar nature of their nomenclature. For in- separately from the regular text processing due to the inherent structure
stance, various pharmaceutical companies (or organizations) would present in tables. Since tables are often rich sources of information
have an internally different way of referencing chemicals based on a conveying some very specific information, we process them separately
unique key or an internally understood nomenclature. To handle such during the NER stage by treating the table column names as enti-
cases, we use regular expressions that capture a certain pattern in ties. This assumption of table columns containing entities is based

5
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Fig. 6. Materials class and its subclasses.

Fig. 7. Material properties class and its subclasses.

on the fact that pharmaceutical documents usually contain informa- functions that encode domain knowledge or supervision sources to
tion on drug concentration profiles, material composition information, assign labels to data programmatically. We adapt the labeling func-
impurity information, and so on. tions (LFs) used in Fries et al. (2021) including — ontology-based
LFs, dictionary-based LFs, task-specific LFs containing regular expres-
Chemical names tokenizer. pharmaceutical documents containing in-
sions, and synset-based LFs. Ontology-based LFs include UMLS-based
formation on drug development and manufacturing often has drug
ontologies such as MeSH, Schwartz Hearst for abbreviation detection;
names being referred to using many different nomenclatures. The drugs
dictionary-based LFs included separated LFs for CHEBI, CTD Chemi-
and underlying chemicals or impurities are usually referred to using
cal, CTD Disease, RXNO, MOP, and the custom-built pharmaceutical
their common names, IUPAC nomenclature, chemical formula, unique
drug development ontology; task-specific LFs include regular expres-
identifiers corresponding to standard databases, or internal databases.
sions to capture internally used chemicals nomenclature; synset LFs
These nomenclatures often are inherently complex in that they are
include subject predicate object relationships based on class-subclass
characterized by comma, hyphens, numbers, and spaces. A standard,
relationships extracted from the custom-built pharmaceutical drug de-
natural language-based text tokenizer would therefore make errors
velopment ontology. Each labeling function assigns a label 1 or 0 based
while tokenizing words containing such chemical names. Thus, we use
on the defined rule within the labeling function based on term overlap
a rules-based chemical names tokenizer, ChemTok (Akkasi et al., 2016),
or text-based matching results. Thus, the input to a labeling function
to tokenize and identify chemical names in the text.
is a word and the output would its assigned label of 0 or 1 depending
on whether the input word matches the patterns/database of symbols
3.4. Weak supervision defined in the labeling function.
For instance, given a sentence ‘The cell growth is assessed by a
The hierarchical structure of an ontology could be used as a skeleton staining method using a tetrazolium salt which is converted by cellular
for assigning structure to unstructured text in a systematic manner. dehydrogenases to a colored formazan product’, each labeling function
Ontologies, owing to their underlying hierarchy offer an efficient use- would assign a label to each word as shown in Table 1 with the var-
case in weak supervision. Weak supervision is an approach in machine ious labeling functions (listed across columns) and the corresponding
learning that sidesteps the requirement of having manually labeled labels for each word (listed across rows). The labeling functions are –
datasets by using structured (but often noisier) data sources such as 𝐿𝐹𝑈 𝑀𝐿𝑆-𝑀𝑆𝐻 (based on the MeSH ontology in UMLS), 𝐿𝐹𝐶𝑀𝐶-𝑜𝑛𝑡𝑜𝑙𝑜𝑔𝑦
ontologies, knowledge-bases, and so on for assigning labels to unlabeled (based on the custom-built CMC ontology), 𝐿𝐹𝐶𝐻𝐸𝐵𝐼 (based on the
datasets in an automated manner (Zhou, 2018), and thus is useful for CHEBI ontology), 𝐿𝐹𝐶𝑇 𝐷-𝑐ℎ𝑒𝑚𝑖𝑐𝑎𝑙 (based on the CTD chemical names),
labeling and generating (synthetic) datasets. Such synthetically labeled and 𝐿𝐹𝑅𝑋𝑁𝑂 (based on the RXNO ontology of named chemical re-
datasets could then be used for training powerful ML models that are actions). The final label shown in column ‘assigned label’ is assigned
dependent on labeled datasets. We combined the custom-built ontol- as
ogy with standard structured knowledge-bases and NLP algorithms for
providing structure to text in a context-aware manner. The follow- 𝑙𝑎𝑏𝑒𝑙 = 𝑚𝑎𝑥(𝐿𝐹𝑈 𝑀𝐿𝑆-𝑀𝑆𝐻 , 𝐿𝐹𝐶𝑀𝐶-𝑜𝑛𝑡𝑜𝑙𝑜𝑔𝑦 ,
ing sections provide details on the various modules underlying our 𝐿𝐹𝐶𝐻𝐸𝐵𝐼 , 𝐿𝐹𝐶𝑇 𝐷-𝑐ℎ𝑒𝑚𝑖𝑐𝑎𝑙 , 𝐿𝐹𝑅𝑋𝑁𝑂 )
ontology-based weak supervision approach.
In the first step, we match words in each sentence to various implying that a word is assigned a label of ‘1’ if any of the label-
terms in the structured knowledge base comprising the UMLS knowl- ing functions assigns it a label ‘1’. The labels in the ’assigned label’
edge base, ontologies, and custom functions. To do this efficiently, we column are used for training the BioBERT model-based classifier to
use Snorkel, a Python-based package for programmatically performing learn generalized patterns for identifying words. The last column ‘af-
weak supervision and assigning labels to data (Ratner et al., 2017). ter contextualization’ indicates word labels assigned after performing
Performing weak supervision using Snorkel involves writing labeling contextualization by identifying noun phrases (after identifying labels

6
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Table 1
Example word labeling with various labeling functions.
Words 𝐿𝐹𝑈 𝑀𝐿𝑆-𝑀𝑆𝐻 𝐿𝐹𝐶𝑀𝐶-𝑜𝑛𝑡𝑜𝑙𝑜𝑔𝑦 𝐿𝐹𝐶𝐻𝐸𝐵𝐼 𝐿𝐹𝐶𝑇 𝐷-𝑐ℎ𝑒𝑚𝑖𝑐𝑎𝑙 𝐿𝐹𝑅𝑋𝑁𝑂 Assigned label After contextualization
The1
cell2 1 1 1
growth3 1 1 1
is4
assessed5
by6
a7
staining8 1 1 1
method9 1 1 1 1 1
using10
a11
tetrazolium12 1 1 1 1
salt13 1 1 1 1 1 1
which14
is15
converted16
by17
cellular18 1
dehydrogenases19 1 1 1
to20
a21
colored22 1
formazan23 1 1 1 1
product24 1 1 1 1

The underlined labels in the last column indicate additional words labeled as important after the contextualization step.

using BioBERT model on test documents) as described in the next product’ are tagged as important information and assigned a label
section. The reader is referred to Ratner et al. (2017), Fries et al. (2021) of 1. Thus, the entities are contextualized using such noun phrases-
for further details on various types and functional details on labeling based matching, and the labels are updated to reflect the contextual
functions. information as shown in the ‘after contextualization’ column in Table 1.

3.5. Learning generalized patterns using fine-tuned BioBERT 3.7. Relation extraction and auto-generating knowledge graphs

After the weak supervision-based information extraction, we have After the important words and their associated context have been
a labeled dataset with terms that are of interest (or are important) identified, an important next step for information extraction is identify-
labeled as 1 and the rest labeled as 0. Since these labels are based ing relations between the words. This would allow for better capturing
on a weak supervision strategy, they are inherently noisy and the of the meaning of the text as a knowledge graph where the entities
labeling strategy not completely generalizable. Thus, to improve the (or important words) are represented as nodes in the knowledge graph,
generalization of the transformation function Γ(.), label words based and relations between them are represented as directed edges between
on their context in a sentence and not just in isolation, and improve them. To extract such relations, we utilize a linguistic structure-based
the framework’s ability to handle and classify words unseen during the approach for extracting relations as semantic triples (subject, predicate,
training stage, we use a pre-trained BioBERT model and fine-tune it object) where the predicate is the relation of interest connecting the
on our dataset on the classification task. Fine-tuning involves further subject and the object. For instance, in the sentence,
training a pre-trained model on customized and often much smaller ‘𝑇 𝑒𝑡𝑟𝑎𝑧𝑜𝑙𝑖𝑢𝑚 𝑠𝑎𝑙𝑡 𝑖𝑠 𝑐𝑜𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑏𝑦 𝑐𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑑𝑒ℎ𝑦𝑑𝑟𝑜𝑔𝑒𝑛𝑎𝑠𝑒𝑠’
datasets to achieve optimal performance. The dataset used for such fine-
tuning and further details on the fine-tuning strategy are presented in the extracted semantic triple would be
detail in Section 4. 𝚜𝚞𝚋𝚓𝚎𝚌𝚝 ∶ 𝚃𝚎𝚝𝚛𝚊𝚣𝚘𝚕𝚒𝚞𝚖 𝚜𝚊𝚕𝚝

3.6. Contextualization 𝚙𝚛𝚎𝚍𝚒𝚌𝚊𝚝𝚎 ∶ 𝚒𝚜 𝚌𝚘𝚗𝚟𝚎𝚛𝚝𝚎𝚍 𝚋𝚢


𝚘𝚋𝚓𝚎𝚌𝚝 ∶ 𝚍𝚎𝚑𝚢𝚍𝚛𝚘𝚐𝚎𝚗𝚊𝚜𝚎𝚜, 𝚌𝚎𝚕𝚕𝚞𝚕𝚊𝚛 𝚍𝚎𝚑𝚢𝚍𝚛𝚘𝚐𝚎𝚗𝚊𝚜𝚎𝚜
It is often necessary to capture the neighboring information around
Though the idea is straightforward, extracting such triples is a challeng-
entities since they convey important contextual information. However,
ing task that involves handling sentences with complex grammatical
the ontology-based weak supervision labeling approach only identifies
structures, multiple clauses that need to be split into simpler ones, and
individual entities but not the neighboring contextual information. To
performing coreference resolution that involves finding all mentions
this end, we use noun phrases in documents to capture such infor-
in text referring to the same entity. A framework that mitigates these
mation. Noun phrases are words or group words that function like
challenges was proposed in Angeli et al. (2015) and is now part of the
nouns and contains a noun accompanied by modifiers. To identify
Stanford OpenIE framework. We use this framework to extract relations
noun phrases, the dependency structure of a sentence is first identified
from text as semantic triples. We map the important words identified
based on the grammatical structure of the sentence and modifiers
to their relations based on the extracted relations to auto-generate
associated with nouns. We use Spacy’s pre-trained language model,
knowledge graphs like the one shown in Fig. 1.
‘en_core_web_sm’ (Honnibal and Montani, 2017), to perform depen-
dency parsing and identify the noun phrases. For instance, for the 4. Dataset and model training
example sentence shown in Table 1, a partial dependency parsing
diagram between the words is shown in Fig. 8. Here, the identified 4.1. Dataset
entities ‘formazan’ and ‘product’ are both part of a noun phrase –
‘colored formazan product’. Hence, even though the word ‘colored’ was Since our framework does not rely on labeled datasets for model
not identified as an entity, all words in the phrase ‘colored formazan training (due to the presence of weak supervision stage), we could

7
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Fig. 8. A partial dependency parsing diagram for the example sentence in Table 1 along with identified noun phrases highlighted in blue. The contextual information captured
in each noun phrase is also highlighted.

use any data source that contains concepts relevant for drug discov- 4.2. BioBERT model architecture and training
ery and development. Thus, we primarily worked with the publicly
available (International Council for Harmonisation, 2023) international The BioBERT model is a pre-trained language representation model
conference on harmonization (ICH) guideline documents on quality, trained on biomedical corpora, containing PubMed abstracts (4.5B
safety, and efficacy containing a wide spectrum of relevant concepts. In words), PubMed Central full-text articles (13.5B words), and standard
total, our dataset comprised 64 documents and we perform a random English language corpus (3B words) (Lee et al., 2020). BioBERT is based
80-10-10 train–valid–test split with 51 documents in the training set, on the BERT model (Devlin et al., 2018), a state-of-the art sequence
6 documents in validation set, and 7 documents in the test set. In model used for learning word representations for English language
addition, we use an internal technical report obtained from Eli Lilly and is trained on general domain texts. A transformer model (Vaswani
as an additional test document for evaluating model performance. The et al., 2017) underlies the BERT architecture, which has been shown
list of documents used for training, validation, and testing are listed in to be successful in various engineering applications including reaction
Table 2. In total, there were 21571 examples (sentences) in the training prediction (Mann and Venkatasubramanian, 2021a; Schwaller et al.,
set, 2338 in the validation set, and 2392 in the test set. 2019), retrosynthesis (Tetko et al., 2020; Mann and Venkatasubrama-
The training and validation datasets were first processed using the nian, 2021b; Venkatasubramanian and Mann, 2022), electrocatalyst
weak supervision framework to assign labels (0 or 1, depending on discovery (Muthukkumaran et al., 2023), property prediction (Mann
whether the word was identified as important or not) to each word et al., 2022), chemical product design (Mann et al., 2023a; Trinh et al.,
in the sentence, thus transforming the unlabeled ICH documents to a 2021), and so on. BioBERT overcomes the limitations of using BERT on
labeled training dataset. This labeled dataset was then passed on to BioMedical text by using a domain-focused training set and learning
the supervised learning stage involving the BioBERT fine-tuning. The efficient word distributions more focused on the BioMedical domain.
BioBERT model input during the training stage would be sentences with Hence, BioBERT is based on the BERT model architecture and is fine-
tokenized words along with their target labels (0 or 1). For tokenizing tuned on a large corpus of BioMedical datasets. A schematic of the
words, we use the WordPiece tokenizer (Sennrich et al., 2015) that BioBERT model architecture is shown in Fig. 9.
splits a given word into possibly multiple subwords (or pieces) based Large models such as BioBERT could also be fine-tuned on a down-
on their frequency of occurrence in the training corpus and uses special stream task without retraining the entire model. This involves ini-
token ‘##’ to indicate word pieces that are not the first part of a given tializing the BioBERT model architecture using the pre-trained model
word. For instance, consider the same sentence in Table 1. At the weights, and then training the model from this stage on a custom
training stage, the input to the model would be: dataset or task for few epochs. In our case, the custom dataset is the set
of ICH documents in the training set and the task is to use the labeled
([𝐶𝐿𝑆], 0), (𝑡ℎ𝑒, 0), (𝑐𝑒𝑙𝑙, 1), (𝑔𝑟𝑜𝑤𝑡ℎ, 1), (𝑖𝑠, 0), (𝑎𝑠𝑠𝑒𝑠𝑠𝑒𝑑, 0), (𝑏𝑦, 0), (𝑎, 0), dataset after weak supervision stage to train a BioBERT-based classifier.
Since this is a binary classification task with two labels, we minimize
(𝑠𝑡𝑎𝑖𝑛, 1), (##𝑖𝑛𝑔, 1), (𝑚𝑒𝑡ℎ𝑜𝑑, 1), (𝑢𝑠𝑖𝑛𝑔, 0), (𝑎, 0), (𝑡𝑒, 1), (##𝑡𝑟𝑎, 1), (##𝑧𝑜, 1),
a cross-entropy loss function defined by
(##𝑙𝑖𝑢𝑚, 1), (𝑠𝑎𝑙𝑡, 1), (𝑤ℎ𝑖𝑐ℎ, 0), (𝑖𝑠, 0), (𝑐𝑜𝑛𝑣𝑒𝑟𝑡𝑒𝑑, 0), (𝑏𝑦, 0), (𝑐𝑒𝑙𝑙𝑢𝑙𝑎𝑟, 0),
1 ∑
𝑁
(𝑑𝑒, 1), (##ℎ𝑦, 1), (##𝑑𝑟, 1), (##𝑜𝑔𝑒𝑛, 1), (##𝑎𝑠𝑒𝑠, 1), (𝑡𝑜, 0), (𝑎, 0), (𝑐𝑜𝑙𝑜𝑟𝑒𝑑, 0), 𝐿=− 𝑦 .𝑙𝑜𝑔(𝑝(𝑦𝑖 )) + (1 − 𝑦𝑖 ).𝑙𝑜𝑔(1 − 𝑝(𝑦𝑖 )) (3)
𝑁 𝑖=1 𝑖
(𝑓 𝑜𝑟𝑚, 1), (##𝑎𝑧, 1), (##𝑎𝑛, 1), (𝑝𝑟𝑜𝑑𝑢𝑐𝑡, 1), ([𝑆𝐸𝑃 ], 0)
where 𝑝(𝑦𝑖 ) is the predicted probability of 𝑖 being in class 1, (and
where, ‘[CLS]’ and ‘[SEP]’ are special tokens that indicate the start hence, a probability 1 − 𝑝(𝑦𝑖 ) of it being in class 0), 𝑦𝑖 is the true
and end of sentences, respectively. During the model evaluation or class label, and 𝑁 is the total number of data points. We fine-tune the
test stage, the input to the model are similar to those in the training BioBERT model on this task for 3 epochs since — first, fine-tuning for
stage with words in the sentence tokenized using WordPiece tokenizer, just a few epochs is usually sufficient to achieve optimal performance,
but without their target labels. The target labels are predicted using and second, to avoid overfitting on the small custom training dataset
the trained or fine-tuned BioBERT model, classifying each word as not with noisy labels from weak supervision. Moreover, as seen from the
important (label 0) or important (label 1). validation loss in Fig. 10, the validation loss plateaus around 3 epochs.

8
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Table 2
List of documents used for training, validation, and evaluation of the model..
Source: All data (except the technical report from Eli Lilly) is publicly available at International Council for Harmonisation (2023).
S.No Document Description/Title Type
1 E2A guideline Clinical safety data management: definitions and standards for expedited reporting E2A train
2 E2D guideline Post-approval safety data management: definitions and standards for expedited reporting E2D train
3 E2E guideline Pharmacovigilance planning E2E train
4 E2F guideline Development safety update report E2F train
5 E3 guideline Structure and content of clinical study reports E3 train
6 E4 guideline Dose–response information to support drug registration E4 train
7 E5 R1 guideline Ethnic factors in the acceptability of foreign clinical data E5(R1) train
8 E7 guideline Studies in support of special populations: geriatrics E7 train
9 E8-R1 guideline step 4 General considerations for clinical studies E8(R1) train
10 E9 guideline Statistical principles for clinical trials E9 train
11 E10 guideline Choice of control group and related issues in clinical trials E10 train
12 E11 R1 guideline Addendum to ICH E11: clinical investigation of medicinal products in the pediatric population E11 train
(R1)
13 E14 guideline The clinical evaluation of qt/qtc interval prolongation and proarrhythmic potential for train
non-antiarrhythmic drugs E14
14 E15 guideline Definitions for genomic biomarkers, pharmacogenomics, pharmacogenetics, genomic data and sample train
coding categories E15
15 E16 guideline Biomarkers related to drug or biotechnology product development: context, structure and format of train
qualification submissions E16
16 E17 EWG step 4 General principles for planning and design of multi-regional clinical trials E17 train
17 E18 guideline Guideline on genomic sampling and management of genomic data E18 train
18 E19 EWG guideline Optimization of safety data collection E19 train
19 E11A step 2 guideline Pediatric extrapolation E11A train
20 Q3C-R8 step 4 guideline Impurities: guideline for residual solvents Q3C(R8) train
21 Q14 step 2 guideline Analytical procedure development Q14 train
22 Q1B guideline Stability testing: photostability testing of new drug substances and products Q1B train
23 Q1C guideline Stability testing for new dosage forms Q1C train
24 Q1D guideline Bracketing and matrixing designs for stability testing of new drug substances and products Q1D train
25 Q1E guideline Evaluation for stability data Q1E train
26 Q1F stability guideline Stability testing of active pharmaceutical ingredients and finished pharmaceutical products train
27 Q3A(R2) guideline Impurities in new drug substances Q3A(R2) train
28 Q3B(R2) guideline Impurities in new drug substances Q3B(R2) train
29 Q3D(R2) step 4 guideline Guideline for elemental impurities Q3D(R2) train
30 Q4B guideline Evaluation and recommendation of pharmacopoeial texts for use in the ICH regions Q4B train
31 Q5B guideline Quality of biotechnological products: Analysis of the expression construct in cells used for production train
of r-dna derived protein products Q5B
32 Q5C guideline Quality of biotechnological products: Stability testing of biotechnological/biological products Q5C train
33 Q5E guideline Comparability of biotechnological/biological products subject to changes in their manufacturing train
process Q5E
34 Q6B guideline Specifications: test procedures and acceptance criteria for biotechnological/biological products Q6B train
35 Q7 guideline Good manufacturing practice guide for active pharmaceutical ingredients Q7 train
36 Q8(R2) guideline Pharmaceutical development Q8(R2) train
37 Q9 guideline Quality risk management Q9 train
38 Q10 guideline Pharmaceutical quality system Q10 train
39 Q11 guideline Development and manufacture of drug substances (chemical entities and biotechnological/biological train
entities) Q11
40 Q12 guideline step 4 Technical and regulatory considerations for pharmaceutical product lifecycle management Q12 train
41 S1A guideline Guideline on the need for carcinogenicity studies of pharmaceuticals S1A train
42 S1B-R1 guideline Testing for carcinogenicity of pharmaceuticals S1B(R1) train
43 S2(R1) guideline Guidance on genotoxicity testing and data interpretation for pharmaceuticals intended for human use train
S2(R1)
44 S3A guideline Note for guidance on toxicokinetics: the assessment of systemic exposure in toxicity studies S3A train
45 S3B guideline Pharmacokinetics: guidance for repeated dose tissue distribution studies S3B train
46 S4 guideline Duration of chronic toxicity testing in animals (rodent and non rodent toxicity testing) S4 train
47 S5-R3 step 4 guideline Detection of reproductive and developmental toxicity for human pharmaceuticals S5(R3) train
48 S6 R1 guideline Preclinical safety evaluation of biotechnology-derived pharmaceuticals S6(R1) train
49 S7B guideline The non-clinical evaluation of the potential for delayed ventricular repolarization (qt interval train
prolongation) by human pharmaceuticals S7B
50 S10 guideline Photosafety evaluation of pharmaceuticals S10 train
(continued on next page)

9
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Table 2 (continued).
S.No Document Description/Title Type
51 S11 step 4 guideline Nonclinical safety testing in support of development of pediatric pharmaceuticals S11 train
52 E2C R2 guideline Periodic benefit-risk evaluation report (PBRER) E2C(R2) valid
53 Q2-R2 step 2 guideline Validation of analytical procedures Q2(R2) valid
54 S12 step 2 guideline Nonclinical biodistribution considerations for gene therapy products S12 valid
55 Q6A guideline Specifications: test procedures and acceptance criteria for new drug substances and new drug valid
products: chemical substances Q6A
56 S7A guideline Safety pharmacology studies for human pharmaceuticals S7A valid
57 S9 guideline Nonclinical evaluation for anticancer pharmaceuticals S9 valid
58 E1 guideline The extent of population exposure to assess clinical safety for drugs intended for long-term treatment test
of non-life-threatening conditions E1
59 Q1A(R2) guideline Stability testing of new drug substances and products Q1A(R2) test
60 Q2(R1) guideline Validation of analytical procedures: text and methodology Q2(R1) test
61 Q5A (R1) guideline Viral safety evaluation of biotechnology products derived from cell lines of human or animal origin test
Q5A(R1)
62 Q5D guideline Derivation and characterization of cell substrates used for production of biotechnological/biological test
products Q5D
63 S1C(R2) guideline Dose selection for carcinogenicity studies of pharmaceuticals S1C(R2) test
64 S8 guideline Immunotoxicity studies for human pharmaceuticals S8 test
65 Eli Lilly document Eli Lilly internal technical report test

Fig. 9. A schematic of the BioBERT architecture and pre-training strategy based on BERT.
Source: Schematic adapted from Lee et al. (2020).

Since we are only fine-tuning the model on a smaller dataset, we do not Table 3
Evaluation metrics on the train, validation, and test set. For the test set, numbers in
perform extensive hyperparameter optimization except for the learning
parentheses indicate performance metrics on the ICH test set documents (i.e. excluding
rate, weight decay, and the number of epochs using the validation set. the Eli Lilly internal technical report).
The model is characterized by 107M total training parameters. Note Metric Train Validation Test
that since we are only fine-tuning the model for a few epochs, the
Precision 0.989 0.954 0.943 (0.955)
computational cost associated with the model is much smaller. Recall 0.991 0.973 0.824 (0.975)
F1-score 0.990 0.964 0.879 (0.965)
Remark 1. Note that the model considers the sequential position Accuracy 0.996 0.988 0.954 (0.987)
of the words in a sentence due to its autoregressive nature. Thus,
the contextual information about the words are used inherently while
predicting the target labels. logit values. To further understand the performance of the model,
we computer the standard metrics computed for classifiers such as —
5. Results precision, recall, F1 score, and accuracy. Precision indicates the fraction
of true positives across all the positive class predictions made by the
5.1. Model performance evaluation and statistics classifier (TP/(TP+FP)); recall indicates the fraction of true positives
across all the true positives in the dataset (TP/(TP+FN); F1-score is a
We present the model training results by showing the training loss harmonic mean of precision and recall and shows the trade-off between
and validation loss with training steps during the model fine-tuning them as a single measure; and accuracy is the fraction of correct
stage. The training and validation loss as the training progresses is predictions across all the predictions ((TP+TN)/(TP+TN+FP+FN)). The
shown in Fig. 10 and we observe that the validation loss plateaus precision, recall, F1-score, and accuracy on the training, evaluation,
around 1750 training steps which corresponds to 3 epochs. The predic- and test dataset are shown in Table 3.
tions from the model are in the form of logits or raw probability scores It should be noted that the ground truth labels are auto-generated
which are mapped to their respective class predictions based on higher based on the weak supervision strategy which are inherently noisy

10
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

5.2. Test set information chunks identified and auto-generated knowledge


graphs

Recall that apart from the ICH documents, we also had an internal
Eli Lilly report as part of the test set. This document possibly contained
several new concepts and terms that were likely not part of the ICH
documents and hence, were not seen by the model during the training
stage. This document therefore serves as an ideal test bed to evaluate
the model performance on truly out-of-sample documents. Few exam-
ples of important information chunks identified from the document
using SUSIE are highlighted below.

Remark 2. Note that certain chemicals/drug names that are con-


fidential in nature have been masked-out using terms such as Com-
Fig. 10. Training and evaluation loss with training steps. pound_1/2/3, Chemical_1/2/3, IUPAC_1/2/3, Drug_1/2/3, and so on.

Table 4 Input 1: Compound_1 is not a bacterial mutagen in the


Test set statistics of the NER approach using SUSIE. Ames assay. The same assessment criteria would be used
Document Detection rate (%) tf-idf coverage (%) to evaluate any future route modifications to determine if
E1 guideline 34.5 26.4 additional controls for genotoxic species will be needed in
Q1A(R2) guideline 39.6 71.2
proposed starting material Compound_1.
Q2(R1) guideline 38.2 59.4
Q5D guideline 48.4 89.6
S1C(R2) guideline 40.0 63.4
Q5A(R1) guideline 42.7 84.8 Input 2: Compound_1 is introduced in the first step of
S8 guideline 48.9 85.2 the proposed 3-step commercial manufacturing process
Eli Lilly internal report 47.2 86.4 for Drug_1 and is incorporated into the drug substance
Average 42.4 70.8 as a significant structural fragment. It is a
chemically-stable, highly-purified crystalline material that
has more desirable properties for a starting material
than compounds earlier in the synthetic route.
and thus the labels may not be completely accurate. Although these
Compound_1 can be readily characterized by
metrics give a good indication of the model’s abilities, they may not
commonly known analytical techniques, which
be completely representative of the true model performance given the
separate and quantify known impurities, as well as
noisy nature of assigned labels. We have therefore adapted some of the
potential isomeric impurities and degradation products.
metrics for evaluating NER performance and created other custom met-
rics to provide further insights into the performance of our approach.
First, we define entity detection rate indicating the fraction of words Input 3: The structures for both impurities listed in
labeled as entities by our framework as, Table 3.2-2 are shown in Figure 3.2-1. The structure of
# of unique tagged words Compound_2 has been confirmed by independent synthesis
detection rate = (4) of an authentic sample and characterization by
# of unique words in the entire document
sensitive spectroscopic methods. Benzophenone (Compound_3)
Second, to assess the accuracy of these labels, we use an indirect route is a commercially available material. Both of these impurities,
using tf-idf (term frequency-inverse document frequency)-based word and their downstream process derivatives, are either fully
importance scores (Ramos et al., 2003) to identify the most important purged or are present in the final drug substance at levels
words across documents, given as below the ICH reporting threshold (0.05%). Structures
𝑁 of the downstream process derivatives are provided in
tf(t,d) = 𝑓𝑡,𝑑 idf(t,D) = log
∣ {𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝐷} ∣ Background Information.
tf-idf(t,d,D) = tf(t,d) × idf(t,D) (5)

where 𝑓𝑡,𝑑 is the raw term frequency of term 𝑡 in document 𝑑, 𝑖𝑑𝑓 (𝑡, 𝐷) Input 4: The fate of Compound_3 in Step 1 of the
is the inverse document frequency of term 𝑡 across a corpus of doc- proposed commercial manufacturing process was determined
uments 𝐷, and ∣ {𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝐷} ∣ is the number of documents across by reacting Compound_2 with starting material compound_1,
𝑁 =∣ 𝐷 ∣ total documents where the term 𝑡 appears. For each document, as shown in Scheme 6.5.2.1-2. One main component was
we identify the 500 words with the highest tf-idf scores to get a observed in both Step 1a and Step 1b, which were determined
list of important words that need to be captured. We then compute to be Compound_4 and Compound_5, respectively, by LC-MS
fraction of these 500 words identified by the NER model as important characterization.
to indirectly assess the accuracy of the model by estimating its ability For evaluation of the fate and purge of Compound_2,
to capture these important words. Hence, the actual performance of Step 1 of the commercial manufacturing process
the model is possibly better than these scores, however, they could be was performed with 1% of Compound_2 added to
treated as a lower bound on the model performance. The assumption the proposed starting material Compound_8. The
here is that words with high tf-idf scores are likely to be important isolated product Compound_6, contained 0.09% of
and hence should be captured but there is no differentiation between Compound_5, with neither Compound_2 nor Compound_4
general words and drug discovery and development related words. The being observed at a reporting threshold of 0.05%. Thus, a
entity detection rate and tf-idf scores-based coverage fraction of the rejection efficiency of 91% was demonstrated for Compound_2
framework estimated using the test set are presented in Table 4. in Step 1.

11
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Fig. 11. Automatically generated knowledge graph for example Input 2 above.

Fig. 12. Auto-generated knowledge graph for example Input 4 above.

Input 5: Drug_1 is being developed as 80 mg and stage. However, we also observe certain misses that are difficult for
150 mg immediate release tablets. The following discussion the model to capture at this stage. For instance, consider input 5
summarizes the development of the dissolution conditions where the model correctly identified the 150mg dosage information
including justification of the testing conditions chosen; for the drug tablets. However, the 80mg dosage information that was
apparatus and rotation speed, media conditions selected; also associated with the same drug was missed. However, a subject
and demonstration of the discriminating capability of the matter expert (or SME) input is still needed to filter out the important
conditions. The complete method and validation summary are relationships from generic ones. But the knowledge graph is still useful
provided in Appendices 3.4.A1 and 3.4.A2, respectively, at the since it surfaces output in a way that makes it easier for SME to spot
end of this section. the important relationships. Improving the capabilities of the model
to allow capturing such complex but relevant contextual information
For inputs 2 and 3 above, the knowledge graph representing the could be addressed using relation extraction and is part of our future
extracted information presented in a structured manner are shown in work on this framework. Further examples from the report are provided
Figs. 11 and 12. Additional knowledge graphs for examples shown in in the Appendix.
the Appendix from the technical report.
6. Conclusions
Based on the above, we observe that our framework (SUSIE) au-
tomatically generates structured knowledge graphs from unstructured We have developed the first part of an end-to-end pharmaceutical
pharmaceutical text inputs. The knowledge graphs capture the impor- relevant information extraction framework called SUSIE by combining
tant information (from a drug development or CMC standpoint) as — domain ontologies for representing important concepts, a weak
defined by the custom built and standard ontologies used in our frame- supervision framework for transforming unlabeled datasets to labeled
work. SUSIE also captures additional contextual information character- datsets, a fine-tuned BioBERT model that represents a generalized
izing the information chunks. For instance, in input 1, ‘Compound_1’ model able to handle new documents, contextualization modules to
along with its associated information that it is a ‘proposed starting capture relevant context associated with important entities, and a
material’ has been captured. In input 2, drug names that were likely relation extraction approach for auto-generating knowledge graphs.
not present in the training database but were automatically captured Our framework was trained on publicly available (unlabeled) ICH
based on their neighboring context, thus pointing towards the model’s documents and achieves a test accuracy and F1-score of 96% and
ability to handle new and relevant concepts not seen during the training 88%, respectively, on out-of-sample documents including an internal

12
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

technical report from Eli Lilly. A major contribution of this work is to Input 3: Supporting data are provided herein to indicate
develop a custom pharmaceutical drug development ontology and build that the two proposed starting materials, compound_1 and
an information extraction framework upon it that eliminates the need Compound_8, are ideal candidates for the control of impurities
for manual data curation for model training. The underlying framework entering the synthetic process. The proposed starting materials
characterizing SUSIE is generalizable and adaptable to other domains. enter the process with adequate purity, guided by a
As extension of this work, we plan to demonstrate downstream ap- rigorous set of proposed specifications, that ensure the
plications such as performing semantic search, topical modeling, and drug substance does not contain significant (greater than
historical data mining on pharmaceutical documents. 0.10% ) impurities originating from the starting materials.
Additionally, the downstream processes required to convert
these proposed starting materials to drug substance also
CRediT authorship contribution statement provide significant impurity reduction and control capability
as demonstrated through batch history data and impurity
Vipul Mann: Performed the computations and analysis, Wrote the spiking studies. This combination of factors is the basis
paper. Shekhar Viswanath: Provided pharmaceutical industry-related the proposal that the two proposed starting materials are
ideas, suggestions, and guidelines, Wrote the paper. Shankar Vaid- appropriate for the commercial manufacture of Drug_1.
yaraman: Provided pharmaceutical industry-related ideas, suggestions,
and guidelines, Wrote the paper. Jeya Balakrishnan: Provided phar-
Input 4: The drug substance manufacturing process uses
maceutical industry-related ideas, suggestions, and guidelines, Wrote
two starting materials, compound_1 and Compound_8.
the paper. Venkat Venkatasubramanian: Formulated the problem,
It consists of three synthetic steps and employs
Designed the approach, Wrote the paper.
3 covalent bond making/breaking processes. In Step 1a,
a mixture of the two starting materials (compound_1 and
Declaration of competing interest Compound_8), IUPAC_2 and toluene are heated at reflux under
a partial vacuum, with water removal, to form Compound_11
The authors declare that they have no known competing finan- by a condensation reaction. In Step 1b, IUPAC_3 and additional
cial interests or personal relationships that could have appeared to IUPAC_2 are added, nitrogen is used to reduce the oxygen level,
influence the work reported in this paper. and most of the toluene is removed by vacuum distillation.
Nitrogen is again used to reduce the oxygen level in the
reaction mixture, before it is heated to produce Compound_6
Data availability by a cyclocondensation reaction. As the reaction is cooled,
seed crystals may be added to aid crystallization of
Data will be made available on request. Compound_6. To complete the crystallization, water is added,
followed by an aqueous solution of potassium carbonate. The
product is then isolated by filtration, washed, and dried.
Acknowledgments

Input 5: In Step 2, aqueous sodium hydroxide is added to


This work was supported in part by Eli Lilly and Company, United
a heated solution of Compound_6 in IUPAC_4 to hydrolyze
States of America through the Lilly Research Award Program (LRAP),
the nitrile to an amide. When the reaction is complete,
and the Center for the Management of Systemic Risk (CMSR) at
the partially-cooled solution may be seeded to initiate
Columbia University.
crystallization. The crystallization is completed by the addition
of water to the warm slurry followed by cooling. The product
Appendix. Additional examples from the eli lilly internal report is then isolated by filtration, washed and dried.
In Step 3, technical-grade Compound_10 monohydrate,
ethanol, which may be denatured with methanol, and water are
Input 1: Compound_1, IUPAC_1, and Compound_8, IUPAC_2, heated and the resulting solution is polish filtered. Following
are proposed starting materials for the commercial concentration, the solution may be seeded with Drug_1 to
manufacture of Drug_1, as shown in Scheme 3.0-1. The initiate crystallization. The temperature is reduced slightly and
drug substance, Drug_1, is manufactured as a monohydrate water is added to the warm mixture. The slurry is then cooled
and is also referred to using the reference numbers to complete the crystallization. The particle size distribution
Compound_9 and Compound_10 monohydrate. The is adjusted by slurry-milling followed by thermal-cycling. The
non-proprietary name Drug_1 can refer to the product is isolated by filtration, washed and dried.
anhydrous or monohydrate form interchangeably. Reference to
the water of hydration is retained in the chemical information
References
(chemical names, formulas, weight). In the CMC sections of
this document, Drug_1 refers to the monohydrate form. When
Akkasi, A., Varoğlu, E., Dimililer, N., 2016. ChemTok: A new rule based tokenizer for
referring specifically to the anhydrous form, the text will state chemical named entity recognition. BioMed Res. Int. 2016.
either Compound_10 or Drug_1 (anhydrous). Angeli, G., Premkumar, M.J.J., Manning, C.D., 2015. Leveraging linguistic structure for
open domain information extraction. In: Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguistics and the 7th International Joint
Input 2: The two proposed starting materials provide Conference on Natural Language Processing (Volume 1: Long Papers). pp. 344–354.
all the non-hydrogen atoms that form Drug_1, except the Beuls, K., Van Eecke, P., Cangalovic, V.S., 2021. A computational construction grammar
amide oxygen atom and the water of hydration, and therefore, approach to semantic frame extraction. Linguist. Vanguard 7 (1), 20180015.
Bhatnagar, R., Sardar, S., Beheshti, M., Podichetty, J.T., 2022. How can natural
are both considered significant structural fragments of
language processing help model informed drug development?: a review. JAMIA
the drug substance. These starting materials are readily Open 5 (2), ooac043.
synthesized, have been well-characterized, and demonstrate Bird, S., Klein, E., Loper, E., 2009. Natural Language Processing with Python: Analyzing
good stability. Text with the Natural Language Toolkit. O’Reilly Media, Inc..

13
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Bodenreider, O., 2004. The unified medical language system (UMLS): integrating Mann, V., Gani, R., Venkatasubramanian, V., 2023b. Intelligent process flowsheet
biomedical terminology. Nucleic Acids Res. 32 (suppl_1), D267–D270. synthesis and design using extended SFILES representation. In: Computer Aided
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan- Chemical Engineering. Vol. 52, Elsevier, pp. 221–226.
tan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot Mann, V., Venkatasubramanian, V., 2021a. Predicting chemical reaction outcomes: A
learners. In: Advances in Neural Information Processing Systems. Vol. 33, pp. grammar ontology-based transformer framework. AIChE J. 67 (3), e17190.
1877–1901. Mann, V., Venkatasubramanian, V., 2021b. Retrosynthesis prediction using grammar-
Christensen, J., Mausam, Soderland, S., Etzioni, O., 2011. An analysis of open based neural machine translation: An information-theoretic approach. Comput.
information extraction based on semantic role labeling. In: Proceedings of the Sixth Chem. Eng. 155, 107533.
International Conference on Knowledge Capture. pp. 113–120. Mann, V., Venkatasubramanian, V., 2023. AI-driven hypergraph network of organic
Collier, N., Nobata, C., Tsujii, J., 2000. Extracting the names of genes and gene products chemistry: network statistics and applications in reaction classification. React.
with a hidden Markov model. In: COLING 2000 Volume 1: The 18th International Chem. Eng. 8 (3), 619–635.
Conference on Computational Linguistics. Musen, M.A., 2015. The protégé project: a look back and a look forward. AI Matters
Degtyarenko, K., De Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., 1 (4), 4–12.
Alcántara, R., Darsow, M., Guedj, M., Ashburner, M., 2007. ChEBI: a database and Muthukkumaran, A., Raghunathan, S., Ravichandran, A., Rengaswamy, R., 2023.
ontology for chemical entities of biological interest. Nucleic Acids Res. 36 (suppl_1), Perovskite-based electrocatalyst discovery and design using word embeddings from
D344–D350. retrained scibert language model. AIChE J. e18068.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep
Pilehvar, M.T., Bernard, A., Smedley, D., Collier, N., 2022. PheneBank: a
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.
literature-based database of phenotypes. Bioinformatics 38 (4), 1179–1180.
04805.
Ramos, J., et al., 2003. Using tf-idf to determine word relevance in document queries.
Fries, J.A., Steinberg, E., Khattar, S., Fleming, S.L., Posada, J., Callahan, A.,
In: Proceedings of the First Instructional Conference on Machine Learning. Vol.
Shah, N.H., 2021. Ontology-driven weak supervision for clinical entity classification
242, (1), Citeseer, pp. 29–48.
in electronic health records. Nat. Commun. 12 (1), 1–11.
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C., 2017. Snorkel:
Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P., 2003. Protein structures and
Rapid training data creation with weak supervision. In: Proceedings of the VLDB
information extraction from biological texts: the PASTA system. Bioinformatics 19
Endowment. International Conference on Very Large Data Bases. Vol. 11, (3), NIH
(1), 135–143.
Public Access, p. 269.
Gamallo, P., Garcia, M., Fernández-Lanza, S., 2012. Dependency-based open infor-
mation extraction. In: Proceedings of the Joint Workshop on Unsupervised and Remolona, M.F.M., Conway, M.F., Balasubramanian, S., Fan, L., Feng, Z., Gu, T.,
Semi-Supervised Learning in NLP. pp. 10–18. Kim, H., Nirantar, P.M., Panda, S., Ranabothu, N.R., et al., 2017. Hybrid ontology-
Gentile, A.L., Gruhl, D., Ristoski, P., Welch, S., 2019. Personalized knowledge graphs learning materials engineering system for pharmaceutical products: Multi-label
for the pharmaceutical domain. In: The Semantic Web–ISWC 2019: 18th Interna- entity recognition and concept detection. Comput. Chem. Eng. 107, 49–60.
tional Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Saidi, R., Maddouri, M., Nguifo, E.M., 2009. Comparing graph-based representations of
Proceedings, Part II 18. Springer, pp. 400–417. protein for mining purposes. In: Proceedings of the KDD-09 Workshop on Statistical
Gothard, C.M., Soh, S., Gothard, N.A., Kowalczyk, B., Wei, Y., Baytekin, B., Grzy- and Relational Learning in Bioinformatics. pp. 35–38.
bowski, B.A., 2012. Rewiring chemistry: algorithmic discovery and experimental Sasaki, Y., Tsuruoka, Y., McNaught, J., Ananiadou, S., 2008. How to make the most of
validation of one-pot reactions in the network of organic chemistry. Angew. Chem. NE dictionaries in statistical NER. BMC Bioinform. 9 (11), 1–9.
124 (32), 8046–8051. Schmitz, M., Soderland, S., Bart, R., Etzioni, O., et al., 2012. Open language learning for
Hailemariam, L., Venkatasubramanian, V., 2010a. Purdue ontology for pharmaceutical information extraction. In: Proceedings of the 2012 Joint Conference on Empirical
engineering: part I. Conceptual framework. J. Pharmaceut. Innov. 5, 88–99. Methods in Natural Language Processing and Computational Natural Language
Hailemariam, L., Venkatasubramanian, V., 2010b. Purdue ontology for pharmaceutical Learning. pp. 523–534.
engineering: Part II. Applications. J. Pharmaceut. Innov. 5, 139–146. Schwaller, P., Laino, T., Gaudin, T., Bolgar, P., Hunter, C.A., Bekas, C., Lee, A.A.,
Harmata, S., Hofer-Schmitz, K., Nguyen, P.-H., Quix, C., Bakiu, B., 2017. Layout- 2019. Molecular transformer: a model for uncertainty-calibrated chemical reaction
aware semi-automatic information extraction for pharmaceutical documents. In: prediction. ACS Central Sci. 5 (9), 1572–1583.
Data Integration in the Life Sciences: 12th International Conference, DILS 2017, Schwartz, A.S., Hearst, M.A., 2002. A simple algorithm for identifying abbreviation
Luxembourg, Luxembourg, November 14-15, 2017, Proceedings 12. Springer, pp. definitions in biomedical text. In: Biocomputing 2003. World Scientific, pp.
71–85. 451–462.
Hirtreiter, E., Balhorn, L.S., Schweidtmann, A.M., 2022. Towards automatic generation Sennrich, R., Haddow, B., Birch, A., 2015. Neural machine translation of rare words
of piping and instrumentation diagrams (p&ids) with artificial intelligence. arXiv with subword units. arXiv preprint arXiv:1508.07909.
preprint arXiv:2211.05583. Shen, D., Zhang, J., Zhou, G., Su, J., Tan, C.L., 2003. Effective adaptation of hidden
Honnibal, M., Montani, I., 2017. Spacy 2: natural language understanding with bloom markov model-based named entity recognizer for biomedical domain. In: Proceed-
embeddings, convolutional neural networks and incremental parsing. 7, (1), pp. ings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine.
411–420, in press. pp. 49–56.
Huang, M.-S., Lai, P.-T., Lin, P.-Y., You, Y.-T., Tsai, R.T.-H., Hsu, W.-L., 2020. Simon, C., Davidsen, K., Hansen, C., Seymour, E., Barnkob, M.B., Olsen, L.R., 2019.
Biomedical named entity recognition and linking datasets: survey and our recent BioReader: a text mining tool for performing classification of biomedical literature.
development. Brief. Bioinform. 21 (6), 2219–2238. BMC Bioinform. 19, 165–170.
International Council for Harmonisation, 2023. Efficacy Guidelines - International Coun- Skeppstedt, M., Kvist, M., Nilsson, G.H., Dalianis, H., 2014. Automatic recognition
cil for Harmonisation. https://www.ich.org/page/efficacy-guidelines, Accessed: of disorders, findings, pharmaceuticals and body structures from clinical text: An
June, 2023. annotation and machine learning study. J. Biomed. Inform. 49, 148–158.
Kang, T., Zhang, S., Tang, Y., Hruby, G.W., Rusanov, A., Elhadad, N., Weng, C., 2017.
Tetko, I.V., Karpov, P., Van Deursen, R., Godin, G., 2020. State-of-the-art augmented
EliIE: An open-source information extraction system for clinical trial eligibility
NLP transformer models for direct and single-step retrosynthesis. Nat. Commun. 11
criteria. J. Am. Med. Inf. Assoc. 24 (6), 1062–1071.
(1), 5575.
Kulkarni, R., Kulkarni, H., Balar, K., Krishna, P., 2018. Cognitive natural language
Trinh, C., Meimaroglou, D., Hoppe, S., 2021. Machine learning in chemical product
search using calibrated quantum mesh. In: 2018 IEEE 17th International Conference
engineering: The state of the art and a guide for newcomers. Processes 9 (8),
on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, pp. 174–178.
1456.
Lawrence, X.Y., Raw, A., Wu, L., Capacci-Daniel, C., Zhang, Y., Rosencrance, S.,
U.S. Food and Drug Administration, 2018. Understanding CDER’s risk-based selec-
2019. Fda’s new pharmaceutical quality initiative: Knowledge-aided assessment &
tion model. https://www.fda.gov/media/116004/download, (Accessed October 28,
structured applications. Int. J. Pharmaceut.: X 1, 100010.
2022).
Lee, P., Bubeck, S., Petro, J., 2023. Benefits, limits, and risks of GPT-4 as an AI chatbot
for medicine. N. Engl. J. Med. 388 (13), 1233–1239. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J., 2020. BioBERT: a Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information
pre-trained biomedical language representation model for biomedical text mining. Processing Systems. Vol. 30.
Bioinformatics 36 (4), 1234–1240. Venkatasubramanian, V., Mann, V., 2022. Artificial intelligence in reaction prediction
Leser, U., Hakenberg, J., 2005. What makes a gene name? Named entity recognition and chemical synthesis. Curr. Opin. Chem. Eng. 36, 100749.
in the biomedical literature. Brief. Bioinform. 6 (4), 357–369. Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X.,
Luo, L., Lai, P.-T., Wei, C.-H., Arighi, C.N., Lu, Z., 2022. BioRED: a rich biomedical Moschitti, A., 2008. BART: A modular toolkit for coreference resolution. In:
relation extraction dataset. Brief. Bioinform. 23 (5), bbac282. Proceedings of the ACL-08: HLT Demo Session. pp. 9–12.
Mann, V., Brito, K., Gani, R., Venkatasubramanian, V., 2022. Hybrid, interpretable Viswanath, S., Fennell, J.W., Balar, K., Krishna, P., 2021a. An industrial approach
machine learning for thermodynamic property estimation using grammar2vec for to using artificial intelligence and natural language processing for accelerated
molecular representation. Fluid Phase Equilib. 561, 113531. document preparation in drug development. J. Pharmaceut. Innov. 16, 302–316.
Mann, V., Gani, R., Venkatasubramanian, V., 2023a. Group contribution-based property Viswanath, S., Guntz, S., Dieringer, J., Vaidyaraman, S., Wang, H., Gounaris, C., 2021b.
modeling for chemical product design: A perspective in the AI era. Fluid Phase An ontology to describe small molecule pharmaceutical product development and
Equilib. 113734. methodology for optimal activity scheduling. J. Pharmaceut. Innov. 1–15.

14
V. Mann et al. Computers and Chemical Engineering 179 (2023) 108446

Washio, T., Motoda, H., 2003. State of the art of graph-based data mining. Acm Sigkdd Zhang, D., Mukherjee, S., Lockard, C., Dong, X.L., McCallum, A., 2019. Openki: Inte-
Explor. Newslett. 5 (1), 59–68. grating open information extraction and knowledge bases with relation inference.
Xu, H., Stenner, S.P., Doan, S., Johnson, K.B., Waitman, L.R., Denny, J.C., 2010. MedEx: arXiv preprint arXiv:1904.12606.
a medication information extraction system for clinical narratives. J. Am. Med. Inf. Zhou, Z.-H., 2018. A brief introduction to weakly supervised learning. Natl. Sci. Rev.
Assoc. 17 (1), 19–24. 5 (1), 44–53.
Yuan, C., Ryan, P.B., Ta, C., Guo, Y., Li, Z., Hardin, J., Makadia, R., Jin, P., Shang, N.,
Kang, T., et al., 2019. Criteria2Query: a natural language interface to clinical
databases for cohort definition. J. Am. Med. Inf. Assoc. 26 (4), 294–305.

15

You might also like