0% found this document useful (0 votes)
5 views

ICAART_2024_-paper (1)

This study investigates the use of machine learning techniques to predict drug-drug interactions (DDIs) and their severity by analyzing various drug features from multiple databases. It evaluates the performance of five machine learning models across 16 similarity measures, identifying Jaccard similarity as the most effective for DDI prediction, with Random Forest achieving the highest accuracy of 72%. The research emphasizes the importance of feature selection in improving prediction efficiency and clinical relevance for drug safety.

Uploaded by

Deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ICAART_2024_-paper (1)

This study investigates the use of machine learning techniques to predict drug-drug interactions (DDIs) and their severity by analyzing various drug features from multiple databases. It evaluates the performance of five machine learning models across 16 similarity measures, identifying Jaccard similarity as the most effective for DDI prediction, with Random Forest achieving the highest accuracy of 72%. The research emphasizes the importance of feature selection in improving prediction efficiency and clinical relevance for drug safety.

Uploaded by

Deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Study on Drug Similarity Measures for Predicting Drug-Drug

Interactions and Severity Using Machine Learning Techniques

a
Deepa Kumari , Antony Joseph K, Pranay Tarigopula, Rohith Kumar Gattu, Maithili Seemakurthi,
Subhrakanta Panda and Jabez Christopher
CSIS Department, BITS Pilani, Hyderabad Campus, Shameerpet, Hyderabad, India
{p20190020, f20190620, f20180237, f20190049, f20190377, spanda, jabezc}@hyderabad.bits-pilani.ac.in

Keywords: Drug-Drug Interaction, Side-Effects, Similarity Measures, Machine Learning.

Abstract: Drug-Drug interaction (DDI) can lead to adverse reactions by decreasing the absorption rate in a patient
body. The existing literature has limited focus on the impact of various similarity measures on DDI effects.
This paper analyzes seven drug features (chemical substructures, targets, transporters, enzymes, side-effects,
offsides, and carriers) obtained from Drugbank, Sider, TWOSIDES, and OFFSIDE databases to analyze DDI.
This research examines five Machine Learning models (Logistic Regression, Random Forest, Decision Tree,
KNN, ANN) on 16 different similarity measures to observe the performance of predicting samples through
accuracy and AUC-curve analysis. The Jaccard similarity is chosen for further DDI prediction as it gives the
best similarity score. The feature selection process (using Chi-Square) further reduces the time and space
complexity. It compares combinations of every selected feature (chemical substructures, side-effects, offsides,
enzymes) on Logistic Regression, Random Forest, and XGB classifiers. The results show that the Random
Forest Classifier predicts DDI with the best accuracy of 72%. It also uniquely categorizes the severity level of
side effects (minor, moderate, and major) due to DDI events through multi-class classification. Thus, it gives
a better clinical significance to fast-track the clinical trials.

1 INTRODUCTION
Drugs are critical in treating diseases and sustaining
healthy lifestyles (Huang et al., 2021). But drugs
can interfere with other drugs (called Drug-Drug In-
teraction (DDIs)) during treatment and cause serious
health complications (Seo et al., 2020). The occur-
rence of DDIs may lead to various Adverse Drug
Reactions (ADRs) that cause unavoidable detrimen-
tal consequences and high costs for health service
providers and hospitals (Liu et al., 2012) (Galeano Figure 1: Drug -Drug Interaction with its severity levels
et al., 2020). However, the processes involved in graph.
drug-drug interaction detection are costlier and time-
Interaction (DDI) prediction problem, where DDI
consuming but crucial for drug research and devel-
refers to the featured matrix network, M = {D, E, F}.
opment (Han et al., 2022) (Ferdousi et al., 2017).
Here, D = {dl }Nl=1 is the set of drugs, where l is the
The complex nature of DDIs makes them extremely
difficult to predict, while ADRs are expensive to di- number of N nodes. E ∈ {0, 1}NXN is the existence of
agnose and practically hard to treat. In drug devel- drug interactions, where amn is an entry of matrix E
opment and identification of DDIs, several computa- at the mth row and nth column, and shows an interac-
tional approaches have successfully been used (Wu tion between drugs dm and dn . So, amn = 1 indicates
et al., 2022). the existence of interaction, and amn = 0 denotes the
The proposed approach is framed as a Drug-Drug absence of interaction. F ∈ RNXP represents the drug
features matrix, where P is the dimension of the fea-
a
tures. fi ∈ R1XP corresponds to the mth row of matrix
https://orcid.org/0000-0002-0696-9790

72
Kumari, D.
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques.
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 72-79
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques

E which is the feature vector of drug dm . With the Table 1: Performance of similarity measures.
node feature matrix F and adjacency matrix E, this Similarity Measures LR DT RF KNN NN
research aims to study the following DDI prediction Bray
Dice
0.62
0.64
0.63
0.63
0.63
0.63
0.64
0.64
0.62
0.64
problems. Jaccard 0.64 0.64 0.63 0.64 0.64
Hamming 0.56 0.60 0.61 0.58 0.55
• Binary DDI Prediction: The binary DDI predic- Russel Rao 0.63 0.64 0.64 0.64 0.64
Faith 0.55 0.60 0.61 0.59 0.52
tion is crucial to quickly ascertain whether an in- Gower 0.56 0.60 0.61 0.59 0.53
teraction between a pair of drugs (dm , dn ) exists Sokal Michener 0.56 0.60 0.61 0.59 0.52
Ample 0.58 0.60 0.60 0.61 0.60
or not. It is useful in terms of computational re- Anderberg 0.60 0.61 0.60 0.61 0.61
sources and time, especially when dealing with a Baroni 0.63 0.63 0.63 0.64 0.63
Kulczynski 0.64 0.63 0.64 0.64 0.64
large number of drug pairs. Formally, it is to learn Goodman 0.61 0.61 0.61 0.61 0.61
a mapping from fin (dm , dn ) to Interacti j ∈ [0, 1]. Rogers Tanimoto
Yule
0.64
0.58
0.63
0.63
0.63
0.64
0.64
0.64
0.64
0.58
Here, Interacti j indicates the interaction probabil- Inner Product 0.54 0.60 0.61 0.58 0.52
ity of (dm , dn ).
Biological data includes lists of drug-carrier
• Multi-class DDI side-effects Prediction: It is to pairs, drug-target pairs, drug-enzyme pairs, and
predict the specific interaction type of drug-pair drug-transporter pairs. These lists are the base to
(dm , dn ) based on drug interactions. Computation- construct a feature space corresponding to the four
ally, it is to learn a mapping F : DXD ⇐⇒ SD , types of binary fingerprints of the biological elements:
where SD represents the degree of severity of side carrier, target, enzyme and transporter (Liu et al.,
effects. 2012). The length of the bit vectors for carrier, target,
For a more transparent visual representation, Figure 1 enzyme, and transporter features is 78, 2856, 434, and
presents a subset of 15 drugs depicted as nodes, along 273, respectively.
with their corresponding interactions showcased as Chemical data consists of 2D chemical struc-
edges. Their interaction severity is shown in 3 col- tures of the same drug list considered the drug fea-
ors: Blue for low, Green for moderate, and red for ture. Chemical substructure information is retrieved
high severity levels. For example, Figure 1 shows from the PubChem database in SMILES (Simplified
that Drug ID DB00783 and DB00316 interact with Molecular Input Line Entry System) format using
moderate severity, whereas Drug ID DB00361 and MOE 2010.10 software. Then, MACCS (Molecu-
DB01263 interact with high severity risks. Thus, this lar ACCess System) substructures are calculated with
paper offers valuable insights into potential risks and 166 key descriptor bits. This work used MACCS be-
their implications. cause of its availability in cheminformatics software
The organization of the remaining paper is as fol- libraries or databases, and promising performance in
lows: Section 2 presents the Methodology. Section capturing relevant substructure information required
3 explains the comparative analysis. Section 4 con- for predicting DDI (Ibrahim et al., 2021).
cludes the work along with the future work. The phenotypic data of drugs are also essential
in predicting DDIs. Drug indications, side effects,
and offside effects construct the phenotypic data of
2 METHODOLOGY AND drugs. It extracts drug indications and side effects
from SIDER and offside effects from OFFSIDES.
RESULTS The framework creates a comprehensive drug dataset
by merging and intersecting these diverse datasets.
The proposed framework is implemented on a server
with 64 GB of RAM and Intel(R) Core(TM) i9-
7980XE CPU @ 2.60 GHz (18 Cores, 36 Threads). 2.2 Similarity Measures
The code is deployed on PyCharm version 3.5 and
Similarity measures are numerical quantities that
uses Power BI packages. Figure 2 shows the work-
quantify the degree of association between pairs of
flow of the proposed framework.
drugs and are considered a measure of similarity simi j
if, for every di ∈ D satisfies the following properties:
2.1 Dataset 0 ≤ simi j ≤ 1 if i ̸= j, simi j = 1, then simi j = sim ji .
Even though numerous binary similarity measures ex-
The proposed framework uses drug datasets from ist in the literature, only a few similarity measures are
DRUGBANK (version 5.1.9) (Wu et al., 2022), in use (Ibrahim et al., 2021) (Huang et al., 2021). Dif-
SIDER (Seo et al., 2020), OFFSIDES, & TWOSIDES ferent similarity-based ML methods help predict DDI
(Tatonetti et al., 2012). It uses only approved drugs through binary classification (Wu et al., 2022) (Vilar
containing biological, chemical, and phenotypic data.

73
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

Figure 2: Workflow of Proposed Approach.

et al., 2014).
This paper implements 16 binary similarity mea-
sures for analyzing their performance on different
classifiers, as shown in Table 2. Where vi and v j ←
are two row-vectors, each comprised of i and j
variables with a value of 1 (present) or 0 (absent).
p ← number of features where values for vi = 1 and
vj = 1
q ← number of features where values for vi = 0 and
vj = 1
r ← number of features where values for vi = 1 and
vj = 0
s ← number of features where values for vi = 0 and Figure 3: Chi-square test on Features.
vj = 0
σ ← observed agreement or similarity between two 2.3 Drug-Drug Interactions (DDI)
sets of drug interactions
σ′ ← expected agreement to occur randomly between This paper uses the Chi-square test, a simple tool
two sets of drug interactions. for univariate feature selection for classification. The
p+s ← total number of matches between vi and v j threshold calculation is based on the mean of the
q+r ← total number of mismatches between vi and v j summed chi-squared values for feature selection (i.e.
M ← Similarity values ( sum o f chi−squared values 1
Total number o f f eatures = 7 = 0.14)). Figure 3 shows
that only four binary similarities values such as offside
The binary similarity determines the analysis (Off Sim), side effect (SE Sim), chemical substructure
properties of the similarity and dissimilarity coeffi- (Chemsub Sim), and enzyme (Enzyme Sim) are above
cients (de Albuquerque et al., 2022). The choice of the threshold.
the correct coefficients and the variables depends on Each Drug is coded into binary vectors by consid-
the best performance of similarity measures on dif- ering every bit as the association between two drugs
ferent classifiers, as shown in Table 1. Out of these, or not. If a drug is associated with another drug, the
the classifier that results in the highest performance is corresponding bit becomes 1; otherwise, it is 0. Drug
chosen to predict the candidate side effects of drugs similarities are evaluated based on DDI information
as shown in Table 3. from DrugBank and the interaction information with
standard similarity calculation methods. Figure 4 in-

74
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques

Table 2: Definitions of different Similarity Measures.


S.no. Similarity Measures Formulae Descriptions
q+r
1 Bray (Huang et al., 2021) M = 2p+q+r Computes the compositional dissimilarity between the two sites based on counts at each
site.
p
2 Dice (de Albuquerque et al., M = 2p+q+r Measures the similarity between two sets of data
2022)
p
3 Jaccard (de Albuquerque M= p+q+r Check similarity of members for two sets to see which members are shared and which are
et al., 2022) distinct. Computes similarity for the two sets of data, with a range from 0% to 100%.
4 Hamming (Huang et al., Distance = q + r Measures the number of equals components, divided by the length of vectors. Defines
2021) the minimum number of substitutions needed to modify one string into the other, or the
minimum number of errors that could have converted one string into the other.
p
5 Russel Rao (de Albuquerque M= p+q+r+s Dot-product-based similarity measure in a range between 0 (minimum similarity) and 1
et al., 2022) (maximum similarity). Measures the similarity between drug interactions as it is a specific
and appropriate similarity measure with a 0 to 1 similarity range.
p+0.5s
6 Faith (Huang et al., 2021) M= p+q+r+s Feature and Information Theoretic measures parameterized ratio model of similarity.
p+s
7 Gower (Huang et al., 2021) M= √ Measures how different two records (including logical, categorical, numerical or text data)
(p+q)(p+r)(q+s)(r+s)
are. The distance is always a number between 0 (identical) and 1 (maximally dissimilar).
p+q
8 Sokal Michener (de Albu- M= p+q+r+s Measures the negative matches that do not mean necessarily any similarity between two
querque et al., 2022) objects.
|p(r+s)|
9 Ample (Huang et al., 2021) M= |r(p+q)| Similar to absolute value of the Tarantula that has high correlation with chi-square based
measures
p
10 Anderberg (Huang et al., M= p+2(q+r) Measures handle similarity between categorical attributes. Assigns higher similarity to
2021) √
rare matches, and lower similarity to rare mismatches
ps+p
11 Baroni (Huang et al., 2021) M= √ ps+p+q+r Selects compounds which exhibit a similar size distribution to the database. suitable for
compound selection to identify a wide structural variety of compounds but with a similar
distribution to the full database.
p
12 Kulczynski (de Albuquerque M= q+r Measures the correlation between occurrences of two items, which is a fundamental con-
et al., 2022) cept in the analysis of presence-absence data. Solves many pattern recognition problems
such as classification, clustering, and retrieval problems.

13 Goodman (Huang et al., M= σ−σ ′ Measures the similarity of the orderings of the data when ranked by each of the quantities
2k−σ
2021) and strength of association of the cross tabulated data.
p+s
14 Roger Tanimoto (de Albu- M= p+2(q+r)+s
Emphasize on the weight of the count of four states
querque et al., 2022)
15 Yule (Huang et al., 2021) M= ps−qr
ps+qr Defined as the coefficient of colligation. Measures association between two binary vari-
ables
16 Cosine of Inner product M=p + s Measures the cosine of the angle between two vectors and determines whether two vectors
(Huang et al., 2021) are pointing in roughly the same direction.

Table 3: Performance analysis of different similarity mea- The positive samples for 816 drugs are calculated
sures on different classifiers. using the Jaccard similarity measure, where positive
Similarity Class- Acc- Prec- Recall F1- AUC interactions are 144282 and unlabeled interactions are
Mea- ifiers uracy ision score
sures 116019. Unlabeled drug interactions are labeled by
LR
DT
0.64
0.64
0.64
0.66
0.65
0.64
0.64
0.63
0.68
0.69
mapping drugs from the SIDER database. Thus, unla-
RF 0.63 0.65 0.64 0.64 0.70 beled interactions are converted into negative and pos-
Jaccard KNN 0.64 0.65 0.65 0.64 0.69
NN 0.64 0.65 0.65 0.64 0.65 itive interactions. Here, 89835 interactions are con-
LR 0.63 0.62 0.62 0.62 0.68 sidered negative samples, and 26184 are considered
DT 0.64 0.66 0.65 0.64 0.69
RF 0.64 0.65 0.65 0.64 0.69 positive. Hence, the total number of positive interac-
Russel Rao KNN
NN
0.64
0.64
0.65
0.65
0.65
0.65
0.64
0.64
0.69
0.64
tions increased to 170466.
LR 0.64 0.66 0.65 0.64 0.69
DT 0.63 0.65 0.64 0.63 0.69
RF 0.64 0.65 0.65 0.64 0.69 2.4 Predictive Models
Kulczynski KNN 0.64 0.65 0.65 0.64 0.69
NN 0.64 0.66 0.65 0.64 0.65
LR 0.64 0.65 0.64 0.63 0.68 Predictive models require little computation time and
Rogers DT 0.63 0.65 0.64 0.63 0.69
RF 0.63 0.65 0.64 0.64 0.65 supervision (Wu et al., 2022) (Kumari et al., 2023).
Tanimoto KNN 0.64 0.65 0.65 0.64 0.69
NN 0.64 0.65 0.64 0.64 0.51
The performance of the models is compared us-
ing metrics such as accuracy, precision, recall, F1
fers that Jaccard gives better accuracy with Random score, AUC score and Mathews Correlation Coeffi-
Forest than other similarity measures. Drugbanks cient (MCC). The proposed experiment follows 5-
give 13910 extracted drugs with a total of 2682157 fold cross-validation for a robust evaluation of the
interactions. Of these, there are only 4107 approved model’s performance compared to a single train-test
drugs, resulting in the total interactions dropping to split. Each fold contains an equal number of samples.
1889983. After filtering for duplicate interactions In each iteration, one fold is held out as the test set,
(such as di j and d ji ), the total number of interac- while the remaining four folds are combined to form
tions becomes 1341086. The common drugs with all the training set. It mitigates the impact of the data’s
four features (Off Sim, SE Sim, Chemsub Sim, En- initial distribution and provides a more representative
zyme Sim) come down to 816 drugs, and the total estimate of the model’s ability to generalize unseen
number of interactions becomes 260301. data. Then, aggregated similarity matrices (Off Sim,

75
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

SE Sim, Chemsub Sim, Enzyme Sim) are applied to


train the machine learning (ML) models. ML mod-
els such as Logistic Regression, Random Forest, and
XGB are tuned with their hyperparameter values to
achieve maximum learning process as shown in Ta-
ble 4. Here, the optimal parameter makes the learn-
ing process faster, and the learning rate helps achieve
minimum loss function and avoid underfitting scenar-
ios. It continues till the model reaches its conver-
gence. Thus, ML models achieve their best accuracy
by tuning their hyperparameters to the best set of pa-
rameter values (Rajita et al., 2023). Random forest
outperforms other models with an accuracy of 0.72
and an AUC score of 0.78 with a minimal set of four Figure 4: AUC for similarity measures on different classi-
features: offside, side-effect, chemical substructure, fiers.
and enzyme, as shown in Table 5.
have an interaction frequency of 1 (minor) for three
symptoms (Arthralgia, diarrhea, and Headache). Sim-
2.5 Severity Level of Drugs Side-Effects
ilarly, there are 170466 positive drug-drug interac-
The multi-class classification process classifies the tions and a total of 11676 symptoms due to differ-
severity of drug-drug interactions (DDIs) into three ent interactions in the TWOSIDES database. The
classes – minor, moderate, and major. By focus- DDI events obtained from the constructed dataset are
ing on severity, healthcare professionals can prioritize mapped with the TWOSIDES database. So, there
their actions and interventions, leading to improved are 18263659 drug-drug interactions for all provided
patient outcomes and better management of poten- symptoms in the constructed dataset. Table 7 presents
tial drug interactions. There are two approaches for the total number of drug interactions in three classes:
multi-class classification: One-vs-Rest and One-vs- minor, moderate, and major side-effect (symptom)
One techniques. This paper chooses the One-vs-Rest frequencies.
strategy because it classifies data more efficiently and
faster. It splits a multi-class classification into one 2.6 Result Analysis
binary classification problem per class using heuris-
tic methods where each classification model predicts This section presents a comparison of Logistic Re-
a class membership probability or a probability-like gression, Random Forest, and XGB (Extreme Gradi-
score. ent Boosting) classifiers to assess the impact of al-
The frequency values and their corresponding gorithmic diversity on the predictive performance for
probability-like scores are collected from the TWO- DDI. Where, Logistic Regression is a commonly used
SIDES database. The argmax (probability) of these baseline model due to its simplicity, but Random For-
scores (class index with the largest score) is then used est and XGB are more complex models known for
to predict a class. Thus, each frequency class fits a their ability to capture intricate patterns and relation-
mean reporting frequency in three percentage classes ships. The comparison helps benchmark the perfor-
[33%, 66%, 100%] to the predicted scores and obtains mance of more sophisticated models against a simpler
a probability density function (pdf) for each class. one to evaluate the trade-off between model complex-
The pdfs built for each frequency class are the de- ity and predictive accuracy. In this work, the exper-
fined boundaries for the classification decision with iments are conducted up to three times, and the re-
maximum likelihood. The thresholds obtained are ported values for each classifier are based on the aver-
0.00315, 0.0128, and 1. Thus, given a predicted score age of these repetitions. This approach ensures a more
x, a frequency class is chosen using the thresholds accurate representation of the classifiers’ performance
given in Equation1: by minimizing the impact of random fluctuations.
 Thus, the paper presents the performance of dif-
minor
 if 0 ≤ x ≤ 0.00315 ferent Machine Learning Binary Classifiers employ-
pd f (x) = moderate if 0.00315 ≤ x ≤ 0.0128 ing four similarity measures using the AUC curve.

major if 0.0128 ≤ x ≤ 1 A higher AUC score indicates better discriminative
(1) power and overall classifier performance. Figure 4
illustrates that the Random Forest model using the
Table 6 infers that Drugid: DB00231 and DB00203

76
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques

Table 4: Hyperparameters tuned with their initial and final values for different classifiers.
Classifier Hyperparameters Epochs Descriptions

Initial values Final Values


C=[0.1,1,10] C=0.1 100 C is the regularization parameter. For a given value of C, the regularization
LR strength decreases.
Penalty=[l1, l2] Penalty= l2 Penalty determines the type of regularization applied to the logistic regres-
sion model. Regularization helps prevent overfitting by adding a penalty
term to the loss function.
Solver=[newton-cg, lbfgs, liblinear] Solver= liblinear Solver determines the algorithm to use for optimizing LR model.
max depth=[30,50] max depth = 30 50 max depth controls the maximum depth of each decision tree in the Ran-
RF dom Fores
max features = [1,2,3,4] max features = [1] max features determines the maximum number of features to consider
when looking for the best split at each node of the decision tree.
n estimators = [100,250,500] n estimators = 500 n estimators represents the number of decision trees to be included in the
RF ensemble.
booster = [gbtree, dart] booster = gbtree 100 booster as gbtree provides strong predictive power and handles non-linear
XGB
relationships well.
max depth=[30,50] max depth=50 max depth determines the maximum depth of each decision tree in the
boosting process

Table 5: Performance analysis of combinational features for different classifiers.


Classifier features Accuracy Precision Recall F1-score AUC MCC
off sim 0.66 0.61 0.56 0.55 0.66 0.37
SE sim 0.67 0.64 0.55 0.52 0.65 0.36
Chemsub sim 0.65 0.65 0.50 0.51 0.56 0.36
LR Enzyme sim 0.65 0.65 0.51 0.51 0.63 0.35
[off sim, SE sim] 0.67 0.63 0.59 0.58 0.69 0.40
[off sim, SE sim, Chemsub sim] 0.68 0.63 0.59 0.59 0.69 0.41
[off sim, SE sim, Chemsub sim, Enzyme sim] 0.68 0.65 0.63 0.64 0.72 0.44
off sim 0.65 0.58 0.54 0.52 0.63 0.34
SE sim 0.64 0.57 0.54 0.52 0.58 0.34
Chemsub sim 0.62 0.52 0.51 0.48 0.52 0.33
RF Enzyme sim 0.65 0.50 0.50 0.49 0.63 0.32
[off sim, SE sim] 0.67 0.62 0.58 0.57 0.68 0.40
[off sim, SE sim, Chemsub sim] 0.70 0.67 0.61 0.61 0.73 0.46
[off sim, SE sim, Chemsub sim, Enzyme sim] 0.72 0.69 0.66 0.66 0.78 0.52
off sim 0.64 0.56 0.54 0.53 0.63 0.36
SE sim 0.66 0.60 0.54 0.52 0.60 0.35
Chemsub sim 0.64 0.53 0.51 45.5 0.52 0.33
XGBoost Enzyme sim 0.65 0.50 0.50 0.46 0.63 0.32
[off sim, SE sim] 0.66 0.60 0.58 0.58 0.67 0.37
[off sim, SE sim, Chemsub sim] 0.68 0.64 0.62 0.63 0.70 0.43
[off sim, SE sim, Chemsub sim, Enzyme sim] 0.70 0.66 0.65 0.66 0.75 0.50

Table 6: Frequency of Side-Effects (symptoms) Induced by mitigate overfitting contribute to its superior perfor-
Drug-Drug Interaction (DDI) Events. mance when employing the Jaccard Similarity mea-
Symptoms mean re- drug id1 drug id2 Predicted fre- sure. Based on the overall performance in the given
porting quency &
frequency severity levels machine learning models, the Jaccard coefficient is
Arthralgia 0.044872 DB00231 DB00203 1 (minor) taken to calculate the scoring function for drug sim-
Arthralgia 0.071429 DB00887 DB00107 1 (minor)
Diarrhea 0.012821 DB00231 DB00203 1 (minor) ilarity. Consequently, datasets comprising positive
Diarrhoea
Headache
0.214286
0.102564
DB00887
DB00231
DB00107
DB00203
1 (minor)
1 (minor)
samples labeled as 1 and negative samples labeled as
0 are constructed using the Jaccard similarity matrix
Table 7: Number of drug-drug interaction for different side- as discussed in Section 2.3.
effects based frequencies. Random forest gives better accuracy of 72% with
Severity Levels Number of interactions selected 4 features (offside, side-effect, chemical sub-
Major (High Frequent=3) 5965100 structure, and enzyme) as shown in Table 5. This
Moderate (Moderately Frequent=2) 6053430
Minor (Less Frequent=1) 6245129 combination of relevant features makes the model
more efficient than individual features alone. These
Jaccard similarity measure achieves a higher AUC findings highlight the importance of feature selection
score than other similarity measures such as Kulczyn- and its impact on model performance and resource
ski, Rogers Tanimoto, and Russell Rao. This supe- utilization. Figure 5 also shows an increase in the per-
riority is attributed to Jaccard’s capability to handle formance of predictive models after feature selection.
binary data and capture the presence or absence of The proposed framework also predicts the sever-
shared features between instances. Consequently, it ity levels for the given DDI events. The approach
effectively captures the similarities and differences, collects probability-like scores for frequency classes
improving classification performance. Moreover, the from the TWOSIDES database, fits probability den-
ensemble nature of Random Forest and its ability to sity functions to the scores, and uses thresholds to pre-

77
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

tations of docking-based approaches. The network-


based approach visualizes the drug features and their
interactions in a network and helps identify more in-
teractions and their side effects (Huang et al., 2013)
(Zhang et al., 2016). Also, the Machine learning ap-
proach (Liu et al., 2012) employs different classifiers
to address the prediction problem as in this work. It
is an automated intelligent approach that requires lit-
tle supervision and comparatively less comprehensive
data (Chen and Li, 2018). There is significance in
Figure 5: Performance analysis of different classifiers be- examining different similarity measures on machine
fore and after feature selection. learning (ML) models instead of deep learning mod-
els. ML model lies in exploring and understanding
dict the severity level for given DDI events based on the effectiveness and applicability of other techniques
the highest scoring class. The severity levels are cate- in solving the specific problem of drug-drug interac-
gorized into different classes, and the predictions for tion (DDI) analysis (Liu et al., 2012). ML models,
these classes are summarized in Table 6. Additionally, notably simpler algorithms such as decision trees or
Table 7 presents the total number of drug interactions random forests, often exhibit good generalization per-
for each of the three frequency classes, providing fur- formance and are easier to implement and deploy in
ther insights into the drug data. real-world applications. This practical applicability
makes them suitable for DDI analysis tasks where in-
terpretability and efficiency are crucial. They can han-
3 COMPARATIVE ANALYSIS dle high-dimensional data efficiently, essential when
dealing with multiple similarity measures and fea-
This section presents the comparison between the pro- tures.
posed framework and other existing methods. Table
8 provides the details of the different types of tech-
niques along with the repositories for Biological (pro- 4 CONCLUSIONS
tein) data, Chemical data, and phenotypic (side effect)
data. It is observed that different methods use dif- This paper proposed an effective and robust frame-
ferent datasets for the prediction process. No stan- work to predict the potential DDIs by utilizing the
dard data set is available to compare the results of drug properties (i.e., chemical, biological, and phe-
various techniques. Thus, this paper compares the notype properties). This research compared 16 differ-
overall efficiency of different methods, the issues ad- ent similarity measures on various machine learning
dressed by them, and their limitations. The docking- models, and the results show that the Jaccard sim-
based approach predicts the side effects based on ilarity measure performed better. Feature selection
the analysis of the alignment of the drugs with the further aided in DDI prediction with minimal fea-
protein structures (Luo et al., 2011) (LaBute et al., tures. Jaccard similarity measure helped analyze pos-
2014). However, these methods do not depend on itive and negative interactions for training the mod-
experimental data that help identify novel and un- els. Thus, it detected unexpected side effects and
expected interactions. But, network-based and ma- guided drug combinations. The proposed approach
chine learning-based approaches overcome the limi- is at relatively early stage to showcase the need for

Table 8: Comparative analysis of Existing methods with the Proposed approach.


References Type of technique Phenotypic Protein Drug Limitations
(Luo et al., Docking based FDA and AERS UniPort Drugbank Complex Task as it involves the iterative molecular simulation of
2011) information 3D structures of drugs proteins drugs .
(LaBute et al., Docking based SIDER PDB Drugbank No sufficient validation to infer the binding strength based on the
2014) docking affinity score.
(Huang et al., Network based SIDER PubChem Drugbank Pathway-based models dependent on gene expression information
2013)
(Zhang et al., Network based SIDER PubChem KEGG and Dependence on experimental data prevents the identification of
2016) Drugbank unexpected drug target bindings
(Liu et al., Machine learning SIDER KEGG and Drug- PubChem The performance of the methods is limited to the diversity of com-
2012) based bank pounds in dataset, quality of descriptors etc.
(Zheng et al., Miscellaneous SIDER Gene ontology Drugbank The various parameters need to be specified every time.
2019)
Proposed ap- Machine learning SIDER, TWO- MACCS and Drugbank The frequency of side-effects are constrained to the constructed
proach based SIDES and Drugbank dataset.
OFFSIDES

78
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques

additional refinement in similarity measures. This LaBute, M. X., Zhang, X., Lenderman, J., Bennion,
paper also proposed predicting the severity levels of B. J., Wong, S. E., and Lightstone, F. C. (2014).
side effects through a multi-class classification ap- Adverse drug reaction prediction using scores pro-
proach. It classified drug interactions into minor (low- duced by large-scale drug-protein target docking on
high-performance computing machines. PloS one,
frequency), moderate (medium-frequency), and ma- 9(9):e106298.
jor (high-frequency) levels. Liu, M., Wu, Y., Chen, Y., Sun, J., Zhao, Z., Chen, X.-
The authors aspire to develop more effective pre- w., Matheny, M. E., and Xu, H. (2012). Large-scale
dictive models using Deep Learning methods, Re- prediction of adverse drug reactions using chemical,
current Neural Network (RNNs) and their variations biological, and phenotypic properties of drugs. Jour-
which could significantly contribute to the evolution nal of the American Medical Informatics Association,
of the reasearch work. Future work could also explore 19(e1):e28–e35.
the other existing research to perform comparison on Luo, H., Chen, J., Shi, L., Mikailov, M., Zhu, H., Wang, K.,
the same dataset for a more comprehensive evaluation He, L., and Yang, L. (2011). Drar-cpi: a server for
identifying drug repositioning potential and adverse
of model performance. drug reactions via the chemical–protein interactome.
Nucleic acids research, 39(suppl 2):W492–W498.
Rajita, B., Tarigopula, P., Ramineni, P., Sharma, A., and
REFERENCES Panda, S. (2023). Application of evolutionary algo-
rithms in social networks: A comparative machine
Chen, H. and Li, J. (2018). Drugcom: Synergistic discovery learning perspective. New Generation Computing,
of drug combinations using tensor decomposition. In 41(2):401–444.
2018 IEEE International Conference on Data Mining Seo, S., Lee, T., Kim, M.-h., and Yoon, Y. (2020). Pre-
(ICDM), pages 899–904. IEEE. diction of side effects using comprehensive similarity
de Albuquerque, M. A., do Nascimento, E. R., measures. BioMed research international, 2020.
de Oliveira Barros, K. N. N., and Barros, P. Tatonetti, N. P., Ye, P. P., Daneshjou, R., and Altman,
S. N. (2022). Comparison between similarity coeffi- R. B. (2012). Data-driven prediction of drug ef-
cients with application in forest sciences. Research, fects and interactions. Science translational medicine,
Society and Development, 11(2):e48511226046– 4(125):125ra31–125ra31.
e48511226046. Vilar, S., Uriarte, E., Santana, L., Lorberbaum, T., Hripc-
Ferdousi, R., Safdari, R., and Omidi, Y. (2017). Compu- sak, G., Friedman, C., and Tatonetti, N. P. (2014).
tational prediction of drug-drug interactions based on Similarity-based modeling in large-scale prediction of
drugs functional similarities. Journal of biomedical drug-drug interactions. Nature protocols, 9(9):2147–
informatics, 70:54–64. 2163.
Galeano, D., Li, S., Gerstein, M., and Paccanaro, A. (2020). Wu, L., Wen, Y., Leng, D., Zhang, Q., Dai, C., Wang, Z.,
Predicting the frequencies of drug side effects. Nature Liu, Z., Yan, B., Zhang, Y., Wang, J., et al. (2022).
communications, 11(1):4575. Machine learning methods, databases and tools for
Han, K., Cao, P., Wang, Y., Xie, F., Ma, J., Yu, M., Wang, drug combination prediction. Briefings in bioinfor-
J., Xu, Y., Zhang, Y., and Wan, J. (2022). A review matics, 23(1):bbab355.
of approaches for predicting drug–drug interactions Zhang, W., Zou, H., Luo, L., Liu, Q., Wu, W., and Xiao, W.
based on machine learning. Frontiers in Pharmacol- (2016). Predicting potential side effects of drugs by
ogy, 12:3966. recommender methods and ensemble learning. Neu-
Huang, L., Luo, H., Li, S., Wu, F.-X., and Wang, J. (2021). rocomputing, 173:979–987.
Drug–drug similarity measure and its applications. Zheng, Y., Peng, H., Ghosh, S., Lan, C., and Li, J. (2019).
Briefings in Bioinformatics, 22(4):bbaa265. Inverse similarity and reliable negative samples for
drug side-effect prediction. BMC bioinformatics,
Huang, L.-C., Wu, X., and Chen, J. Y. (2013). Predicting
adverse drug reaction profiles by integrating protein 19(13):91–104.
interaction networks with drug structures. Proteomics,
13(2):313–324.
Ibrahim, H., El Kerdawy, A. M., Abdo, A., and Eldin, A. S.
(2021). Similarity-based machine learning frame-
work for predicting safety signals of adverse drug–
drug interactions. Informatics in Medicine Unlocked,
26:100699.
Kumari, D., Yannam, P. K. R., Gohel, I. N., Naidu, M. V.
S. S., Arora, Y., Rajita, B., Panda, S., and Christo-
pher, J. (2023). Computational model for breast can-
cer diagnosis using hfse framework. Biomedical Sig-
nal Processing and Control, 86:105121.

79

You might also like