ICAART_2024_-paper (1)
ICAART_2024_-paper (1)
a
Deepa Kumari , Antony Joseph K, Pranay Tarigopula, Rohith Kumar Gattu, Maithili Seemakurthi,
Subhrakanta Panda and Jabez Christopher
CSIS Department, BITS Pilani, Hyderabad Campus, Shameerpet, Hyderabad, India
{p20190020, f20190620, f20180237, f20190049, f20190377, spanda, jabezc}@hyderabad.bits-pilani.ac.in
Abstract: Drug-Drug interaction (DDI) can lead to adverse reactions by decreasing the absorption rate in a patient
body. The existing literature has limited focus on the impact of various similarity measures on DDI effects.
This paper analyzes seven drug features (chemical substructures, targets, transporters, enzymes, side-effects,
offsides, and carriers) obtained from Drugbank, Sider, TWOSIDES, and OFFSIDE databases to analyze DDI.
This research examines five Machine Learning models (Logistic Regression, Random Forest, Decision Tree,
KNN, ANN) on 16 different similarity measures to observe the performance of predicting samples through
accuracy and AUC-curve analysis. The Jaccard similarity is chosen for further DDI prediction as it gives the
best similarity score. The feature selection process (using Chi-Square) further reduces the time and space
complexity. It compares combinations of every selected feature (chemical substructures, side-effects, offsides,
enzymes) on Logistic Regression, Random Forest, and XGB classifiers. The results show that the Random
Forest Classifier predicts DDI with the best accuracy of 72%. It also uniquely categorizes the severity level of
side effects (minor, moderate, and major) due to DDI events through multi-class classification. Thus, it gives
a better clinical significance to fast-track the clinical trials.
1 INTRODUCTION
Drugs are critical in treating diseases and sustaining
healthy lifestyles (Huang et al., 2021). But drugs
can interfere with other drugs (called Drug-Drug In-
teraction (DDIs)) during treatment and cause serious
health complications (Seo et al., 2020). The occur-
rence of DDIs may lead to various Adverse Drug
Reactions (ADRs) that cause unavoidable detrimen-
tal consequences and high costs for health service
providers and hospitals (Liu et al., 2012) (Galeano Figure 1: Drug -Drug Interaction with its severity levels
et al., 2020). However, the processes involved in graph.
drug-drug interaction detection are costlier and time-
Interaction (DDI) prediction problem, where DDI
consuming but crucial for drug research and devel-
refers to the featured matrix network, M = {D, E, F}.
opment (Han et al., 2022) (Ferdousi et al., 2017).
Here, D = {dl }Nl=1 is the set of drugs, where l is the
The complex nature of DDIs makes them extremely
difficult to predict, while ADRs are expensive to di- number of N nodes. E ∈ {0, 1}NXN is the existence of
agnose and practically hard to treat. In drug devel- drug interactions, where amn is an entry of matrix E
opment and identification of DDIs, several computa- at the mth row and nth column, and shows an interac-
tional approaches have successfully been used (Wu tion between drugs dm and dn . So, amn = 1 indicates
et al., 2022). the existence of interaction, and amn = 0 denotes the
The proposed approach is framed as a Drug-Drug absence of interaction. F ∈ RNXP represents the drug
features matrix, where P is the dimension of the fea-
a
tures. fi ∈ R1XP corresponds to the mth row of matrix
https://orcid.org/0000-0002-0696-9790
72
Kumari, D.
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques.
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 72-79
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques
E which is the feature vector of drug dm . With the Table 1: Performance of similarity measures.
node feature matrix F and adjacency matrix E, this Similarity Measures LR DT RF KNN NN
research aims to study the following DDI prediction Bray
Dice
0.62
0.64
0.63
0.63
0.63
0.63
0.64
0.64
0.62
0.64
problems. Jaccard 0.64 0.64 0.63 0.64 0.64
Hamming 0.56 0.60 0.61 0.58 0.55
• Binary DDI Prediction: The binary DDI predic- Russel Rao 0.63 0.64 0.64 0.64 0.64
Faith 0.55 0.60 0.61 0.59 0.52
tion is crucial to quickly ascertain whether an in- Gower 0.56 0.60 0.61 0.59 0.53
teraction between a pair of drugs (dm , dn ) exists Sokal Michener 0.56 0.60 0.61 0.59 0.52
Ample 0.58 0.60 0.60 0.61 0.60
or not. It is useful in terms of computational re- Anderberg 0.60 0.61 0.60 0.61 0.61
sources and time, especially when dealing with a Baroni 0.63 0.63 0.63 0.64 0.63
Kulczynski 0.64 0.63 0.64 0.64 0.64
large number of drug pairs. Formally, it is to learn Goodman 0.61 0.61 0.61 0.61 0.61
a mapping from fin (dm , dn ) to Interacti j ∈ [0, 1]. Rogers Tanimoto
Yule
0.64
0.58
0.63
0.63
0.63
0.64
0.64
0.64
0.64
0.58
Here, Interacti j indicates the interaction probabil- Inner Product 0.54 0.60 0.61 0.58 0.52
ity of (dm , dn ).
Biological data includes lists of drug-carrier
• Multi-class DDI side-effects Prediction: It is to pairs, drug-target pairs, drug-enzyme pairs, and
predict the specific interaction type of drug-pair drug-transporter pairs. These lists are the base to
(dm , dn ) based on drug interactions. Computation- construct a feature space corresponding to the four
ally, it is to learn a mapping F : DXD ⇐⇒ SD , types of binary fingerprints of the biological elements:
where SD represents the degree of severity of side carrier, target, enzyme and transporter (Liu et al.,
effects. 2012). The length of the bit vectors for carrier, target,
For a more transparent visual representation, Figure 1 enzyme, and transporter features is 78, 2856, 434, and
presents a subset of 15 drugs depicted as nodes, along 273, respectively.
with their corresponding interactions showcased as Chemical data consists of 2D chemical struc-
edges. Their interaction severity is shown in 3 col- tures of the same drug list considered the drug fea-
ors: Blue for low, Green for moderate, and red for ture. Chemical substructure information is retrieved
high severity levels. For example, Figure 1 shows from the PubChem database in SMILES (Simplified
that Drug ID DB00783 and DB00316 interact with Molecular Input Line Entry System) format using
moderate severity, whereas Drug ID DB00361 and MOE 2010.10 software. Then, MACCS (Molecu-
DB01263 interact with high severity risks. Thus, this lar ACCess System) substructures are calculated with
paper offers valuable insights into potential risks and 166 key descriptor bits. This work used MACCS be-
their implications. cause of its availability in cheminformatics software
The organization of the remaining paper is as fol- libraries or databases, and promising performance in
lows: Section 2 presents the Methodology. Section capturing relevant substructure information required
3 explains the comparative analysis. Section 4 con- for predicting DDI (Ibrahim et al., 2021).
cludes the work along with the future work. The phenotypic data of drugs are also essential
in predicting DDIs. Drug indications, side effects,
and offside effects construct the phenotypic data of
2 METHODOLOGY AND drugs. It extracts drug indications and side effects
from SIDER and offside effects from OFFSIDES.
RESULTS The framework creates a comprehensive drug dataset
by merging and intersecting these diverse datasets.
The proposed framework is implemented on a server
with 64 GB of RAM and Intel(R) Core(TM) i9-
7980XE CPU @ 2.60 GHz (18 Cores, 36 Threads). 2.2 Similarity Measures
The code is deployed on PyCharm version 3.5 and
Similarity measures are numerical quantities that
uses Power BI packages. Figure 2 shows the work-
quantify the degree of association between pairs of
flow of the proposed framework.
drugs and are considered a measure of similarity simi j
if, for every di ∈ D satisfies the following properties:
2.1 Dataset 0 ≤ simi j ≤ 1 if i ̸= j, simi j = 1, then simi j = sim ji .
Even though numerous binary similarity measures ex-
The proposed framework uses drug datasets from ist in the literature, only a few similarity measures are
DRUGBANK (version 5.1.9) (Wu et al., 2022), in use (Ibrahim et al., 2021) (Huang et al., 2021). Dif-
SIDER (Seo et al., 2020), OFFSIDES, & TWOSIDES ferent similarity-based ML methods help predict DDI
(Tatonetti et al., 2012). It uses only approved drugs through binary classification (Wu et al., 2022) (Vilar
containing biological, chemical, and phenotypic data.
73
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
et al., 2014).
This paper implements 16 binary similarity mea-
sures for analyzing their performance on different
classifiers, as shown in Table 2. Where vi and v j ←
are two row-vectors, each comprised of i and j
variables with a value of 1 (present) or 0 (absent).
p ← number of features where values for vi = 1 and
vj = 1
q ← number of features where values for vi = 0 and
vj = 1
r ← number of features where values for vi = 1 and
vj = 0
s ← number of features where values for vi = 0 and Figure 3: Chi-square test on Features.
vj = 0
σ ← observed agreement or similarity between two 2.3 Drug-Drug Interactions (DDI)
sets of drug interactions
σ′ ← expected agreement to occur randomly between This paper uses the Chi-square test, a simple tool
two sets of drug interactions. for univariate feature selection for classification. The
p+s ← total number of matches between vi and v j threshold calculation is based on the mean of the
q+r ← total number of mismatches between vi and v j summed chi-squared values for feature selection (i.e.
M ← Similarity values ( sum o f chi−squared values 1
Total number o f f eatures = 7 = 0.14)). Figure 3 shows
that only four binary similarities values such as offside
The binary similarity determines the analysis (Off Sim), side effect (SE Sim), chemical substructure
properties of the similarity and dissimilarity coeffi- (Chemsub Sim), and enzyme (Enzyme Sim) are above
cients (de Albuquerque et al., 2022). The choice of the threshold.
the correct coefficients and the variables depends on Each Drug is coded into binary vectors by consid-
the best performance of similarity measures on dif- ering every bit as the association between two drugs
ferent classifiers, as shown in Table 1. Out of these, or not. If a drug is associated with another drug, the
the classifier that results in the highest performance is corresponding bit becomes 1; otherwise, it is 0. Drug
chosen to predict the candidate side effects of drugs similarities are evaluated based on DDI information
as shown in Table 3. from DrugBank and the interaction information with
standard similarity calculation methods. Figure 4 in-
74
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques
Table 3: Performance analysis of different similarity mea- The positive samples for 816 drugs are calculated
sures on different classifiers. using the Jaccard similarity measure, where positive
Similarity Class- Acc- Prec- Recall F1- AUC interactions are 144282 and unlabeled interactions are
Mea- ifiers uracy ision score
sures 116019. Unlabeled drug interactions are labeled by
LR
DT
0.64
0.64
0.64
0.66
0.65
0.64
0.64
0.63
0.68
0.69
mapping drugs from the SIDER database. Thus, unla-
RF 0.63 0.65 0.64 0.64 0.70 beled interactions are converted into negative and pos-
Jaccard KNN 0.64 0.65 0.65 0.64 0.69
NN 0.64 0.65 0.65 0.64 0.65 itive interactions. Here, 89835 interactions are con-
LR 0.63 0.62 0.62 0.62 0.68 sidered negative samples, and 26184 are considered
DT 0.64 0.66 0.65 0.64 0.69
RF 0.64 0.65 0.65 0.64 0.69 positive. Hence, the total number of positive interac-
Russel Rao KNN
NN
0.64
0.64
0.65
0.65
0.65
0.65
0.64
0.64
0.69
0.64
tions increased to 170466.
LR 0.64 0.66 0.65 0.64 0.69
DT 0.63 0.65 0.64 0.63 0.69
RF 0.64 0.65 0.65 0.64 0.69 2.4 Predictive Models
Kulczynski KNN 0.64 0.65 0.65 0.64 0.69
NN 0.64 0.66 0.65 0.64 0.65
LR 0.64 0.65 0.64 0.63 0.68 Predictive models require little computation time and
Rogers DT 0.63 0.65 0.64 0.63 0.69
RF 0.63 0.65 0.64 0.64 0.65 supervision (Wu et al., 2022) (Kumari et al., 2023).
Tanimoto KNN 0.64 0.65 0.65 0.64 0.69
NN 0.64 0.65 0.64 0.64 0.51
The performance of the models is compared us-
ing metrics such as accuracy, precision, recall, F1
fers that Jaccard gives better accuracy with Random score, AUC score and Mathews Correlation Coeffi-
Forest than other similarity measures. Drugbanks cient (MCC). The proposed experiment follows 5-
give 13910 extracted drugs with a total of 2682157 fold cross-validation for a robust evaluation of the
interactions. Of these, there are only 4107 approved model’s performance compared to a single train-test
drugs, resulting in the total interactions dropping to split. Each fold contains an equal number of samples.
1889983. After filtering for duplicate interactions In each iteration, one fold is held out as the test set,
(such as di j and d ji ), the total number of interac- while the remaining four folds are combined to form
tions becomes 1341086. The common drugs with all the training set. It mitigates the impact of the data’s
four features (Off Sim, SE Sim, Chemsub Sim, En- initial distribution and provides a more representative
zyme Sim) come down to 816 drugs, and the total estimate of the model’s ability to generalize unseen
number of interactions becomes 260301. data. Then, aggregated similarity matrices (Off Sim,
75
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
76
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques
Table 4: Hyperparameters tuned with their initial and final values for different classifiers.
Classifier Hyperparameters Epochs Descriptions
Table 6: Frequency of Side-Effects (symptoms) Induced by mitigate overfitting contribute to its superior perfor-
Drug-Drug Interaction (DDI) Events. mance when employing the Jaccard Similarity mea-
Symptoms mean re- drug id1 drug id2 Predicted fre- sure. Based on the overall performance in the given
porting quency &
frequency severity levels machine learning models, the Jaccard coefficient is
Arthralgia 0.044872 DB00231 DB00203 1 (minor) taken to calculate the scoring function for drug sim-
Arthralgia 0.071429 DB00887 DB00107 1 (minor)
Diarrhea 0.012821 DB00231 DB00203 1 (minor) ilarity. Consequently, datasets comprising positive
Diarrhoea
Headache
0.214286
0.102564
DB00887
DB00231
DB00107
DB00203
1 (minor)
1 (minor)
samples labeled as 1 and negative samples labeled as
0 are constructed using the Jaccard similarity matrix
Table 7: Number of drug-drug interaction for different side- as discussed in Section 2.3.
effects based frequencies. Random forest gives better accuracy of 72% with
Severity Levels Number of interactions selected 4 features (offside, side-effect, chemical sub-
Major (High Frequent=3) 5965100 structure, and enzyme) as shown in Table 5. This
Moderate (Moderately Frequent=2) 6053430
Minor (Less Frequent=1) 6245129 combination of relevant features makes the model
more efficient than individual features alone. These
Jaccard similarity measure achieves a higher AUC findings highlight the importance of feature selection
score than other similarity measures such as Kulczyn- and its impact on model performance and resource
ski, Rogers Tanimoto, and Russell Rao. This supe- utilization. Figure 5 also shows an increase in the per-
riority is attributed to Jaccard’s capability to handle formance of predictive models after feature selection.
binary data and capture the presence or absence of The proposed framework also predicts the sever-
shared features between instances. Consequently, it ity levels for the given DDI events. The approach
effectively captures the similarities and differences, collects probability-like scores for frequency classes
improving classification performance. Moreover, the from the TWOSIDES database, fits probability den-
ensemble nature of Random Forest and its ability to sity functions to the scores, and uses thresholds to pre-
77
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
78
A Study on Drug Similarity Measures for Predicting Drug-Drug Interactions and Severity Using Machine Learning Techniques
additional refinement in similarity measures. This LaBute, M. X., Zhang, X., Lenderman, J., Bennion,
paper also proposed predicting the severity levels of B. J., Wong, S. E., and Lightstone, F. C. (2014).
side effects through a multi-class classification ap- Adverse drug reaction prediction using scores pro-
proach. It classified drug interactions into minor (low- duced by large-scale drug-protein target docking on
high-performance computing machines. PloS one,
frequency), moderate (medium-frequency), and ma- 9(9):e106298.
jor (high-frequency) levels. Liu, M., Wu, Y., Chen, Y., Sun, J., Zhao, Z., Chen, X.-
The authors aspire to develop more effective pre- w., Matheny, M. E., and Xu, H. (2012). Large-scale
dictive models using Deep Learning methods, Re- prediction of adverse drug reactions using chemical,
current Neural Network (RNNs) and their variations biological, and phenotypic properties of drugs. Jour-
which could significantly contribute to the evolution nal of the American Medical Informatics Association,
of the reasearch work. Future work could also explore 19(e1):e28–e35.
the other existing research to perform comparison on Luo, H., Chen, J., Shi, L., Mikailov, M., Zhu, H., Wang, K.,
the same dataset for a more comprehensive evaluation He, L., and Yang, L. (2011). Drar-cpi: a server for
identifying drug repositioning potential and adverse
of model performance. drug reactions via the chemical–protein interactome.
Nucleic acids research, 39(suppl 2):W492–W498.
Rajita, B., Tarigopula, P., Ramineni, P., Sharma, A., and
REFERENCES Panda, S. (2023). Application of evolutionary algo-
rithms in social networks: A comparative machine
Chen, H. and Li, J. (2018). Drugcom: Synergistic discovery learning perspective. New Generation Computing,
of drug combinations using tensor decomposition. In 41(2):401–444.
2018 IEEE International Conference on Data Mining Seo, S., Lee, T., Kim, M.-h., and Yoon, Y. (2020). Pre-
(ICDM), pages 899–904. IEEE. diction of side effects using comprehensive similarity
de Albuquerque, M. A., do Nascimento, E. R., measures. BioMed research international, 2020.
de Oliveira Barros, K. N. N., and Barros, P. Tatonetti, N. P., Ye, P. P., Daneshjou, R., and Altman,
S. N. (2022). Comparison between similarity coeffi- R. B. (2012). Data-driven prediction of drug ef-
cients with application in forest sciences. Research, fects and interactions. Science translational medicine,
Society and Development, 11(2):e48511226046– 4(125):125ra31–125ra31.
e48511226046. Vilar, S., Uriarte, E., Santana, L., Lorberbaum, T., Hripc-
Ferdousi, R., Safdari, R., and Omidi, Y. (2017). Compu- sak, G., Friedman, C., and Tatonetti, N. P. (2014).
tational prediction of drug-drug interactions based on Similarity-based modeling in large-scale prediction of
drugs functional similarities. Journal of biomedical drug-drug interactions. Nature protocols, 9(9):2147–
informatics, 70:54–64. 2163.
Galeano, D., Li, S., Gerstein, M., and Paccanaro, A. (2020). Wu, L., Wen, Y., Leng, D., Zhang, Q., Dai, C., Wang, Z.,
Predicting the frequencies of drug side effects. Nature Liu, Z., Yan, B., Zhang, Y., Wang, J., et al. (2022).
communications, 11(1):4575. Machine learning methods, databases and tools for
Han, K., Cao, P., Wang, Y., Xie, F., Ma, J., Yu, M., Wang, drug combination prediction. Briefings in bioinfor-
J., Xu, Y., Zhang, Y., and Wan, J. (2022). A review matics, 23(1):bbab355.
of approaches for predicting drug–drug interactions Zhang, W., Zou, H., Luo, L., Liu, Q., Wu, W., and Xiao, W.
based on machine learning. Frontiers in Pharmacol- (2016). Predicting potential side effects of drugs by
ogy, 12:3966. recommender methods and ensemble learning. Neu-
Huang, L., Luo, H., Li, S., Wu, F.-X., and Wang, J. (2021). rocomputing, 173:979–987.
Drug–drug similarity measure and its applications. Zheng, Y., Peng, H., Ghosh, S., Lan, C., and Li, J. (2019).
Briefings in Bioinformatics, 22(4):bbaa265. Inverse similarity and reliable negative samples for
drug side-effect prediction. BMC bioinformatics,
Huang, L.-C., Wu, X., and Chen, J. Y. (2013). Predicting
adverse drug reaction profiles by integrating protein 19(13):91–104.
interaction networks with drug structures. Proteomics,
13(2):313–324.
Ibrahim, H., El Kerdawy, A. M., Abdo, A., and Eldin, A. S.
(2021). Similarity-based machine learning frame-
work for predicting safety signals of adverse drug–
drug interactions. Informatics in Medicine Unlocked,
26:100699.
Kumari, D., Yannam, P. K. R., Gohel, I. N., Naidu, M. V.
S. S., Arora, Y., Rajita, B., Panda, S., and Christo-
pher, J. (2023). Computational model for breast can-
cer diagnosis using hfse framework. Biomedical Sig-
nal Processing and Control, 86:105121.
79