Semantic Similarity Between Medium-Sized Texts
Semantic Similarity Between Medium-Sized Texts
Medium-Sized Texts
1 Introduction
Semantically comparing texts involves analyzing their meaning in different con-
texts and can be very useful for a variety of applications such as sentiment
analysis, plagiarism detection, or content analysis, among others. Education is
not unfamiliar with this field. We find ourselves with the need to correct exams
and/or activities, this being a necessary activity to validate the knowledge of
the examinees.
Current natural language processing techniques allow us to obtain a semantic
meaning of sentences, in such a way that we can find similarities of sentences writ-
ten differently, but with similar semantics. While English may be more advanced
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2024
R. Chbeir et al. (Eds.): MEDES 2023, CCIS 2022, pp. 361–373, 2024.
https://doi.org/10.1007/978-3-031-51643-6_26
362 J. Farray Rodrı́guez et al.
2 State of Art
This chapter presents the recent progress in the calculation of semantic similar-
ity between texts, from the more classical approaches, such as those based on
the similarity of vectors created from a latent probabilistic semantic analysis,
generally using the cosine distance as a similarity metric [13] or variants of word
embeddings models [3], to new approaches with the advent of machine learning.
With machine learning, a lot of work has been done taking into account the con-
text of the text to carry out NLP tasks. The calculation of semantic similarity
is not immune to this trend. That is why the first approaches applied recurrent
networks as long short-term memory (LSTM) ones [13], which were able to learn
long-term dependencies in the text until the presentation of the Transformers
architecture [11]. Transformers architecture has meant a turning point within
the NLP, where the models have the novelty of the replacement of the recurrent
layers by the so-called attention layers. These layers remove the recursion that
LSTM has, so sentences are processed as a whole (positional encoding) instead
of word by word. With this, it reduces the complexity and allows parallelization,
thus improving the efficiency of the calculation.
Semantic Similarity Between Medium-Sized Texts 363
– BERT [5]: Model that represented a turning point in NLP [3], in which
Transformer architecture was used to obtain a better semantic meaning of
the whole text.
– mBERT [5]: Extension of the initial Bert to which multilanguage support
has been given for a total of 104 languages.
– BETO [4]: NLP model that has been prepared exclusively for Spanish. In
the study by Cañete et al. [4] better results have been obtained than using
mBert [5].
Fig. 1. SBert architectures for classification tasks (left) and semantic comparison tasks
(right). Source: [9]
364 J. Farray Rodrı́guez et al.
– Retina API [2]: API from the company Cortical with support for 50 languages,
which provides different measures of similarity, such as cosine, Euclidean,
Jaccard, and some own distance metrics.
– Dandelion API [1]: Product of a startup based in Italy, which offers different
APIs for natural language processing, such as language detection or semantic
similarity.
3 Methodology
1. Analysis of the data source to be used in the project. This is a set of ques-
tions/answers from two subjects delivered at Universidad Internacional de
La Rioja (UNIR). The total number of records in the dataset is 240, with an
average response length of the teacher of 130 words and the student of 150
words, there being cases in which the student’s response is 405 words.
2. Identification of NLP Transformers models to use. Although the architecture
used to specialize the models has always been that of sentence transformers,
we have used four existing conjoined transformer models and two occurrences
that create the Siamese architecture from scratch defined as shown in Fig. 2.
A maximum input size of 256 words has been defined.
For the choice of these models, support for Spanish or multilanguage and/or
the number of tokens the models allow has been taken into account. Table 1
shows the selected models, where the S-ST type means new siamese trans-
former from the scratch and E-ST means existing Siamese Transformer mod-
els. Although in our datasets we have an average student response greater
than 128 words, models with this number of tokens have been chosen, since
they were multilingual models and the teacher’s average response is 130 words.
It is also worth mentioning that in the event that a sentence exceeds the maxi-
mum input size allowed by the model, the sentence will be truncated, affecting
the training of the models. In this study we have prioritized the average size of
the teacher’s response, understanding that every response should be correct
with that approximate size.
3. Specialization of NLP models, to which a fine-tuning phase has been applied.
For the training process of the models, the open-source library Hugging Face
(https://huggingface.com) has been used. For each of the Siamese models
described above, the training will be done for epochs 1, 5, 10, 30, 50 and 100.
In addition to the number of epochs, the other parameters used are:
– train loss: As a loss evaluation function we selected the cosine similarity.
Semantic Similarity Between Medium-Sized Texts 365
4 Results
Evaluation 1. Considering the Pearson correlation (See Table 2), better results
are obtained in the models previously trained with the Spanish language, and
even better if we look at the Beto model, which is a model specifically designed
for the Spanish language and not multilingual.
Observing the Spearman correlation (See Table 3), although the coefficients
of this correlation are not so close to 1, a behavior similar to that with the
Pearson coefficient is observed, having better results when we use multilingual
models and even better in the case of the model based on Beto.
Epochs A B C D E F
1 0.76 0.72 0.67 0.76 0.77 0.80
5 0.75 0.81 0.66 0.76 0.77 0.81
10 0.76 0.79 0.69 0.81 0.77 0.78
30 0.77 0.81 0.72 0.78 0.80 0.80
50 0.76 0.78 0.73 0.75 0.82 0.81
100 0.77 0.81 0.74 0.77 0.77 0.82
Semantic Similarity Between Medium-Sized Texts 367
Epochs A B C D E F
1 0.39 0.32 0.13 0.31 0.44 0.52
5 0.35 0.56 0.15 0.45 0.43 0.50
10 0.36 0.49 0.15 0.59 0.51 0.43
30 0.43 0.63 0.32 0.52 0.59 0.51
50 0.37 0.48 0.33 0.44 0.63 0.56
100 0.38 0.66 0.37 0.48 0.43 0.54
Considering the models trained with 50 epochs as those that offer a better
balance between results and computational cost, the correlation plots including
Pearson, Spearman and Kendall coefficients for these models are shown in Fig. 3.
Evaluation 3. Negations
We studied the semantic similarity in those cases in which one of the texts
to be compared semantically is the negation of the other. Since the goal is to
rate an answer, a denial can mean a complete change in the rating. For example,
for the question “Is Spain on the European continent?”, the student’s answer
could be “Spain is on the European continent” or “Spain is not on the European
continent”. Both sentences are very similar but mean the complete opposite.
Analyzing the semantic similarity using the trained models based on BETO
returns a semantic similarity of 0.783, a value that would tell us that these texts
have a lot in common in terms of semantic meaning.
As an extension of this point, we can also include affirmation and denial in
the same sentence.
Fig. 3. Pearson, Spearman, and Kendall correlation for models trained with 50 Epochs
Semantic Similarity Between Medium-Sized Texts 369
Table 4. Metrics MAE, MSE, RMSE, RMSLE, MAPE and R2 of the models obtained.
all-distilroberta-v1
epochs MAE MSE RMSE RMSLE R2
1 0.149 0.033 0.182 0.012 0.575
5 0.156 0.036 0.190 0.012 0.537
10 0.151 0.035 0.186 - 0.555
30 0.142 0.032 0.178 0.011 0.592
50 0.149 0.033 0.182 - 0.573
100 0.145 0.032 0.180 0.011 0.585
distiluse-base-multilingual-cased-v1
epochs MAE MSE RMSE RMSLE R2
1 0.177 0.044 0.209 - 0.438
5 0.137 0.028 0.167 - 0.642
10 0.144 0.030 0.173 0.011 0.616
30 0.129 0.027 0.165 - 0.652
50 0.141 0.030 0.175 0.011 0.609
100 0.129 0.027 0.164 - 0.653
paraphrase-multilingual-MiniLM-L12-v2
epochs MAE MSE RMSE RMSLE R2
1 0.170 0.045 0.213 0.016 0.418
5 0.172 0.046 0.213 0.016 0.415
10 0.165 0.043 0.207 0.015 0.450
30 0.153 0.039 0.197 - 0.503
50 0.149 0.037 0.193 0.013 0.524
100 0.152 0.037 0.191 - 0.529
paraphrase-multilingual-mpnet-base-v2
epochs MAE MSE RMSE RMSLE R2
1 0.153 0.037 0.191 0.014 0.531
5 0.158 0.035 0.186 0.012 0.556
10 0.141 0.029 0.169 - 0.633
30 0.148 0.033 0.182 - 0.575
50 0.154 0.037 0.193 0.013 0.520
100 0.143 0.033 0.182 - 0.576
mbert
epochs MAE MSE RMSE RMSLE R2
1 0.183 0.060 0.244 - 0.235
5 0.188 0.064 0.252 - 0.182
10 0.191 0.062 0.249 - 0.201
30 0.134 0.029 0.171 0.010 0.626
50 0.133 0.028 0.169 - 0.635
100 0.186 0.063 0.251 0.021 0.190
beto
epochs MAE MSE RMSE RMSLE R2
1 0.154 0.033 0.182 0.012 0.574
5 0.139 0.027 0.165 0.009 0.652
10 0.139 0.032 0.179 0.012 0.586
30 0.139 0.030 0.173 - 0.614
50 0.130 0.027 0.164 0.010 0.654
100 0.154 0.034 0.185 0.011 0.562
370 J. Farray Rodrı́guez et al.
5 Discussion
The models trained with 50 epochs are the ones with better metrics. Within
these, the best ones are the Siamese models built from scratch, being first the
one based on Beto [4] followed by mBert [5]. This may be because they were
built with a more complex architecture, adding an extra dense layer (see Fig. 2).
In Fig. 3 we show Pearson, Spearman, and Kendall coefficients for these models.
For the Pearson coefficient, a linear dependence between the teacher’s qualifi-
cation and the semantic similarity of at least 0.81 is obtained. Considering the
rank correlation, values of at least 0.56 are obtained for Spearman and 0.49 for
Kendall. This Kendall value is obtained for the siamese model based on mBert,
so we can conclude that if we order the students’ answers according to the grade
given by the teacher, we find that this order is respected in 49% of the cases.
Semantic Similarity Between Medium-Sized Texts 371
6 Conclusion
Starting from an architecture of Siamese Transformers models, a relatively mod-
ern architecture and very useful for the case at hand, where we want to measure
the similarity of two medium-size text inputs, this study delves into:
– Dealing with medium-sized texts leads to greater complexity and dimension-
ality of the models, which is why the architecture adopted is very important,
directly impacting the performance of the models and mainly their training.
– Working with texts in Spanish since most research work is in English.
Putting emphasis on these 2 challenges, the relevance of the study lies in
the union of both, that is working with medium-sized texts in Spanish for the
semantic comparison between them. Analyzing the results obtained in detail,
we see that the models obtained, although they have an acceptable performance
(Pearson correlation around 82% for the best two), are far from being a solution
that can be used autonomously without human review. In relation to this, it is
necessary to take into account the volume of data that has been used to train
the models, with a total of 240 labeled question-answers. With this volume of
data, it has been possible to assess, in a certain way, if the proposed architecture
solution would be valid, but it would be advisable to train the models with
the largest volume of labeled data. In addition to starting with a larger volume
372 J. Farray Rodrı́guez et al.
References
1. Dandelion API. https://dandelion.eu/semantic-text/text-similarity-demo. Acce-
ssed 28 Feb 2023
2. Retina API. https://www.Cortical.Io/Retina-Api-Documentation. Accessed 28
Feb 2023
3. Babić, K., Guerra, F., Martinčić-Ipšić, S., Meštrović, A.: A comparison of
approaches for measuring the semantic similarity of short texts based on word
embeddings. J. Inf. Organ. Sci. 44(2) (2020). https://doi.org/10.31341/jios.44.2.2,
https://jios.foi.hr/index.php/jios/article/view/142
4. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-
trained BERT model and evaluation data. Pml4dc ICLR 2020(2020), 1–10 (2020)
5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep
bidirectional transformers for language understanding (2018). https://doi.org/10.
48550/ARXIV.1810.04805, https://arxiv.org/abs/1810.04805
6. Gonçalo Oliveira, H., Sousa, T., Alves, A.: Assessing lexical-semantic regularities
in portuguese word embeddings. Int. J. Interact. Multimed. Artif. Intell. 6, 34 (03
2021). https://doi.org/10.9781/ijimai.2021.02.006
7. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019).
https://doi.org/10.48550/ARXIV.1907.11692, https://arxiv.org/abs/1907.11692
8. Qiu, X., Sun, T., X.Y., et al.: Pre-trained models for natural language processing:
a survey. Sci. China Technol. Sci. 63, 1872–1897 (2020). https://doi.org/10.1007/
s11431-020-1647-3
Semantic Similarity Between Medium-Sized Texts 373