0% found this document useful (0 votes)

16 views9 pages

Short Text Similarity Calculation Based On Jaccard and Semantic Mixture

Uploaded by

Dongmin Jeong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views9 pages

Short Text Similarity Calculation Based On Jaccard and Semantic Mixture

Uploaded by

Dongmin Jeong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Short Text Similarity Calculation

Based on Jaccard and Semantic Mixture

Shushu Wu1(B) , Fang Liu1 , and Kai Zhang1,2

1
School of Computer Science, Wuhan University of Science and Technology,
Wuhan 430081, China
2
Hubei Province Key Laboratory of Intelligent Information Processing
and Real-time Industrial System, Wuhan 430081, China

Abstract. For the sake of enhancing the accuracy of short text similar-
ity calculation, a short text similarity calculation method on account of
Jaccard and semantic mixture is proposed. Jaccard is a traditional sim-
ilarity algorithm based on literal matching. It only considers word form,
and its semantic calculation has certain limitations. The word vector can
represent the semantic similarity by computing the cosine similarity of
two terms in the vector space, and the semantic similarity is obtained by
adding and averaging the word similarity of two sentences according to
a certain method. The two methods are now weighted to compute the
ﬁnal text similarity. Experiments show that the algorithm improves the
recall rate and F value of short text calculation to some extent.

Keywords: Short text similarity · Jaccard · Word vector · Semantic

similarity

1 Introduction

In the wake of the development of computer technology and the Internet, more
and more information is presented in short texts. How to correctly compute
the similarity of short texts has become particularly important, and it has also
become a hot spot in natural language processing. Text similarity refers to the
degree of semantic similarity of text. It can not only be applied to search engines,
document duplicate checking, and automatic question and answer systems, but
also can be applied to document classification and clustering and accurate doc-
ument push [1]. We have brought great convenience. Compared with long texts,
short texts have shorter content and sparse words, which makes calculations
more difficult. For example, the same words can express different meanings,
and different words can also express the same meaning, that is, the so-called
polysemous and multi-sense words. And even if the word composition of the
two sentences is exactly the same, but their combined structure is different, the
meanings expressed are also different. According to the characteristics of simi-
larity calculation methods, text similarity can be divided into literal matching
similarity, semantic similarity and structural similarity. The calculation method
of literal matching only considers the similarity of the text from the morphology,
c Springer Nature Singapore Pte Ltd. 2021
L. Pan et al. (Eds.): BIC-TA 2020, CCIS 1363, pp. 37–45, 2021.
https://doi.org/10.1007/978-981-16-1354-8_4
38 S. Wu et al.

which has great limitations; the semantic similarity method solves the semantic
matching of words, but it needs to rely on the corpus; the structural similarity
can analyze the grammatical structure of the text, But the accuracy will decrease
with the increase of sentence length [2]. The three calculation methods have their
own advantages and disadvantages, and they all need certain optimization.
For the study of short text similarity, Huang Xianying et al. added word
order to the term, and combined the overlap similarity algorithm with the word
order similarity algorithm between common word blocks to calculate the short
text similarity [3]. Gu Zhixiang et al. used part of speech and word frequency
weighting to improve the simhash algorithm [4]. Li Lian et al. optimized the
vector space model algorithm by considering the influence of the same feature
words between texts on the text similarity [5]. These methods all consider other
features of the term on the basis of the word shape, and improve the text simi-
larity algorithm to a certain extent, but do not involve the semantic level of the
sentence. The realization of sentence semantics can use corpus, such as HowNet,
WordNet, etc. Yuan Xiaofeng uses HowNet to calculate the semantic similar-
ity of words, and uses the TF/IDF values of a small number of feature words
to assign weights to the vectors in the VSM, and then computes the similarity
between texts [6]. On the basis of improving the edit distance, Che Wanxiang
et al. used two semantic resources HowNet and synonym cilin to calculate the
semantic distance between words, and obtained better results on the basis of
taking into account word order and semantics [7]. Zhang Jinpeng et al. studied
text similarity based on the vector space model and semantic dictionary, and
discussed the semantic similarity of texts of different lengths and their applica-
tions [8]. Liao Zhifang et al. proposed a short text similarity algorithm based
on syntax and semantics. By calculating the similarity of short texts with the
same syntactic structure and considering the contribution of sentence phrase
order to the similarity, the similarity of Chinese short texts was calculated [9].
These methods consider the part of speech and semantics of the sentence, but
the semantics need to rely on an external dictionary, which cannot calculate the
semantic similarity of words between different parts of speech. The word vector
makes up for this shortcoming. It can also be used for extended training based
on the corpus and the one-hot vector can be converted into a low-dimensional
word vector. Therefore, Jaccard is combined with a semantic algorithm based
on word vectors to find text similarity. Jaccard is an algorithm based on literal
matching, which takes into account the morphology of the text. It is suitable for
calculating two sentences with more co-occurring words, but for two sentences
that do not overlap at all, the calculated similarity is 0 and cannot be calculated
Similarity of similar words. The word vector can calculate the semantic similar-
ity of words, which makes up for the shortcomings of the Jaccard algorithm, so
the two are combined to find the short text similarity.
The first part of this article mainly introduces the short text and some related
research content, the second part describes the algorithm used in detail, the third
Short Text Similarity Calculation Based on Jaccard and Semantic Mixture 39

part is the experimental results and comparative analysis, and the fourth part
gives the conclusion.

2 Related Algorithms
2.1 Jaccard Algorithm
Jaccard ratio is an indicator which is used to weigh the similarity of two musters.
It is deﬁned as the intersection of two musters divided by the union of two
musters. The Jaccard ratio only focuses on words with the same characteristics.
The more feature words in the two sentences, the greater the value of the jacard
ratio. For the two sentences S1 and S2, their Jaccard similarity is:

|S1 ∩ S2|
Sim(S1, S2) = (1)
|S1 ∪ S2|

The numerator is a quantity of identical terms in two sentences, and the denom-
inator is a quantity of total terms.

2.2 Semantic Algorithm Based on Word Vector

Word2vec. Along with the popularization of deep learning in natural language
processing, word vectors have also been proposed Word embedding refers to vec-
tors that map words or phrases in the word list to the actual number through
some method. The traditional One-Hot encoding converts words into discrete
individual symbols, which can simply represent word vectors, but the vocabu-
lary is often very large, which causes the vectors to be high-dimensional and
sparse. Word2vec [10] can transform high-dimensional sparse vector into low-
dimensional dense vector, and the position of synonyms in vector space is also
close. Word2vec was released by Google in 2013 and can be used to generate
and calculate word vectors. It can be eﬀectively trained on the data set, and the
training word vector can well weigh the similarity between words.

At the back of word2vec is a superﬁcial neural network, which includes two

models, one is CBOW model, and the other is Skip-gram model. Both mod-
els include input layer, hidden layer and output layer, as shown in Fig. 1. But
CBOW forecasts the present word according to the context, while Skip-gram
forecasts the context according to the present word. This paper is on account
of the CBOW model to train word vectors. Input layer: the input is the con-
text one hot encoding vector of the word of the selected window size; hidden
layer: simple summation and average of the word vectors of the context words;
output layer: this layer corresponds to a Huﬀman tree. The leaf nodes of the
tree are the words that appear in the corpus. The non-leaf nodes are generally
a two-classiﬁer, along the left subtree is the negative class, and the right sub-
tree is the positive class, thus calculating the probability of each word. For the
sake of simplifying the complexity of the model, in comparision with the neural
40 S. Wu et al.

probabilistic language model, the CBOW model changes the stitching method
to the cumulative summation method in the hidden layer, and changes the linear
structure of the output layer to a tree structure.

Fig. 1. word2vec model

Semantic Algorithm. The traditional Jaccard algorithm does not involve the
similarity calculation at the semantic level, while the word vector can calculate
the similarity of synonyms, and the sentences S1 and S2 are used to illustrate
the speciﬁc algorithm steps:

Step 1: Use word2vec to train the corpus to generate model and word vectors
of each vocabulary;

Step 2: Segment the sentences S1 and S2 and remove the stop words. The
words in S1 are ai (i = 1,2,3, ..., m), and the words in S2 are bj (j = 1,2,3, i, n);

Step 3: Compute the semantic similarity of each word in S1 and each word in
S2 through the word vector to form a two-dimensional matrix M. The formula
is:
ai · bj
cos(ai, bj) = (2)
|ai| × |bj|
Step 4: Find the largest value in the matrix M, add it to the set P, and assign
the value of the row and column corresponding to the value to −1, repeat this
step until all the values in the matrix are −1;
Step 5: Add all the values in the set P and divide by the average similarity
obtained by the set length n, as the ﬁnal semantic similarity of the two sentences.
The formula is: n−1
P (i)
Sim(S1, S2) = i=0 (3)
n
Short Text Similarity Calculation Based on Jaccard and Semantic Mixture 41

2.3 Algorithm Based on Jaccard and Semantics

Calculate the Jaccard similarity and semantic similarity for sentences S1 and
S2, and record them as Sim1 (S1, S2) and Sim2 (S1, S2), and mix the two with
a certain weight to get the ﬁnal similarity. The formula is:

Sim(S1, S2) = α · Sim1(S1, S2) + (1 − α) · Sim2(S1, S2) (4)

Among them, α is the weight adjustment factor, and the speciﬁc value is analyzed
in the experiment. The obtained sentence similarity Sim (S1, S2) needs to be
compared with the similarity threshold β, if it is greater, it is determined to be
similar, otherwise it is not similar.

3 Experiment Design and Result Analysis

3.1 Experimental Details

In this article, three data sets are used to verify the algorithm. Each data set
has human annotations. Two sentence labels with the same semantics are 1, and
the other is 0. Data set I is a MSRP data set [11] which is provided by the
Microsoft Research Interpretation Corpus. It has many positive categories and
comes from news sources on the Web. Data set II is the STS data set [12], which
can be used to measure the similarity of the meaning of sentences and has more
negative categories than positive categories. Data set III selects 2000 data from
Quora data set, and the positive-negative ratio is 1:1.
In order to reduce experimental errors and improve accuracy, all text is pre-
processed, uppercase is converted to lowercase, and useless information such as
punctuation marks and extra spaces is removed. Use the tf-idf method to extract
keywords in the text during testing. After the above processing, all sentences are
segmented and trained with word2vec to generate the model and the word vector
corresponding to the word. The CBOW model is used during training, and the
vector is 200-dimensional. The training corpus consists of two parts, one is the
English Wikipedia corpus, the size is about 500M, and the other is the data used.
This is to prevent the data of the data set from being included in the word vector
model and unable to calculate the semantic similarity. The semantic similarity
between words and expressions can be computed by loading the model.
The evaluation standard uses the precision rate P, recall rate R and F score
commonly used in the ﬁeld of information retrieval to estimate the performance
of the algorithm. Precision rate P = correct prediction is positive/all predictions
are positive, recall rate R = correct prediction is positive/all actual is positive,
precision and recall rate aﬀect each other, ideally both are high, but the general
case if the precision rate is high, the recall rate is low, and the recall rate is low
42 S. Wu et al.

while the precision rate is high. Therefore, in the case where both require high,
the F value can be used to measure. The deﬁnition of F is:
2×P ×R
F = (5)
P +R
This article includes two experiments. One experiment is how to ascertain
the similarity threshold of two sentences and the respective weights of the two
algorithms. Another experiment is to prove the validity of the short text simi-
larity algorithm put forward in this article, using the weighting factors obtained
in the ﬁrst experiment to mix Jaccard and semantic algorithms, and compare
with other algorithms.

3.2 Experimental Results and Analysis

Value of Similarity Threshold β and Weighting Factor α: To determine

whether two sentences are similar, a similarity threshold must be set. A smaller
similarity threshold will determine that two dissimilar sentences are similar, and
a larger similarity threshold will also cause a judgment error, so it is necessary to
select a suitable similarity Degree threshold. The speciﬁc method is as follows:

1) Divide the interval according to the similarity value. Set an integer m and
divide the interval [0, 1] into[0, 1/m), [1/m, 2/m)...[(m −1)/m,1];
2) Select the minimum and maximum values of each interval and generate a
series of values evenly distributed between them;
3) Find the threshold with the lowest error rate and its corresponding accuracy
rate in each interval, and record it;
4) Screen the threshold, the array Z1 and Z2 record the threshold and accuracy
after screening respectively;
5) The normalized accuracy is weighted and summed with the similarity thresh-
old to obtain the ﬁnal similarity threshold

In the experiment, the data set I and the data set II are combined to find the
similarity threshold and weight factor, m is set to 10, and finally the similarity is
0.58. The weighting factor is calculated according to the determined similarity
threshold, and take different values of α in 0.05 steps. The result is shown in
Fig. 2:
It can be found from Fig. 2 that with the α increases, the precision rate
gradually increases, while the recall rate gradually decreases, and the F value
increases first and then decreases. Taking the F value as the selection criterion,
when α is 0.35, the maximum F value is 0.734, at this time the precision rate is
0.651, and the recall rate is 0.842. Therefore, this paper takes a weighting factor
of 0.35 for subsequent performance evaluation.
Short Text Similarity Calculation Based on Jaccard and Semantic Mixture 43

Fig. 2. Comparison of experimental results with diﬀerent weighting factors

Similarity Algorithm Performance Evaluation. For the sake of proving

the performance of the algorithm put forward in this article, the algorithm of
this paper is now compared with several classic algorithms. The experiments are
conducted in Data Set I, Data Set II and Data Set III. The classic algorithms
are as follows:

Method 1: Traditional Jaccard algorithm;

Method 2: Vector-based cosine similarity algorithm;
Method 3: Edit distance algorithm based on terms;
Method 4: This article is based on a hybrid algorithm of Jaccard and seman-
tics.

The results of the experiment are shown in the following table:

Table 1. Data set-I evaluation results.

Method1 Method2 Method3 Method4

Precision rate 0.892 0.781 0.878 0.756
Recall rate 0.428 0.763 0.427 0.867
F 0.579 0.772 0.575 0.807
44 S. Wu et al.

Table 2. Data set-II evaluation results.

Method1 Method2 Method3 Method4

Precision rate 0.764 0.605 0.653 0.542
Recall rate 0.349 0.682 0.437 0.810
F 0.479 0.641 0.524 0.649

Table 3. Data set-III evaluation results.

Method1 Method2 Method3 Method4

Precision rate 0.667 0.651 0.639 0.635
Recall rate 0.334 0.641 0.389 0.793
F 0.445 0.646 0.484 0.705

Table 1, Table 2 and Table 3 show the text similarity calculation perfor-
mance of different methods on dataset I , dataset II and dataset III. From the
experimental results, no matter on which data set, compared with the other
three algorithms, the algorithm put forward in this article that combines jac-
card and semantics has a significant improvement in recall rate and F value,
but the precision rate Lower. The traditional jaccard algorithm only pays atten-
tion to the word form, without considering the semantics of the term, and its
recall and F value are low. The vector-based cosine similarity algorithm which
concentrates on the word form and its number is more stable, the accuracy and
recall are not much different, and the F value is also higher. The word-based edit
distance algorithm considers the word order to a certain extent, but also does
not consider the semantics of the sentence. The evaluation results are similar to
the jaccard algorithm. The algorithm in this paper not only considers the item
information of co-occurrence terms, but also considers the semantic information
of non-co-occurrence terms, and obtains good results. Comparing the evaluation
results of the three data sets, the algorithm proposed in this paper, in terms of
F value, data set I > data set III > data set II, which may be related to the
characteristics of the data set. Data set I has more positive classes than negative
classes, and data set II has more negative classes than positive classes, and the
positive class is equal to the negative class in Data Set III. And the data set
III has not been used to find the similarity threshold and weighting factor, but
when the weighting factor is 0.35, good experimental results are also obtained.

4 Conclusion
This article puts forward a text similarity algorithm based on a mixture of Jac-
card and semantics. This algorithm ﬁrst considers the eﬀect of co-occurrence
words on text similarity, and uses the traditional Jaccard algorithm to compute
the similarity of two sentences. Secondly, the word vectors are acquired by train-
ing the external corpus, then the corresponding values between the word vectors
Short Text Similarity Calculation Based on Jaccard and Semantic Mixture 45

in the two sentences are calculated, the maximum value is taken out and the
word vectors in the corresponding two sentences are deleted, then average all the
maximum values as the semantic similarity of these two short sentences. Finally,
the weighted Jaccard similarity and semantic similarity are combined to compute
the final similarity of the two sentences. In this paper, experiments were carried
out on three data sets, and the algorithm was compared with the conventional
Jaccard algorithm, cosine similarity algorithm, editing distance algorithm, etc.
The results show that the algorithm of this paper is higher than other methods in
the recall rate R and F of the text similarity calculation, thus proving the effec-
tiveness of the algorithm. However, the effect of this algorithm in text similarity
calculation is not very significant, which is related to the linguistic features such
as word vectors obtained by training, sentence syntax, and word order. Because
the larger the training corpus, the better the word vectors obtained by training,
but the training corpus in this article is only a medium size, and the algorithm
does not consider the semantic impact of the order of words in the sentence and
the composition of the sentence on the sentence.

References
1. Erjing, C., Enbo, J.: A review of text similarity calculation methods. Data Anal.
Knowl. Discov. 1(6), 1–11 (2017)
2. Hanru, W., Yangsen, Z.: A review of research progress in text similarity calculation.
J. Beijing Inf. Sci. Technol. Univ. (Nat. Sci. Edn.) 34(01), 68–74 (2019)
3. Xianying, H., Yingtao, L., Qinfei, R.: An English short text similarity algorithm
based on common chunks. J. Chongqing Univ. Technol. (Nat. Sci.) 29(08), 88–93
(2015)
4. Zhixiang, G., Xie Longen, D.Y.: Implementation and improvement of SimHash
algorithm for text similarity calculation. Inf. Commun. 01, 27–29 (2020)
5. Li, L., Zhu, A., Su, T.: Research and implementation of an improved text similarity
algorithm based on vector space. Comput. Appl. Softw. (02), 282–284 (2012)
6. Yuan, X.: Research on text similarity based on HowNet. J. Chengdu Univ. (Nat.
Sci. Edn.) 33(3), 251–253 (2014). https://doi.org/10.3969/j.issn.1004-5422.2014.
03.015
7. Wanxiang, C., Ting, L., Bing, Q., et al.: Chinese similar sentence retrieval based
on improved edit distance. High-Tech Commun. 14(7),15–19 (2004). https://doi.
org/10.3321/j.issn:1002-0470.2004.07.004
8. Jinpeng, Z.: Research and application of text similarity algorithm based on seman-
tics. Chongqing University of Technology (2014)
9. Zhifang, L., Guoen, Z., Junfeng, L., Fei, L., Fei, C.: Chinese short text grammar
semantic similarity algorithm. J. Hunan Univ. (Nat. Sci. Edn.) 43(02), 135–140
(2016)
10. Mikolov, T., Chen, K., Corrado, G., et al.: Eﬃcient estimation of word represen-
tations in vector space (2013). arXiv preprint arXiv:1301.3781
11. Dolan, W., Quirk, C., Brockett, C., et al.: Unsupervised construction of large
paraphrase corpora: exploiting massively parallel news sources (2004)
12. Cer, D.M., Diab, M.T., Agirre, E., et al.: SemEval-2017 Task 1: semantic tex-
tual similarity multilingual and crosslingual focused evaluation. In: Meeting of the
Association for Computational Linguistics, pp. 1–14 (2017)

Complete Bundle HTML5 and CSS3 Illustrated Complete 2nd Edition Vodnik
No ratings yet
Complete Bundle HTML5 and CSS3 Illustrated Complete 2nd Edition Vodnik
407 pages
Evaluating of Efficacy Semantic Similarity Methods
No ratings yet
Evaluating of Efficacy Semantic Similarity Methods
8 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
Ceh Practical Exam Questions PDF
100% (2)
Ceh Practical Exam Questions PDF
10 pages
Symfony Metabook 2.1
No ratings yet
Symfony Metabook 2.1
914 pages
week2and3
No ratings yet
week2and3
76 pages
978 1 59636 869 9
No ratings yet
978 1 59636 869 9
22 pages
NLP 2
No ratings yet
NLP 2
8 pages
CSUKs Algorithm Writing Guide and Workbook
No ratings yet
CSUKs Algorithm Writing Guide and Workbook
73 pages
Sun 等 - 2022 - Sentence Similarity Based on Contexts
No ratings yet
Sun 等 - 2022 - Sentence Similarity Based on Contexts
16 pages
Adkhar Book
No ratings yet
Adkhar Book
159 pages
Compositional Approaches For Representing Relations Between Words - A Comparative Study
No ratings yet
Compositional Approaches For Representing Relations Between Words - A Comparative Study
33 pages
Admin, 4015
No ratings yet
Admin, 4015
19 pages
DH CWECCIEE KW SSOR 2024 019 Site Safety Observation Report Close Out
No ratings yet
DH CWECCIEE KW SSOR 2024 019 Site Safety Observation Report Close Out
6 pages
Evolution of Semantic Similarity - A Survey
No ratings yet
Evolution of Semantic Similarity - A Survey
35 pages
Religion and Cult in the Dodecanese
No ratings yet
Religion and Cult in the Dodecanese
10 pages
CRPITV74Yang
No ratings yet
CRPITV74Yang
10 pages
NLP-proj
No ratings yet
NLP-proj
13 pages
Information 11 00421 v2
No ratings yet
Information 11 00421 v2
17 pages
Data-Codata Symmetry and Its Interaction With Evaluation Order
No ratings yet
Data-Codata Symmetry and Its Interaction With Evaluation Order
23 pages
Improving WordNet Using Word Embeddings
No ratings yet
Improving WordNet Using Word Embeddings
8 pages
Obd Pid
No ratings yet
Obd Pid
17 pages
8-Measuring Text Similarity Based On Structure and Word Embedding
No ratings yet
8-Measuring Text Similarity Based On Structure and Word Embedding
20 pages
Nlp Project[1]
No ratings yet
Nlp Project[1]
16 pages
Reference Architecture: 11 April 2019 Vrealize Automation 7.6
100% (1)
Reference Architecture: 11 April 2019 Vrealize Automation 7.6
41 pages
WEEK 7 - ING 201A - Body Paragraphs
No ratings yet
WEEK 7 - ING 201A - Body Paragraphs
16 pages
Trigram 11
No ratings yet
Trigram 11
16 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
2020.lrec-1.851
No ratings yet
2020.lrec-1.851
6 pages
Module 07 - Variables Tree - Basics
No ratings yet
Module 07 - Variables Tree - Basics
23 pages
10 1002@cpe 5971
No ratings yet
10 1002@cpe 5971
17 pages
Extracting Word Synonyms From Text Using Neural Approaches
No ratings yet
Extracting Word Synonyms From Text Using Neural Approaches
7 pages
2015 - Arabic Text Dimensionality Reduction Using Semantic Analysis - A425709-597
No ratings yet
2015 - Arabic Text Dimensionality Reduction Using Semantic Analysis - A425709-597
10 pages
BEIJER - StartUp Ix (09 - 2014)
No ratings yet
BEIJER - StartUp Ix (09 - 2014)
362 pages
GSTR PT
No ratings yet
GSTR PT
12 pages
Semantic Similarity Between Medium-Sized Texts
No ratings yet
Semantic Similarity Between Medium-Sized Texts
13 pages
Intervisitations
No ratings yet
Intervisitations
4 pages
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
No ratings yet
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
8 pages
Document
No ratings yet
Document
3 pages
Vector Based Models
No ratings yet
Vector Based Models
41 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Bank Application Synopsis
No ratings yet
Bank Application Synopsis
3 pages
A Novel Hybrid Methodology of Measuring
No ratings yet
A Novel Hybrid Methodology of Measuring
10 pages
06879d26e3ba5b6fb7feeddc199f24dd4ff6
No ratings yet
06879d26e3ba5b6fb7feeddc199f24dd4ff6
7 pages
alshammari-2023-ijca-922667
No ratings yet
alshammari-2023-ijca-922667
4 pages
AAAI06-123 (Revisar para Referencias)
No ratings yet
AAAI06-123 (Revisar para Referencias)
6 pages
Using Wordnet For Text Categorization
No ratings yet
Using Wordnet For Text Categorization
9 pages
Deep Learning For Semantic Similarity
No ratings yet
Deep Learning For Semantic Similarity
7 pages
Published Paper
No ratings yet
Published Paper
12 pages
Sentence Similarity Based On Semantic Networks
No ratings yet
Sentence Similarity Based On Semantic Networks
36 pages
Shankara Digvijaya With Commentary (Sanskrit)
100% (2)
Shankara Digvijaya With Commentary (Sanskrit)
624 pages
A Cognitive Study On Semantic Similarity Analysis
No ratings yet
A Cognitive Study On Semantic Similarity Analysis
6 pages
Internal Test - I Internal Test - I: Part-A 5 2 10 Answer The Following: Part - A 5 2 10 Answer The Following
No ratings yet
Internal Test - I Internal Test - I: Part-A 5 2 10 Answer The Following: Part - A 5 2 10 Answer The Following
5 pages
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
No ratings yet
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
11 pages
Teaching The Verb To Be in English To Newcomer ELs (ESL) - Raise The Bar Reading
No ratings yet
Teaching The Verb To Be in English To Newcomer ELs (ESL) - Raise The Bar Reading
1 page
Omkar's_Resume_Linkedin
No ratings yet
Omkar's_Resume_Linkedin
1 page
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
No ratings yet
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
6 pages
Similarity Metric
No ratings yet
Similarity Metric
13 pages
Detailed Lesson Plan (DLP) Format: Learning Competency/ies: Code
No ratings yet
Detailed Lesson Plan (DLP) Format: Learning Competency/ies: Code
4 pages
Semantic Similarity
No ratings yet
Semantic Similarity
14 pages
Expert Systems With Applications: Raja Muhammad Suleman, Ioannis Korkontzelos
No ratings yet
Expert Systems With Applications: Raja Muhammad Suleman, Ioannis Korkontzelos
9 pages
Grade 10-Lesson 1 Only
No ratings yet
Grade 10-Lesson 1 Only
7 pages
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
No ratings yet
Corpus Linguistics: National Conference On Artificial Intelligence. 1, PP
4 pages
Queue
100% (1)
Queue
16 pages
A Review of Semantic Similarity Measures in WordNet
No ratings yet
A Review of Semantic Similarity Measures in WordNet
12 pages
Data & Knowledge Engineering: Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, Ángel Iglesias
No ratings yet
Data & Knowledge Engineering: Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, Ángel Iglesias
3 pages
Volume 2 Issue 6 2016 2020
No ratings yet
Volume 2 Issue 6 2016 2020
5 pages
Recurrence Relations
No ratings yet
Recurrence Relations
7 pages
Factors Affecting L2 Learning
100% (1)
Factors Affecting L2 Learning
23 pages
Text-To-Text Semantic Similarity For Automatic Short Answer Grading
No ratings yet
Text-To-Text Semantic Similarity For Automatic Short Answer Grading
9 pages
Unidad 4
No ratings yet
Unidad 4
7 pages
Sentence-Level Semantic Textual Similarity Using Word-Level Semantics
No ratings yet
Sentence-Level Semantic Textual Similarity Using Word-Level Semantics
4 pages
4th Grade Pre-Post Test 2022-2023 Students Edition
No ratings yet
4th Grade Pre-Post Test 2022-2023 Students Edition
12 pages
Text Semantic Similarity
No ratings yet
Text Semantic Similarity
17 pages
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
No ratings yet
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
5 pages
l4 CT Plan 2
No ratings yet
l4 CT Plan 2
8 pages
Text Similarity Using Siamese Networks and Transformers
No ratings yet
Text Similarity Using Siamese Networks and Transformers
10 pages
Measure Term Similarity Using A Semantic Network Approach
No ratings yet
Measure Term Similarity Using A Semantic Network Approach
5 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
DLL-Food Fish Processing 9-Q2-W3
100% (1)
DLL-Food Fish Processing 9-Q2-W3
4 pages
A Web Search Engine
No ratings yet
A Web Search Engine
3 pages
M S S W: A S: Easurement of Emantic Imilarity Between Ords Urvey
No ratings yet
M S S W: A S: Easurement of Emantic Imilarity Between Ords Urvey
10 pages
A Survey On Semantic Similarity Measures
No ratings yet
A Survey On Semantic Similarity Measures
5 pages
Measuring Semantic Similarity Between Words and Improving Word Similarity by Augumenting PMI
No ratings yet
Measuring Semantic Similarity Between Words and Improving Word Similarity by Augumenting PMI
5 pages
Review On NLP Paraphrase Detection Approaches
No ratings yet
Review On NLP Paraphrase Detection Approaches
4 pages
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Regular Expressions Demystified: A Practical Guide with Examples
From Everand
Regular Expressions Demystified: A Practical Guide with Examples
William E. Clark
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Short Text Similarity Calculation Based On Jaccard and Semantic Mixture

Uploaded by

Short Text Similarity Calculation Based On Jaccard and Semantic Mixture

Uploaded by

Short Text Similarity Calculation

Based on Jaccard and Semantic Mixture

Shushu Wu1(B) , Fang Liu1 , and Kai Zhang1,2

Keywords: Short text similarity · Jaccard · Word vector · Semantic

2.2 Semantic Algorithm Based on Word Vector

At the back of word2vec is a superﬁcial neural network, which includes two

Fig. 1. word2vec model

2.3 Algorithm Based on Jaccard and Semantics

Sim(S1, S2) = α · Sim1(S1, S2) + (1 − α) · Sim2(S1, S2) (4)

3 Experiment Design and Result Analysis

3.1 Experimental Details

3.2 Experimental Results and Analysis

Value of Similarity Threshold β and Weighting Factor α: To determine

Fig. 2. Comparison of experimental results with diﬀerent weighting factors

Similarity Algorithm Performance Evaluation. For the sake of proving

Method 1: Traditional Jaccard algorithm;

The results of the experiment are shown in the following table:

Table 1. Data set-I evaluation results.

Method1 Method2 Method3 Method4

Table 2. Data set-II evaluation results.

Method1 Method2 Method3 Method4

Table 3. Data set-III evaluation results.

Method1 Method2 Method3 Method4

You might also like