Multi-Task Pre-Training Language Model For Semantic Network Completion
Multi-Task Pre-Training Language Model For Semantic Network Completion
Abstract—Semantic networks, such as the knowledge graph, knowledge graphs which is challenging but critical due to its
can represent the knowledge leveraging the graph structure. potential to boost downstream applications.
Although the knowledge graph shows promising values in natural The link prediction for KG is also known as KG completion.
language processing, it suffers from incompleteness. This paper
focuses on knowledge graph completion by predicting linkage be- Previous link prediction methods can be classified into two
tween entities, which is a fundamental yet critical task. Semantic main categories: translational distance based approach and
matching is a potential solution for link prediction as it can semantic matching based approach [4]. Translational distance
deal with unseen entities, while the translational distance based based models typically embed both entities and relations into
methods struggle with the unseen entities. However, to achieve vector space and exploit scoring functions to measure the
competitive performance as translational distance based methods,
semantic matching based methods require large-scale datasets for distance between them. Although the distance representation
the training purpose, which are typically unavailable in practical of entity relations can be very diverse, it is difficult to
settings. Therefore, we employ the language model and introduce predict the entity information which did not appear in the
a novel knowledge graph architecture named LP-BERT, which training phase. As a promising alternative, semantic match-
contains two main stages: multi-task pre-training and knowledge ing based approaches utilize semantic information of entities
graph fine-tuning. In the pre-training phase, three tasks are taken
to drive the model to learn the relationship information from and relationships, being capable of embedding those unseen
triples by predicting either entities or relations. While in the entities based on their text description. Furthermore, due to
fine-tuning phase, inspired by contrastive learning, we design a the high-complex structure of the model and the slow training
triple-style negative sampling in a batch, which greatly increases speed, the proportion of negative sampling is much lower for
the proportion of negative sampling while keeping the training the training purpose, which leads to insufficient learning of
time almost unchanged. Furthermore, we propose a new data
augmentation method utilizing the inverse relationship of triples negative sample entity information in the entity library and
to improve the performance and robustness of the model. To severely constrains the performance of the model.
demonstrate the effectiveness of our proposed framework, we To address the aforementioned issues, especially, with the
conduct extensive experiments on three widely-used knowledge goal to alleviate poor prediction performance of the unseen
graph datasets, WN18RR, FB15k-237, and UMLS. The experi- node of the translation distance model and insufficient training
mental results demonstrate the superiority of our methods, and
our approach achieves state-of-the-art results on the WN18RR of the text matching model, in this paper, we propose a novel
and FB15k-237 datasets. Significantly, the Hits@10 indicator is pre-training framework for knowledge graph, namely LP-
improved by 5% from the previous state-of-the-art result on the BERT. Specifically, LP-BERT employs the semantic matching
WN18RR dataset while reaching 100% on the UMLS dataset. representation, which leverages the multi-task pre-training
Index Terms—Knowledge Graph, Link Prediction, Semantic strategy, including masked language model task (MLM) for
Matching, Translational Distance, Multi-Task Learning. context learning, masked entity model task (MEM) for entity
semantic learning, and mask relation model task (MRM) for
relational semantic learning. With the pre-training tasks, LP-
I. I NTRODUCTION
BERT could learn relational information and unstructured se-
The applications of knowledge graphs (KG) seems to be mantic knowledge of structured knowledge graphs. Moreover,
evident in both the industrial and in academic fields [1], includ- to solve the problem of insufficient training induced by the
ing the question answering, recommendation systems, natural low negative sampling ratio, we propose a negative sampling
language processing [2], etc. These evident applications in in a batch inspired by contrastive learning, which significantly
turn have attracted considerable interest for the construction increases the proportion of negative sampling while ensuring
of large-scale KG. Despite the sustainable efforts that have that the training time remains unchanged. At the same time,
been made, many previous knowledge graphs suffered from we propose a data augmentation method based on the inverse
the incompleteness [3], as it is difficult to store all these facts relationship of triples to increase the sample diversity, which
in one time. To address this incompleteness issue, many link can boost the performance further.
prediction approaches have been explored, with the aim to To demonstrate the effectiveness of robustness of the pro-
discover unannotated relations between entities to complete posed solution, we comprehensively evaluate the performance
2
of LP-BERT on WN18RR, FB15k-237, and UMLS datasets. have introduced a pre-training paradigm. As aforementioned,
Without bells and whistles, LP-BERT outperforms a group of due to the high complexities of the model and the slow training
competitive methods [5], [6], [7], [8], [9] and achieves state-of- speed, the proportion of negative sampling is far lower than
the-art results on WN18RR and UMLS datasets. The Hits@10 that of previous work, which leads to insufficient learning of
indicator is improved by 5% from the previous state-of-the-art negative sample entity information in the entity library.
result on the WN18RR dataset while reaching 100% on the Compared with aforementioned methods, our method intro-
UMLS dataset. duces the contrastive learning strategy. In the training process,
The structure of the remainder paper is as follows. Section a novel negative sampling approach is carried out in the batch,
II discusses the relationship between the proposed method thus the proportion of negative sampling can be exponentially
and prior works. In Section III, we provide the details of increased, which can alleviate the insufficient learning issue.
our methodology, while Section IV describes comprehensive Moreover, we also optimize the pre-training strategy, so that
experimental results. Section V provides the conclusion of this the model can not only learn context knowledge, but also learn
paper. the element information of left entity, relationship and right
entity, which greatly improves the performance of the model.
II. R ELATED W ORK
A. Knowledge Graph Embedding
KG embedding is a well-studied topic, and comprehensive B. Link Prediction
surveys can be found in [4], [10]. Traditional methods could
only utilize the structural information which is observed in Link prediction is an active research area in KG embedding
the triple to complete the knowledge graph. For example, and has received lots of attention in recent years. KBGAT
TransE [11] and TransH [12] were two representative works [24], a novel attention based feature embedding, was proposed
that iteratively updated the vector representation of entities to capture both entity and relation features in any given
by calculating the distance between entities. Convolutional entity’s neighbourhood. AutoSF [25] endeavored to automati-
Neural Network (CNN) also obtained satisfying performance cally design SFs for distinct KGs by the AutoML techniques.
in knowledge graph embedding [13], [14], [15]. In addition, CompGCN [26] aimed to build a novel Graph Convolutional
different types of external information, such as entity types, framework that jointly embeds both nodes and relations in a
logical rules and text descriptions, were introduced to enhance relational graph, which leverages a variety of entity-relation
results [4]. For text descriptions, [16] firstly represented enti- composition operations from Knowledge Graph Embedding
ties by averaging word embeddings contained in their names, techniques and scales with the number of relations. Meta-
where the word embeddings are learned from external corpora. KGR [27] was a meta-based multi-hop reasoning method
[4] embedded entities and words into the same vector space by adopting meta-learning to learn effective meta parameters from
aligning the Wikipedia anchor points and entity names. CNN high-frequency relations that could quickly adapt to few-
also has been utilized to encode word sequences in entity shot relations. ComplEx-N3-RP [28] designed a new self-
description [17]. [18] proposed Semantic Space Projection supervised training objective for multi-relational graph repre-
(SSP) to learn topics and knowledge graph embedding together sentation learning via simply incorporating relation prediction
by depicting the strong correlation between fact triples and text into the commonly used 1vsAll objective, which contains not
descriptions. only terms for predicting the subject and object of a given
Despite the satisfying performance of these models, they triple, but also a term for predicting the relation type. AttH
only learned the same text representation of entities and re- [29] was a class of hyperbolic KG embedding models that
lationships, while the words in entity/relationship descriptions simultaneously capture hierarchical and logical patterns, which
could have different meanings or importance weights in differ- combines hyperbolic reflections and rotations with attention to
ent triples. To address these problems, [19] proposed a text- model complex relational patterns.
enhanced KG embedding model TEKE, which could assign RotatE [30] defined each relation as a rotation from the
different embeddings to relationships in different triples and source entity to the target entity in the complex vector space,
use the co-occurrence of entities and words in entity-labeled being able to model and infer various relation patterns, in-
text corpus. [20] used Long Short-Term Memory (LSTM) cluding symmetry/antisymmetry, inversion, and composition.
encoder with attention mechanism to construct context text HAKE [8] combined TransE and RotatE to model entities at
representation with given different relationships. [21] proposed different levels of a hierarchy while distinguishing entities at
an accurate text-enhanced KG embedding method by using the the same level. GAAT [31] integrated an attenuated attention
mutual attention mechanism between triple specific relation mechanism and assigns diverse weights to relation paths to
mention and relation mention and entity description. acquire the information from the neighborhoods. StAR [9]
Although these methods could deal with the semantic partitioned a triple into two asymmetric parts as in translation
changes of entities and relations in different triples, they could distance based graph embedding approach and encodes both
not make full use of syntactic and semantic information in parts into contextualized representations by a Siamese-style
large-scale free text data because they only use entity descrip- textual encoder. However, the pre-training stage of textual
tion, relation mention and word co-occurrence with entities. models could only learn the context knowledge of the text,
KG-BERT [22], MLMLM [23], StAR [9] and other works while ignoring the graph structure.
3
hypernym Triple
MRM
half mile a unit of length equal to half of 1 mile [mask] linear unit a unit of measurement of length
Eh half mile
linear unit
MEMt
Dh a unit of length equal
half mile a unit of length equal to half of 1 mile hypernym [mask] to half of 1 mile
Fig. 2. An instance demonstrating the pre-training tasks. Entities, relations, and entity descriptions are concatenated as a whole sequence. For MLM, random
tokens are masked; For MEM, either head entity or tail is masked, so we qualify MEM with subscript “h” or “t”; While For MRM, the relation is masked.
It is worthwhile noticing that, these pre-training tasks can be combined using the multi-task learning paradigm during the pre-training procedure.
Algorithm 1 Sample construction in the pre-training phase. propose a simple but effective method to generate negative
Require: Tokens: Token list of Triples samples. As illustrated in Figure 3, for a mini batch of size
Ensure: x: masked input, y1 : MRM or MEM target, y2 : MLM n, by interleaving Ehi kRi and Etj (1 ≤ i, j ≤ n), we
target take Ehi kRi , Etj (i 6= j) as negative samples. Therefore, for
1: function MLM(x, y2 , region) a mini-batch, LP-BERT only forwards twice for n2 sample
2: for i in region do distance, which greatly increases the proportion of negative
3: if random[0, 1]<0.15 then sampling and reduces the training time. The detailed procedure
4: y2 [i] = x[i] of constructing negative samples for fine-tunning is shown in
5: if random[0, 1]<0.8 then Algorithm 2.
6: x[i] = [MASK]
7: else
8: if random[0, 1]>0.5 then
9: x[i] = sample(Vocab)
10: end if
11: end if
12: end if
13: end for
14: return x, y2
15: end function
16:
17: x ← X̃ in Equation 1
18: y1 , y2 ← [PAD] * len(X̃)
Fig. 3. Label matrix for a batch of size n. For Eh Ri and Eti in the ith(1 ≤
19: random state← random([0, 1]) i ≤ n) element of batch, they can only generate 1 positive sample and n − 1
20: if random state<0.4 then negative samples. Therefore, there are n2 samples in the batch, including n
positive samples and n(n − ˙ 1) negative samples.
21: y1 [Eh ] ← x[Eh ]
22: x[Eh ] ← [MASK] ∗ len(Eh )
23: x[Dh ] ← [PAD] ∗ len(Eh )
24: x, y2 ← MLM(x, y2 , RkEt kDt ) Algorithm 2 Batch sample construction in the fine-tuning
25: end if phase.
26: if 0.4≤random state<0.8 then Require: x1 : Eh R batch tokens, x2 : Et batch tokens, dict: a
27: y1 [Et ] ← x[Et ] dict can get positive Entities for each Eh kR
28: x[Et ] ← [MASK] ∗ len(Et ) Ensure: p: model predict results, y: ground truth
29: x[Dt ] ← [PAD] ∗ len(Et ) 1: Emb1 ← Encoding of x1
30: x, y2 ← MLM(x, y2 , Eh kDh kR) 2: Emb2 ← Encoding of x2
31: end if 3: p ← cosine similarity of Emb1 and Emb2
32: if random state≥0.8 then 4: y = []
33: y1 [R] ← x[R] 5: for Eh R ∈ x1 do
34: x[R] ← [MASK] ∗ len(R) 6: for Et ∈ x2 do
35: x, y2 ← MLM(x, y2 , regions of x other than R) 7: if Et ∈ dict[Eh kR] then
36: end if 8: y.append(1)
37: return x, y1 , y2 9: else
10: y.append(0)
11: end if
12: end for
D. Knowledge Graph Fine-tuning 13: end for
1) Knowledge Representation: The sample construction 14: return p, y
method in the fine-tuning stage is different from that in the
pre-training stage. Inspired by the StAR [9], for each triple, we 2) Triple Augmentation: The above pair-based knowledge
concatenate Eh and R texts. Then the pre-trained LP-BERT graph representation method has limitations because it cannot
model uses the structure of Siamese [44] to encode Eh kR represent (Eh , REt ) pair directly in the head entity prediction
and Et , respectively. The fine-tuning objective of the model is task. Specifically, for the construction of the negative samples,
to make the two representations of the positive sample closer we can only make negative sampling for Et but cannot for
and the negative further. Here, the positive sample means the Eh , limiting the diversity of negative samples, especially
( Eh , R, Et ) exists in the knowledge base, while the negative when there are mo relationships in the dataset. Therefore,
sample does not. we propose a dual relationship data enhancement method.
Since a triple (Eh , R, Et ) is splitted into Eh kR and For each relationship R, we define a corresponding inverse
Et , the knowledge graph can only supply positive samples. relationship Rrev . For the head entity prediction sample in the
Hence, to conduct binary classification during fine-tuning, we form of (?, R, Er ), we rewrite it into the form of (Er , Rrev , ?)
6
TABLE I
to enhance the data. In this way, we can use the vector rep- T HE SUMMARY STATISTICS OF THE USED DATASETS , WHICH INCLUDING
resentation of (Et Rrev , Eh ) to replace (Eh , REt ), improving WN18RR, FB15 K -237 AND UMLS.
the diversity of sampling and the robustness of the model.
Moreover, we use the mixed precision strategy of fp16 and WN18RR FB15k-237 UMLS
fp32 to reduce the GPU memory usage of gradient calculation Entities 40943 14541 135
to improve the size of n and increase the negative sampling Relations 11 237 46
Train samples 86835 272115 5216
ratio.
Valid samples 3034 17535 652
3) Fine-tuning Loss Designing: We designed two distance Test samples 3034 20466 661
calculation methods to jointly calculate the loss function,
L = L1 (VEh R , VEt ) + L2 (VEh R , VEt ) (7) of RAM and a Nvidia P40 GPU for the training purpose. The
where VEh R and VEt are encoded vectors of Eh R and Et , AdamW optimizer is used with 5% steps of warmup. For the
respectively. hyper-parameters in LP-BERT, we set the epochs to 50, the
batch size as 32, and the learning rate=10−4 /5 × 10−5 respec-
−αt (1 − d1 )γ log(d1 )
y=1
L1 (d1 ) = (8) tively for the linear and attention parts initialized. The early-
−αt (d1 )γ log(1 − d1 ) y 6= 1
stop epoch number is set as 3. In the knowledge graph fine-
tuning phase, we set the batch size as 64 on WN18RR, 120
Sigmoid(sum(d2 )) y=1
L2 (d2 ) = (9) for FB15k-237, 128 on UMLS based on the best Hits@10 on
1 − Sigmoid(sum(d2 )) y 6= 1
where development dataset. The learning rate is set as 10−3 /5×10−5
respectively for the linear and attention parts initialized with
V˙
(
V
d1 = kVEEhRRkkVEEt k LB-BERT. The number of training epochs is 7 on WN18RR
h t (10)
d2 = kVEh R − VEt k and FB15k-237, while 30 for the UMLS datasets, α = 0.8 on
WN18RR and UMLS, α = 0.5 on FB15k-237, and γ = 2 in
where α is used to adjust the weights of positive and negative
Eq 8.
samples, γ is used to adjust the weights of samples that are
In the inference phase, all other entities in the knowledge
difficult to distinguish. Two different dimensional distance cal-
graph are regarded as the wrong candidate which can damages
culation methods are used to calculate the distance relationship
their head or tail entities. The trained model aims to use the
between multi-task learning vector pairs.
“filtered” settings to correct triple ranking of the corrupt. The
IV. E XPERIMENTS evaluation metrics has two aspects : (1) Hits@N represents
the ratio of test instances in the top N of correct candidates;
In this part, we firstly detail the experimental settings. (2) The average rank (MR) and the average reciprocal rank
Then, we demonstrate the effectiveness of the proposed model (MRR) reflect the absolute ranking.
on multiple widely-used datasets, including WN18RR [13]
FB15k-237 [45] and UMLS [13] benchmark datasets. Sec-
ondly, ablation studies are conducted and we tease apart B. Results
them from our standard model and keep other structures as We benchmark the link prediction tasks using proposed
they are, with the goal to justify the contribution of each methods and competitive approaches, including both the trans-
improvement. Finally, we conduct extensive analysis of the lational distance-based approaches and semantic matching-
prediction performance for the unseen entities and a case study based methods. For the translational distance models, 18
is also provided to show the effectiveness of proposed LP- widely-used solutions are tested in our experiments, including
BERT. TransE [11], DistMult [48], ComplEx [49], R-GCN [15],
ConvE [13], KBAT [24], QuatE [50], RotatE [30], TuckER
A. Experiment Settings and Datasets [51], AttH [29], DensE [53], Rot-Pro [5], QuatDE [6], Lin-
eaRE [7], CapsE [54], RESCAL-DURA [55] and HAKE
We evaluate the proposed models on WN18RR [13], FB15k-
[8]. As expected, only a few previous attempts employed
237 [45] and UMLS [13] datasets. For WN18RR, the dataset
semantic-matching-based methods, due to the difficulties of
was adopted from WordNet, for the link prediction task [46]
the training. Here, KG-BERT [22] and StAR [9] are used for
and it consists of English phrases and the corresponding
the quantitative comparison.
semantic relations. FB15k-237 [45] is a subset of Freebase
The detailed results are shown in Table II. As can be
[47], which consists of real-world named entities and their
observed from the Table, the LB-BERT is able to achieve state-
relations. Both WN18RR and FB15k-237 are updated from
of-the-art or competitive performance on all three widely used
WN18 and FB15k [11] respectively by removing inverse
datasets, including WN18RR, FB15k-237 and UMLS dataset.
relations and data leakage, which is one of the most popular
The improvement is especially significant in terms of Hits@10
benchmark. UMLS [13] is a small KG containing medical
and Hits@3 due to the superior generalization performance
semantic entities and their relations. The summary statistics
of multi-task pre-training textual encoding approach, which
of the datasets are shown in Table I.
We implement the LP-BERT using PyTorch 1 framework, will be further analyzed in the section below. Furthermore,
on a workstation with an Intel Xeon processor with a 64GB LP-BERT surpasses all other methods by a large margin
in terms of Hits@3, Hits@10 on WN18RR and Hits@10
1 https://pytorch.org/ on UMLS. Although it only achieves inferior performance
7
TABLE II
E XPERIMENTAL RESULTS ON WN18RR, FB15 K -237 AND UMLS DATASETS . T HE BOLD NUMBERS DENOTE THE BEST RESULTS IN EACH GENRE WHILE
THE UNDERLINED ONES ARE STATE - OF - THE - ART PERFORMANCE . W E CAN SEE THAT LP-BERT ACHIEVES STATE - OF - THE - ART PERFORMANCE IN
MULTIPLE EVALUATION RESULTS ON WN18RR AND UMLS DATASETS , AND OUTPERFORMS OTHER SEMANTIC MATCHING MODELS ON THE FB15 K -237
DATASET. ↑ MEANS THAT HIGHER VALUES PROVIDE BETTER PERFORMANCE , WHILE ↓ MEANS THAT LOWER VALUES PROVIDE BETTER PERFORMANCE .
on FB15k-237 dataset and Hits@1 and MRR of WN18RR on RoBERTa-large. As for empirical efficiency, due to the
dataset compared to translational distance models, it still small number of model parameters and the strategy of negative
remarkably outperforms other semantic matching models such sampling based on the batch in the training process, our model
as KG-BERT and StAR from the same genre by introducing is faster than KG-BERT and StAR for both the training and
structured knowledge. In particular, LB-BERT outperforms inference phases.
StAR [9], which is the previous state-of-the-art model, on all
three datasets with only one-third parameters numbers of it. C. Ablation Study
For the WN18RR datasets, the Hits@1 is increased from 0.243
In this part, we tease apart each module from our standard
to 0.343, Hits@3 is increased to 0.563 from 0.491.
model and keep other structures as they are, with the goal
The experimental results show that the semantic matching
to justify the contribution of each improvement. The ablation
models perform well in the topK recall evaluation methods,
experiments are conducted on the WN18RR dataset, and we
but the Hits@1 result is significantly inferior to the translation
find similar performance on all three datasets. Table IV shows
distance models. This is because that the features of the
the results. After adding a batch-based triple-style negative
semantic matching models are based on text, which leads to
sampling strategy combined with focal-loss, the appropriate
the vector representation of similar entities in the text being
negative sampling ratio dramatically improves the model eval-
close and difficult to distinguish. Although translation distance
uation effect, as shown in the second line. The original pre-
models perform well on the Hits@1, they are incapable of
training weights (BERT or RoBERTa pre-trained weights) are
understanding the text semantics. For new entities not seen in
not familiar with the corpus information of the link prediction
the training set, the prediction results of translation distance
task. After adding the pre-training strategy based on MLM,
models are random, while the semantic matching models are
the evaluation effect is improved. However, the pre-training
reliable, which is why Hits@3 and Hits@10 LP-BERT can far
method based on MLM strategy does not fully excavate the
exceed the translation distance models to achieve state-of-art
relationship information of triplets, and the multi-task pre-
performance.
training strategy combined with MLM, MEM, and MRM
Similar to KG-BERT and StAR, our model is relied BERT
makes the model evaluation result optimal.
and we compare LP-BERT with KG-BERT and StAR on
WN18RR in detail, including different initialization manners.
As shown in Table III, LP-BERT consistently achieves su- D. Unseen Entities
perior performance over most metrics. The evaluation effect To verify the prediction performance of LP-BERT on unseen
of the LB-BERT model based on RoBERTa-base has ex- entities, we re-split the dataset. Specifically, we randomly
ceeded the evaluation effect of KG-BERT and StAR based select 10% of the entity triples as the validation set and test
8
TABLE III
Q UANTITATIVE COMPARISONS WITH KG-BERT AND S TAR ON WN18RR DATASETS . “T RAIN ” DENOTE THE TIME FOR PER TRAINING EPOCH , AND AND
“I NFERENCE ” DENOTES TOTAL INFERENCE TIME ON TEST SET. T HE VALUES WERE COLLECTED USING T ESLA P40 WITHOUT MIXED PRECISION .
TABLE IV
A BLATION STUDY FOR LP-BERT ON THE WN18RR DATASET.
set, respectively, to ensure that the training set, validation set, E. Case Study
and test set don’t overlap on any entity. We then re-train and To further demonstrate the performance of LP-BERT, addi-
evaluate LP-BERT as well as other baselines, of which the tional case studies are conducted on the WN18RR datasets
results are shown in Table V. and the visualization of the results are provided in Table
As can be seen from the table, all models experienced VI. In the table, each row denotes a real sample which is
dramatic performance drop across the five metrics. In par- randomly selected from the test set. The first column is a
ticular, the distance-based methods are inferior in coping triple which is formatted as (left entity, relation, ?) ← right
with unseen entities. As mentioned above, such methods only entity. The prediction models employ the left entity and the
encode the relationship and distance between entities without relationship to predict the right entity. From the second column
including semantic information, thus incapable of encoding to the fourth column, we present the Top-5 ranked entities
entities not seen in the training set. Contrastively, pre-training with highest predicted probability obtained using different
based models, including MLMLM, StAR, and our LP-BERT, pre-training approaches. The entities are ordered using the
have displayed their ability to cope with unseen entities. predicted probability and the correct answers are highlighted
Furthermore, our LP-BERT surpasses MLMLM and StAR using the bold font. Column 2 presents the predicted results of
on almost all metrics, proving its superiority for processing the proposed pre-training strategy of our proposed LP-BERT.
unseen entities. Especially, LP-BERT outperforms StAR, the Column 3 provides the results obtained by only using the
previous state-of-the-art, with over 6-9 points on Hits@3 and MLM-based pre-training, while the last column presents the
Hits@10. However, the score of LP-BERT on the mean rank results obtained without pre-training.
metric is not as well as other metrics, indicating LP-BERT For different approaches, the order of the correctly predicted
performs worse on those failed entities. results is given in the table (the number in each element).
For LP-BERT, the orders of the correct predicted results
TABLE V
are [1,2,1,2,2], while the orders are [3,3,5,3,3] for MLM-
P ERFORMANCE OF LP-BERT ON UNSEEN ENTITIES . based pre-training and the orders are [6,21,12,6,4] without
pre-training. The results suggest that LP-BERT can provide
Models Hits@1↑ Hits@3↑ Hits@10↑ MRR↑ MR↓ superior performance, with comparison to the MLM-based
TransE 0.0010 0.0010 0.0010 0.0010 21708 pre-training the model without pre-training. Noting that, all
DistMult 0.000 0.000 0.000 0.000 33955 the presented results are typical and not cherry-picked for
ComplEx 0.000 0.000 0.000 0.000 24678 presentation, with the goal to avoid misrepresenting the actual
RotatE 0.0010 0.0010 0.0010 0.0010 21023 performance of the proposed method.
LinearRE 0.0010 0.0010 0.0010 0.0010 21502
QuatDE 0.0010 0.0010 0.0010 0.0010 21301
MLMLM 0.0490 0.0932 0.1413 0.0812 6324 V. C ONCLUSION AND F UTURE W ORK
StAR 0.1108 0.2355 0.4053 0.2072 535
LP-BERT 0.1204 0.2978 0.4919 0.2434 1998
This paper focuses on a fundamental yet critical task in the
natural language processing field, i.e., semantic network com-
pletion. More specifically, we manage to predict the linkage
9
TABLE VI
C ASE STUDY USING THE WN18RR DATASETS . T HE FIRST COLUMN IS A TRIPLE WHICH IS FORMATTED AS ( LEFT ENTITY, RELATION , ?) ← RIGHT
ENTITY. F ROM THE SECOND COLUMN TO THE FOURTH COLUMN , WE PRESENT THE T OP -5 RANKED ENTITIES WITH HIGHEST PREDICTED PROBABILITY
WHICH WERE OBTAINED FROM DIFFERENT PRE - TRAINING APPROACHES ( INCLUDING THE PROPOSED LP-BERT- BASED PRE - TRAINING , MLM- BASED
PRE - TRAINING AND WITHOUT PRE - TRAINING ). T HE ENTITIES ARE ORDERED USING THE PREDICTED PROBABILITY. T HE CORRECT ANSWERS ARE
HIGHLIGHTED USING THE BOLD FONT. T HE NUMBER IN EACH ELEMENT IS THE ORDER OF THE CORRECTLY PREDICTED RESULTS .
between entities in the semantic network of the knowledge [8] Z. Zhang, J. Cai, Y. Zhang, and J. Wang, “Learning hierarchy-aware
graph. We employ the language model and introduce the LP- knowledge graph embeddings for link prediction,” in Proceedings of
the AAAI Conference on Artificial Intelligence, 2020.
BERT, which contains multi-task pre-training and knowledge [9] B. Wang, T. Shen, G. Long, T. Zhou, Y. Wang, and Y. Chang, “Structure-
graph fine-tuning phases. In the pre-training phase, we propose augmented text representation learning for efficient knowledge graph
two novel pre-training tasks, MEM and MRM, to encourage completion,” in International World Wide Web Conference, 2021, pp.
1737–1748.
the model to learn the knowledge of context and the struc- [10] P. Goyal and E. Ferrara, “Graph embedding techniques, applications,
ture information of the knowledge graph. While in the fine- and performance: A survey,” Knowledge-Based Systems, vol. 151, pp.
tuning phase, we design a triple-style negative sampling in 78–94, 2018.
a batch, which greatly increases the proportion of negative [11] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko,
“Translating embeddings for modeling multi-relational data,” Interna-
sampling while keeping the training time almost unchanged. tional Conference on Neural Information Processing Systems, 2013.
Extensive experimental results on three datasets demonstrated [12] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding
the efficiency and effectiveness of our proposed LP-BERT. In by translating on hyperplanes,” in Proceedings of the AAAI Conference
on Artificial Intelligence, 2014.
future work, we will explore more diverse pre-training tasks [13] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional 2d
and increase the model parameter size to enable LP-BERT to knowledge graph embeddings,” in Proceedings of the AAAI Conference
store larger graph knowledge. on Artificial Intelligence, 2018.
[14] D. Q. Nguyen, D. Q. Nguyen, T. D. Nguyen, and D. Phung, “A con-
volutional neural network-based model for knowledge base completion
VI. ACKNOWLEDGEMENTS and its application to search personalization,” Semantic Web, vol. 10,
no. 5, pp. 947–960, 2019.
This work is partially supported by the National Key Re- [15] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov,
search and Development Program of China (2021ZD0112901). and M. Welling, “Modeling relational data with graph convolutional
networks,” in ESWC. Springer, 2018, pp. 593–607.
[16] R. Socher, D. Chen, C. D. Manning, and A. Ng, “Reasoning with neural
R EFERENCES tensor networks for knowledge base completion,” Advances in neural
information processing systems, vol. 26, 2013.
[1] S. M. Kazemi and D. Poole, “Simple embedding for link prediction in
[17] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma, “Collaborative
knowledge graphs,” Advances in neural information processing systems,
knowledge base embedding for recommender systems,” in Proceedings
vol. 31, 2018.
of the 22nd ACM SIGKDD international conference on knowledge
[2] M. Sundermeyer, H. Ney, and R. Schlüter, “From feedforward to
discovery and data mining, 2016, pp. 353–362.
recurrent lstm neural networks for language modeling,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, [18] H. Xiao, M. Huang, L. Meng, and X. Zhu, “Ssp: semantic space
pp. 517–529, 2015. projection for knowledge graph embedding with text descriptions,” in
[3] A. Rossi, D. Barbosa, D. Firmani, A. Matinata, and P. Merialdo, “Knowl- Thirty-First AAAI conference on artificial intelligence, 2017.
edge graph embedding for link prediction: A comparative analysis,” [19] Z. Wang, J. Li, Z. Liu, and J. Tang, “Text-enhanced representation
ACM Transactions on Knowledge Discovery from Data, vol. 15, no. 2, learning for knowledge graph,” in Proceedings of International Joint
pp. 1–49, 2021. Conference on Artificial Intelligent (IJCAI), 2016, pp. 4–17.
[4] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embed- [20] J. Xu, K. Chen, X. Qiu, and X. Huang, “Knowledge graph repre-
ding: A survey of approaches and applications,” IEEE Transactions on sentation with jointly structural and textual encoding,” arXiv preprint
Knowledge and Data Engineering, 2017. arXiv:1611.08661, 2016.
[5] T. Song, J. Luo, and L. Huang, “Rot-pro: Modeling transitivity by [21] B. An, B. Chen, X. Han, and L. Sun, “Accurate text-enhanced knowledge
projection in knowledge graph embedding,” International Conference graph representation learning,” in Proceedings of the 2018 Conference
on Neural Information Processing Systems, vol. 34, 2021. of the North American Chapter of the Association for Computational
[6] H. Gao, K. Yang, Y. Yang, R. Y. Zakari, J. W. Owusu, and K. Qin, Linguistics: Human Language Technologies, Volume 1 (Long Papers),
“Quatde: Dynamic quaternion embedding for knowledge graph comple- 2018, pp. 745–755.
tion,” arXiv preprint arXiv:2105.09002, 2021. [22] L. Yao, C. Mao, and Y. Luo, “Kg-bert: Bert for knowledge graph
[7] Y. Peng and J. Zhang, “Lineare: Simple but powerful knowledge graph completion,” arXiv:1909.03193, 2019.
embedding for link prediction,” in ICDM. IEEE, 2020, pp. 422–431. [23] L. Clouatre, P. Trempe, A. Zouaq, and S. Chandar, “Mlmlm: Link
10
prediction with mean likelihood masked language model,” arXiv preprint [47] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase:
arXiv:2009.07058, 2020. a collaboratively created graph database for structuring human knowl-
[24] D. Nathani, J. Chauhan, C. Sharma, and M. Kaul, “Learning attention- edge,” in International Conference on Management of Data, 2008, pp.
based embeddings for relation prediction in knowledge graphs,” arXiv 1247–1250.
preprint arXiv:1906.01195, 2019. [48] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and
[25] Y. Zhang, Q. Yao, W. Dai, and L. Chen, “Autosf: Searching scoring relations for learning and inference in knowledge bases,” arXiv preprint
functions for knowledge graph embedding,” in International Conference arXiv:1412.6575, 2014.
on Data Engineering. IEEE, 2020. [49] T. Trouillon, J. Welbl, S. Riedel, Ã. Gaussier, and G. Bouchard, “Com-
[26] S. Vashishth, S. Sanyal, V. Nitin, and P. Talukdar, “Composition- plex embeddings for simple link prediction.(2016),” in International
based multi-relational graph convolutional networks,” in International Conference on Machine Learning, 2016.
Conference on Learning Representations, 2019. [50] S. Zhang, Y. Tay, L. Yao, and Q. Liu, “Quaternion knowledge graph em-
[27] X. Lv, Y. Gu, X. Han, L. Hou, J. Li, and Z. Liu, “Adapting meta beddings,” International Conference on Neural Information Processing
knowledge graph information for multi-hop reasoning over few-shot Systems, 2019.
relations,” in Conference on Empirical Methods in Natural Language [51] I. Balažević, C. Allen, and T. Hospedales, “Tucker: Tensor factorization
Processing and International Joint Conference on Natural Language for knowledge graph completion,” in Conference on Empirical Methods
Processing, 2019, pp. 3376–3381. in Natural Language Processing and International Joint Conference on
[28] Y. Chen, P. Minervini, S. Riedel, and P. Stenetorp, “Relation prediction Natural Language Processing, 2019, pp. 5185–5194.
as an auxiliary training objective for improving multi-relational graph [52] Y. Bai, Z. Ying, H. Ren, and J. Leskovec, “Modeling heterogeneous
representations,” in AKBC, 2021. hierarchies with relation-specific hyperbolic cones,” in International
[29] I. Chami, A. Wolf, D.-C. Juan, F. Sala, S. Ravi, and C. Ré, “Low- Conference on Neural Information Processing Systems, 2021.
dimensional hyperbolic knowledge graph embeddings,” in Annual Meet- [53] H. Lu and H. Hu, “Dense: An enhanced non-abelian group representation
ing of the Association for Computational Linguistics, 2020. for knowledge graph embedding,” arXiv preprint arXiv:2008.04548,
[30] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang, “Rotate: Knowledge graph 2020.
embedding by relational rotation in complex space,” in International [54] T. Vu, T. D. Nguyen, D. Q. Nguyen, D. Phung et al., “A capsule network-
Conference on Learning Representations, 2018. based embedding model for knowledge graph completion and search
[31] R. Wang, B. Li, S. Hu, W. Du, and M. Zhang, “Knowledge graph personalization,” in Conference of the North American Chapter of the
embedding via graph attenuated attention networks,” IEEE Access, Association for Computational Linguistics, 2019, pp. 2180–2189.
vol. 8, pp. 5212–5224, 2019. [55] Z. Zhang, J. Cai, and J. Wang, “Duality-induced regularizer for tensor
[32] Y. Cui, W. Che, T. Liu, B. Qin, and Z. Yang, “Pre-training with whole factorization based knowledge graph completion,” International Confer-
word masking for chinese bert,” IEEE/ACM Transactions on Audio, ence on Neural Information Processing Systems, vol. 33, 2020.
Speech, and Language Processing, vol. 29, pp. 3504–3514, 2021.
[33] J. Qiang, Y. Li, Y. Zhu, Y. Yuan, Y. Shi, and X. Wu, “Lsbert: Lexical
simplification based on bert,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 29, pp. 3064–3076, 2021.
[34] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their composi-
tionality,” Advances in neural information processing systems, vol. 26,
2013.
[35] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
for word representation,” in Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), 2014, pp.
1532–1543.
[36] J. Sarzynska-Wawer, A. Wawer, A. Pawlak, J. Szymanowska, I. Stefa-
niak, M. Jarkiewicz, and L. Okruszek, “Detecting formal thought disor-
der by deep contextualized word representations,” Psychiatry Research,
vol. 304, p. 114135, 2021.
[37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[38] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” International Conference on Learning Represen-
tations, 2020.
[39] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy,
“Spanbert: Improving pre-training by representing and predicting spans,”
Transactions of the Association for Computational Linguistics, vol. 8,
pp. 64–77, 2020.
[40] Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting pre-
trained models for chinese natural language processing,” in Conference
on Empirical Methods in Natural Language Processing, 2020.
[41] H. Wang, V. Kulkarni, and W. Y. Wang, “Dolores: deep contextualized
knowledge graph embeddings,” arXiv preprint arXiv:1811.00147, 2018.
[42] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “Ernie: En-
hanced language representation with informative entities,” arXiv preprint
arXiv:1905.07129, 2019.
[43] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and
Y. Choi, “Comet: Commonsense transformers for automatic knowledge
graph construction,” arXiv preprint arXiv:1906.05317, 2019.
[44] J. Mueller and A. Thyagarajan, “Siamese recurrent architectures for
learning sentence similarity,” in Proceedings of the AAAI Conference
on Artificial Intelligence, 2016.
[45] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Ga-
mon, “Representing text for joint embedding of text and knowledge
bases,” in Conference on Empirical Methods in Natural Language
Processing, 2015, pp. 1499–1509.
[46] G. A. Miller, WordNet: An electronic lexical database. MIT press,
1998.