0% found this document useful (0 votes)

56 views

Applying Deep Learning For Arabic Keyphrase Extraction

ؤثي

Uploaded by

ort braude

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Applying Deep Learning For Arabic Keyphrase Extraction

ؤثي

Uploaded by

ort braude

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Available online at www.sciencedirect.

com

Procedia Computer Science 00 (2018) 000–000

www.elsevier.com/locate/procedia

The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),

November 17-19 2018, Dubai, United Arab Emirates

Applying Deep Learning for Arabic Keyphrase Extraction

Muhammad Helmy∗, R. M. Vigneshram, Giuseppe Serra∗, Carlo Tasso
Artificial Intelligence Laboratory, Department of Mathematics, Computer Science, and Physics, University of Udine, Udine 33100, Italy

Abstract
Arabic keyphrase extraction is a crucial task due to the significant and growing amount of Arabic text on the web generated by
a huge population. It is becoming a challenge for the community of Arabic natural language processing because of the severe
shortage of resources and published processing systems. In this paper we propose a deep learning based approach for Arabic
keyphrase extraction that achieves better performance compared to the related competitive approaches. We also introduce the
community with an annotated large-scale dataset of about 6000 scientific abstracts which can be used for training, validating and
evaluating deep learning approaches for Arabic keyphrase extraction.

c 2018 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguis-
tics.
Keywords: Arabic NLP; Keyphrase Extraction; Deep Learning

1. Introduction

A keyphrase (KP) is a phrase composed of one or more words (usually up to five) that manifest a main idea or topic
of a natural language text document [41]. The objective of any automatic keyphrase extraction (KPE) mechanism is
to compile a condensed list of high quality KPs for a given document.
Considering that a massive amount of text documents is produced daily, KPE received more attention as a sup-
portive task in different fields of Natural Language Processing (NLP), information retrieval, document clustering,
data-mining, text summarization, and text classification [19, 15, 11].
Typically, KPE systems have two sequential phases: candidate KPs identification, then candidate KPs ranking and
selection. In the candidate KPs identification, a set of potential KPs is extracted from the text according to some
morphological, syntactical [5] and spatial features. After that, every candidate KP is assigned a score, which reflects
its expressiveness according to statistical and semantic measures of its document. Finally, the top KPs are selected.
Many paradigms have been developed to tackle KPE task including machine learning , and graph based methods

∗ Corresponding author. Tel.: +39-351-2482124 ; fax: +39-432-558499.

E-mail address: [email protected]

1877-0509 c 2018 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics.
2 Author name / Procedia Computer Science 00 (2018) 000–000

[42, 40, 7]. These systems need the features for ranking and selecting KPs to be predetermined before running the
system. These features cannot be learnt or modified during the system lifetime. It is difficult to enumerate all of the
associated features for a specific domain as many linguistic, statistical, and external knowledge about the text should
be exploited.
Deep learning (DL) [29] techniques introduced promising approaches for various NLP tasks, which do not require
predetermined features. Since Deep Learning approaches require huge datasets to train its models, KPE approaches
based on this methodology [31, 6] are designed and developed for English text.
However, in our planet we have a population of about 400 million Arabic native speakers in 28 countries and about
1.7 billion Muslim who are using Arabic as ritual language. Therefore, Arabic KPE is a crucial task due to revolution
in the Arabic digital content, especially after Arab Spring, and the emerging needs for annotating and classifying this
content.
Moreover, Arabic language has specific characteristics [14], which do not exist in western languages like English,
and should be considered when the Arabic text is being processed. These characteristics include diglossia, aggluti-
native nature, nonconcatenative morphological, ambiguity and high polysemy [18]. To the best of our knowledge, no
DL-based approach targeting Arabic KPE task has been reported.
To fill this gap, in this paper we introduce a large-scale dataset suitable for training and testing Arabic KPE models
especially DL approaches. We believe that this is the first work which introduce such a dataset. We then propose a
DL approach for extracting high quality KPs from Arabic text. The architecture is based on Bidirectional Long Short-
Term Memory (Bi-LSTM) Recurrent Neural Network, which is able to exploit previous and future context of a given
word in text. The experimental Evaluation shows that the proposed approach outperforms competitive methods. The
dataset is available1 to share them with the community so they can take the opportunity to develop new DL strategies
for this task.

2. Related Work

Deep Learning achieved good performance in various NLP tasks including, but not limited to, Language model-
ing [25], Automatic machine translation (AMT) [12, 17], Named Entity Recognition [28], Sentiment analysis [16],
Question answering [37], and lately KPE [44, 6, 31].
DL has been recently taken into consideration for KPE. As mentioned before, The presented DL approaches are
devoted for English language, because of abundance of resources created for developing DL systems, e.g., training
datasets and word embeddings.
Zhang et al. [44] proposed a novel deep recurrent neural network (RNN) model to tackle the problem of extracting
important KPs from tweets where the length restrictions of Twitter-like sites make the performance of existing KPE
systems decrease clearly. Tweet length is about 140 characters where the mean size of documents in most of the KPE
datasets is more than 500 words [27, 20]. In addition, a huge dataset of tweets was constructed to evaluate the proposed
approach.
Basaldella et al. [6] introduced proposed a Deep Long-Short Term Memory Neural Network approach to extract
KPs from scientific documents. Since the system does not require hand-craft features dedicated for a specific field, it
can be utilized in a wide range of domains. The system has been evaluated on INSPEC dataset [22].
DL was utilized in KP generation and assignment, where the system can recognize KPs that do not exist in the
document, and takes into account the actual semantic meaning behind the text. Meng et al. [31] introduced a generative
model (deep keyphrase generation) for KP prediction with an encoder-decoder framework. The authors built a dataset
consists of 20,000 scientific documents to train and evaluate the system.
For sake of completeness, Deep Learning has been also employed in few areas in Arabic NLP like text categoriza-
tion [23], sentiment analysis [4], and question answering [38]. As far as we know, there is not a system which employs
DL approaches in KPE from Arabic text.

1 http://ailab.uniud.it/arabickpe/
Author name / Procedia Computer Science 00 (2018) 000–000 3

Fig. 1: An Example of the dataset where each keyphrase is colored to indicate where it exists in the abstract and title.

3. Arabic Keyphrase Dataset

There is not any large-scale dataset for Arabic KPE that can be used to train, validate and test a deep learning
model. We found only three small publicly available datasets:

• Arabic Keyphrase Extraction Corpus (AKEC) [20]: The corpus consists of 160 Arabic documents and their
assigned KPs. The authors employed the crowdsourcing platform of Crowdflower to construct the collection
with the support of 226 workers. AKEC2 is the first dataset which is not customized or annotated by the authors
of the KPE system.
• Arabic Dataset3 [1]: the dataset contains 400 documents and covers 18 different topics. All of the documents
were assigned to six readers only to read and extract 10 KPs for each.
• WikiAll [13]: it is composed of 100 documents collected from Arabic Wikipedia4 . The average size of document
is 804 word and the average number of assigned KPs per document is 8.1. The documents are not preprocessed
or organized in categories. Moreover, the metadata of Wikipedia are still there in the documents text.

Since these datasets are fairly small (total number of documents is about 660 document), they can not be used as
training datasets. They may be employed as test sets. Therefore, we started to build a large dataset of the scientific
articles abstracts written and published in Arabic language.
We targeted web sites of the scientific journals of the Arabic universities and some Arabic literature publishers to
crawl the abstracts available freely with their keyphrases, titles, and topics. A set of 6219 abstracts has been crawled.
The total number of KPs assigned by authors is 26,685 with 15,730 KP that appear verbatim in text and 10,955 KP
do not exist in the abstracts text. Finally, we removed all of the absent KPs and exclude the documents which have no
assigned KPs. The total number of documents after preprocessing became about 6000 documents.The total number of
words in the dataset text is 1,223,723 word, and the vocabulary size is 68,108 unique word.
The collection of abstracts was arbitrarily split into three sets: a training set (used during building and training
the model), contains about 4,000 documents, a validation set (to evaluate the various model cases with different
parameters, and select the best performing one), consists of about 1,000 documents, and a test set (to obtain impartial
results of different systems) with the remaining 1,000 abstracts. The dataset is stored in JSON format where each item
(document) of the dataset contains title, abstract, keyphrases and the topic the item belongs to. Figure 1 shows an
example of a dataset item.
Table 1 shows statistics about the dataset. Where Docs refers to the total number of documents in every item. KPs is
the total number of KPs verbally exist in the text. Words is the summation of all words within the documents whether
it is repeated or not. Vocabulary is the number of unique words i.e. without repetition. Finally, the table presents the
maximum, minimum, average, and median value for document size (Doc size), in words, and the number of KPs (No.
of KPs) assigned for documents.

2 https://github.com/ailab-uniud/akec
3 https://github.com/logmani/ArabicDataset
4 https://ar.wikipedia.org/wiki/
4 Author name / Procedia Computer Science 00 (2018) 000–000

Table 1: Statistical information of our dataset.

Training Validation Test

Docs 4000 1000 1000
KPs 10582 2583 2565
Words 1026938 195630 196785
Vocabulary 62204 24424 24373
Max 994 761 634
Min 25 45 31
Doc size
Avg 210 207.24 209.12
Med 195 194 194
Max 11 9 13
Min 1 1 1
No. of KPs
Avg 2.69 2.74 2.72
Med 3 3 3

To determine whether our dataset is comparable to the well-established English datasets, we compare the total
number of KPs, present KPs, and absent KPs of our dataset against four English datasets. The comparison is presented
in Table 2. The four author-assigned keyphrases English datasets are:

• Krapivin [26]: includes about 2,304 high quality documents representing scientific articles from computer sci-
ence domain. It was dedicated for training and evaluating machine learning-based KPE approaches.
• NUS [35]: consists of 211 conference articles, with a length range of 4-12 pages. The documents were converted
into plain text format and originally downloaded using Google SOAP API as PDF documents. Volunteers were
recruited to assign KPs to each document which allows multiple judgments beside the author-assigned KPs.
• Inspec [22]: is a collection of 2,000 abstracts, with their corresponding titles and KPs from Inspec5 which is an
indexing database of scientific and technical literature, published by the Institution of Engineering and Tech-
nology (IET)6 . The dataset was randomly divided into three parts: a training set consisting of 1,000 documents,
a validation set consisting of 500 documents, and a test set with the remaining 500 abstracts.
• SemEval-2010 [24]: it is composed of 288 documents collected from ACM Digital Library. The dataset was con-
structed for evaluating participant systems of Task 5 of the Workshop on Semantic Evaluation 2010 (SemEval-
2010)7 . The size of the documents ranges from 6 to 8 pages from a variety of different topics. The collection is
divided into three parts: training (144 documents), test (100 documents) and trial (40 documents).

4. The Proposed System

A DL model was developed based on LSTM. We utilized an existing general purpose Arabic word embeddings for
training the model. The description of the system components will be discussed in the following subsections.

4.1. Word Embeddings

Word embedding simply map the words or phrases of natural text into vectors of real numbers. The main two ap-
proaches available for building word embeddings from raw text are: GloVe [36] and word2vec model [32]. Word2vec,
in turn, has two approaches for computing the word vectors, the skip-gram which predicts the context-words from a
given source word, and Continuous Bag-Of-Words (CBOW) which predicts a word given its context window [33].

5 https://inspecdirect.theiet.org/
6 https://www.theiet.org/
7 http://semeval2.fbk.eu/semeval2.php?location=tasks#T6
Author name / Procedia Computer Science 00 (2018) 000–000 5

Table 2: Comparison of present and absent

keyphrases percentage: four English datasets [31]
and our dataset.

Dataset #Keyphrase % Present % Absent

Inspec 19,275 55.69 44.31
Krapivin 2,461 44.74 55.26
NUS 2,834 67.75 32.25
SemEval 12,296 42.01 57.99
Our dataset 26,685 58.95 41.05

Table 3: Forms of Arabic text during preprocessing.

Type Text
è Që
Original A®Ë@ éJK
YÓ úÍ@ ú
G P AJ
I J.»P
Trans I drove my car to Cairo city
èQëA®Ë@
éJK YÓ úÍ@ úGPAJ IJ
No Diac

. »P
éJK YÓ úÍ@ úGPAJ IJ
Normal èQëA®Ë@

. »P
Segmented
éJK YÓ úÍ@ ø èPAJ H I»P
èQëA®Ë@

.
Fig. 2: Model architecture

All NLP researchers of the Arabic DL systems build customized Word embedding for their applications and most
of them are not published [9, 3, 2]. However, we found two public global word embedding sets; the first one uses
Glove and Word2Vec with vector size of 300 only [43], the second one is called AraVec and uses the Word2Vec
approach with three different vector sizes (e.g. 300, 100, and 50) [39]. We decided to use AraVec for our system since
the first one includes bigram phrases which are not required in our pipeline.

4.2. Model Architecture

KPE is performed by the following procedure: the document text is preprocessed to represent the text in the form
of separated words. Preprocessing Arabic text includes removing Arabic diacritics (which represent short vowels and
consonant),
normalizing different shapes of Arabic characters into a single shape (i.e Alef letter has different shapes:
@ , @ , @ which is normalized to @ ), finally, segmenting the text into single tokens (Arabic word may contain more
than one token or word) using Stanford CoreNLP Toolkit [30]. Table 3 shows different forms of Arabic text during
preprocessing. Then, the documents are divided into sentences and associate the tokens with the word embedding
representation.
Let the input tokens of word embedding represented as {x1 , ..., xn }, a Recurrent Neural Network (RNN) determines
the output vector of each token by iteration.
The embedding layer works as a lookup table that transforms discrete features such as the words of Arabic text
into continuous real-valued vector representations, which are then concatenated and provided to the neural network.
Instead of a feed-forward network, we utilize the bi-directional long-short term memory (BLSTM) network.
KPE can be considered as sequential labeling task which involves the algorithmic assignment of a categorical
label to each member of a sequence of observed values. In such task, a bi-directional LSTM model can take into
consideration an adequately enormous amount of context on both sides of a word and erase limited context problem
that applies to any feed-forward model.
6 Author name / Procedia Computer Science 00 (2018) 000–000

Table 4: Comparison results on our dataset.

Top 5 KPs Top 10 KPs Top 15 KPs

Approach P R F1 P R F1 P R F1 MAP
TF-IDF 0.170 0.255 0.204 0.085 0.256 0.128 0.057 0.256 0.093 0.255
KP-Miner 0.201 0.303 0.242 0.101 0.304 0.151 0.067 0.305 0.110 0.303
Our System 0.305 0.444 0.361 0.208 0.588 0.308 0.160 0.671 0.258 0.471

The Bidirectional LSTM network adopts the future context. In fact, with this architecture we are able to make use
of both past context and future context of a specific word. It consists of two separate hidden layers; it computes the
forward hidden sequence then, it computes the backward hidden sequence and finally, it combines forward hidden
sequence and backward hidden sequence to generate the output. The combination (concat) layer is connected to a
softmax output layer with three neurons for each word. The three neurons are associated with three possible output
classes, which respectively mark tokens that are not keyphrases, the first token of a keyphrase, and the internal tokens
of a keyphrase. A dropout technique was implemented between Bi-LSTM and the dense layer to prevent overfitting.
Figure 2 shows the basic structure of the model.

4.3. Model Implementation

We have used Keras8 with Tensorflow9 as a backend. That in turn allowed us to employ CUDA10 to train our neural
network using GPU framework (GeForce GTX 1080 Ti Graphics Card)11 . After trying different configurations for the
network, we obtained the best results with a size of 150 neurons for the Bi-LSTM layer, 150 neurons for the hidden
dense layer, and a value of 0.25 for the dropout layers. During the training of our network, we used Root Mean Square
Propagation optimization algorithm and batch size of 32. The early stopping rule in Keras on embedding is used to
terminate the training process when the training loss does not decrease for two consecutive epochs.

5. Evaluation and Experimental Results

The evaluation experiments were conducted on two datasets, our test dataset and WikiAll dataset [13]. We choose
WikiAll as an evaluation dataset, because it has been used by most of the published Arabic KPE systems.
The first experiment was carried out using our test dataset. We compare the performance of our system against two
available published systems: The first one is Distiller TF-IDF (D-TF-IDF) [8] which is a pipeline implemented within
the Distiller framework for extracting KPs using the simple statistical approach of Term FrequencyInverse Document
Frequency (TF-IDF). Distiller [10] is a knowledge extraction framework which provide a flexible, multilingual KPE
functionalities for about five languages, one of them is Arabic. The second system is KP-Miner [13] which is based on
an unsupervised approach for KPE. It does not need to be trained on a particular document set in order to achieve its
task. KP-Miner can extract KPs from a single document or a corpus of documents. Its heuristic rules can be configured
to suit the document domain and user understandings of the document nature.
We check the systems performance over the top 5, 10 and 15 candidates KPs returned by each system. The lem-
matized versions of the returned KPs are matched with the lemmatized KPs assigned to the dataset documents. Then,
we calculated the Precision (P), Recall (R), F1-score (F1), and Mean Average Precision (MAP) as evaluation metrics.
Table 4 shows these comparison results where our system achieves higher performance.
The second experiment was conducted on WikiAll dataset. We compare the performance in term of Precision,
Recall and Average number of correct detected KPs (Avg. Keys) which are used by the competitive systems. Table 5
shows the performances of five different approaches and our approach. The five approaches are KP-Miner [13], Arabic
TF-IDF, Word2Vec, Hyprid model [34], and MorphKE [21].

8 https://keras.io/
9 https://www.tensorflow.org/
10 https://developer.nvidia.com/cuda-toolkit
11 https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/
Author name / Procedia Computer Science 00 (2018) 000–000 7

Table 5: Performance results of our approach compared to other published results on WikiAll dataset.

Precision Recall Avg. Keys

KP-Miner 0.131 0.383 2.491
TF-IDF 0.112 0.349 2.253
Word2Vec 0.092 0.294 1.701
Hybrid 0.101 0.312 2.002
MorphKE 0.131 0.377 2.530
Our System 0.216 0.419 2.851

In Arabic TF-IDF, The candidate KPs are weighted and scored using TFxIDF algorithm which gives low weights
to the unimportant KPs. In addition, it uses a list of stopwords which is very beneficial for Arabic text as some stop-
words in Arabic are compound ones and do not occur frequently. Word2Vec approach employed Googles Word2Vec12
library to measure the similarity between the candidate patterns and the document title. The System was trained using
Wikipedia Arabic dump13 to get the vector representation of the words, then the cosine similarity was used to measure
the distance between the title of each document and its valid KPs patterns. The hybrid approach is a combination
model of Arabic TFxIDF and Word2Vec models [34]. MorphKE is an unsupervised approach based on utilizing the
rich Arabic morphology and syntax to generate a restricted set of meaningful candidates KPs for a single document
[21]. The experimental results showed that the proposed approach performs significantly better than previous methods.

6. Conclusion

In this article, we introduced a deep learning KPE approach based on the Bi-LSTM neural network model for
extracting keyphrases from Arabic text. Since we have a shortage in large-scale datasets for training and evaluating
deep learning models for Arabic KPE, we construct a new dataset consisting in about 6,000 abstracts of scientific
Arabic documents. The dataset attributes are comparable to the English datasets. We used the dataset to train, validate,
and test our approach against the existing systems. The evaluation results show that our approach achieves state-of-
the-art performance in Arabic KPE domain.

References

[1] Al Logmani, M., Al Muhtaseb, H., 2017. Arabic dataset for automatic keyphrase extraction, in: International Conference on Computer Science,
Information Technology and Applications, pp. 217–222.
[2] Al-Sallab, A., Baly, R., Hajj, H., Shaban, K.B., El-Hajj, W., Badaro, G., 2017. Aroma: A recursive deep learning model for opinion mining in
arabic as a low resource language. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16, 1–20.
[3] Altowayan, A.A., Tao, L., 2016. Word embeddings for arabic sentiment analysis, in: IEEE International Conference on Big Data, pp. 3820–
3825.
[4] Badaro, G., Baly, R., Hajj, H., Habash, N., El-Hajj, W., 2014. A large scale arabic sentiment lexicon for arabic opinion mining, in: Conference
on Empirical methods in Natural Language Processing (EMNLP) Workshop on Arabic Natural Language Processing (ANLP), pp. 165–173.
[5] Barker, K., Cornacchia, N., 2000. Using noun phrase heads to extract document keyphrases, in: Conference of the Canadian Society for
Computational Studies of Intelligence, pp. 40–52.
[6] Basaldella, M., Antolli, E., Serra, G., Tasso, C., 2018a. Bidirectional lstm recurrent neural network for keyphrase extraction, in: Italian Research
Conference on Digital Libraries (IRCDL), pp. 180–187.
[7] Basaldella, M., Helmy, M., Antolli, E., Popescu, M.H., Serra, G., Tasso, C., 2017. Exploiting and evaluating a supervised, multilanguage
keyphrase extraction pipeline for under-resourced languages, in: International Conference Recent Advances in Natural Language Processing
(RANLP), pp. 78–85.
[8] Basaldella, M., Serra, G., Tasso, C., 2018b. The distiller framework: Current state and future challenges, in: IRCDL, pp. 93–100.
[9] Dahou, A., Xiong, S., Zhou, J., Haddoud, M.H., Duan, P., 2016. Word embeddings and convolutional neural network for arabic sentiment
classification, in: International Conference on Computational Linguistics, pp. 2418–2427.

12 https://code.google.com/p/word2vec/
13 https://github.com/anastaw/Arabic-Wikipedia-Corpus
8 Author name / Procedia Computer Science 00 (2018) 000–000

[10] De Nart, D., Degl’Innocenti, D., Tasso, C., 2015. Introducing distiller: a lightweight framework for knowledge extraction and filtering, in: The
23rd Conference on User Modelling, Adaptation and Personalization (UMAP).
[11] Degl’Innocenti, D., De Nart, D., Helmy, M., Tasso, C., 2018. Fast, accurate, multilingual semantic relatedness measurement using wikipedia
links, in: Intelligent Natural Language Processing: Trends and Applications, pp. 571–584.
[12] Deselaers, T., Hasan, S., Bender, O., Ney, H., 2009. A deep learning approach to machine transliteration, in: Association for Computational
Linguistics (ACL) Workshop on Statistical Machine Translation, pp. 233–241.
[13] El-Beltagy, S.R., Rafea, A., 2009. Kp-miner: A keyphrase extraction system for english and arabic documents. Information Systems 34,
132–144.
[14] Farghaly, A., Shaalan, K., 2009. Arabic natural language processing: Challenges and solutions. TALLIP 8, 14:1–14:22.
[15] Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G., 1999. Domain-specific keyphrase extraction, in: International Joint
Conference on Artificial Intelligence, pp. 668–673.
[16] Glorot, X., Bordes, A., Bengio, Y., 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach, in: International
Conference on Machine Learning (ICML), pp. 513–520.
[17] Guzmán, F., Bouamor, H., Baly, R., Habash, N., 2016. Machine translation evaluation for arabic using morphologically-enriched embeddings,
in: International Conference on Computational Linguistics, pp. 1398–1408.
[18] Habash, N.Y., 2010. Introduction to arabic natural language processing. Synthesis Lectures on Human Language Technologies 3, 1–187.
[19] Hasan, K.S., Ng, V., 2014. Automatic keyphrase extraction: A survey of the state of the art, in: ACL, pp. 1262–1273.
[20] Helmy, M., Basaldella, M., Maddalena, E., Mizzaro, S., Demartini, G., 2016a. Towards building a standard dataset for arabic keyphrase
extraction evaluation, in: 20th International Conference on Asian Language Processing (IALP), pp. 26–29.
[21] Helmy, M., De Nart, D., Degl’Innocenti, D., Tasso, C., 2016b. Leveraging arabic morphology and syntax for achieving better keyphrase
extraction, in: 20th International Conference on Asian Language Processing (IALP), pp. 340–343.
[22] Hulth, A., 2003. Improved automatic keyword extraction given more linguistic knowledge, in: EMNLP, pp. 216–223.
[23] Jindal, V., 2016. A personalized markov clustering and deep learning approach for arabic text categorization, in: ACL Student Research
Workshop, pp. 145–151.
[24] Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T., 2010. Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles, in:
ACL Workshop on Semantic Evaluation, pp. 21–26.
[25] Kim, Y., Jernite, Y., Sontag, D., Rush, A.M., 2016. Character-aware neural language models., in: Association for the Advancement of Artificial
Intelligence, pp. 2741–2749.
[26] Krapivin, M., Autaeu, A., Marchese, M., 2009. Large dataset for keyphrases extraction. Technical Report. University of Trento.
[27] Krapivin, M., Autayeu, M., Marchese, M., Blanzieri, E., Segata, N., 2010. Improving machine learning approaches for keyphrases extraction
from scientific documents with natural language knowledge, in: the joint JCDL/ICADL international digital libraries conference, pp. 102–111.
[28] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C., 2016. Neural architectures for named entity recognition, in: Conference
of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, pp. 260–270.
[29] LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444.
[30] Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D., 2014. The Stanford CoreNLP natural language processing
toolkit, in: ACL, System Demonstrations, pp. 55–60.
[31] Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y., 2017. Deep keyphrase generation, in: ACL, pp. 582–592.
[32] Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient estimation of word representations in vector space. arXiv preprint
arXiv:1301.3781 .
[33] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed representations of words and phrases and their composition-
ality, in: Advances in neural information processing systems, pp. 3111–3119.
[34] Nabil, M., Atiya, A., Aly, M., 2015. New approaches for extracting arabic keyphrases, in: International Conference on Arabic Computational
Linguistics, pp. 133–137.
[35] Nguyen, T.D., Kan, M.Y., 2007. Keyphrase extraction in scientific publications, in: International Conference on Asian Digital Libraries, pp.
317–326.
[36] Pennington, J., Socher, R., Manning, C., 2014. Glove: Global vectors for word representation, in: EMNLP, pp. 1532–1543.
[37] Qiu, X., Huang, X., 2015. Convolutional neural tensor network architecture for community-based question answering., in: International Joint
Conferences on Artificial Intelligence (IJCAI), pp. 1305–1311.
[38] Romeo, S., Da San Martino, G., Belinkov, Y., Barrón-Cedeño, A., Eldesouki, M., Darwish, K., Mubarak, H., Glass, J., Moschitti, A., 2017.
Language processing and learning models for community question answering in arabic. Information Processing & Management .
[39] Soliman, A.B., Eissa, K., El-Beltagy, S.R., 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer
Science 117, 256–265.
[40] Tixier, A., Malliaros, F., Vazirgiannis, M., 2016. A graph degeneracy-based approach to keyword extraction, in: EMNLP, pp. 1860–1870.
[41] Turney, P.D., 2000. Learning algorithms for keyphrase extraction. Information retrieval 2, 303–336.
[42] Ying, Y., Qingping, T., Qinzheng, X., Ping, Z., Panpan, L., 2017. A graph-based approach of automatic keyphrase extraction. Procedia
Computer Science 107, 248–255.
[43] Zahran, M.A., Magooda, A., Mahgoub, A.Y., Raafat, H., Rashwan, M., Atyia, A., 2015. Word representations in vector space and their
applications for arabic, in: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 430–443.
[44] Zhang, Q., Wang, Y., Gong, Y., Huang, X., 2016. Keyphrase extraction using deep recurrent neural networks on twitter, in: EMNLP, pp.
836–845.

Kotlin In-Depth [Vol-I]: A Comprehensive Guide to Modern Multi-Paradigm Language
From Everand
Kotlin In-Depth [Vol-I]: A Comprehensive Guide to Modern Multi-Paradigm Language
Aleksei Sedunov
No ratings yet
Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
03 Task Performance 1 HCI
No ratings yet
03 Task Performance 1 HCI
2 pages
Contingency or Situational Approach: Prepared By: Sheena Claire V. Dela Pena
100% (1)
Contingency or Situational Approach: Prepared By: Sheena Claire V. Dela Pena
102 pages
Linux Programming Tools Unveiled
From Everand
Linux Programming Tools Unveiled
N. B. Venkateswarlu
No ratings yet
C# Algorithms for New Programmers: A Practical Guide with Examples
From Everand
C# Algorithms for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
C++ OOP Made Simple: A Practical Guide with Examples
From Everand
C++ OOP Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
LOTED: a semantic web portal for the management of tenders from the European Community
From Everand
LOTED: a semantic web portal for the management of tenders from the European Community
Francesco Valle
No ratings yet
The Definitive JavaScript Handbook: From Fundamentals to Cutting‑Edge Best Practices
From Everand
The Definitive JavaScript Handbook: From Fundamentals to Cutting‑Edge Best Practices
Aarav Joshi
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Python The Complete Reference: Comprehensive Guide to Mastering Python Programming from Fundamentals to Advanced Techniques
From Everand
Python The Complete Reference: Comprehensive Guide to Mastering Python Programming from Fundamentals to Advanced Techniques
Aarav Joshi
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
C# Fundamentals Made Simple: A Practical Guide with Examples
From Everand
C# Fundamentals Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
DynamoDB Applied Design Patterns
From Everand
DynamoDB Applied Design Patterns
Uchit Vyas
3/5 (1)
C# Data Structures Explained: A Practical Guide with Examples
From Everand
C# Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Functional Programming Made Easy: A Practical Guide with Examples
From Everand
C# Functional Programming Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
A Guide To All Programming and Coding Languages
From Everand
A Guide To All Programming and Coding Languages
Don Carlos
No ratings yet
Programming Paradigms
From Everand
Programming Paradigms
Zoe Codewell
No ratings yet
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Terminology Extraction: Fundamentals and Applications
From Everand
Terminology Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
.NET Mastery: The .NET Interview Questions and Answers
From Everand
.NET Mastery: The .NET Interview Questions and Answers
Chetan Singh
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Core Objective-C in 24 Hours
From Everand
Core Objective-C in 24 Hours
Keith Lee
5/5 (1)
Programming And Coding in Intermidiate Level
From Everand
Programming And Coding in Intermidiate Level
Memo
No ratings yet
ATS Programming: Safe and Efficient Code for Real-World Projects
From Everand
ATS Programming: Safe and Efficient Code for Real-World Projects
Robert Johnson
No ratings yet
Python Performance Engineering: Strategies and Patterns for Optimized Code
From Everand
Python Performance Engineering: Strategies and Patterns for Optimized Code
Aarav Joshi
No ratings yet
Full Stack Web Development with Fastify: Building High-Performance Modern Applications from Frontend to Backend
From Everand
Full Stack Web Development with Fastify: Building High-Performance Modern Applications from Frontend to Backend
Aarav Joshi
No ratings yet
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
From Everand
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
Ruchi Doshi
No ratings yet
C# Debugging from Scratch: A Practical Guide with Examples
From Everand
C# Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
Question Answering: Fundamentals and Applications
From Everand
Question Answering: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science with Python: From Zero to Machine Learning
From Everand
Data Science with Python: From Zero to Machine Learning
Pouvo
No ratings yet
Towards best practice in the Archetype Development Process
From Everand
Towards best practice in the Archetype Development Process
Alberto Moreno Conde
No ratings yet
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
Fundamentals of Python Data Engineering
From Everand
Fundamentals of Python Data Engineering
Aarav Joshi
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
The spaCy Handbook: Simplifying Natural Language Processing
From Everand
The spaCy Handbook: Simplifying Natural Language Processing
Robert Johnson
No ratings yet
The Art of Rust: Professional Patterns for Clean, Efficient, and Maintainable Code
From Everand
The Art of Rust: Professional Patterns for Clean, Efficient, and Maintainable Code
Aarav Joshi
No ratings yet
Basic Guide to Programming Languages Python, JavaScript, and Ruby
From Everand
Basic Guide to Programming Languages Python, JavaScript, and Ruby
Kiet Huynh
No ratings yet
Learning Cypher
From Everand
Learning Cypher
Onofrio Panzarino
No ratings yet
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
"Careers in Information Technology: DevOps Engineer": GoodMan, #1
From Everand
"Careers in Information Technology: DevOps Engineer": GoodMan, #1
Patrick Mukosha
No ratings yet
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
From Everand
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
DG. Junior
No ratings yet
System Programming Essentials with Go: System calls, networking, efficiency, and security practices with practical projects in Golang
From Everand
System Programming Essentials with Go: System calls, networking, efficiency, and security practices with practical projects in Golang
Alex Rios
No ratings yet
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
Mastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion
From Everand
Mastering Transformers: The Journey from BERT to Large Language Models and Stable Diffusion
Savaş Yıldırım
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
Code With AI
From Everand
Code With AI
Kai Turing
No ratings yet
Automatic Generation of Short Answer Questions in Reading Comprehension Using NLP and KNN
No ratings yet
Automatic Generation of Short Answer Questions in Reading Comprehension Using NLP and KNN
28 pages
Spring 2.5 Aspect Oriented Programming
From Everand
Spring 2.5 Aspect Oriented Programming
Massimiliano DessÃ¬
No ratings yet
Reflective Memo
No ratings yet
Reflective Memo
2 pages
Leadership Styles Wheel of Balance DBeverlyCoaching
No ratings yet
Leadership Styles Wheel of Balance DBeverlyCoaching
2 pages
Guidance Notes For Completing The MSC Project Interim Report
No ratings yet
Guidance Notes For Completing The MSC Project Interim Report
1 page
Dr. Vandana Gupta: Education
No ratings yet
Dr. Vandana Gupta: Education
3 pages
Cesc 12 - Q1 - M4 PDF
No ratings yet
Cesc 12 - Q1 - M4 PDF
14 pages
Concept Paper
No ratings yet
Concept Paper
4 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Ralph, B. C., Thomson, D. R., Cheyne, J. A., Smilek, D. (2014) - Media Multitasking and Failures of Attention in Everyday Life.-2-4
No ratings yet
Ralph, B. C., Thomson, D. R., Cheyne, J. A., Smilek, D. (2014) - Media Multitasking and Failures of Attention in Everyday Life.-2-4
3 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
44 pages
Steward J. 1942. The Direct Approach To Archaeology PDF
No ratings yet
Steward J. 1942. The Direct Approach To Archaeology PDF
8 pages
STM 2
No ratings yet
STM 2
12 pages
DM Me
No ratings yet
DM Me
2 pages
Business Analytics Theory
No ratings yet
Business Analytics Theory
3 pages
Impact of Social Media On Cognitive Functioning and Sleep Quality 1
No ratings yet
Impact of Social Media On Cognitive Functioning and Sleep Quality 1
58 pages
Lesson Planning
No ratings yet
Lesson Planning
18 pages
Course Outcome BBA
No ratings yet
Course Outcome BBA
16 pages
Empowering Leaders Professional Development
No ratings yet
Empowering Leaders Professional Development
14 pages
201 Flier
No ratings yet
201 Flier
2 pages
Engineering Manager
No ratings yet
Engineering Manager
3 pages
Lesson Plan - Math8 - Q4 - W1D4
No ratings yet
Lesson Plan - Math8 - Q4 - W1D4
3 pages
Module 5.4 The Global Teacher
100% (1)
Module 5.4 The Global Teacher
9 pages
Lesson Plan in 21 Century Literature From The Philippines & The World
No ratings yet
Lesson Plan in 21 Century Literature From The Philippines & The World
7 pages
Final Template
No ratings yet
Final Template
48 pages
Psy 403 Psychological Tests and Testing
No ratings yet
Psy 403 Psychological Tests and Testing
46 pages
FourthBi 2010
No ratings yet
FourthBi 2010
522 pages
PR1 - LESSON PLAN - Week 3.0 - Feb 22, 2023
No ratings yet
PR1 - LESSON PLAN - Week 3.0 - Feb 22, 2023
5 pages
ALS Form 1 and 2 RTOT Irene Aranas
No ratings yet
ALS Form 1 and 2 RTOT Irene Aranas
8 pages
A Review On The Effects of Chanting and Solfeggio Frequencies On Well-Being
No ratings yet
A Review On The Effects of Chanting and Solfeggio Frequencies On Well-Being
689 pages

Applying Deep Learning For Arabic Keyphrase Extraction

Uploaded by

Applying Deep Learning For Arabic Keyphrase Extraction

Uploaded by

Available online at www.sciencedirect.

Procedia Computer Science 00 (2018) 000–000

The 4th International Conference on Arabic Computational Linguistics (ACLing 2018),

Applying Deep Learning for Arabic Keyphrase Extraction

c 2018 The Authors. Published by Elsevier B.V.

∗ Corresponding author. Tel.: +39-351-2482124 ; fax: +39-432-558499.

1877-0509 c 2018 The Authors. Published by Elsevier B.V.

3. Arabic Keyphrase Dataset

Table 1: Statistical information of our dataset.

Training Validation Test

4. The Proposed System

4.1. Word Embeddings

Table 2: Comparison of present and absent

Dataset #Keyphrase % Present % Absent

Table 3: Forms of Arabic text during preprocessing.

4.2. Model Architecture

Table 4: Comparison results on our dataset.

Top 5 KPs Top 10 KPs Top 15 KPs

4.3. Model Implementation

5. Evaluation and Experimental Results

Precision Recall Avg. Keys

You might also like