RENAR
RENAR
Named entity recognition has served many natural language processing tasks such as information retrieval, 2
machine translation, and question answering systems. Many researchers have addressed the name identi-
fication issue in a variety of languages and recently some research efforts have started to focus on named
entity recognition for the Arabic language. We present a working Arabic information extraction (IE) system
that is used to analyze large volumes of news texts every day to extract the named entity (NE) types per-
son, organization, location, date, and number, as well as quotations (direct reported speech) by and about
people. The named entity recognition (NER) system was not developed for Arabic, but instead a multilin-
gual NER system was adapted to also cover Arabic. The Semitic language Arabic substantially differs from
the Indo-European and Finno-Ugric languages currently covered. This article thus describes what Arabic
language-specific resources had to be developed and what changes needed to be made to the rule set in order
to be applicable to the Arabic language. The achieved evaluation results are generally satisfactory, but could
be improved for certain entity types.
Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing
General Terms: Algorithms, Languages, Design
Additional Key Words and Phrases: Named entity recognition, rule-based systems, Arabic natural language
processing, information extraction
ACM Reference Format:
Zaghouani, W. 2012. RENAR: A rule-based Arabic named entity recognition system. ACM Trans. Asian
Lang. Inform. Process. 11, 1, Article 2 (March 2012), 13 pages.
DOI = 10.1145/2090176.2090178 http://doi.acm.org/10.1145/2090176.2090178
1. INTRODUCTION
The named entity recognition (NER) task was introduced for the first time in 1995
by the Message Understanding Conference (MUC-6) [Grishman and Sundheim 1996].
Three subtasks were defined for MUC-6: ENAMEX for the proper names (person,
locations, and organizations), NUMEX for the numeric expressions, and TIMEX for
temporal expressions.
NER is used in various areas in the field of natural language processing (NLP) and
Information Retrieval (IR) [Sekine 2004]. For instance, news analysis systems such as
NewsVine, SiloBreaker and the Europe Media Monitor (EMM) application NewsEx-
plorer [Steinberger et al. 2009] go beyond this IR information setting, by additionally
extracting information from the news text and by further linking information found in
the news.
Most such systems are monolingual (typically English). NewsExplorer, in contrast,
currently covers 19 languages, including Arabic. Such high multilinguality is most
Author’s address: W. Zaghouani, University of Pennsylvania, Linguistic Data Consortium, 3600 Market
Street, Suite 810, Philadelphia, PA, 19104, USA; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is permit-
ted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of
this work in other works requires prior specific permission and/or a fee. Permissions may be requested from
the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212)
869-0481, or [email protected].
c 2012 ACM 1530-0226/2012/03-ART2 $10.00
DOI 10.1145/2090176.2090178 http://doi.acm.org/10.1145/2090176.2090178
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
2:2 W. Zaghouani
likely to be achieved only if the effort for each language is limited. The EMM family
of applications processes an average of approximately 100,000 articles per day. The
Arabic language is relatively well represented within EMM.
Steinberger et al. [2008] propose to use language-independent rules and as few
language-specific resources as possible. These should furthermore be simple and easy
to produce, and they should be organized in a compositional manner so that any new
language can simply be plugged in to the overall system once the language-specific
resources are available. In this article, we present the effort of adding the Semitic
language Arabic to EMM-NewsExplorer.
Arabic is significantly different from the other, mostly Indo-European EMM lan-
guages [Vergyri et al. 2004], and is thus a good test case for the proposed method. In
the next section, we discuss the state of the art of Arabic NER. In Section 3, we present
some challenges specific to the Arabic language. In Section 4, we discuss the method
we used to build our resources. Section 5 illustrates the architecture of the system and
its various components. In Section 6, we present the evaluation results achieved with
the described method. Section 7 summarizes the results.
2. RELATED WORK
NER by now is a known and common task and has been integrated into several prod-
ucts. For a few widely used languages, a large variety of NER tools exist [Nadeau and
Sekine 2009], and three main approaches have been used to implement these tools:
linguistic rule based, statistical based, and hybrid.
Rule-based methods are usually based on an existing lexicon of proper names
and a local grammar that describes patterns to match NEs using internal evidence
(gazetteers) and external evidence provided by the context in which the NEs appear.
Statistical and machine learning approaches generally require a large amount of
manually annotated training data. Hybrid methods are a combination of the statisti-
cal and the rule-based approaches. A remaining challenge in the field is how to develop
such systems quickly with minimal costs. For Arabic, available software is mostly
commercial, including ANEE1 (Coltec), IdentiFinder2 (BBN), NetOwlExtractor3
(NetOwl), Siraj4 (Sakhr), Clear Tags5 (ClearForest), and InXight-Smart-Discovery-
Entity-Extractor6 (InXight). Little information is available on the inner workings and
on formal evaluation results of these systems.
Maloney and Niv [1998] presented TAGARAB, an Arabic name recognizer that
combines the pattern-matching module with a morphological analyzer to improve per-
formance. TAGARAB was evaluated on a corpus of 3,214 tokens and the overall per-
formance obtained for the various categories (time, person, location, and number) was
a precision of 89.5%, a recall of 80.8% and an F-measure of 85%.
Abuleil [2004] developed a rule-based system that makes use of hand-written rules
and trigger words. The system starts by marking the phrases that could include
names, then it builds up a graph that represents the words in these phrases and the
relationships between them and finally, rules are applied to classify and generate the
names before saving them in a database. Abuleil’s system has been evaluated on a
corpus of 500 news articles from the Alraya newspaper and has obtained a precision of
90.4% on person, 93% on location and 92.3% on organization.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
RENAR: A Rule-Based Arabic Named Entity Recognition System 2:3
Doaa et al. [2005] used a parallel corpus in Arabic and in Spanish as well as a Span-
ish Named Entity tagger in order extract NE in the Arabic corpus. A simple mapping
technique has been used to transliterate the words in the Arabic text and return those
matching with named entities in the Spanish text as named entities in Arabic. The
size of the parallel corpus was limited to 1,200 sentence pairs and 300 sentences were
randomly selected from the Spanish corpus with their equivalent in Arabic. For each
pair, the output of the NE tagger was compared to the manually annotated gold stan-
dard set. Moreover, a stop word filter was applied to exclude the stop words from the
potential transliterated candidates. The filter improved the precision from 84% to 90%
and the recall was very high at 97.5%.
Zitouni et al. [2005] presented an Arabic mention detection system that looks for
nominals, pronominals, references to entities, and named entities. A Maximum En-
tropy Markov model was implemented using lexical and syntactic features. The sys-
tem was evaluated against the ACE 2004 data set. The overall score obtained was an
F-measure of 69%.
Traboulsi [2006] presented NExtract, a rule-based named entity recognition model
that uses local grammar and dictionaries. He showed that his approach could lead to
good results when tested in a small-scale experiment with a Reuters corpus. Later
on, Traboulsi [2009] proposed a local grammar-based approach combined with a finite
state automata to detect Arabic named entities. No further details are available at
this time about this method.
Benajiba et al. [2008] proposed a system that combines two machine learning ap-
proaches: Support Vector Machines (SVMs) and Conditional Random Fields (CRFs).
Moreover, the system uses lexical, syntactic, and morphological features and a multi-
classifier approach where each classifier is designed to tag a NE class separately. The
system obtained an F-measure of 83.5% when tested with the ACE 2003.
Benajiba [2009a] compared in his thesis the results obtained by ANERsys from vari-
ous machine learning (ML) approaches such as the Maximum Entropy, Support Vector
Machines, and Conditional Random Fields. Benajiba concluded that no single ML ap-
proach is considered better than the other for the Arabic NER task and that the best
results were obtained when he used a multi-classifier approach where each classifier
used the best ML technique for the specific named entity class.
In another experiment Benajiba et al. [2009b] explored a combination of lexical,
contextual, and morphological features. The impact of the different features has been
measured in isolation and combined. The best F-measure of 82.71% was obtained
when he combined language independent features with language specific features.
Shaalan and Raza [2009] presented a NER system for Arabic (NERA) using a rule-
based approach, a dictionary of names, a local grammar in the form of regular ex-
pressions, and a filtering mechanism. The filter serves mainly to revise the system
output by using a blacklist to reject the incorrect named entities. NERA obtained
an F-measure of 87.7% for person, 85.9% for locations, 83.15% for organizations, and
91.6% for dates.
LingPipe7 is a tool kit of various information extraction tools. The named entity
recognition module is based on a supervised training model combined with a dictionary
and a regular expression matching method. The NER module has been trained on
ANERCorp, a corpus of 150,000 words created by Benajiba et al. [2007]. The default
detection in LingPipe distinguishes between three types of entities (person, place, and
organization). LingPipe was evaluated and obtained an F-measure of 67%.
7
LingPipe is freely available at http://alias-i.com/lingpipe/.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
2:4 W. Zaghouani
Similar to the other rule-based systems, our own system, RENAR, is also based on
hand-written local patterns. However, a big difference is that we use a set of language-
independent rules in combination with language-specific parameter files, containing
the relevant vocabulary and an optional set of extra language-specific rules. The gen-
eral mechanism will be explained in Section 4.3. In the same section, we will discuss
the changes made to support the peculiarities of the Arabic language.
The choice of a rule-based approach over a machine learning approach was moti-
vated mainly by the fact that the current architecture of the EMM interface is op-
timized for rule-based systems and by the encouraging results obtained by various
similar Arabic rule-based systems such as Maloney and Niv [1998], Abuleil [2004],
Shaalan and Raza [2009].
4.1. Corpus
We used the freely available ArabiCorpus8 (AC) as the main corpus to build our re-
sources. AC is a collection Arabic text (68,943,447 words) from various sources such as
8
ArabiCorpus is freely available at http://arabicorpus.byu.edu/.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
RENAR: A Rule-Based Arabic Named Entity Recognition System 2:5
newspapers, the Quran and Arabic literature. AC includes a user interface that offers
advanced searching capabilities, which we used in our preparation.
4.4. Gazetteers
The name part lists contain a total of 19,600 names. The person names list is composed
of a first names list (17,000 entries), including an extensive list of common name parts
(not only Arabic first names, but also common international names such as ϥϮΟ John,
ϥΎΟ Jean, ϥϮΧ Juan written in Arabic), a list of name infixes (ϦΑ bin, ΪΒϋ abd, ϮΑ abu,
ϝ Al, etc.), and finally a last names list of 2,600 entries.
The locations are recognized through a simple gazetteer lookup procedure, with-
out any grammatical patterns. The gazetteer consists of currently 2,200 names of
countries, cities, towns, villages, states, and political regions found mainly in the mul-
tilingual KNAB gazetteer produced by the Institute of the Estonian Language9 and
enriched recently by the Arabic Wikipedia10 . The organizations lexicon is limited to a
list of 4,000 company and organization names that we extracted from the AC.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
2:6 W. Zaghouani
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
RENAR: A Rule-Based Arabic Named Entity Recognition System 2:7
are stored in various dictionary files. The person name recognition tool distinguishes
various types of trigger words, lists of modifiers, and stop words.
The trigger word lists include titles (ΓΪϴδϟ Mrs., ΫΎΘγ Prof., έϮΘϛΩ Dr., etc.), professions
or positions (ήϳΪϣ director, βϴέ president, ϡΎΤϣ lawyer, ϦϫΎϜϟ priest, etc.), country ad-
jectives (ϲδϧϮΘϟ Tunisian, ϱΪϨϜϟ Canadian, etc.), religious and ethnic groups (ϲϜϴϟϮΛΎϜϟ
Catholic, ΔϴϨδϟ Sunni, ήΑήΑ Berber) and many other expressions indicating, for exam-
ple, that in Latin-based languages some uppercase words (X) may be names (e.g., ‘X
declared’, ‘X died’, [0-9]+-year-old X, etc.).
The word lists for each language are kept in a language-specific parameter file. The
language-specific parameter file furthermore contains a word list loosely named mod-
ifiers, that is, words that can appear in certain places between the name mention and
the trigger words. The modifier list can contain all sorts of modifiers, but also auxiliary
verbs and more (e.g., has, yesterday). The rule set thus refers to different subsets of
these language-specific word lists. Here are some examples, using the notation: (\w+)
represents an unknown word, \b an obligatory word boundary (white space, possibly
with punctuation); + indicates one or more elements, * means zero or more elements:
Rule (1) will recognize any combination of at least two uppercase words as person
names if they are found next to trigger words or expressions (e.g., Mr. Ahmed Issa).
Note that we require at least two name parts because all names are unambiguously
grounded to real-world entities so that the extracted information can be displayed on
NewsExplorer11.
Rule (2) captures apposition constructions such as “Hamid Karzai, the newly elected
Afghan president.” The words “the”, “newly”, and “elected” can be found in the English
modifier lists, while both Afghan and president are part of the trigger word lists. Our
recognition expressions (e.g., for modifier) can be more complex than what is shown
here. Details of the approach that are worth highlighting here are (a) determiners,
adverbs and other elements may be loosely combined under the heading modifier, as
the application would not benefit from a distinction; (b) the order of elements within
the groups modifier and person trigger is also not specified.
These are both examples of under-specified rules that make it easier to write the
rules and apply them to different languages. While generation grammars would need
this information, our recognition grammars do not. These generic rules refer to case
information, which is not available in Arabic or Farsi. When a person trigger expres-
sion is found in these Arabic script languages, we cannot know whether the preceding
or following words are names or not (because we do not have access to generic dic-
tionaries, part-of-speech or syntactic information). Furthermore, we would not know
where the name boundaries are. For this reason, we needed to write separate, safer
rules, which were then placed in the language-specific parameter file so that they only
apply to Arabic. The following are examples of such Arabic-specific rules:
11 See, for example, the page for Iraqi politician Nouri al-Maliki:
http://emm.newsexplorer.eu/NewsExplorer/entities/en/77049.html.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
2:8 W. Zaghouani
Rule (3) recognizes combinations of known name parts (first names or last names in
some Arabic countries, the distinction is mostly irrelevant). The names can optionally
be separated by one or more name parts (e.g., ϦΑ bin, ΪΒϋ abd, ϮΑ abu, ϝ Al), referred
to as name infixes, to stay in line with other languages where we can find name infixes
such as van der, de la, della, von, etc. This rule would successfully recognize the name
ΔϤϴϠΣ ϦΑ ϲϠϋ ΪϤΤϣ (Mohammed ali ben Halima), assuming that both Mohammed and
Halima are in the list of known names. The known name list used contains thousands
of name parts from different parts of the Arab world. Similar lists exist for most EMM
languages.
Rule (4) recognizes combinations of unknown (or, optionally, known) words if they
are linked by name infixes. Similar rules will also allow for longer combinations of
three, four or more names, as long as each element is linked to the others via name
infixes.
Rule (5) will recognize an unknown word and a known name part as a name if they
follow a trigger expression (e.g., Ϊϴδϟ ΪϤΣ ϰδϴϋ –Mr. Issa Ahmed, assuming that Ahmed
is a known name and Issa is unknown). Rule (6) is the Arabic equivalent to rule (2),
that is, it will capture apposition constructions. However, in order to recognize the left-
hand-side boundary of the name, we make use of stop words. Stop words are typically
high-frequency words from a long list of words that cannot be name parts.
In the example below, the rule will thus recognize “Hamid Karzai” or any other
name as a person, even if the words involved are not known name parts. The words
“and said”, found in the name stop word list, ensure that the left-hand-side border of
the name is correctly identified as Hamid. The uppercase condition applicable to most
EMM languages is thus replaced in Arabic by the introduction of the name stop word
list:
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
RENAR: A Rule-Based Arabic Named Entity Recognition System 2:9
full name is composed of eight words. Moreover, it was very difficult to cover with
our local grammar all regional variants of the names of organizations because labels
and standards differ from one country to another and from one culture to another. For
instance, companies that originate from the Maghreb region will frequently use French
words in their names (such as αΎϴΑ ΩϭϭΩ ΪϤΤϣ ϮΗϭ –Mohamed Daoud Pièces Auto).
On the other hand, companies in the Gulf region will use English words (such as
ίέϮϠϳΎΗ ϲϧΪϣ –Madani Tailors). We have therefore limited our coverage to a lookup of
major internationally known organizations, as well as those from the Arab world.
6. EVALUATION
6.1. Corpus
The results of our system were evaluated against a standard evaluation corpus: Bena-
jiba’s ANerCorp12 , which is freely available online. Evaluation results are available for
other systems that were tested against this corpus [Benajiba and Rosso 2007; Benajiba
et al. 2007; Lingpipe13 ], so we are able to compare these directly to the performance of
RENAR.
Moreover, this allows a fair comparison of the performances of our system with three
existing systems which have been evaluated already against the same corpus.
The corpus consists of 316 articles and 150,286 tokens which were collected from
various online news sources14 . Proper names represent 11% of the whole corpus and
their distribution ratio per class is shown the Table I.
6.2. Results
We have used the CoNLL 2002 evaluation tool15 to process the results obtained by
RENAR. According to the official CoNLL evaluation guidelines, a named entity is
considered correctly detected only if the classification is correct and all the constituent
12 See http://www.dsic.upv.es/∼ybenajiba.
13 See http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html.
14 This is the list of sources used and the articles distribution ratio [Benajiba et al. 2008]:
— http://www.aljazeera.net 34.8%
— Other newspapers and magazines 17.8%
— http://www.raya.com 15.5%
— http://ar.wikipedia.org 6.6%
— http://www.alalam.ma 5.4%
— http://www.ahram.eg.org 5.4%
— http://www.alittihad.ae 3.5%
— http://www.bbc.co.uk/arabic/ 3.5%
— http://arabic.cnn.com 2.8%
— http://www.addustour.com 2.8%
— http://kassioun.org 1.9%
15
Available at http://bredt.uib.no/download/conlleval.txt.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
2:10 W. Zaghouani
Table III. Precision, Recall and F -Measure Obtained by Existing Arabic NER Systems
Person Location Organization
System
P R F P R F P R F
RENAR 71.18 54.20 61.54 90.21 85.20 87.63 58.79 47.00 52.23
Lingpipe16 63.40 65.70 64.50 78.20 78.80 78.50 60.90 52.70 56.50
Benajiba ANERsys 1.017 54.21 41.01 46.69 82.17 78.42 80.25 45.16 31.04 36.79
Benajiba ANERsys 2.018 56.27 48.56 52.13 91.69 82.23 86.71 47.95 45.02 46.43
words of the named entity are recognized. Table II summarizes the results obtained
by our system. We used the standard evaluation measures (i.e., precision, recall,
and F-measures), allowing direct comparison with the results from other systems.
We have included the precision, recall, and F-measure values for each named en-
tity category except for the miscellaneous category, which was not applicable to our
system.
Finally, Table III summarizes the recognition accuracy, in terms of precision, recall,
and F-measure, achieved by RENAR and the other NER systems for Arabic. All sys-
tems in Table III: RENAR, Lingpipe, Benajiba’s ANERsys 1.0 and ANERsys 2.0, have
been evaluated against Benajiba’s AnerCorp.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
RENAR: A Rule-Based Arabic Named Entity Recognition System 2:11
organization category is the most challenging one, with a low F-measure ranging from
36.79% (ANERsys 2.0), 52.23% (Renar) to 56.50% (Lingpipe). The analysis of detection
errors for the organization category showed they are sometimes due to factors such
as the inconsistency of transcribing foreign organization names to Arabic, or the
extended length of the name and the relatively small size of our organizations-name
gazetteer.
Moreover, our tool was not able to identify correctly many entries as shown by the
low recall of 47%. A deeper analysis has revealed many cases of incorrect categoriza-
tion due to the ambiguity of some Arabic words. Another type of error was that names
were only partially recognized, especially with organization and person names. In ad-
dition, the absence of rigorous standards for keyboarding Arabic text has led to incon-
sistencies in the spelling of some words and therefore has influenced our results. For
example, the use of the letter Hamza ˯ , an Arabic glottal stop, varies in word-initial
position (e.g., a name like Ahmad could be written as ΪϤΣ or as ΪϤΣ ).
Some errors were also caused by the spelling variants of translated or transliter-
ated entities that were not present in our gazetteer (e.g., a location name like Los
Angeles could be written in four different ways as βϠϴΠϧ αϮϟ , βϴϠϴΠϧ αϮϟ , αϮϠΠϧ βϟ
or βϴϠΠϧ αϮϟ ). In addition to that, the ambiguity and length of some named enti-
ties prevented our system from detecting all parts of multi-word names (e.g., de-
tecting Bin Ahmed Talel ϝϼσ ΪϤΣ ϦΑ instead of the full name Katib Bin Ahmed Talel
ΪϤΣ ϦΑ ΐ˶ΗΎ˴ϛ ϝϼσ ), the ambiguity here involves the word Katib ΐ˶ΗΎ˴ϛ which can be the
past participle of the verb “to write” or a person’s name. More fine-grained rules will
be needed to address such cases.
With this small-scale evaluation, we showed promising results. Our long-term eval-
uation plan involves a larger evaluation corpus to check the accuracy and compre-
hensiveness of our system. We are also planning to improve our detection rules and to
increase the coverage of the detection rules to cover more named entities combinations,
especially for the person and organization categories.
7. CONCLUSION
We have presented our work on adapting a multilingual NER system to the Arabic
language. Many of the existing language-independent rules had to be adapted to
Arabic, mostly because the lack of orthographic case makes it difficult to know where
a name starts and ends. More than for other languages, we needed long lists of
potential name parts, and we had to make much more use of stop words. Not having
access to full dictionaries and to part-of-speech information made the NER task rather
difficult for Arabic, more so than for the EMM languages using the Latin script (and
thus orthographic case). In NewsExplorer, we are currently mostly making use of safe
NER rules (such as those using lists of known name parts and name infixes). The
reason is that we did not have an Arabic speaker in the group for a long time. Not
having any control instance, we needed to optimize precision at the cost of lowering
recall.
We are now planning to work on improving the recall. Moreover, we showed that
there are some important factors that have greatly influenced the above achieved re-
sults such as the relative lack of standardized orthography in Arabic, especially in
the transcription of foreign names, the high ambiguity found in Arabic text and the
relatively small size of our gazetteers. Having said this, we feel that the results are
relatively good, considering the simplicity of the approach and having ensured that the
approach remains the same across all 20 languages in which we currently recognize
named entities.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
2:12 W. Zaghouani
ACKNOWLEDGMENTS
The author would like to thank Ralf Steinberger and Bruno Pouliquen of the Language Technology Group
at the Joint Research Center, for all the help provided during this project. A special thanks to David Graff
of the Linguistic Data Consortium for his valuable suggestions and comments.
REFERENCES
A BULEIL , S. 2004. Extracting names from Arabic text for question-answering systems. In Proceedings of
Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval (RIAO’04).
638–647.
B ENAJIBA ,D Y. 2009a. Named entity recognition. Doctoral dissertation, Universidad Politecnica de
Valencia.
B ENAJIBA Y. AND R OSSO, P. 2007. ANERsys 2.0: Conquering the NER task for the Arabic language by com-
bining the maximum entropy with POS-tag information. In Proceedings of the Workshop on Language-
Independent Engineering (LIE’07).
B ENAJIBA Y., R OSSO P., AND B ENEDI , J.-M. 2007. Arabic ANERsys: An Arabic named entity recognition
system based on maximum entropy. In Proceedings of the Conference on Computational Linguistics and
Intelligent Text Processing (CLITP’07). 143–153.
B ENAJIBA , Y., D IAB, M., AND R OSSO, P. 2008. Arabic named entity recognition using optimized feature
sets. In Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language
Processing (EMNLP’08). 284–293.
B ENAJIBA , Y., D IAB, M., AND R OSSO, P. 2009b. Using language independent and language specific features
to enhance Arabic NER. Int. Arabic J. Inf. Technol., 463–471.
D EBILI , F. AND A CHOUR , H. 1998. Voyellation automatique de l’arabe. In Proceedings of the Workshop on
Computational Approaches to Semitic Languages (CASL’98). 42–49.
D OAA , S., M ORENO -S ANDOVAL , A., AND G UIRAO, J.-M. 2005. A proposal for an Arabic named entity tagger
leveraging a parallel corpus (Spanish-Arabic). In Proceedings of Recent Advances in Natural Language
Processing (RANLP’05). 459–465.
G RISHMAN, R. AND S UNDHEIM , B. 1996. Message Understanding Conference - 6: A brief history. In
Proceedings of the International Conference on Computer Linguistics (COLING’96). 466–471.
M ALONEY, J. AND N IV, M. 1998. TAGARAB: A fast, accurate Arabic name recognizer using high preci-
sion morphological analysis. In Proceedings of the Workshop on Computational Approaches to Semitic
Languages (CASL’98). 8–15.
N ADEAU, D. AND S EKINE , S. 2009. A survey of named entity recognition and classification. In Named
Entities – Recognition, Classification and Use. S. Sekine and E. Ranchhod Eds., Benjamins Current
Topics, Vol. 19, John Benjamins Publishing Company, Amsterdam.
P OULIQUEN, B., S TEINBERGER , R., I GNAT, C., T EMNIKOVA , I., W IDIGER , A., Z AGHOUANI , W., AND Ž I ŽKA ,
J. 2005. Multilingual person name recognition and transliteration. Corela, Numéros spéciaux, Le traite-
ment lexicographique des noms propres.
S EKINE , S. 2004. amed entity: History and future. http://cs.nyu.edu/∼sekine/papers/NEsurvey200402.pdf.
S HAALAN, K. 2005. Arabic GramCheck: A grammar checker for Arabic. Softw. Prac. Exp. 35, 7, 643–665.
S HAALAN, K. AND R AZA , H. 2008. Arabic named entity recognition from diverse text types. In Proceedings
of the 6th International Conference (GoTAL’09). 440–451.
S HAALAN, K. AND R AZA , H. 2009. NERA: Named entity recognition for Arabic. J. Amer. Soc. for Inf. Sci.
Technol. 60, 8, 1652–1663.
S TEINBERGER , R., P OULIQUEN, B., AND I GNAT, C. 2008. Using language-independent rules to achieve high
multilinguality in Text Mining. In Mining Massive Data Sets for Security. F.-S. Françoise, D. Perrotta,
J. Piskorski, and R. Steinberger Eds., IOS Press, 217–240.
S TEINBERGER , R., P OULIQUEN, B. AND VAN DER G OOT, E. 2009. An introduction to the Europe media
monitor family of applications. In Proceedings of the Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR-CLIR’09). F. Gey, N. Kando, and J. Karlgren
Eds. 1–8.
T RABOULSI , H. N. 2006. Named Entity Recognition: A local grammar-based approach. Doctoral dissertation,
Department of Computing, Surrey University, Guildford, U.K.
T RABOULSI , H. N. 2009. Arabic named entity extraction: A local grammar-based approach. In Proceedings
of the International Multiconference on Computer Science and Information Technology (IMCSIT’09).
139–143.
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.
RENAR: A Rule-Based Arabic Named Entity Recognition System 2:13
V ERGYRI , D., K IRCHHOFF , K., D UH , K., AND S TOLCKE , A. 2004. Morphology-based language modeling for
Arabic speech recognition. In Proceedings of International Conference on Spoken Language Processing
(ICSLP’04). 2245–2248.
Z AGHOUANI , W. 2009. Le repérage automatique des entités nommées dans la langue arabe: Vers la création
d’un système à base de règles. Master’s thesis, University of Montreal.
Z ITOUNI , I., S ORENSEN, J., L UO, X., AND F LORIAN, R. 2005. The impact of morphological stemming on
Arabic mention detection and coreference resolution. In Proceedings of the Workshop of Computational
Approaches to Semitic Languages (ACL’05). 79–86.
Received June 2010; revised February 2011, April 2011; accepted May 2011
ACM Transactions on Asian Language Information Processing, Vol. 11, No. 1, Article 2, Publication date: March 2012.