NLP | Regex and Affix tagging
Last Updated :
24 Sep, 2021
Regular expression matching is used to tag words. Consider the example, numbers can be matched with \d to assign the tag CD (which refers to a Cardinal number). Or one can match the known word patterns, such as the suffix "ing".
Understanding the concept -
- RegexpTagger is a subclass of SequentialBackoffTagger. It can be positioned before a DefaultTagger class so as to tag words that the n-gram tagger(s) missed and thus can be a useful part of a backoff chain.
- At initialization, patterns are saved in RegexpTagger class. choose_tag() is then called, it iterates over the patterns. Then, it returns the first expression tag that can match the current word using re.match().
- So, if the two given expressions get matched, then the tag of the first one will be returned without even trying the second expression.
- If the given pattern is like - (r'.*', 'NN'), RegexpTagger class can replace the DefaultTagger class
Code #1 : Python regular expression module and re syntax
Python3
patterns = [(r'^\d+$', 'CD'),
# gerunds, i.e. wondering
(r'.*ing$', 'VBG'),
# i.e. wonderment
(r'.*ment$', 'NN'),
# i.e. wonderful
(r'.*ful$', 'JJ')]
RegexpTagger class expects a list of two tuples
-> first element in the tuple is a regular expression
-> second element is the tag
Code #2 : Using RegexpTagger
Python3
# Loading Libraries
from tag_util import patterns
from nltk.tag import RegexpTagger
from nltk.corpus import treebank
test_data = treebank.tagged_sents()[3000:]
tagger = RegexpTagger(patterns)
print ("Accuracy : ", tagger.evaluate(test_data))
Output :
Accuracy : 0.037470321605870924
What is Affix tagging?
It is a subclass of ContextTagger. In the case of AffixTagger class, the context is either the suffix or the prefix of a word. So, it clearly indicates that this class can learn tags based on fixed-length substrings of the beginning or end of a word.
It specifies the three-character suffixes. That words must be at least 5 characters long and None is returned as the tag if a word is less than five character.
Code #3 : Understanding AffixTagger.
Python3
# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import AffixTagger
# initializing training and testing set
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
print ("Train data : \n", train_data[1])
# Initializing tagger
tag = AffixTagger(train_data)
# Testing
print ("\nAccuracy : ", tag.evaluate(test_data))
Output :
Train data :
[('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'),
('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (', ', ', '), ('the', 'DT'),
('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')]
Accuracy : 0.27558817181092166
Code #4 : AffixTagger by specifying 3 character prefixes.
Python3
# Specifying 3 character prefixes
prefix_tag = AffixTagger(train_data,
affix_length = 3)
# Testing
accuracy = prefix_tag.evaluate(test_data)
print ("Accuracy : ", accuracy)
Output :
Accuracy : 0.23587308439456076
Code #5 : AffixTagger by specifying 2-character suffixes
Python3
# Specifying 2 character suffixes
sufix_tag = AffixTagger(train_data,
affix_length = -2)
# Testing
accuracy = sufix_tag.evaluate(test_data)
print ("Accuracy : ", accuracy)
Output :
Accuracy : 0.31940427368875457
Similar Reads
POS(Parts-Of-Speech) Tagging in NLP One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging, which is giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs. Through improved comprehension of phrase structure and semantics, this technique makes it possible f
11 min read
NLP | Classifier-based tagging ClassifierBasedPOSTagger class: It is a subclass of ClassifierBasedTagger that uses classification technique to do part-of-speech tagging. From the words, features are extracted and then passed to an internal classifier. It classifies the features and returns a label i.e. a part-of-speech tag. The f
2 min read
NLP | Customization Using Tagged Corpus Reader How we can use Tagged Corpus Reader ?  Customizing word tokenizerCustomizing sentence tokenizerCustomizing paragraph block readerCustomizing tag separatorConverting tags to a universal tagset  Code #1 : Customizing word tokenizer  Python3 # Loading the libraries from nltk.tokenize import SpaceTok
2 min read
NLP | Training Tagger Based Chunker | Set 2 Conll2000 corpus defines the chunks using IOB tags. It specifies where the chunk begins and ends, along with its types.A part-of-speech tagger can be trained on these IOB tags to further power a ChunkerI subclass.First using the chunked_sents() method of corpus, a tree is obtained and is then transf
3 min read
NLP | Training Tagger Based Chunker | Set 1 To train a chunker is an alternative to manually specifying regular expression (regex) chunk patterns. But manually training to specify the expression is a tedious task to do as it follows the hit and trial method to get the exact right patterns. So, existing corpus data can be used to train chunker
2 min read