Lab 02 - Regular Expression
Lab 02 - Regular Expression
References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in Artificial
Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.
QUICK REVIEW
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of
regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all
text files in a file manager. The regex equivalent is «.*\.txt». Basically, a regular expression is a pattern describing a
certain amount of text.
A regular expression “engine” is a piece of software that can process regular expressions, trying to match the pattern to
the given string. Usually, the engine is part of a larger application and you do not access the engine directly. Rather, the
application will invoke it for you when needed, making sure the right regular expression is applied to the right file or
data.
PRACTICES
import re
word = 'supercalifragilisticexpialidocious'
print(re.findall(r'[aeiou]', word))
print(len(re.findall(r'[aeiou]', word)))
# Search
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
print('found'), match.group() ## 'found word:cat'
else:
print('did not find')
Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
... for vs in re.findall(r'[aeiou]{2,}', word))
It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left
out. For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel
sequences. The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all
consonants; everything else is ignored. This three-way disjunction is processed left-to-right, if one of the three parts
matches the word, any later parts of the regular expression are ignored. We use re.findall() to extract all the
matching pieces, and ''.join() to join them together.
import re
import nltk
word = 'supercalifragilisticexpialidocious'
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
pieces = re.findall(regexp, word)
return ''.join(pieces)
english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))
url = "http://www.gutenberg.org/ebooks/2554?msg=welcome_stranger"
with urllib.request.urlopen(url) as f:
print(f.read(300).decode('utf-8'))
Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-
vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a
conditional frequency distribution. We then tabulate the frequency of each pair:
import nltk, re
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
print(cfd.tabulate())
Examining the rows for s and t, we see they are in partial "complementary distribution", which is evidence that they are
not distinct phonemes in the language. Thus, we could conceivably drop s from the Rotokas alphabet and simply have a
pronunciation rule that the letter t is pronounced s when followed by i. (Note that the single entry having su, namely
kasuari, 'cassowary' is borrowed from English.)
If we want to be able to inspect the words behind the numbers in the above table, it would be helpful to have an index,
allowing us to quickly find the list of words that contains a given consonant-vowel pair, e.g. cv_index['su'] should
give us all words containing su. Here's how we can do this:
import nltk, re
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
cv_index = nltk.Index(cv_word_pairs)
print(cv_index['su'])
print(cv_index['po'])
This program processes each word w in turn, and for each one, finds every substring that matches the regular expression
«[ptksvr][aeiou]». In the case of the word kasuari, it finds ka, su and ri. Therefore, the cv_word_pairs list will
contain ('ka', 'kasuari'), ('su', 'kasuari') and ('ri', 'kasuari'). One further step, using nltk.Index(),
converts this into a useful index.
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
print(stem('usually'))
import re
print(re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing'))
Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is
because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to
specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of
many arcane subtleties of regular expressions. Here's the revised version.
import re
print(re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing'))
However, we'd actually like to split the word into stem and suffix. So we should just parenthesize both parts of the
regular expression:
import re
print(re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing'))
import re
If we use the "non-greedy" version of the star operator, written *?, we get what we want:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', inputWord)
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', inputWord)
This approach still has many problems (can you spot them?) but we will move on to define a function to perform
stemming, and apply it to a whole text:
import re #Python2
from nltk.tokenize import sent_tokenize, word_tokenize
def stem(word):
regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regexp, word)[0]
return stem
tokens = word_tokenize(sampleText)
Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words
like distribut and deriv, but these are acceptable stems in some applications.
import re #Python3
import nltk
def stem(word):
regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regexp, word)[0]
return stem
tokens = nltk.tokenize.word_tokenize(sampleText)
import re
text = "I am a human"
tokens = re.split(' ', text)
print(tokens)
s_marks = 'one-two+three#four'
print(re.split('[-+#]', s_marks))
string.split(separator, maxsplit)
Parameter Description
separator Optional. Specifies the separator to use when splitting the string. By default any whitespace is a
separator
maxsplit Optional. Specifies how many splits to do. Default value is -1, which is "all occurrences"
txt = "apple#banana#cherry#orange"
x = txt.split("#")
print(x)
txt = "apple#banana#cherry#orange"
# setting the maxsplit parameter to 1, will return a list with 2 elements!
x = txt.split("#", 1)
print(x)
2.7. Stemming
Stemming with Python nltk package. "Stemming is the process of reducing inflection in words to their root forms such
as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Using nltk.PorterStemmer()
In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word
stem, base or root form—generally a written word form
import nltk
porter = nltk.PorterStemmer()
for t in word:
print(porter.stem(t))
import nltk
porter = nltk.PorterStemmer()
for t in word:
print(porter.stem(t))