0% found this document useful (0 votes)

10 views

Lab 02 - Regular Expression

The document discusses regular expressions and provides examples of using regular expressions to extract word pieces, find word stems, and perform text analysis tasks. It demonstrates how to extract consonant-vowel sequences from words and construct a conditional frequency distribution. Problems with greedy matching in regular expressions are also explained.

Uploaded by

potatoes.clowns

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lab 02 - Regular Expression

Uploaded by

potatoes.clowns

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

Lab 2: Regular Expression

References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in Artificial
Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.

QUICK REVIEW

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of
regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all
text files in a file manager. The regex equivalent is «.*\.txt». Basically, a regular expression is a pattern describing a
certain amount of text.

A regular expression “engine” is a piece of software that can process regular expressions, trying to match the pattern to
the given string. Usually, the engine is part of a larger application and you do not access the engine directly. Rather, the
application will invoke it for you when needed, making sure the right regular expression is applied to the right file or
data.

PRACTICES

2.1. Extracting Word Pieces

The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find
all the vowels in a word, then count them.

import re

word = 'supercalifragilisticexpialidocious'

print(re.findall(r'[aeiou]', word))
print(len(re.findall(r'[aeiou]', word)))

# Search
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
print('found'), match.group() ## 'found word:cat'
else:
print('did not find')

Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:

wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj
... for vs in re.findall(r'[aeiou]{2,}', word))

Level 3 Asia Pacific University (APU) Page 1 of 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

2.2. More Practice with Word Pieces

Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue
them back together or plot them.

It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left
out. For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel
sequences. The regular expression in our next example matches initial vowel sequences, final vowel sequences, and all
consonants; everything else is ignored. This three-way disjunction is processed left-to-right, if one of the three parts
matches the word, any later parts of the regular expression are ignored. We use re.findall() to extract all the
matching pieces, and ''.join() to join them together.

import re
import nltk

word = 'supercalifragilisticexpialidocious'
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

def compress(word):
pieces = re.findall(regexp, word)
return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

 How to fetch internet resources (https://docs.python.org/3/library/urllib.request.html?highlight=urllib)

import urllib.request

url = "http://www.gutenberg.org/ebooks/2554?msg=welcome_stranger"

with urllib.request.urlopen(url) as f:
print(f.read(300).decode('utf-8'))

Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-
vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a
conditional frequency distribution. We then tabulate the frequency of each pair:

import nltk, re

rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')

cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]

cfd = nltk.ConditionalFreqDist(cvs)

print(cfd.tabulate())

Examining the rows for s and t, we see they are in partial "complementary distribution", which is evidence that they are
not distinct phonemes in the language. Thus, we could conceivably drop s from the Rotokas alphabet and simply have a
pronunciation rule that the letter t is pronounced s when followed by i. (Note that the single entry having su, namely
kasuari, 'cassowary' is borrowed from English.)

Level 3 Asia Pacific University (APU) Page 2 of 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

If we want to be able to inspect the words behind the numbers in the above table, it would be helpful to have an index,
allowing us to quickly find the list of words that contains a given consonant-vowel pair, e.g. cv_index['su'] should
give us all words containing su. Here's how we can do this:

import nltk, re

rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')

cv_word_pairs = [(cv, w) for w in rotokas_words

for cv in re.findall(r'[ptksvr][aeiou]', w)]

cv_index = nltk.Index(cv_word_pairs)

print(cv_index['su'])
print(cv_index['po'])

This program processes each word w in turn, and for each one, finds every substring that matches the regular expression
«[ptksvr][aeiou]». In the case of the word kasuari, it finds ka, su and ri. Therefore, the cv_word_pairs list will
contain ('ka', 'kasuari'), ('su', 'kasuari') and ('ri', 'kasuari'). One further step, using nltk.Index(),
converts this into a useful index.

2.3. Finding Word Stems

When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our
search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed,
laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we
want to ignore word endings, and just deal with word stems.
There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off
anything that looks like a suffix:

def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word

print(stem('usually'))

2.4. Using Regular Expression for Stemming

Although we will ultimately use NLTK's built-in stemmers, it's interesting to see how we can use regular expressions
for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to
limit the scope of the disjunction:

import re
print(re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing'))

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is
because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to
specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of
many arcane subtleties of regular expressions. Here's the revised version.

import re
print(re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing'))

Level 3 Asia Pacific University (APU) Page 3 of 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

However, we'd actually like to split the word into stem and suffix. So we should just parenthesize both parts of the
regular expression:

import re
print(re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing'))

We can change it for the user’s input:

import re

inputWord = input("Enter the word for stemming: ")

print(re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', inputWord))

Test it for “processes”.

Question: What is the problem?

Answer: The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another
subtlety: the star operator is "greedy" and the .* part of the expression tries to consume as much of the input as
possible.

If we use the "non-greedy" version of the star operator, written *?, we get what we want:

re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', inputWord)

Test it for “language”.

Question: What is the problem?
Solve the problem using the following code:

re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', inputWord)

This approach still has many problems (can you spot them?) but we will move on to define a function to perform
stemming, and apply it to a whole text:

import re #Python2
from nltk.tokenize import sent_tokenize, word_tokenize

def stem(word):
regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regexp, word)[0]
return stem

sampleText = """You probably worked out that a backslash means that

the following character is deprived of its special powers
and must literally match a specific character in the word.
Thus, while . is special, \. only matches a period. The
braced expressions, like {3,5}, specify the number of repeats
of the previous item. The pipe character indicates a choice
between the material on its left or its right. Parentheses
indicate the scope of an operator: they can be used together
with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t»,
matching wit, wet, wait, and woot. It is instructive to see what
happens when you omit the parentheses from the last expression
above."""

tokens = word_tokenize(sampleText)

Level 3 Asia Pacific University (APU) Page 4 of 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

print([stem(t) for t in tokens])

Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words
like distribut and deriv, but these are acceptable stems in some applications.

import re #Python3
import nltk

def stem(word):
regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regexp, word)[0]
return stem

sampleText = """You probably worked out that a backslash means that

tokens = nltk.tokenize.word_tokenize(sampleText)

print([stem(t) for t in tokens])

from nltk.stem import RegexpStemmer

2.5. Using Regular Expression package to perform split

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the
pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the
remainder of the string is returned as the final element of the list.

import re
text = "I am a human"
tokens = re.split(' ', text)
print(tokens)

# An example of split by consecutive numbers is as follows.

s_nums = 'one1two22three333four'
print(re.split('\d+', s_nums))

print(re.split('\d+', s_nums, 2))

s_marks = 'one-two+three#four'
print(re.split('[-+#]', s_marks))

Level 3 Asia Pacific University (APU) Page 5 of 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

2.6. Using Python String split() method to perform split

The split() method splits a string into a list.
You can specify the separator, default separator is any whitespace.
Note: When maxsplit is specified, the list will contain the specified number of elements plus one.

string.split(separator, maxsplit)

Parameter Description

separator Optional. Specifies the separator to use when splitting the string. By default any whitespace is a
separator

maxsplit Optional. Specifies how many splits to do. Default value is -1, which is "all occurrences"

Example 1: Split the string, using comma, followed by a space, as a separator:

txt = "hello, my name is Peter, I am 26 years old"

x = txt.split(",")
print(x)

Example 2: Use a hash character as a separator:

txt = "apple#banana#cherry#orange"
x = txt.split("#")
print(x)

Example 3: Split the string into a list with max 2 items:

txt = "apple#banana#cherry#orange"
# setting the maxsplit parameter to 1, will return a list with 2 elements!
x = txt.split("#", 1)
print(x)

Level 3 Asia Pacific University (APU) Page 6 of 7

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

2.7. Stemming

Stemming with Python nltk package. "Stemming is the process of reducing inflection in words to their root forms such
as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Using nltk.PorterStemmer()
In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word
stem, base or root form—generally a written word form

import nltk

Text = input("Enter the text corpus for stemming: ")

word = Text.lower().split(' ')
print(word)

porter = nltk.PorterStemmer()
for t in word:
print(porter.stem(t))

import nltk

Text = input("Enter the text corpus for stemming: ")

word = nltk.tokenize.word_tokenize(Text)
print(word)

porter = nltk.PorterStemmer()
for t in word:
print(porter.stem(t))

Level 3 Asia Pacific University (APU) Page 7 of 7

"7" Series: Portable Concrete Pumps
100% (1)
"7" Series: Portable Concrete Pumps
2 pages
Lecture Notes On Lexical Processing
No ratings yet
Lecture Notes On Lexical Processing
16 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
NLP Experiment 04
No ratings yet
NLP Experiment 04
3 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Upto Morphological Parsing
No ratings yet
Upto Morphological Parsing
19 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
NLP-pyth
No ratings yet
NLP-pyth
5 pages
regular exp
No ratings yet
regular exp
10 pages
Week2
No ratings yet
Week2
44 pages
Regular Expression
No ratings yet
Regular Expression
18 pages
NLP 04
No ratings yet
NLP 04
3 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Python unit 3
No ratings yet
Python unit 3
46 pages
Regex Summary
No ratings yet
Regex Summary
8 pages
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
Unit 5
No ratings yet
Unit 5
4 pages
6 Python Regex Search Function
No ratings yet
6 Python Regex Search Function
4 pages
Module 2
No ratings yet
Module 2
78 pages
9.RegEx (1)
No ratings yet
9.RegEx (1)
57 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Python Regex Cheat Sheet
No ratings yet
Python Regex Cheat Sheet
29 pages
Module II
No ratings yet
Module II
17 pages
unit2
No ratings yet
unit2
20 pages
Regular Expression 4
No ratings yet
Regular Expression 4
16 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
unit 4 Regular expression
No ratings yet
unit 4 Regular expression
16 pages
Module3 RegularExpressions
No ratings yet
Module3 RegularExpressions
8 pages
Lecture-2n-04032024-081220pm-19022025-105409am
No ratings yet
Lecture-2n-04032024-081220pm-19022025-105409am
38 pages
Regular Expression
No ratings yet
Regular Expression
17 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
PP_Module-3 Notes
No ratings yet
PP_Module-3 Notes
56 pages
UNIT-4 (Regular Expressions)
No ratings yet
UNIT-4 (Regular Expressions)
25 pages
Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
Day-13 Python Regx
No ratings yet
Day-13 Python Regx
11 pages
Lec 07 II Dsfa23
No ratings yet
Lec 07 II Dsfa23
44 pages
Python Tutorial 27
No ratings yet
Python Tutorial 27
3 pages
Lec 07-II-DSFa23
No ratings yet
Lec 07-II-DSFa23
44 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Regular Expression 01
No ratings yet
Regular Expression 01
48 pages
Lecture 3-4 Regex
No ratings yet
Lecture 3-4 Regex
33 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Morphological Analysis (1)
No ratings yet
Morphological Analysis (1)
118 pages
Lecture03 Regular Expressions 20092024 012539pm
No ratings yet
Lecture03 Regular Expressions 20092024 012539pm
36 pages
Lecture 7 Re Part2 Split
No ratings yet
Lecture 7 Re Part2 Split
8 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Python Complete Unit 3
No ratings yet
Python Complete Unit 3
40 pages
Subtitle
No ratings yet
Subtitle
3 pages
Python - Slide 5
No ratings yet
Python - Slide 5
42 pages
17_Regular Expression
No ratings yet
17_Regular Expression
20 pages
NLP Record
No ratings yet
NLP Record
15 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
Chapter 7
No ratings yet
Chapter 7
5 pages
UNIT4
No ratings yet
UNIT4
67 pages
Howto Regex
No ratings yet
Howto Regex
19 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
A Matrix Formulation For The Moment Distribution Method Applied To Continuous Beams
100% (1)
A Matrix Formulation For The Moment Distribution Method Applied To Continuous Beams
12 pages
Climate & YOUth Summit 2024 - Concept Note (2)
No ratings yet
Climate & YOUth Summit 2024 - Concept Note (2)
3 pages
University of Rajasthan Admit Card
No ratings yet
University of Rajasthan Admit Card
2 pages
1
No ratings yet
1
5 pages
Souled Store
No ratings yet
Souled Store
9 pages
ExtraClear v1 Low Res
No ratings yet
ExtraClear v1 Low Res
1 page
Visualizing Geology Visualizing Series 4th Edition Murck Ebook All Chapters PDF
100% (15)
Visualizing Geology Visualizing Series 4th Edition Murck Ebook All Chapters PDF
57 pages
This Study Resource Was Shared Via
No ratings yet
This Study Resource Was Shared Via
3 pages
Cambridge
No ratings yet
Cambridge
327 pages
SO E U7 9 TestA Form
No ratings yet
SO E U7 9 TestA Form
3 pages
Pediatric Review of Systems
No ratings yet
Pediatric Review of Systems
1 page
Soil and Concrete Lab Equipment
No ratings yet
Soil and Concrete Lab Equipment
3 pages
Sex Ed Art
No ratings yet
Sex Ed Art
17 pages
Mushak-9.1 VAT Return On 08.AUG.2021
No ratings yet
Mushak-9.1 VAT Return On 08.AUG.2021
6 pages
List of Football Clubs in Germany by Major Honours Won - Wikipedia, The Free Encyclopedia
No ratings yet
List of Football Clubs in Germany by Major Honours Won - Wikipedia, The Free Encyclopedia
3 pages
Mobile Iron Compatibility
No ratings yet
Mobile Iron Compatibility
3 pages
Notes
No ratings yet
Notes
5 pages
"Mine Audit Team Pursuant To DENR Memorandum Order 01, Series of 2016" of Memorandum
100% (1)
"Mine Audit Team Pursuant To DENR Memorandum Order 01, Series of 2016" of Memorandum
4 pages
ADMIN LAW - Discretionary Power PDF
No ratings yet
ADMIN LAW - Discretionary Power PDF
5 pages
Aliens ColdWar Plug in
No ratings yet
Aliens ColdWar Plug in
38 pages
OURIGINAL (Urkund) Admin Brief Manual
No ratings yet
OURIGINAL (Urkund) Admin Brief Manual
41 pages
The Eagle and The Hen 2
No ratings yet
The Eagle and The Hen 2
5 pages
Vao Results
No ratings yet
Vao Results
16 pages
Pre Assessment Part 3 Writeup
No ratings yet
Pre Assessment Part 3 Writeup
3 pages
EMS501U - SIEMENS Relay For Railway Switches - Group 48 Presentation
No ratings yet
EMS501U - SIEMENS Relay For Railway Switches - Group 48 Presentation
9 pages
Gravity Nullified Antigravity Free Energy
No ratings yet
Gravity Nullified Antigravity Free Energy
9 pages
CSC Writing Arguments
No ratings yet
CSC Writing Arguments
25 pages
Previewpdf
No ratings yet
Previewpdf
102 pages
Indian Folk Literature: Identity and Hybridity: Kshetra and Desha
No ratings yet
Indian Folk Literature: Identity and Hybridity: Kshetra and Desha
35 pages

Lab 02 - Regular Expression

Uploaded by

Lab 02 - Regular Expression

Uploaded by

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Regular Expression

Lab 2: Regular Expression

2.1. Extracting Word Pieces

Level 3 Asia Pacific University (APU) Page 1 of 7

2.2. More Practice with Word Pieces

 How to fetch internet resources (https://docs.python.org/3/library/urllib.request.html?highlight=urllib)

cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]', w)]

Level 3 Asia Pacific University (APU) Page 2 of 7

cv_word_pairs = [(cv, w) for w in rotokas_words

2.3. Finding Word Stems

2.4. Using Regular Expression for Stemming

Level 3 Asia Pacific University (APU) Page 3 of 7

We can change it for the user’s input:

inputWord = input("Enter the word for stemming: ")

Test it for “processes”.

Question: What is the problem?

Test it for “language”.

sampleText = """You probably worked out that a backslash means that

Level 3 Asia Pacific University (APU) Page 4 of 7

print([stem(t) for t in tokens])

sampleText = """You probably worked out that a backslash means that

print([stem(t) for t in tokens])

from nltk.stem import RegexpStemmer

2.5. Using Regular Expression package to perform split

# An example of split by consecutive numbers is as follows.

print(re.split('\d+', s_nums, 2))

Level 3 Asia Pacific University (APU) Page 5 of 7

2.6. Using Python String split() method to perform split

Example 1: Split the string, using comma, followed by a space, as a separator:

txt = "hello, my name is Peter, I am 26 years old"

Example 2: Use a hash character as a separator:

Example 3: Split the string into a list with max 2 items:

Level 3 Asia Pacific University (APU) Page 6 of 7

Text = input("Enter the text corpus for stemming: ")

Text = input("Enter the text corpus for stemming: ")

Level 3 Asia Pacific University (APU) Page 7 of 7

You might also like