0% found this document useful (0 votes)
11 views53 pages

Basic Text Processing: Regular Expressions and Text Normalization

Uploaded by

gemmataylorparis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views53 pages

Basic Text Processing: Regular Expressions and Text Normalization

Uploaded by

gemmataylorparis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Basic Text Processing:

Regular Expressions and Text


Normalization
1

DR. K.M. AZHARUL HASAN

Dr. Azhar, KUET 11/17/2024


ELIZA: A natural language processing system
2

 User: You are like my father in some ways.


 ELIZA: WHAT RESEMBLANCE DO YOU SEE
 User: You are not very aggressive but I think you don't want me to
notice that.
 ELIZA: WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
 User: You don't argue with me.
 ELIZA: WHY DO YOU THINK I DON'T ARGUE WITH YOU
 User: You are afraid of me.
 ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU

Weizenbaum (1966)

Dr. Azhar, KUET 11/17/2024


An Intelligent Bangla Conversational Agent: TUNI

Rahman Joy, M.T., Shahriar Akash, M.N., Hasan, K.M.A. (2021). An Intelligent Bangla
Conversational Agent: TUNI. In: Uddin, M.S., Bansal, J.C. (eds) Proceedings of International
Joint Conference on Advances in Computational Intelligence. Algorithms for Intelligent
Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-0586-4_34
Dr. Azhar, KUET 11/17/2024
Corpus
4
 Corpus
 linguistics enclose the compilation and analysis of collections of
spoken and written texts
 as the source of evidence for describing the nature, structure,
and use of languages.
 a collection of written texts, especially the entire works of a
particular author or a body of writing on a particular subject
 Corpora (plural of corpus) vary large in size and
design
 most are nowadays in electronic form with purpose to build
computer software to support analysis.
 Corpora are often annotated to show grammatical classes,
structures and functions.
 Software to analyse grammatical structures or to identify.

Dr. Azhar, KUET 11/17/2024


Regular expressions
5

A formal language for specifying text strings


 They are particularly useful for searching in texts,
when we have a pattern to search.
A regular expression search function will search
through the corpus, returning all texts that match
the pattern.
The corpus can be a single document or a collection.
How can we search for any of these?
 woodchuck
 woodchucks
 Woodchuck
 Woodchucks

Dr. Azhar, KUET 11/17/2024


Regular Expressions: Disjunctions
6

The string of characters inside the braces


specifies a disjunction of characters to match.
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

Ranges [A-Z]
Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case my beans were impatient
letter
[0-9] A single digit Chapter 1: Down the Rabbit Hole

Dr. Azhar, KUET 11/17/2024


Regular Expressions: Negation in Disjunction
7
 The pattern /[2-5]/ specifies any one of the characters 2, 3, 4,
or 5. The pattern /[b-g]/ specifies one of the characters b, c, d,
e, f, or g.
 Negations [^Ss]
 Carat means negation only when first in []

Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[e^] either e or ^ Look here
a^b The pattern a carat Look up a^b now
b
Dr. Azhar, KUET 11/17/2024
Regular Expressions: ? * + .
8

Pattern Matches
colou?r Optional previous char color colour

oo*h! 0 or more of previous oh! ooh! oooh!


char ooooh!

o+h! 1 or more of previous oh! ooh! oooh!


char ooooh!
baa+ baa baaa baaaa Stephen C Kleene
baaaaa
Kleene *, Kleene
beg.n begin begun begun +
beg3n

Dr. Azhar, KUET 11/17/2024


Regular Expressions: Anchors ^ $
9

Anchors are special characters that anchor


regular expressions to particular places in a
string. The most common anchors are the
caret ^ and the dollar sign $.
 The caret matches the start of a line. The pattern
/^The/ matches the word “The” only at the start of a
line
The caret ^ has three uses:
 to match the start of a line,
 to indicate a negation inside of square brackets,
 and just to mean a caret.

Dr. Azhar, KUET 11/17/2024


Regular Expressions: Anchors ^ $
10

 The dollar sign $ matches the end of a line. So the


pattern $ is a useful pattern for matching a space
at the end of a line
 /^The dog\.$/ matches a line that contains only the phrase
“The dog”.
 There are also two other anchors: \b matches a
word boundary, and \B matches a non-boundary.
 \bthe\b/ matches the word the but not the word other.

Dr. Azhar, KUET 11/17/2024


Disjunction, Grouping, and Precedence
11

 The disjunction operator, also called the pipe


symbol |. The pattern /cat|dog/ matches either the
string cat or the string dog.
 To apply disjunction operator only to a specific
pattern, use the parenthesis operators ( and ).
 the pattern /gupp(y|ies)/ would specify that we meant the
disjunction only to apply to the suffixes y and ies.
 we could write the expression /(Column [0-9]+ *)*/ to
match the word Column, followed by a number and
optional spaces, the whole pattern repeated any number of
times

Dr. Azhar, KUET 11/17/2024


operator precedence hierarchy
12

Parenthesis ()
Counters * + ? {}
Sequences and anchors the ^my end$
Disjunction |

 Counters have a higher precedence than


sequences,/the*/ matches theeeee but not thethe.
 Sequences have a higher precedence than disjunction,
/the|any/ matches the or any but not theny.

Dr. Azhar, KUET 11/17/2024


Greedy pattern
13

Patterns can be ambiguous in another way.


 The expression /[a-z]*/ when matching against the text once

upon a time.
 Since /[a-z]*/ matches zero or more letters, this expression

could match nothing, or just the first letter o, on, onc, or


once.
 In this case RE match longest sequence

RE always match the largest string they can


 Therefore, we say that “patterns are greedy”, expanding to
cover as much of a string as they can.
 Ways to make non greedy
 *? : Kleenee star ->as little text as possible

 +? : Kleenee plus ->as little text as possible

Dr. Azhar, KUET 11/17/2024


Example 1
14

Find all instances of the word “the” in a text.

/the/ Misses capitalized examples

[tT]he Incorrectly returns other or theology

/\b[tT]he\b/ some context where it might also have


underlines or numbers nearby
(the or the25)
/[^a-zA-Z][tT]he[^a-zA- won’t find the word the when it
Z]/ begins a line

/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/

Dr. Azhar, KUET 11/17/2024


Example 2
15

Find “ any machine with at least 6 GHz and 500 GB of disk space for less than
$1000”
Look like the expression
“6 GHz or 500 GB or mac or $999.99 "
/$[0-9]+/ regular expression for a dollar
sign followed by a string of digits
/$[0-9]+\.[0-9][0-9]/ fractions of dollars; $199.99 but
not $199
/(^|\W)$[0-9]+(\.[0-9][0-9])?\b/ a word boundary
/(^|\W)$[0-9]{0,3}(\.[0-9][0-9])?\ allows prices like $199999.99,
b/ Limit it
/\b[6-9]+ *(GHz|[Gg]igahertz)\b/ specifications for >6GHz
processor speed
/\b[0-9]+(\.[0-9]+)? *(GB| For disk space
[Gg]igabytes?)\b/

Dr. Azhar, KUET 11/17/2024


Errors
16

The process we just went through was


based on fixing two kinds of errors
 Matching strings that we should not have
matched (there, then, other)
 False positives (Type I)
 Not matching things that we should have

matched (The)
 False negatives (Type II)

Dr. Azhar, KUET 11/17/2024


Errors cont.
17

In NLP we are always dealing with these


kinds of errors.
Reducing the error rate for an application
often involves two antagonistic efforts:
 Increasing accuracy or precision (minimizing
false positives)
 Increasing coverage or recall (minimizing false

negatives).

Dr. Azhar, KUET 11/17/2024


Summary
18

Regular expressions play a surprisingly large


role
 Sophisticated sequences of regular expressions are
often the first model for any text processing text
For many hard tasks, we use machine
learning classifiers
 But regular expressions are used as features in the
classifiers
 Can be very useful in capturing generalizations

Dr. Azhar, KUET 11/17/2024


19

Text
Normalization

Dr. Azhar, KUET 11/17/2024


Text Normalization
20

Normalizing text means converting it to a more


convenient, standard form.
 Ex. most of what we are going to do with language relies on first
separating out or tokenizing words
 Word boundary detection.
 English words are often separated from each other by
whitespace, but whitespace is not always sufficient.
 New York and rock ’n’ roll are sometimes treated as large words
despite the fact that they contain spaces,
 sometimes we’ll need to separate I’m into the two words I and
am.
 For processing tweets or texts we need to tokenize emoticons
like :) or hashtags like #nlproc

Dr. Azhar, KUET 11/17/2024


Tokenization
21

Tokenization is the process of breaking down


a stream of text into words, phrases, symbols,
or any other meaningful elements called
tokens.
 Ex. “After sleeping for four hours, he decided to sleep
for another four”
 {'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided',
'to', 'sleep', 'for', 'another', 'four'}

Dr. Azhar, KUET 11/17/2024


Lemmatization and Stemming
22

 A form of text normalization is lemmatization


 The task of determining that two words have the same

root, despite their surface differences.


 For example, the words sang, sung, and sings are forms
of the verb sing
A lemmatizer maps from all of these to sing
 sang, sung, sings->sing

 Stemming refers to a simpler version of


lemmatization in which we mainly just strip
suffixes from the end of the word.
 sung, sings->sing

Dr. Azhar, KUET 11/17/2024


Lemma
23

 A lemma is a set of lexical forms having the


same stem, the same major part-of-speech,
and the same word sense.
 Seuss’s cat in the hat is different from other cats!
 cat and cats = same lemma
Word form: the full inflected surface form
 cat and cats = different word forms

Dr. Azhar, KUET 11/17/2024


Lemmatization and Stemming
24

 Lemmatization is the task of determining that two


words have the same root, despite their surface
differences.
 The words am, are, and is have the shared lemma be;
 The words dinner and dinners both have the lemma dinner.

 Representing a word by its lemma is important for


web search. This is especially important in
morphologically complex languages.
 Ex. He is reading detective stories would thus be He be
read detective story.

Dr. Azhar, KUET 11/17/2024


How is lemmatization done?
25

 The most sophisticated methods for lemmatization


involve complete morphological parsing of the
word.
 The way the words constructed
 Morphology is the study of morpheme the way words are built up
from smaller meaning-bearing units called morphemes.
 Two broad classes of morphemes :
 Stems : the central morpheme of the word, supplying the

main meaning
 Affixes : adding “additional” meanings of various kinds

Dr. Azhar, KUET 11/17/2024


Lemmatization
26
Reduce inflections or variant forms to base
form
 am, are, is  be
 car, cars, car's, cars'  car
the boy's cars are different colors  the boy
car be different color
Lemmatization: have to find correct
dictionary headword form

Dr. Azhar, KUET 11/17/2024


Text Normalization
27

Every NLP task needs to do text


normalization

1. Segmenting/tokenizing words
2. Normalizing word formats
3. Segmenting sentences in running text

Dr. Azhar, KUET 11/17/2024


Words
28
What counts as a word?
 Look one particular corpus; collection of text or speech

 Brown corpus is a million-word collection of samples

from 500 written English texts from different genres


(newspaper, fiction, non-fiction etc.)
 How many words are in the following Brown

sentence?
“He stepped out into the hall, was delighted to
encounter a water brother”.
13 words if we don’t count punctuation marks as
words, 15 if we count punctuation.
Whether we treat period (“.”), comma (“,”), and so on as words depends on
the task. Punctuation is critical for finding boundaries.

Dr. Azhar, KUET 11/17/2024


Stop words
29

 Stop words are the words in a stop


list (or stoplist or negative dictionary) which are filtered out
(i.e. stopped) before or after processing of natural
language data (text) because they are insignificant in that
context.
 There is no single universal list of stop words used by
all NLP tools, nor any agreed upon rules for identifying stop
words, and indeed not all tools even use such a sigle list.
 Therefore, group of words can vary due to purpose.
 Ex. "This is a sample sentence, showing off the stop words

filtration.“
 ['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words',

'filtration', '.']

Dr. Azhar, KUET 11/17/2024


How many words?
30

 Types are the number of distinct words in a corpus;


 Tokens are the total number of running words.

They picnicked by the pool, then lay back on the grass


and looked at the stars.
How many?
 16 tokens and 14 types

Dr. Azhar, KUET 11/17/2024


Herdan’s Law/ Heaps’ Law
31

The larger the corpora we look at, the more


types we find
The relationship between the number of
types Vand number of tokens N is called
Herdan’s Law

where k and are positive constants, and 0 <<1.


 The value of depends on the corpus size and the
genre, for large ranges from 0.67 to 0.75
 V for a text goes up significantly faster than the
square root of its tokens.
Dr. Azhar, KUET 11/17/2024
How many words?
32

N = number of tokens Church and Gale (1990): |V| > O(N½)


V = vocabulary = set of
types
|V| is the size of the vocabulary

Tokens = N Types = |V|


Switchboard phone 2.4 million 20 thousand
conversations
Shakespeare 884,000 31 thousand
Google N-grams 1 trillion 13 million

Dr. Azhar, KUET 11/17/2024


33

Tokenization: Byte Pair Encoding(BPE)

Dr. Azhar, KUET 11/17/2024


Another option for text tokenization

Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (because tokens can be parts of words
as well as whole word)
Subword tokenization

 Three common algorithms:


 Byte-Pair Encoding (BPE) (Sennrich et al., 2016)
 Unigram language modeling tokenization (Kudo, 2018)
 WordPiece (Schuster and Nakajima, 2012)
 All have 2 parts:
 A token learner takes a raw training corpus and induces

a vocabulary (a set of tokens).


 A token segmenter that takes a raw test sentence and

tokenizes it according to that vocabulary


Byte Pair Encoding (BPE) token
learner

Let vocabulary be the set of all individual


characters
= {A, B, C, D,…, a, b, c, d….}
 Repeat:
 Choose the two symbols that are most frequently
adjacent in the training corpus (say 'A', 'B')
 Add a new merged symbol 'AB' to the vocabulary
 Replace every adjacent 'A' 'B' in the corpus with 'AB'.
 Until k merges have been done.
BPE token learner

Let a corpus: low low low low low lowest lowest newer newer newer
newer newer newer wider wider wider new new

The base vocabulary:

representation
BPE token learner

Merge e r to er
BPE

Merge er _ to er_
BPE

Merge n e to ne
BPE

The next merges are:


42

Merger : er, er_, ne, new, lo, low, newer_,


low_

On the test data, run each merge learned


from the training data:

Result:
 Test set "n e w e r _" would be tokenized as a full word
 Test set "l o w e r _" would be two tokens: "low er_"

Dr. Azhar, KUET 11/17/2024


Properties of BPE tokens

Usually include frequent words


And frequent subwords
• Which are often morphemes like -est or –er
A morpheme is the smallest meaning-bearing
unit of a language
• unlikeliest has 3 morphemes un-, likely, and
-est
BPE token learner algorithm
HT
45

Training corpus

This is the hugging face course. This


chapter is about tokenization. This
section shows several tokenizer
algorithms.

Testing corpus: hugs

Dr. Azhar, KUET 11/17/2024


The Porter Stemmer
46

One of the most widely used stemming


algorithms is the simple and efficient Porter
(1980) algorithm.

 ATIONAL -> ATE (e.g., relational->relate)


 ING ->  if stem contains vowel (e.g., motoring!->motor)
 SSES -> SS (e.g., grasses -> grass)

Dr. Azhar, KUET 11/17/2024


Porter’s algorithm
47
Step 1a
sses  ss caresses  caress
ies  i ponies  poni
ss  ss caress  caress
s  ø cats  cat
Step 1b
Step 2 (for long stems)
(*v*)ing  ø walking  walk
ational ate relational relate
sing  sing
(*v*)ed  ø plastered  plaster
izer ize digitizer 
digitize

ator ate operator 
operate

Step 3 (for longer stems)
al  ø revival  reviv
able  ø adjustable  adjust
ate  ø activate  activ

Dr. Azhar, KUET 11/17/2024
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
48

(*v*)ing  ø walking  walk


sing  sing

1312 King 548 being


548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going

Dr. Azhar, KUET 11/17/2024


Issues in Tokenization
49

Some exceptions to handle

 Finland’s capital  Finland Finlands Finland’s ?


 what’re, I’m, isn’t  What are, I am, is not
 Hewlett-Packard  Hewlett Packard ?
 state-of-the-art  state of the art ?
 Lowercase  lower-case lowercase lower case ?
 San Francisco  one token or two?
 m.p.h., Ph.D.  ??
 $44.55  prices ?
 01/02/06  dates ?
 abc@xyz, www.kuet.ac.bd/cse  Email and urls ?

Dr. Azhar, KUET 11/17/2024


Case folding
50

 Case folding: Another form of normalization


 reduce all letters to lower case
 Applications like IR from text
 Since users tend to use lower case

 Possible exception: upper case in mid-sentence?


 e.g., General Motors
 US vs. us
 SAIL vs. sail
 For sentiment analysis, MT, Information extraction
 Case is helpful

Dr. Azhar, KUET 11/17/2024


Sentence Segmentation
51

!, ? are relatively unambiguous


Period “.” is quite ambiguous
 Sentence boundary
 Abbreviations like Inc. or Dr.
 Numbers like .02% or 4.3
Build a binary classifier
 Looks at a “.”

 Decides EndOfSentence/NotEndOfSentence

 Classifiers: hand-written rules, regular expressions, or

machine-learning

Dr. Azhar, KUET 11/17/2024


Determining if a word is end-of-
sentence: a Decision Tree
52

Dr. Azhar, KUET 11/17/2024


53

THANK YOU

Dr. Azhar, KUET 11/17/2024

You might also like