0% found this document useful (0 votes)
11 views

Regular Expression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Regular Expression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Regular Expression

Basic Regular Expression Patterns

The use of the brackets [] to specify a disjunction of characters.


Use of the brackets [] plus the dash - to specify a range

Caret ˆ for negation or just to mean ˆ


The question mark ? marks optionality of the previous expression

? - Zero or one instances of the previous character


* - zero or more occurrences of the immediately previous character or regular
expression

/a*/ - [1 ] Any string of zero or more a’s”. [2]. Match a or aaaaaa, but it will
also match the empty string at the start of Off Minor
/aa*/, meaning one a followed by zero or more a’s
/[ab]*/ means “zero or more a’s or b’s. Like aaaa or ababab or bbbb
Kleene + one or more occurrences of the immediately
preceding character or regular expression

Use of the period . to specify any character:


Anchors in regular expressions

caret ˆ
to match the start of a line
to indicate a negation inside of square brackets
just to mean a caret
/\b99\b

99 - There are 99 bottles of beer on the wall (because 99 follows a space)


not 99 - There are 299 bottles of beer on the wall (since 99 follows a number).
99 - $99 (since 99 follows a dollar sign ($), which is not a digit, underscore, or
letter).
Disjunction, Grouping, and Precedence
Search for either the string cat or the string dog : /cat|dog/
guppy and guppies? : /gupp(y|ies)/

Match repeated instances of a string:

Column 1 Column 2 Column 3

Match the word Column, followed by a number and


optional spaces, the whole pattern repeated zero or more times
Highest precedence to lowest precedence

Counters have a higher precedence than sequences,


/the*/ matches theeeee but not thethe.
Sequences have a higher precedence than disjunction
/the|any/ matches the or any but not thany or theny.
More Operators
Aliases for common sets of characters:

Regular expression operators for counting


Some characters that need to be backslashed
Reference

Code: https://github.com/codebasics/py/blob...
Exercise: https://github.com/codebasics/py/blob.
Substitution, Capture Groups, and ELIZA
Substitution operator s/regexp1/pattern/ regular expression to
be replaced by another string

s/colour/color/

To change 35 boxes to the <35> boxes


-put parentheses ( and ) around the first pattern and use the
number operator \1 in the second pattern to refer back
-s/([0-9]+)/<\1>/
the Xer they were, the Xer they will be
/the (.*)er they were, the \1er they will be
Eg: bigger they were, the bigger they will be

Capture group
/the (.*)er they (.*), the \1er we \2/
Eg: the faster they ran, the faster we ran

Non-capturing group
/(?:some|a few) (people|cats) like some \1/
Eg: some cats like some cats
Regular expression substitutions each of which matches and
changes some part of the input lines. After the input is
uppercased, substitutions change all instances of MY to YOUR,
and I’M to YOU ARE, and so on
Text normalization
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Simple Unix Tools for Word Tokenization
Word Tokenization
Top-down (rule-based) tokenization
Define a standard and implement rules
A Bottom-up Tokenization Algorithm
Use simple statistics
Byte-Pair Encoding: A Bottom-up Tokenization
Algorithm
Deal with unknown words
Training corpus : low, new, newer
Test corpus: lower
Byte-Pair Encoding Algorithm
Word Normalization and Stemming

Normalization : To Standard Format


• Applications like IR: reduce all letters to lower case
• Since users tend to use lower case
• Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• For sentiment analysis, MT, Information extraction
• Case is helpful (US versus us is important)
Lemmatization

• Reduce inflections or variant forms to base form


• am, are, is → be
• car, cars, car's, cars' → car
• the boy's cars are different colors → the boy car be different
color
• Lemmatization: have to find correct dictionary headword
form
Morphology
• Morphemes:
• The small meaningful units that make up words
• Stems: The core meaning-bearing units
• Affixes: Bits and pieces that adhere to stems
• Often with grammatical functions
• Eg: fox consists of one morpheme
cats consists of two: the morpheme cat and the morpheme -s.
Stemming
• Reduce terms to their stems in information retrieval
• Stemming is crude chopping of affixes
• language dependent
• e.g., automate(s), automatic, automation all reduced to
automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s algorithm
The most common English stemmer
Step 1a Step 2 (for long stems)
sses → ss caresses → caress ational→ ate relational→ relate
ies → i ponies → poni izer→ ize digitizer → digitize
ss → ss caress → caress ator→ ate operator → operate
s → ø cats → cat …

Step 1b Step 3 (for longer stems)


(*v*)ing → ø walking → walk al → ø revival → reviv
sing → sing able → ø adjustable → adjust
ate → ø activate → activ
(*v*)ed → ø plastered → plaster

You might also like