Regular Expression
Regular Expression
/a*/ - [1 ] Any string of zero or more a’s”. [2]. Match a or aaaaaa, but it will
also match the empty string at the start of Off Minor
/aa*/, meaning one a followed by zero or more a’s
/[ab]*/ means “zero or more a’s or b’s. Like aaaa or ababab or bbbb
Kleene + one or more occurrences of the immediately
preceding character or regular expression
caret ˆ
to match the start of a line
to indicate a negation inside of square brackets
just to mean a caret
/\b99\b
Code: https://github.com/codebasics/py/blob...
Exercise: https://github.com/codebasics/py/blob.
Substitution, Capture Groups, and ELIZA
Substitution operator s/regexp1/pattern/ regular expression to
be replaced by another string
s/colour/color/
Capture group
/the (.*)er they (.*), the \1er we \2/
Eg: the faster they ran, the faster we ran
Non-capturing group
/(?:some|a few) (people|cats) like some \1/
Eg: some cats like some cats
Regular expression substitutions each of which matches and
changes some part of the input lines. After the input is
uppercased, substitutions change all instances of MY to YOUR,
and I’M to YOU ARE, and so on
Text normalization
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences
Simple Unix Tools for Word Tokenization
Word Tokenization
Top-down (rule-based) tokenization
Define a standard and implement rules
A Bottom-up Tokenization Algorithm
Use simple statistics
Byte-Pair Encoding: A Bottom-up Tokenization
Algorithm
Deal with unknown words
Training corpus : low, new, newer
Test corpus: lower
Byte-Pair Encoding Algorithm
Word Normalization and Stemming