Basic Text Processing: Regular Expressions and Text Normalization
Basic Text Processing: Regular Expressions and Text Normalization
Weizenbaum (1966)
Rahman Joy, M.T., Shahriar Akash, M.N., Hasan, K.M.A. (2021). An Intelligent Bangla
Conversational Agent: TUNI. In: Uddin, M.S., Bansal, J.C. (eds) Proceedings of International
Joint Conference on Advances in Computational Intelligence. Algorithms for Intelligent
Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-0586-4_34
Dr. Azhar, KUET 11/17/2024
Corpus
4
Corpus
linguistics enclose the compilation and analysis of collections of
spoken and written texts
as the source of evidence for describing the nature, structure,
and use of languages.
a collection of written texts, especially the entire works of a
particular author or a body of writing on a particular subject
Corpora (plural of corpus) vary large in size and
design
most are nowadays in electronic form with purpose to build
computer software to support analysis.
Corpora are often annotated to show grammatical classes,
structures and functions.
Software to analyse grammatical structures or to identify.
Ranges [A-Z]
Pattern Matches
[A-Z] An upper case Drenched Blossoms
letter
[a-z] A lower case my beans were impatient
letter
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Pattern Matches
[^A-Z] Not an upper case Oyfn pripetchik
letter
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[e^] either e or ^ Look here
a^b The pattern a carat Look up a^b now
b
Dr. Azhar, KUET 11/17/2024
Regular Expressions: ? * + .
8
Pattern Matches
colou?r Optional previous char color colour
Parenthesis ()
Counters * + ? {}
Sequences and anchors the ^my end$
Disjunction |
upon a time.
Since /[a-z]*/ matches zero or more letters, this expression
/(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)/
Find “ any machine with at least 6 GHz and 500 GB of disk space for less than
$1000”
Look like the expression
“6 GHz or 500 GB or mac or $999.99 "
/$[0-9]+/ regular expression for a dollar
sign followed by a string of digits
/$[0-9]+\.[0-9][0-9]/ fractions of dollars; $199.99 but
not $199
/(^|\W)$[0-9]+(\.[0-9][0-9])?\b/ a word boundary
/(^|\W)$[0-9]{0,3}(\.[0-9][0-9])?\ allows prices like $199999.99,
b/ Limit it
/\b[6-9]+ *(GHz|[Gg]igahertz)\b/ specifications for >6GHz
processor speed
/\b[0-9]+(\.[0-9]+)? *(GB| For disk space
[Gg]igabytes?)\b/
matched (The)
False negatives (Type II)
negatives).
Text
Normalization
main meaning
Affixes : adding “additional” meanings of various kinds
1. Segmenting/tokenizing words
2. Normalizing word formats
3. Segmenting sentences in running text
sentence?
“He stepped out into the hall, was delighted to
encounter a water brother”.
13 words if we don’t count punctuation marks as
words, 15 if we count punctuation.
Whether we treat period (“.”), comma (“,”), and so on as words depends on
the task. Punctuation is critical for finding boundaries.
filtration.“
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words',
'filtration', '.']
Instead of
• white-space segmentation
• single-character segmentation
Use the data to tell us how to tokenize.
Subword tokenization (because tokens can be parts of words
as well as whole word)
Subword tokenization
Let a corpus: low low low low low lowest lowest newer newer newer
newer newer newer wider wider wider new new
representation
BPE token learner
Merge e r to er
BPE
Merge er _ to er_
BPE
Merge n e to ne
BPE
Result:
Test set "n e w e r _" would be tokenized as a full word
Test set "l o w e r _" would be two tokens: "low er_"
Training corpus
Decides EndOfSentence/NotEndOfSentence
machine-learning
THANK YOU