2 - Text Operation
2 - Text Operation
Operations
r
Information Storage and Retrieval 2.7 Baeza-Yates, Berthier Ribeiro-Neto, 2022
Word distribution: Zipf's Law...
• r*f=c
• Different collections have different constants c.
• The table shows the most frequently occurring words from 336,310
– Elimination of stop words: filter out words which are not useful
in the retrieval process
Documen
t Corpus Tokenization stop words stemming Thesaurus
Free Index
Text terms
• For example,
• Tokenization Issues
– How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
BT: Vehicles
RT: Road Engineering
Road Transport
– Language-specific and
– Often, application-specific