0% found this document useful (0 votes)

7 views

2_Text Operations (1)

Uploaded by

kenabadane9299

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

2_Text Operations (1)

Uploaded by

kenabadane9299

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Information Storage and Retrieval

(ISR)
Chapter 2

Document represen-
tation and
Text Operations
1
Document representation
• IR system/Search engine does not scan each docu-
ment to see if it satisfies the query
• It uses an index to quickly locate the relevant doc-
uments
• Index: a list of concepts with pointers to docu-
ments that discuss(represent) them
– What goes in the index is very important
• Document representation: deciding what concepts
should go in the index

2
Document representation cont. …
Two options:
1.Controlled vocabulary – a set of manually con-
structed concepts that describe the major topics
covered in the collection
2.Free-text indexing – the set of individual terms
that occur in the collection

3
1.Document representation
Controlled Vocabulary: cont. …
a set of well-defined
concepts
– Assigned to documents by humans (or automatically)
• E.g. Subject headings, key words, etc.
– May include parent-child relations b/n concepts
• E.g. computers: software/Hardware: Information Retrieval
– Facilitate non-query-based browsing and exploration
Because it will brows parent-child relationship.

4
Controlled Vocabulary: Advantages
• Concepts do not need to appear explicitly in the text
• R/ships b/n concepts facilitate non-query based navigation
and exploration
• Developed by experts who know the data and the user
• Represent the concepts/relationships that users (presum-
ably) care the most about
• Describe the concepts that are most central to the docu-
ment
• Concepts are unambiguous and recognizable (necessary for
annotators and good for users)

5
Controlled Vocabulary: Disadvantages
• Time consuming
• Users must know the concepts in the index
• Labor intensive

6
2. Free Text Indexing
• Represent documents using terms within the document
• Which terms? Only the most descriptive terms? Only the unam-
biguous ones? All of them?
– Usually, all of them (a.k.a. full-text indexing)
• The user will use term-combinations to express higher level con-
cepts
• Query terms will hopefully disambiguate each other (e.g., “volk-
swagen golf”)
• The search engine will determine which terms are important

7
How are the texts handled?
• What happens if you take the words exactly as they ap-
pear in the original text?
• What about punctuation, capitalization, etc.?
• What about spelling errors?
• What about plural vs. singular forms of words?
• What about cases and declension(the variation of the
form of a noun, pronoun, or adjective, by which its
grammatical case) in non-English language?
• What about non-roman alphabets?

8
Free Text Indexing: Steps
1. Mark-up removal
2. Normalization – e.g down casing
– Information and information
– Retrieval and RETRIEVAL
– US and us – can change the meaning of words
3. Tokenization - splitting text into words (based on sequences of non-alpha-
numeric characters)
– Problematic cases: ph.d. = ph d, isn’t = isn t
4. Stop word removal
5. Do steps 1-4 to every document in the collection and
6. create an index using the union of all remaining terms

9
Controlled Vs. Free Text Indexing
Cost of Assigning Ambiguity of Detail of repre-
index terms index terms sentation

Controlled High Not ambiguous Can’t represent

Vocabulary arbitrary detail

Free text in- low Can be am- Any level of de-

dexing
• Both are effective biguous
and used often tail

• We will focus on free-text indexing in this

course
– cheap and easy
– most search engines use it
10
Free/Full Text Indexing
• Our goal is to describe content using content
• Are all words equally descriptive?
• What are the most descriptive words?
• How might a computer identify these?
• We know that language use is varied
– There are many ways to convey the same information (which
makes IR difficult)
• But, are there statistical properties of word usage that are pre-
dictable? Across languages? Across modalities? Across
genres(literature)?

11
Statistical Properties of words in a Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a corpus?
• Such properties of text collection greatly affect the performance of IR system &
can be used to select suitable term weights & other aspects of the system
• There are three well-known researchers who define statistical properties of
words in a text:
1. Zipf’s Law: models word distribution in text corpus
2. Luhn’s idea: measures word significance
3. Heap’s Law: shows how vocabulary size grows with the growth of the cor-
pus size

12
1. Zipf's Law: Word Distribution/Fre-
quency
• A few words are very
common.
2 most frequent words
(e.g. “the”, “of”) can ac-
count for about 10% of
word occurrences.
• Most words are very rare.
Half the words in a cor-
pus appear only once,
called “read only once”
or Hapax Legomena (in
Greek)

13
Zipf's Law: Word distribution cont. ...
• Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
• For all the words in a collection of documents, for each word w
f : is the frequency of w
r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f

Zipf’s distribu- Distribution of sorted word

tions: Rank w has rank r &
frequencies, according
Frequency Dis-
frequency f to Zipf’s law
tribution

r 14
Word distribution: Zipf's Law
• Zipf's Law states that when the distinct words in a text are ar-
ranged in decreasing order of their frequency of occuerence (most
frequent words first), the occurence characterstics of the vocabulary
can be characterized by the constant rank-frequency law of Zipf.

• If the words, w, in a col-

lection are ranked, r, by
their frequency, f, they
roughly fit the relation:
1
f 
r*f=c r
– Different collections
have different constants
c.

• The table shows the most frequently occurring words from 336,310 document corpus
containing 125, 720, 891 total words; out of which 508, 209 are unique words 15
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off). Used by almost all systems.
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words
(lower cut-off)
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighed
less. Used by almost all ranking methods.

16
Zipf's Law Impact
• Zipf’s Law Impact on IR

– Good News: Stopwords will account for a large fraction

of text so eliminating them greatly reduces inverted-in-
dex storage costs.

– Bad News: For most words, gathering sufficient data for

meaningful statistical analysis (e.g. for correlation anal-
ysis for query expansion) is difficult since they are ex-
tremely rare.

17
2. Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text fur-
nishes a useful measurement of word significance

• Luhn suggested that both extremely common and extremely un-

common words were not very useful for indexing

• For this, Luhn specifies two cutoff points: an upper and a lower
cutoffs based on which non-significant words are excluded
– The words exceeding the upper cutoff were considered to be common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the text
– The ability of words to discriminate content, reached a peak at a rank or-
der position half way between the two-cutoffs

• Let f be the frequency of occurrence of words in a text, and r their rank in

decreasing order of word frequency, then a plot relating f & r yields the fol-
lowing curve
18
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and ex-

tremely uncommon words were not very useful for document rep-
resentation & indexing 19
3. Heaps’ Law
• As the corpus grows, the number of new terms
will increase dramatically at first, but then will
increase at a slower rate
• Nevertheless, as the corpus grows, new terms
will always be found (even if the corpus be-
comes huge)
– there is no end to vocabulary growth
– invented words, proper nouns (people, products),
misspellings, email addresses, etc.

20
Vocabulary Growth: Heaps’ Law
• How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
– This determines how the size of the inverted index will scale with
the size of the corpus.
• Heap’s law: estimates the number of vocabularies in a
given corpus
– The vocabulary size grows by K(nβ), where β is a constant between
0 – 1.
– If V is the size of the vocabulary and n is the length of the corpus in
words, Heap’s law provides the following equation:
• Where constants: 
– K  10100
V Kn
–   0.40.6 (approx. square-root)
21
Heap’s distributions
• Distribution of size of the vocabulary vs. total
number of terms extracted from text corpus

22
Example: Heaps Law
• We want to estimate the size of the vocabulary for a
corpus of 1,000,000 words
• Assume that based on statistical analysis on smaller cor-
pora sizes:
– A corpus with 100,000 words contain 50,000 unique
words; and
– A corpus with 500,000 words contain 150,000 unique
words
• Estimate the vocabulary size for the 1,000,000 words
corpus?
– What about for a corpus of 1,000,000,000 words?
23
Text Operations
• Not all words in a document are equally
significant to represent the contents/
meanings of a document
– Some word carry more meaning than others
– Noun words are the most representative of
a document content
• Therefore, need to preprocess the text
of a document in a collection to be used
as source of index terms

• Using the set of all words in a collection

to index documents creates too much
noise for the retrieval task 24
Text Operations
• Preprocessing is the process of controlling
the size of the vocabulary or the number
of distinct words used as index terms
– Preprocessing will lead to an improvement in
the information retrieval performance
• However, some search engines on the Web
omit preprocessing
– Every word in the document is an index
term

Text operations is the process of text

transformations in to logical represen-25
Text Operations
• Five (5) main text operations for selecting index terms,
i.e. to choose words/stems (or groups of words) to be
used as indexing terms:
1. Tokenization of the text: generate a set of words from text col-
lection
2. Elimination of stop words - filter out words which are not use-
ful in the retrieval process
3. Normalization – bringing to one form – e.g. downcasing
4. Stemming words - remove affixes (prefixes and suffixes) and
group together word variants with similar meaning
5. Construction of term categorization structures such as the-
saurus, to capture relationship for allowing the expansion of
the original query with related terms 26
1. Tokenization
 Is a process of breaking up a piece of text into many pieces,
such as sentences and words. It works by separating words
using spaces and punctation.

– It is the process of demarcating and possibly classifying sec-

tions of a string of input characters into words
– For example,
The quick brown fox jumps over the lazy dog
• Objective - identify words in the text
– What is a word means?
• Is that a sequence of characters, numbers and alpha-numeric once?
– How we identify a set of words that exist in a text docu-
ments?
27
Tokenization Cont. …
• Tokenization is one of the steps used to con-
vert text of the documents into a sequence of
words, w1, w2, … wn to be adopted as index
terms

• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …

28
Issues in Tokenization
• One word or multiple: How to handle special cases
involving hyphens, apostrophes, punctuation marks
etc? C++, C#, URL’s, e-mail, …

– Sometimes punctuations (e-mail), numbers (1999), &

case (Republican vs. republican) can be a meaningful
part of a token.
– However, frequently they are not

29
Cont. …
• Two words may be connected by hyphens
– Can two words connected by hyphens taken as one
word or two words? Break up hyphenated sequence as
two tokens?

• In most cases hyphen – break up the words (e.g. state-of-

the-art  state of the art), but some words, e.g. MS-DOS -
unique word which require hyphens

30
Cont. …
•Two words may be connected by punctuation marks
–Punctuation marks: remove totally unless significant, e.
g. program code: x.exe and
xexe. What about Kebede’s, www.command.com?
•Two words (phrase) may be separated by space
–E.g. Addis Ababa, San Francisco, Los Angeles

•Two words may be written in different ways

–lowercase, lower-case, lower case? data base, database,
data-base?

31
Issues in Tokenization
• Numbers: are numbers/digits words and used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415--)
– IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. is unique.
Generally, don’t index numbers as text, though very useful.
• What about case of letters (e.g. Data or data or DATA):
– cases are not important and there is a need to convert all to upper or lower.
Which one is mostly followed by human beings?

• Simplest approach is to ignore all numbers and punctuation marks

(period, colon, comma, brackets, semi-colon, apostrophe, …) & use
only case-insensitive unbroken strings of alphabetic characters as
words.
– Will often index “meta-data”, including creation date, format, etc. separately
• Issues of tokenization are language specific
– Requires the language to be known 32
Tokenization
• Analyze text into a sequence of discrete tokens (words)
• Input: “Friends, Romans and Countrymen”
• Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
• Each such token is now a candidate for an index entry, af-
ter further processing
• But what are valid tokens to emit?
33
Exercise: Tokenization
• The cat slept peacefully in the living room. It’s
a very old cat.

• The instructor (Dr. O’Neill) thinks that the

boys’ stories about Chile’s capital aren’t amus-
ing.

34
2. Stop-word Removal
• Stopwords: words that we ignore because we expect
them not to be useful in distinguishing between rele-
vant/non-relevant documents for any query
• A stopword is a term that is discarded from the docu-
ment representation
• Stopwords are extremely common words across docu-
ment collections that have no discriminatory power
• Assumption: stopwords are unimportant because they
are frequent in every document
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents match-
ing a user need and needs to be filtered out as potential index terms

35
Stopword Removal cont. …
• Stopwords are typically function words:
– Examples of stopwords are articles, prepositions, conjunctions,
etc.:
• articles (a, an, the); pronouns: (I, he, she, it, their, his)
• Some prepositions (on, of, in, about, besides, against);
• conjunctions/ connectors (and, but, for, nor, or, so,
yet), verbs (is, are, was, were),
• adverbs (here, there, out, because, soon, after) and
• adjectives (all, any, each, every, few, many, some) can
also be treated as stopwords.
• Stopwords are language dependent
36
Why Stopword Removal?
• Intuition:
–Stopwords have little semantic content; It is typical to re-
move such high-frequency words
–Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
• Smaller indices for information retrieval
–Good compression techniques for indices: The 30
most common words account for 30% of the tokens in
written text
• With the removal of stopwords, we can measure better
approximation of importance for text classification, text
categorization, text summarization, etc.
37
How to detect a stopword?

• One method: Sort terms (in decreasing order) by docu-

ment frequency (DF) and take the most frequent ones
based on the cutoff point

• Another method: Build a stop word list that contains a

set of articles, pronouns, etc.
– Why do we need stop lists: With a stop list, we can
compare and exclude from index terms entirely the com-
monest words.
– Can you identify common words in Amharic and build
stop list?

38
Trends in Stopwords
• Stopword elimination used to be standard in older IR sys-
tems. But the trend is away from doing this nowadays.
• Most web search engines index stopwords:
– Good query optimization techniques mean you pay little atten-
tion at query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
• Therefore still now it needs better improvement.
39
3. Normalization
• It is Canonicalizing(normalizing) tokens(words or
terms) so that matches occur despite superficial dif-
ferences in the character sequences of the tokens
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting peri-
ods in a term
• Case Folding: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. anti-discriminatory
– Car vs. Automobile? 40
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile
– Helps a search engine when most users type ferrari
while they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, Kebede…
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way
users issue their queries
41
4. Stemming/Morphological analysis
 The final output from a conflation algorithm is a set of classes, one for
each stem detected

• Stemming reduces tokens to their “root” form of words to recognize

morphological variation
– The process involves removal of affixes (i.e. prefixes and suffixes) with the
aim of reducing variants to the same stem

– Often removes inflectional and derivational morphology of a word

• Inflectional morphology: vary the form of words in order to express grammatical fea-
tures, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.

• Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy

• Compounding – combining words to form new ones e.g. beefsteak

42
Stemming/morphological analysis
• Basic question: words occur in different forms. Do we
want to treat different forms as different index terms?
• Conflation: treating different (inflectional and deriva-
tional) variants as the same index term
• What are we trying to achieve by conflating morpholog-
ical variants?
• Goal: help the system ignore unimportant variations of
language usage.

43
Stemming cont. …
• The final output from a conflation algorithm is a set of classes,
one for each stem detected
–A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting connec-
tion, connections}
–Thus, [automate, automatic, automation] all reduce to  automat
• A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the docu-
ment
–A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords
• Queries : Queries are handled in the same way

44
Ways to implement stemming
There are basically two ways to implement stemming
–The first approach is to create a big dictionary that maps words
to their stems
• The advantage of this approach is that it works perfectly (insofar
as the stem of a word can be defined perfectly); the disadvantages
are the space required by the dictionary and the investment re-
quired to maintain the dictionary as new words appear
–The second approach is to use a set of rules that extract stems
from words
• Techniques widely used include: rule-based, statistical, machine
learning or hybrid
• The advantages of this approach are that the code is typically
small, & it can gracefully handle new words; the disadvantage is
that it occasionally makes mistakes
–But, since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, & the rule-based approach is the one
that is generally chosen 45
Porter Stemmer
• Stemming is the operation of stripping the suffices from
a word, leaving its stem
– Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the
word connect.
• In 1979, Martin Porter developed a stemming algorithm
that uses a set of rules to extract stems from words, and
though it makes some mistakes, most common words
seem to work out right
– Porter describes his algorithm and provides a reference implem
entation in C at http://tartarus.org/~martin/PorterStemmer/ind
ex.html

46
Porter stemmer
• Most common algorithm for stemming English words to
their common grammatical root
• It is simple procedure for removing known affixes in Eng-
lish without using a dictionary. To gets rid of plurals the
following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
–S cats  cat

– EMENT  
– replacement  replac
– cement  cement

47
Porter stemmer
• Porter stemmer works in steps.
– While step 1a gets rid of plurals –s and -es,
– step 1b removes -ed or -ing.
e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed

48
Stemming: challenges
• May produce unusual stems that are not English
words:
– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate (reduce to the same token) words

that are actually distinct.
• “computer”, “computational”, “computation” all re-
duced to same token “comput”

• Note: recognize all morphological derivations.

49
5. Thesaurus Construction
• Thesaurus Construction demonstrate inter-term relationship. it is
like a book that lists words in groups of synonyms and related
concepts.
• Thesaurus: The vocabulary of a controlled indexing language, formally or-
ganized so that a priori relationships between concepts (for example as
"broader" and “related") are made explicit

• Mostly full-text searching cannot be accurate, since different authors may

select different words to represent the same concept
– Problem: The same meaning can be expressed using different terms
that are synonyms, and related terms
– How can it be achieved such that for the same meaning the identical
terms are used in the index and the query?
• A thesaurus contains terms and relationships between terms
– IR thesauri rely typically upon the use of symbols such as USE/UF
(UF=used for), BT(broader term), TT(top term )and RT(related term)
to demonstrate inter-term relationships
– e.g., car UF automobile, truck, bus, taxi, motor vehicle
-color UF colour, or paint 50
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms

• The aim of thesaurus is therefore:

– to provide a standard vocabulary for indexing and query
• Thesaurus rewrite to form equivalence classes, and
we index such equivalences
• When the document contains automobile, index it
under car as well (usually, also vice-versa)
– to assist users with locating terms for proper query formula-
tion: When the query contains automobile, look under car
as well for expanding query

51
Thesaurus Construction
Example: thesaurus built to assist IR for searching
cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport

52
More Example
Example: thesaurus built to assist IR in the fields of Infor-
mation System:
TERM: natural languages
– UF natural language processing (UF=used for NLP)
– BT languages (BT=broader term is languages)
– TT languages (TT = top term is languages)
– RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition

53
Language-specificity
• Many of the above features embody transformations
that are:
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing
process
• Both open source and commercial plug-ins are
available for handling these.

54
Index Term Selection
• Index language is the language used to describe docu-
ments and requests
• Elements of the index language are index terms which
may be derived from the text of the document to be de-
scribed, or may be arrived at independently.
– If a full text representation of the text is adopted, then all
words in the text are used as index terms = full text index-
ing
– Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is basic to
design an efficient searching IR system

55
•The end

Collins Cobuild English Grammar
From Everand
Collins Cobuild English Grammar
HarperCollins UK
4/5 (13)
Language Hub Elementary Student S Book Scope and Sequence
100% (2)
Language Hub Elementary Student S Book Scope and Sequence
4 pages
A Grammar of The Irish Language PDF
100% (2)
A Grammar of The Irish Language PDF
565 pages
Grammar To Go: A Guide To Parts of Speech
No ratings yet
Grammar To Go: A Guide To Parts of Speech
32 pages
2 TextOperations
No ratings yet
2 TextOperations
54 pages
2 Text Operation
No ratings yet
2 Text Operation
46 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
2 Text Operation
No ratings yet
2 Text Operation
42 pages
Chapter 2 Text Operation
No ratings yet
Chapter 2 Text Operation
46 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
2_text operation
No ratings yet
2_text operation
35 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
No ratings yet
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
13 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
2 Text-Operation
No ratings yet
2 Text-Operation
60 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Processing Text: 4.1 From Words To Terms
No ratings yet
Processing Text: 4.1 From Words To Terms
52 pages
Chap 4
No ratings yet
Chap 4
76 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
2 - Text Operation
No ratings yet
2 - Text Operation
55 pages
Zipf's Law and Heaps Law
No ratings yet
Zipf's Law and Heaps Law
10 pages
Ch-2 Text Operations
No ratings yet
Ch-2 Text Operations
40 pages
ch2_Text Operations and Automatic Indexing
No ratings yet
ch2_Text Operations and Automatic Indexing
20 pages
Chapter 4
No ratings yet
Chapter 4
72 pages
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
No ratings yet
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
19 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Basic Text Process
No ratings yet
Basic Text Process
3 pages
Ir Assignment
No ratings yet
Ir Assignment
12 pages
ISR Assignment 1
No ratings yet
ISR Assignment 1
13 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
aris_1440380105 -- de57437e504abe97d142fdc665db6c54 -- Anna’s Archive
No ratings yet
aris_1440380105 -- de57437e504abe97d142fdc665db6c54 -- Anna’s Archive
43 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
No ratings yet
Automatic Indexing: Automatic Text Processing by G. Salton, Addison-Wesley, 1989
65 pages
MLRD 3
No ratings yet
MLRD 3
26 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
W14 Reading 2
No ratings yet
W14 Reading 2
16 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
0 Experimenteeff
No ratings yet
0 Experimenteeff
5 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
8-Text and Multimedia Languages
No ratings yet
8-Text and Multimedia Languages
22 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
No ratings yet
Keyword Extraction From A Single Document Using Word Co-Occurrence Statistical Information
5 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
HG3051 Lec06 DIY
No ratings yet
HG3051 Lec06 DIY
59 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Grammar and Linguistics: Core Concepts
From Everand
Grammar and Linguistics: Core Concepts
Saraswati Saini
No ratings yet
Conceptual Transfer in the Bilingual Mental Lexicon
From Everand
Conceptual Transfer in the Bilingual Mental Lexicon
Sherif Okasha
No ratings yet
Chapter 3 Intents and Services
No ratings yet
Chapter 3 Intents and Services
16 pages
Redi Nesro
No ratings yet
Redi Nesro
15 pages
word exam
No ratings yet
word exam
78 pages
Eyerusalem Delelegn
No ratings yet
Eyerusalem Delelegn
2 pages
WEB DEV
No ratings yet
WEB DEV
48 pages
Network Part
No ratings yet
Network Part
49 pages
Model Exam 2015
No ratings yet
Model Exam 2015
19 pages
Project proposal
No ratings yet
Project proposal
17 pages
Helium Drive Launch Tp686!1!1602us
No ratings yet
Helium Drive Launch Tp686!1!1602us
4 pages
SURFACE COMPUTING 1
No ratings yet
SURFACE COMPUTING 1
12 pages
New Text Document
No ratings yet
New Text Document
2 pages
Project proposal2
No ratings yet
Project proposal2
20 pages
Pharmacy
No ratings yet
Pharmacy
13 pages
PROJECT MANAGEMENT
No ratings yet
PROJECT MANAGEMENT
10 pages
Pharmacy2
No ratings yet
Pharmacy2
13 pages
Bakker y
No ratings yet
Bakker y
15 pages
ali
No ratings yet
ali
15 pages
wmn123
No ratings yet
wmn123
9 pages
Online Ticket Reservations Document (1)[1]
No ratings yet
Online Ticket Reservations Document (1)[1]
24 pages
Project template-WRU-IOT
No ratings yet
Project template-WRU-IOT
22 pages
abdure
No ratings yet
abdure
2 pages
project titles for Even Driven Programming
No ratings yet
project titles for Even Driven Programming
11 pages
Surface Computing
No ratings yet
Surface Computing
12 pages
computer_human_interface
No ratings yet
computer_human_interface
7 pages
What Are Active and Passive Sentences
No ratings yet
What Are Active and Passive Sentences
4 pages
Gold Exp Grammar PPT A1 U8
No ratings yet
Gold Exp Grammar PPT A1 U8
11 pages
ISE I (B1) - Exam Structure
No ratings yet
ISE I (B1) - Exam Structure
7 pages
Affix Overview An Ke An Pen An Per An Sept 1 PDF
No ratings yet
Affix Overview An Ke An Pen An Per An Sept 1 PDF
4 pages
G9 U5 Holidays
No ratings yet
G9 U5 Holidays
64 pages
Sample Tos JHS
No ratings yet
Sample Tos JHS
1 page
CLI115 5641a 04
No ratings yet
CLI115 5641a 04
43 pages
A Day in The Life: Present Simple)
No ratings yet
A Day in The Life: Present Simple)
2 pages
Trial Kedah English SPM 2013 k2 Dan Jawapan
No ratings yet
Trial Kedah English SPM 2013 k2 Dan Jawapan
25 pages
Phrases and Clauses
No ratings yet
Phrases and Clauses
31 pages
TEST 1 - 19 WJ-IV-COG-Standard-Test-Book-Examiner-1
100% (1)
TEST 1 - 19 WJ-IV-COG-Standard-Test-Book-Examiner-1
14 pages
British Council Grammar
No ratings yet
British Council Grammar
119 pages
Meeting 4 Strategic On Multiple Clause NOUN CLAUSE
No ratings yet
Meeting 4 Strategic On Multiple Clause NOUN CLAUSE
14 pages
24 Habitual Participles - Hindilanguage - Info
No ratings yet
24 Habitual Participles - Hindilanguage - Info
1 page
Adverbs of Certainty
No ratings yet
Adverbs of Certainty
6 pages
Tabela de Tempos Verbais em Inglês
No ratings yet
Tabela de Tempos Verbais em Inglês
1 page
Adjectives - Answer Key
No ratings yet
Adjectives - Answer Key
2 pages
CSB Sample Paper for Class 4
No ratings yet
CSB Sample Paper for Class 4
7 pages
474-Article Text-1944-1-10-20221226
No ratings yet
474-Article Text-1944-1-10-20221226
9 pages
Chaha Morphology
No ratings yet
Chaha Morphology
61 pages
Class 20 A Miracle Rescue
No ratings yet
Class 20 A Miracle Rescue
3 pages
Business Letters Punctuations and Styles
No ratings yet
Business Letters Punctuations and Styles
17 pages
Basic English Grammar
No ratings yet
Basic English Grammar
3 pages
Topic 12
No ratings yet
Topic 12
7 pages
Studying Korean
No ratings yet
Studying Korean
42 pages
Subject Object Pronouns: Grammar Worksheet
No ratings yet
Subject Object Pronouns: Grammar Worksheet
1 page
Contoh Kalimat
No ratings yet
Contoh Kalimat
3 pages

2_Text Operations (1)

Uploaded by

2_Text Operations (1)

Uploaded by

Information Storage and Retrieval

Controlled High Not ambiguous Can’t represent

Free text in- low Can be am- Any level of de-

• We will focus on free-text indexing in this

Zipf’s distribu- Distribution of sorted word

• If the words, w, in a col-

– Good News: Stopwords will account for a large fraction

– Bad News: For most words, gathering sufficient data for

• Luhn suggested that both extremely common and extremely un-

• Let f be the frequency of occurrence of words in a text, and r their rank in

Luhn (1958) suggested that both extremely common and ex-

• Using the set of all words in a collection

Text operations is the process of text

– It is the process of demarcating and possibly classifying sec-

– Sometimes punctuations (e-mail), numbers (1999), &

• In most cases hyphen – break up the words (e.g. state-of-

•Two words may be written in different ways

• Simplest approach is to ignore all numbers and punctuation marks

• The instructor (Dr. O’Neill) thinks that the

• One method: Sort terms (in decreasing order) by docu-

• Another method: Build a stop word list that contains a

• Stemming reduces tokens to their “root” form of words to recognize

– Often removes inflectional and derivational morphology of a word

• Compounding – combining words to form new ones e.g. beefsteak

• May conflate (reduce to the same token) words

• Note: recognize all morphological derivations.

• Mostly full-text searching cannot be accurate, since different authors may

• The aim of thesaurus is therefore:

You might also like