0% found this document useful (0 votes)
22 views8 pages

IR 02 02 Tokens

This document introduces tokenization in information retrieval. It discusses how text is broken down into tokens as basic indexing units. It notes issues that can arise in tokenization, like whether to tokenize phrases or numbers. It also discusses language-specific challenges in tokenizing different languages like French, German, Chinese, Japanese, and Arabic texts that have different writing systems than English. Overall, the document provides an overview of the tokenization process in information retrieval systems.

Uploaded by

Afeefa Noorain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views8 pages

IR 02 02 Tokens

This document introduces tokenization in information retrieval. It discusses how text is broken down into tokens as basic indexing units. It notes issues that can arise in tokenization, like whether to tokenize phrases or numbers. It also discusses language-specific challenges in tokenizing different languages like French, German, Chinese, Japanese, and Arabic texts that have different writing systems than English. Overall, the document provides an overview of the tokenization process in information retrieval systems.

Uploaded by

Afeefa Noorain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Introduction to Information Retrieval

Introduction to
Information Retrieval
Tokens
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Input: “Friends, Romans and Countrymen”
 Output: Tokens
 Friends
 Romans
 Countrymen
 A token is an instance of a sequence of characters
 Each such token is now a candidate for an index
entry, after further processing
 Described below
 But what are valid tokens to emit?
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Issues in tokenization:
 Finland’s capital 
Finland AND s? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two
tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
 How do you decide it is one token?
Introduction to Information Retrieval Sec. 2.2.1

Numbers
 3/20/91 Mar. 12, 1991 20/3/91
 55 B.C.
 B-52
 My PGP key is 324a3df234cb23e
 (800) 234-2333
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful: think about things like looking up error
codes/stacktraces on the web
 (One answer is using n-grams: IIR ch. 3)
 Will often index “meta-data” separately
 Creation date, format, etc.
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 French
 L'ensemble  one token or two?
 L ? L’ ? Le ?
 Want l’ensemble to match with un ensemble
 Until at least 2003, it didn’t on Google
 Internationalization!

 German noun compounds are not segmented


 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound splitter
module
 Can give a 15% performance boost for German
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 Chinese and Japanese have no spaces between
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated in Japanese, with multiple
alphabets intermingled
 Dates/amounts in multiple formats
フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 )

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!


Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues


 Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
 Words are separated, but letter forms within a word
form complex ligatures

 ← → ←→ ←
start
 ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
 With Unicode, the surface presentation is complex, but the
Introduction to Information Retrieval

Introduction to
Information Retrieval
Tokens

You might also like