IR 02 02 Tokens

This document introduces tokenization in information retrieval. It discusses how text is broken down into tokens as basic indexing units. It notes issues that can arise in tokenization, like whether to tokenize phrases or numbers. It also discusses language-specific challenges in tokenizing different languages like French, German, Chinese, Japanese, and Arabic texts that have different writing systems than English. Overall, the document provides an overview of the tokenization process in information retrieval systems.

Uploaded by

Afeefa Noorain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views8 pages

IR 02 02 Tokens

Uploaded by

Afeefa Noorain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Introduction to Information Retrieval

Introduction to
Information Retrieval
Tokens
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Input: “Friends, Romans and Countrymen”
 Output: Tokens
 Friends
 Romans
 Countrymen
 A token is an instance of a sequence of characters
 Each such token is now a candidate for an index
entry, after further processing
 Described below
 But what are valid tokens to emit?
Introduction to Information Retrieval Sec. 2.2.1

Tokenization
 Issues in tokenization:
 Finland’s capital 
Finland AND s? Finlands? Finland’s?
 Hewlett-Packard  Hewlett and Packard as two
tokens?
 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?
 How do you decide it is one token?
Introduction to Information Retrieval Sec. 2.2.1

Numbers
 3/20/91 Mar. 12, 1991 20/3/91
 55 B.C.
 B-52
 My PGP key is 324a3df234cb23e
 (800) 234-2333
 Often have embedded spaces
 Older IR systems may not index numbers
 But often very useful: think about things like looking up error
codes/stacktraces on the web
 (One answer is using n-grams: IIR ch. 3)
 Will often index “meta-data” separately
 Creation date, format, etc.
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues

 French
 L'ensemble  one token or two?
 L ? L’ ? Le ?
 Want l’ensemble to match with un ensemble
 Until at least 2003, it didn’t on Google
 Internationalization!

 German noun compounds are not segmented

 Lebensversicherungsgesellschaftsangestellter
 ‘life insurance company employee’
 German retrieval systems benefit greatly from a compound splitter
module
 Can give a 15% performance boost for German
Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues

 Chinese and Japanese have no spaces between
words:
 莎拉波娃现在居住在美国东南部的佛罗里达。
 Not always guaranteed a unique tokenization
 Further complicated in Japanese, with multiple
alphabets intermingled
 Dates/amounts in multiple formats
フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 )

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Introduction to Information Retrieval Sec. 2.2.1

Tokenization: language issues

 Arabic (or Hebrew) is basically written right to left,
but with certain items like numbers written left to
right
 Words are separated, but letter forms within a word
form complex ligatures

 ← → ←→ ←
start
 ‘Algeria achieved its independence in 1962 after 132
years of French occupation.’
 With Unicode, the surface presentation is complex, but the
Introduction to Information Retrieval

Introduction to
Information Retrieval
Tokens

Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
6_2018_09_11!11_16_16_AM
No ratings yet
6_2018_09_11!11_16_16_AM
101 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
text-processing
No ratings yet
text-processing
114 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
5 The Term Vocabulary & Posting List
No ratings yet
5 The Term Vocabulary & Posting List
36 pages
Lec 19
No ratings yet
Lec 19
60 pages
Chapter -2 Text operation( Lecture 2.1)
No ratings yet
Chapter -2 Text operation( Lecture 2.1)
63 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
week6
No ratings yet
week6
98 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Lect 7 Normalization
No ratings yet
Lect 7 Normalization
9 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
chap2part2
No ratings yet
chap2part2
20 pages
Imformation Retrieval
No ratings yet
Imformation Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
lec5
No ratings yet
lec5
22 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
03Text Processing
No ratings yet
03Text Processing
22 pages
Lecture 4 - Tolerant-Retrieval Chapter 3
No ratings yet
Lecture 4 - Tolerant-Retrieval Chapter 3
20 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
5 BASIC TEXT PROCESSING
No ratings yet
5 BASIC TEXT PROCESSING
6 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
OOAD All Chapter NOTES - 20211218132503
No ratings yet
OOAD All Chapter NOTES - 20211218132503
79 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Informa (On Retrieval: Recap of The Previous Lecture
No ratings yet
Informa (On Retrieval: Recap of The Previous Lecture
8 pages
DB2 UDB For OS390 and ZOS V8 Installation Guide
No ratings yet
DB2 UDB For OS390 and ZOS V8 Installation Guide
630 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
UNOTIC - The Uno Game Report
No ratings yet
UNOTIC - The Uno Game Report
68 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
IT SKILL LAB-2 MBA-1st
No ratings yet
IT SKILL LAB-2 MBA-1st
61 pages
TYBSC (CS) - 3510 Python Programming
No ratings yet
TYBSC (CS) - 3510 Python Programming
3 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Manual Ubiquiti U6 LR
No ratings yet
Manual Ubiquiti U6 LR
58 pages
Quick Installation Guide Firewall: DFL-1600 / DFL-2500
No ratings yet
Quick Installation Guide Firewall: DFL-1600 / DFL-2500
48 pages
Motorola Razr - 2023.NA Retail - Andriod 13.UG - En-Us - ssc8D98360-A
No ratings yet
Motorola Razr - 2023.NA Retail - Andriod 13.UG - En-Us - ssc8D98360-A
90 pages
What Is Operating System? Discuss Role/functions of OS As A Resource
No ratings yet
What Is Operating System? Discuss Role/functions of OS As A Resource
46 pages
Interview Questions
No ratings yet
Interview Questions
68 pages
How To Sell Hillstone 312. Hsa v2.10.0
No ratings yet
How To Sell Hillstone 312. Hsa v2.10.0
26 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
MODEL PPT VIVA Viva Voce-Internship
No ratings yet
MODEL PPT VIVA Viva Voce-Internship
11 pages
MS Office Mcqs
No ratings yet
MS Office Mcqs
31 pages
Towards Multimodal Surveillance For Smart Building
No ratings yet
Towards Multimodal Surveillance For Smart Building
9 pages
arun_EPFO_BGBNG16612740000013189
No ratings yet
arun_EPFO_BGBNG16612740000013189
3 pages
JSON Injection
No ratings yet
JSON Injection
7 pages
Freebitcoin Hack 2019, Freebitco in Client Seed Secret Code Live Working and Roll Verify Proofs 2019
31% (29)
Freebitcoin Hack 2019, Freebitco in Client Seed Secret Code Live Working and Roll Verify Proofs 2019
5 pages
Exercise: Labels As Symbols
No ratings yet
Exercise: Labels As Symbols
40 pages
A Framework For Zero-Day Vulnerabilities Detection and Prioritization
No ratings yet
A Framework For Zero-Day Vulnerabilities Detection and Prioritization
9 pages
Threat Modeling - Swiderski, Frank, Snyder, Window - Amazon - in - Books
No ratings yet
Threat Modeling - Swiderski, Frank, Snyder, Window - Amazon - in - Books
3 pages
SICK Camera Family Overview Inspector
No ratings yet
SICK Camera Family Overview Inspector
7 pages
Adrian Garcia Vargas Resume
No ratings yet
Adrian Garcia Vargas Resume
7 pages
SQA SOP Template
No ratings yet
SQA SOP Template
3 pages
Shell Script
No ratings yet
Shell Script
2 pages
Hospital Management System SRS
No ratings yet
Hospital Management System SRS
2 pages
Normalization Dbms Int 306
No ratings yet
Normalization Dbms Int 306
2 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Web Development Bootcamp Curriculum PDF
No ratings yet
Web Development Bootcamp Curriculum PDF
9 pages
Simio Installation Notes
No ratings yet
Simio Installation Notes
3 pages
Goldenville School of Montessori, Inc.: Matimbubong, San Ildefonso, Bulacan
No ratings yet
Goldenville School of Montessori, Inc.: Matimbubong, San Ildefonso, Bulacan
2 pages
Started On State Completed On Time Taken Marks Grade
No ratings yet
Started On State Completed On Time Taken Marks Grade
3 pages
The Enigma of Code
From Everand
The Enigma of Code
Pasquale De Marco
No ratings yet

IR 02 02 Tokens

Uploaded by

IR 02 02 Tokens

Uploaded by

Introduction to Information Retrieval

Tokenization: language issues

 German noun compounds are not segmented

Tokenization: language issues

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Tokenization: language issues

You might also like