0% found this document useful (0 votes)

19 views

Lec 4

Reusable test collections are used for information retrieval evaluation. They consist of a document collection, topics representing information needs, and relevance judgments indicating which documents are relevant to each topic. Such test collections allow researchers to develop and evaluate new retrieval methods in a standardized way.

Uploaded by

haneenalaa465

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Lec 4

Uploaded by

haneenalaa465

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Reusable Test Collections

Reusable Test Collections (for evaluation purposes)

1. Document Collection
2. Topics (sample of information needs)
3. Relevance judgments (qrels)

2
How can we get it?
 For web search, companies apply their own studies to assess the
performance of their search engine.
 Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries

 Academia (or lab settings):

● Someone goes out and builds them (expensive)
● As a byproduct of large scale evaluations (collaborative effort)
 IR Evaluation Campaigns are created for this reason
IR Evaluation Campaigns
 IR test collections are provided for scientific communities to develop
better IR methods.
 Collections and queries are provided, relevance judgements are built
during the campaign.
 TREC = Text REtrieval Conference http://trec.nist.gov/
● Main IR evaluation campaign, sponsored by NIST (US gov).
● Series of annual evaluations, started in 1992.
 Other evaluation campaigns
● CLEF: European version (since 2000)
● NTCIR: Asian version (since 1999)
● FIRE: Indian version (since 2008)
TREC Tracks and Tasks
 TREC (and other campaigns) are formed of a set of tracks, each track
is about (one or more) search tasks.
● Each track/task is about searching a set of documents of given genre and
domain.
 Examples
● TREC Web track
● TREC Medical track
● TREC Legal track  CLEF-IP track  NTCIR patent mining track
● TREC Microblog track
• Adhoc search task
• Filtering task
1. TREC Collection
A set of hundreds of thousands or millions of docs
● 1B in case of web search (TREC ClueWeb09)
 The typical format of a document:
<DOC>
<DOCNO> 1234 </DOCNO>
<TEXT>
This is the document.
Multilines of plain text.
</TEXT>
</DOC>
2. TREC Topic
 Topic: a statement of information need
 Multiple topics (~50) developed (mostly) at NIST for a collection.
 Developed by experts and associated with additional details.
● Title: the query text
● Description: description of what is meant by the query.
● Narrative: what should be considered relevant.
<num>189</num>
<title>Health and Computer Terminals</title>
<desc>Is it hazardous to the health of individuals to work with computer terminals on a
daily basis?</desc>
<narr>Relevant documents would contain any information that expands on any
physical disorder/problems that may be associated with the daily working with
computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said
to be associated, but how widespread are these or other problems and what is being
done to alleviate any health problems</narr>
3. Relevance Judgments
 For each topic, set of relevant docs is required to be known for an
effective evaluation:
a) Exhaustive assessment is usually impractical
● TREC usually has 50 topics
● Collection usually has >1 million documents
b) Random sampling won’t work
● If relevant docs are rare, none may be found!
c) IR systems can help focus the sample (Pooling)
● Each system finds some relevant documents
● Different systems find different relevant documents
● Together, enough systems will find most of them
8
Pooled Assessment Methodology
1. Systems submit top 1000 documents per topic (Ranked)
2. Top 100 documents from each are manually judged
• Single pool, duplicates removed, arbitrary order
• Judged by the person who developed the topic
3. Treat unevaluated documents as not relevant
4. Compute MAP (or others) down to 1000 documents

 To make pooling work:

● Good number of participating systems
● Systems must do reasonably well
● Systems must be different (not all “do the same thing”)
9
Example
In one of TREC tracks, 3 teams T1, T2, and T3 Solution
have participated and they were asked to
retrieve up to 15 documents per query. In Given: Pool is the top 5
Start judging by comparison with the given relevant:
reality (with exhaustive judgments), a query Q Relevant in the collection w.r.t pool: A, G, B, E
has 9 relevant documents in the collection: 1 2 3
( + + )
1 7 9
AP of team (1) =
A, B, C, D, E, F, G, H, and I. 4
The submitted ranked lists are as follows: Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T1 A M Y R K L B Z E N D C W
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T2 Y A J R N Z M C G B X P D K W
T1 A M Y R K L B Z E N D C W
T3 G B Y K E A Z L N C H K W X
T2 Y A J R N Z M C G B X P D K W
T3 G B Y K E A Z L N C H K W X
We constructed the judging pools for Q using We can easily repeat the previous methodology for
other team
only the top 5 documents of each of the
submitted ranked lists.
What is the Average Precision of each of the 3
teams? 10
Ranked Retrieval (vector space model & BM25)
Ranked Retrieval

 Typical queries: free text queries (no Boolean operators)

 Results are “ranked” with respect to a query
 Large result sets are not an issue
● We just show the top k ( ≈ 10) results
● We don’t overwhelm the user
 Top ranked documents are the most likely to satisfy user’s query.
 Assign a score –say in [0, 1] – to each document.
 Score (d, q) measures how well doc d matches a query q.

12
Scoring Example: Jaccard coefficient
 Commonly-used measure of overlap of two sets A and B
 The 2 sets in our context are terms (documents after preprocessing) and queries

 Note that:
 jaccard(A, A)
=1
 jaccard(A, B) = 0 if A ∩ B = 0
 A and B don’t have to be of the same size.
 Always assigns a number between 0 and 1.
13
Example
For the query "the arab world" and the document "fifa world cup in arab
country", what is Jaccard similarity (after removing stop words)?

Solution: after removing the stop words (the, in)

2
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 =
5
Issues and problems With Jaccard for Scoring
1) Term frequency?
 Doesn’t consider term frequency (how many times a term
occurs in a document)
2) Term importance
 It treats all terms equally!
● How about rare terms in a collection? more informative than
frequent terms.
3) Length
 Needs more sophisticated way of length normalization

15
TF-IDF ranking
Term-Document Count Matrix
 Each document as a count (frequency) vector
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0

 Each value in the matrix represents term frequency (tf)

 Bag-of-Words Model doesn’t consider the ordering of words in a document

 John is quicker than Mary

Same vectors (error due to bag of words criteria)
 Mary is quicker than John
22
1. Frequent Terms in a Document
Term Frequency
 tft,d: the number of times that term t occurs in doc d.
 We want to use tf when computing query-document match
scores. But how?

 Raw term frequency?

● A document with 10 occurrences of the term is more relevant than a
document with 1 occurrence of the term.
● But not 10 times more relevant.
 Relevance does not increase linearly with tf.
23
Log-Frequency Weighting
Weight of term in document

24
Log-Frequency Weighting
 The log frequency weight of term t in d is (term weighting function):
1  log10 tf t,d , if tft,d  0
wt,d  
 0, otherwise (If tft,d  0)
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

 Score for a document-query pair: sum of weights over terms t in

both q and d:

 The score is 0 if none of the query terms is present in the document.

25
2. Informative Terms in a Collection
 Rare terms are more informative than frequent terms
● Recall stop words
 We want a high weight for rare terms.
 Collection Frequency cft:
● number of occurrences of term t in the collection
 Document Frequency dft:
● the number of documents that contain t
● inverse measure of the informativeness of t
● dft  N

26
Inverse Documents Frequency, idf
 idf (inverse document frequency) of t:

● N is the number of documents in the collection

● log (N/dft) instead of N/dft to “dampen” the effect of idf.
● As increasing the value of idf, it means that the term is more
informative term dft idft This is the more
informative term as
 Suppose N = 1 million calpurnia 1 6
idf is big
animal 100 4
sunday 1,000 3
fly 10,000 2
under 100,000 1
27
If term “ the” is in all documents the 1,000,000 0
Summary

idf: is a relation between term and collection

tf: is a relation between term and document
tf.idf Term Weighting
 The tf-idf weight of a term is the product of its tf weight and its
idf weight.

 One of the best-known weighting scheme in IR

● Increases with the number of occurrences within a document
● Increases with the rarity of the term in the collection

28
Solve this Example: each student should do a detailed solution
 Collection of 5 documents (balls = terms)
 Query the destructive storm
 Which is the least relevant document?
Start your solution with:
 Which is the most relevant document?
• df (yellow)= 5
D1 D2 D3 D4 D5 • df (red)=3
• df (green)=3

What next !!!

4 5 3 2 1
Vector Space Model
Binary → Count → Weight Matrix

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Dictionary = vocab

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

27
Sec. 6.3

Documents as Vectors
 According to the shown figure, Assume 2 terms then space will
be 2 dimensions d4

 Dimension of space = number of all terms in the whole collection

 |V|-dimensional vector space where |V| is the vocabulary size
 Terms are axes of the space
 Documents are points or vectors in this space

 Note that:
 Very high-dimensional: tens of millions of dimensions when you apply this to a web
search engine.
 These are very sparse vectors - most entries are zero.
 This led to Vector Space Model
28
Sec. 6.3

Queries as Vectors
 Key idea 1: Do the same for queries: represent them as vectors in the space.
(space) ‫عل نفس ال‬
‫( ي‬vector) ‫صغبه ونمثلها ك‬
‫ر‬ (document) ‫(كأنها‬query) ‫نعتب ال‬
‫ر‬

 Key idea 2: Rank documents according to their proximity to the query in

this space.
‫( نقدر نقيس المسافه ر ن‬linear algebra) ‫باستخدام ال‬
‫( ومتجه ال‬query) ‫بي متجه ال‬
‫( لمعرفة مدي قرب هم‬document)
 proximity = similarity of vectors

29
Sec. 6.3

Does the Euclidean Distance is efficient here

 Distance between the end points of the two vectors

 Large for vectors of different lengths.

 Thought experiment: take a document d and append it to
itself. Call this document d′.
● “Semantically” d and d′ have the same content
● Euclidean distance can be quite large 37
Sec. 6.3

Angle Instead of Distance

 If the angle between the two documents is 0, corresponding to maximal similarity.
‫( زادت‬similarity) ‫( الن ال‬more relevant) ‫صغبه كلما كانت‬
‫ر‬ ‫كلما كانت الزاويه ر ن‬
(d) ‫( وال‬q) ‫بي ال‬ •
(Cosine) ‫ه ال‬ ‫الت تحقق تلك العالقه العكسيه ي‬ ‫الداله ي‬ •

 Key idea: Rank documents according to angle with query.

● Rank documents in increasing order of the angle with query
● Rank documents in decreasing order of cosine(query, document)
 Cosine is a monotonically decreasing function for the interval [0o, 180o]

31
Sec. 6.3

3. Length Normalization
 A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:

 Dividing a vector by its L2 norm makes it a unit (length) vector (on

surface of unit hypersphere)
 Effect on the two documents d and d′ (d appended to itself) from
earlier slide: they have identical vectors after length-normalization.
● Long and short documents now have comparable weights

32
Cosine “Similarity” (Query, Document)

 𝑞⃗𝑖 is the [tf-idf] weight of term 𝑖 in the query

 d⃗𝑖 is the [tf-idf] weight of term 𝑖 in the doc
 For normalized vectors:

 For non-normalized vectors:

Computing Cosine Scores

34
Sec. 6.4

Variants of tf-idf Weighting

 Many search engines allow for different weightings for queries vs.
documents.
 SMART Notation: use notation ddd.qqq, using the acronyms from the table
 A very standard weighting scheme is: lnc.ltc

35
BM25 ranking
Okapi BM25 Ranking Function
And

tf component idf component

1.0 6.0

5.8
0.8 L /𝐿 5.6

5.4
0.6
Okapi TF

0.5
Classic

IDF
1.0 5.2
Okapi
2.0
0.4
5.0

4.8
0.2
4.6

0.0 4.4
0 5 10 15 0 5 10 15 20 25 43
Raw TF Raw DF

𝐿𝑑 :Length of d
𝐿: average doc length in collection
Okapi BM25 Ranking Function

𝐿𝑑 :Length of d
𝐿: average doc length in collection
𝐾1 , 𝑏: parameters

If 𝑲𝟏 =2, 𝒃=0.75
Summary – Vector Space Ranking
 Represent the query as a term-weighted (e.g., tf-idf) vector.
 Represent each document as a term-weighted (e.g., tf-idf) vector.
 Compute the cosine similarity score for the query vector and
each document vector.
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user.

Pimso s3
100% (1)
Pimso s3
7 pages
小五組 Grade 5: 時限：分鐘 Time allowed: minutes
100% (1)
小五組 Grade 5: 時限：分鐘 Time allowed: minutes
5 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
TF Idf
100% (3)
TF Idf
38 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
NLP-week10-IR-enc-dec-annotated_by_Ces
No ratings yet
NLP-week10-IR-enc-dec-annotated_by_Ces
83 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
I R Rank
No ratings yet
I R Rank
52 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Lecture4 VSM
No ratings yet
Lecture4 VSM
101 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
NLP-week10-IR-enc-dec
No ratings yet
NLP-week10-IR-enc-dec
68 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
chapter 3 term weighting
No ratings yet
chapter 3 term weighting
11 pages
L03
No ratings yet
L03
16 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
4_IRModels
No ratings yet
4_IRModels
30 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
08B - Chapter 8, Sec 8.4 - 8.8 Black
No ratings yet
08B - Chapter 8, Sec 8.4 - 8.8 Black
18 pages
01 Martin T. Hagan - Neural Network Design, Chino (1996)
0% (1)
01 Martin T. Hagan - Neural Network Design, Chino (1996)
734 pages
Chapter 11 Mathematica For Economists
100% (1)
Chapter 11 Mathematica For Economists
17 pages
Digital Logic Fundamentals and Programming in C
No ratings yet
Digital Logic Fundamentals and Programming in C
6 pages
Pop Up Quizes
No ratings yet
Pop Up Quizes
3 pages
Lec57 PDF
No ratings yet
Lec57 PDF
15 pages
Civil Engineer CV Example
100% (1)
Civil Engineer CV Example
2 pages
01 Gauss Seidel Methods 2
No ratings yet
01 Gauss Seidel Methods 2
29 pages
Homework #5
No ratings yet
Homework #5
5 pages
Geometry and Trigonometry-Right Triangles and Trigonometry
No ratings yet
Geometry and Trigonometry-Right Triangles and Trigonometry
28 pages
Tensionrite Belt Frequency Meter: User Manual
No ratings yet
Tensionrite Belt Frequency Meter: User Manual
24 pages
B. Frequency Domain Representation of Lti Systems: Objective
100% (1)
B. Frequency Domain Representation of Lti Systems: Objective
4 pages
Calculate-Duration-pdf (1) - P6
No ratings yet
Calculate-Duration-pdf (1) - P6
5 pages
EIE3510Assignment 3 - Final
No ratings yet
EIE3510Assignment 3 - Final
2 pages
Results 2023
No ratings yet
Results 2023
6 pages
Golden Ratio Presentation
No ratings yet
Golden Ratio Presentation
6 pages
CH 8 PPT File MMP (Mary L Boas)
67% (6)
CH 8 PPT File MMP (Mary L Boas)
74 pages
Week 4
No ratings yet
Week 4
5 pages
Mathematics Budget of Work - Bridging Program For Grade 7 Entrants
No ratings yet
Mathematics Budget of Work - Bridging Program For Grade 7 Entrants
2 pages
Infinite Series of Positive and Negative Terms
No ratings yet
Infinite Series of Positive and Negative Terms
5 pages
دوائر الكترونية كورس اول
No ratings yet
دوائر الكترونية كورس اول
35 pages
TiVA 2023 Indicators Guide
No ratings yet
TiVA 2023 Indicators Guide
47 pages
Choose Section Property For Shaft
No ratings yet
Choose Section Property For Shaft
2 pages
Stretching Beyond JEE Advanced
No ratings yet
Stretching Beyond JEE Advanced
12 pages
Cat 400
No ratings yet
Cat 400
52 pages
Laguerre Polynomials - Normalization
No ratings yet
Laguerre Polynomials - Normalization
4 pages
10
No ratings yet
10
23 pages

Lec 4

Uploaded by

Lec 4

Uploaded by

Reusable Test Collections

Reusable Test Collections (for evaluation purposes)

 Academia (or lab settings):

 To make pooling work:

 Typical queries: free text queries (no Boolean operators)

Solution: after removing the stop words (the, in)

 Each value in the matrix represents term frequency (tf)

 John is quicker than Mary

 Raw term frequency?

 Score for a document-query pair: sum of weights over terms t in

 The score is 0 if none of the query terms is present in the document.

● N is the number of documents in the collection

idf: is a relation between term and collection

 One of the best-known weighting scheme in IR

What next !!!

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|

 Dimension of space = number of all terms in the whole collection

 Key idea 2: Rank documents according to their proximity to the query in

Does the Euclidean Distance is efficient here

 Large for vectors of different lengths.

Angle Instead of Distance

 Key idea: Rank documents according to angle with query.

 Dividing a vector by its L2 norm makes it a unit (length) vector (on

 𝑞⃗𝑖 is the [tf-idf] weight of term 𝑖 in the query

 For non-normalized vectors:

Variants of tf-idf Weighting

tf component idf component

You might also like