COURSEWORK1 Details

This document provides details for Coursework 1, including important dates, objectives, and submission requirements. Students must implement an information retrieval system that preprocesses text, creates a positional inverted index, and supports boolean, phrase, proximity and ranked searches. Submissions must include an indexed file, results files for boolean and ranked queries, Python code, and a 2-4 page report.

Uploaded by

Bouzid Moulkaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views3 pages

COURSEWORK1 Details

Uploaded by

Bouzid Moulkaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

COURSEWORK 1 DETAILS

Based on lectures 3-7

DOWNLOAD Coursework 1 data (collection and queries) from HERE

IMPORTANT DATES

SUBMISSION DEADLINE: MONDAY, 15 FEBRUARY 2021, 11:59PM

ASSIGNMENT OBJECTIVES

Implement a simple IR tool that does the following:

 Pre-processes text
- Tokenisation
- Stopword removal using this list of words. You MUST use this list.
- Porter stemming. You can use packages for this part, such as Snowball or
NLTK.
 Creates a positional inverted index
Your index can have whatever structure you like, and can be stored in any format
you like, but you will need to output it to a text file using the format specfied
below.
 Uses your positional inverted index to perform:
- Boolean search
- Phrase search
- Proximity search
- Ranked IR based on TFIDF

ADDITIONAL DETAILS

 Use the collections from Lab2 for testing your system. You can download from here.
Focus on the trec collection which contains 1000 sample news articles.
Note 1: use the file in XML format. You need to make your code able to parse it. It
worth noting that the XML format is a standard TREC which might not be parsed
directly using XML parsers. You might need to add header and footer to the file to
make it parsable by existing tools. It is allowed to do so if needed (or feel free to
code your own parser).
Note 2: For the trec collection, please include the headline of the article to the index.
Simply the document text should include the headline and text fields. For positions
of terms, please start counting from the headline and continue with the text.
Note 3: Term position is counted AFTER stop words removal.
 The test collection to be released 4 days before the deadline. It will be of the same
exact format of trec.sample.xml. The size of this collection will be around 5000
documents (5 times of the current version). If your system is running fine with the
current collection, it should be straightforward to run smoothly with the new
collection.
 For tokenisation, you can simply split on every non-letter character, or you can
have special treatment for some cases (such as - or '). Please explain in your report
your selections and why you did so.
 For stopping Please use the stop words list mentioned above.
 Again, for stemming, you do NOT need to write your own stemmer. Use any
available package for Porter stemmer. Write down in your report which one used.
You need to use Porter stemmer, not anything else.
 For the TFIDF search function, please use the formula from lecture 7, slide 13
with title "TFIDF term weighting". Note: this is different from other
implementations of TFIDF in some toolkits, so please implement it as shown in
lecture.
 Please use the queries in lab 2 and lab 3 for testing your code. A new list of
queries to be released with the collection 4 days before the deadline. These 4
days should be more than enough to run the new queries and get the results.
 Notes about the expected queries:
- Queries are expected to be very similar to that in the labs.
- Two query files will be provided, one for Boolean search, and the other for
ranked retrieval.
- Boolean queries will not contain more than one "AND" or "OR" operator at a
time. But a mix between phrase query and one logical "AND" or "OR" operator
can be there (like query q9 in lab 2). Also "AND" or "OR" can be mixed with NOT,
e.g. Word1 AND NOT Word2.
- 10 queries would be provided in a file named queries.boolean.txt in the
following format:
1 term11 AND term12
2 "term21 term22"
- For proximity search queries, it will have the format "#15(term1,term2)", which
means find documents that have both term1 and term2, and distance
between term1 and term2 is less than or equal to 15 (after stop words removal).
- 10 free text queries for ranked retrieval will be provided in a file
named queries.ranked.txt in the following format:
1 this is a sample query

SUBMISSIONS AND FORMATS

You need to submit the following 5 files:

1. index.txt: a formatted version of your positional inverted index. Each line of this
file describes a token, a document it appears in, and its position within that
document:
term:df
docID: pos1, pos2, ....
docID: pos1, pos2, ....
where df is the document frequency of the term. Example:
newspaper:2
23: 2,15
93: 234
.......
Here, the token "newspaper" appeared in 2 documents, where it appeared in
document 23 twice at positions 2 and 15, and document 93 once in position 234.
Each document ID (docID) should be in a separate line that starts with a "tab" ("\t"
in Python). Positions of term should be separated by a comma. Lines should end
with a line break ("\n" in Python).
2. results.boolean.txt: contains results of the queries.boolean.txt in the
following query_number,document_id format:
1,710
1,213
2,103
This means that for query "1" you retrieved two documents - numbers "710" and
"213".
For query "2" you retrieved one document - "103".
The two values on each line should be separated by a comma. Lines should end
with a line break ("\n" in Python).
Your boolean results file should list every matching document, per query.
3. results.ranked.txt: contains results of the queries.ranked.txt in the
following query_number,document_id,score format:
1,710,0.6234
1,213,0.3678
2,103,0.9761
This means that for query "1" you retrieved two documents - document number
"710" (with score 0.6234) and document number "213" (with score 0.3678).
For query "2" you retrieved one document, number "103", with score 0.9761.
Scores should be rounded to four decimal places.
The three values on each line should be separated by a comma. Lines should end
with a line break ("\n" in Python).
Print results for queries in order of their score - that is, all results for query "1" are
sorted by score, then results for query 2, 3, ... 10.
Your ranked results file should list only up to the top 150 results per query.
4. code.py: a single file containing the code used to
generate index.txt, results.boolean.txt and results.ranked.txt.
If you will use something other than Python, let us know before submission!
Please try to make your code as readable as possible: commented code is highly
recommended.
Please DO NOT submit the collection file with the code!
5. report.pdf: Your report on the work.
This is 2 to 4 pages and should include:
- details on the methods you used for tokenisation and stemming
- details on how you implemented the inverted index
- details on how you implemented the four search functions
- a brief commentary on the system as a whole and what you learned from
implementing it
- what challenges you faced when implementing it
- any ideas on how to improve and scale your implementation

Your report SHOULD NOT contain any code or screenshots of code.

It should be a high-level description of how your code works and the decisions
you made while implementing your information retrieval system.

Submit ONLY these five files! Do not submit any of the assignment files.

L5
No ratings yet
L5
54 pages
L4
No ratings yet
L4
46 pages
ir
No ratings yet
ir
120 pages
BF Bot Manager Manual Version 3 0
No ratings yet
BF Bot Manager Manual Version 3 0
490 pages
IR Journal
No ratings yet
IR Journal
36 pages
1 Overview
No ratings yet
1 Overview
44 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
ir
No ratings yet
ir
23 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Samaksh_Gupta_Programming_Ass._IR
No ratings yet
Samaksh_Gupta_Programming_Ass._IR
13 pages
Lec 4
No ratings yet
Lec 4
39 pages
IR practical
No ratings yet
IR practical
24 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
AP MAY 23 QP ANS
No ratings yet
AP MAY 23 QP ANS
9 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Hitit Cranecm Usermanual v4 20220510
No ratings yet
Hitit Cranecm Usermanual v4 20220510
52 pages
Documentation
No ratings yet
Documentation
146 pages
Debar Manual
No ratings yet
Debar Manual
129 pages
Supervisionguide16 17 Students
No ratings yet
Supervisionguide16 17 Students
17 pages
IR_midsem Question Paper_2024_solutionfull (2)
No ratings yet
IR_midsem Question Paper_2024_solutionfull (2)
7 pages
Integrated Control System
No ratings yet
Integrated Control System
148 pages
CE306 Summer Reassessment- Elastic
No ratings yet
CE306 Summer Reassessment- Elastic
4 pages
3 TCP-IP Basic
No ratings yet
3 TCP-IP Basic
159 pages
Assignment #1 Text Retrieval & Search Engine
No ratings yet
Assignment #1 Text Retrieval & Search Engine
6 pages
Cross Lingual Information Retrieval and Error Tracking in Search Engine
No ratings yet
Cross Lingual Information Retrieval and Error Tracking in Search Engine
37 pages
SEARCH ENGINE
No ratings yet
SEARCH ENGINE
6 pages
CS 3308 Programming Assignment 2
No ratings yet
CS 3308 Programming Assignment 2
3 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Ir - Assignment 3
No ratings yet
Ir - Assignment 3
11 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
20230922044043-Chapter 1
No ratings yet
20230922044043-Chapter 1
4 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Assignment 3 BIM IR
No ratings yet
Assignment 3 BIM IR
5 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Assignment 2
No ratings yet
Assignment 2
4 pages
ODI 12c - Mapping - Table To Flat File
No ratings yet
ODI 12c - Mapping - Table To Flat File
31 pages
System Specs NX3170G8
No ratings yet
System Specs NX3170G8
25 pages
Mining Worker Safety Helmet Report
No ratings yet
Mining Worker Safety Helmet Report
24 pages
Mini Project 4
No ratings yet
Mini Project 4
2 pages
Installation Guide For Cisco Business Edition 6000 H/M, Release 11.0
No ratings yet
Installation Guide For Cisco Business Edition 6000 H/M, Release 11.0
42 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
IRE Miniproject Phase2 Requirements
No ratings yet
IRE Miniproject Phase2 Requirements
4 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
DSS Assignment
No ratings yet
DSS Assignment
5 pages
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
No ratings yet
CSE508: Information Retrieval Assignment 2: Question 1 - (40 Points) Scoring and Term-Weighting
3 pages
Theory Assignment
No ratings yet
Theory Assignment
4 pages
Programming With Python - Curriculum Overview
No ratings yet
Programming With Python - Curriculum Overview
20 pages
Terrameter LS 2 Quick Start Guide 4x12 4x16 Electrode
No ratings yet
Terrameter LS 2 Quick Start Guide 4x12 4x16 Electrode
4 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Introduction To Information Rertrieval Answer
100% (4)
Introduction To Information Rertrieval Answer
6 pages
Project Report
No ratings yet
Project Report
5 pages
List
No ratings yet
List
20 pages
Assessment 2
No ratings yet
Assessment 2
3 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
crash-2025-05-25-17-18-55-505
No ratings yet
crash-2025-05-25-17-18-55-505
8 pages
CI - CD For Lambda Functions With Jenkins - by Mohamed Labouardy - A Cloud Guru
No ratings yet
CI - CD For Lambda Functions With Jenkins - by Mohamed Labouardy - A Cloud Guru
12 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Assignment 01
No ratings yet
Assignment 01
2 pages
Action Plan Someka Excel Template V3 Free Version
100% (1)
Action Plan Someka Excel Template V3 Free Version
8 pages
I R Assignment 1
No ratings yet
I R Assignment 1
2 pages
Fundamentals of Image Compression
No ratings yet
Fundamentals of Image Compression
11 pages
MCA-3 Subnetting by Anant Kumar
No ratings yet
MCA-3 Subnetting by Anant Kumar
6 pages
For Power Consumption Measurement: PWM Output
No ratings yet
For Power Consumption Measurement: PWM Output
1 page
VxRail+Appliance VxRail+Installation+Procedures-VxRail+E560 E560F Cabling v70100
No ratings yet
VxRail+Appliance VxRail+Installation+Procedures-VxRail+E560 E560F Cabling v70100
7 pages
CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
Python Modules: Example
0% (1)
Python Modules: Example
4 pages
Statistics Term Paper Example
100% (1)
Statistics Term Paper Example
7 pages
Midterm2006 Sol Csi4107
100% (2)
Midterm2006 Sol Csi4107
9 pages
DSW Film Scanner Product Specs
No ratings yet
DSW Film Scanner Product Specs
7 pages
ASI Controller Label Information
No ratings yet
ASI Controller Label Information
2 pages
(Etc) M1
No ratings yet
(Etc) M1
4 pages
Paste Multiple Keyframes Instructions: Interface
No ratings yet
Paste Multiple Keyframes Instructions: Interface
3 pages
Sosial Network
No ratings yet
Sosial Network
2 pages
Role Tag
No ratings yet
Role Tag
2 pages
Binomial Theorem
No ratings yet
Binomial Theorem
2 pages
Modul Pelatihan Laboratorium Tuberkulosis Bagi Petugas Di Fasyankes - STIKES RS HUSADA
No ratings yet
Modul Pelatihan Laboratorium Tuberkulosis Bagi Petugas Di Fasyankes - STIKES RS HUSADA
1 page
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet

COURSEWORK1 Details

Uploaded by

COURSEWORK1 Details

Uploaded by

COURSEWORK 1 DETAILS

Based on lectures 3-7

DOWNLOAD Coursework 1 data (collection and queries) from HERE

SUBMISSION DEADLINE: MONDAY, 15 FEBRUARY 2021, 11:59PM

Implement a simple IR tool that does the following:

SUBMISSIONS AND FORMATS

You need to submit the following 5 files:

Your report SHOULD NOT contain any code or screenshots of code.

You might also like