CITS1401 Project#02, Sem2, 2024
CITS1401 Project#02, Sem2, 2024
with Python
Dataset
The dataset for this project comprises two key files: a CSV and a TXT file. The CSV file
includes information on countries, hospital IDs, hospital categories, and the number of deaths
recorded in 2022 and 2023, among other details. Meanwhile, the TXT file provides a detailed
account of patient admissions for each country and corresponding hospitals, specifying the
number of cases for Covid, Stroke, and Cancer in 2022.
You are required to write a Python 3 program that will read two different files: a CSV file and
a TXT file. After reading the data file(s), your program will perform four different tasks
outlined below.
Tasks
Country_to_covid_stroke] to map each country to its hospital IDs, deaths in 2022, and
the total number of patients admitted for covid and stroke in 2022, respectively. The
dictionaries will have country names as keys and lists as values, where each entry in the
lists corresponds to a specific hospital. The order of the entries in all lists across the
dictionaries must be perfectly aligned to ensure consistency across the data. For example:
o Country_to_hospitals['argentina'] = ['H1', 'H2', 'H3']
o Country_to_death['argentina'] = [10, 15, 8] (number of deaths for H1,
H2, H3 in 2022)
o Country_to_covid_stroke['argentina'] = [50, 60, 45] (covid and stroke
admissions for H1, H2, H3 in 2022)
ii. Task 2: Calculate Cosine Similarity
For each country, compute the Cosine Similarity between the number of deaths in 2022
and the total number of patients admitted for covid and stroke. Store these results in a
dictionary (Cosine_dict) with the country names as the keys and the Cosine Similarity
as the values.
Create a dictionary (Variance_dict) that stores the variance in cancer patient admissions
across hospitals in each country for a specified hospital category (e.g., 'children'). The
key is the country name, and the value is the variance in the number of cancer admissions
for the specified hospital category.
Output
The following four outputs are expected:
i. OP1= list of dictionary items: [Country_to_hospitals, Country_to_death,
Country_to_covid_stroke].
a) Country_to_hospitals maps each country name to a list of hospitals within that
country. In this dictionary, the key is the name of the country, and the value is a list of
hospital IDs in that country.
b) Country_to_death maps each country name to a list of the number of deaths in
2022 for each hospital in that country. The key is the name of the country, and the value
is a list where each element represents the death count for a corresponding hospital in
that country.
c) Country_to_covid_stroke maps each country name to a list of the total number
of patients admitted for covid and stroke in each hospital in that country. The key is the
name of the country, and the value is a list where each element represents the total
number of covid and stroke admissions for a hospital in that country.
Important Note: It is essential that the order of the entries in the values (lists) for all three
dictionaries (Country_to_hospitals, Country_to_death, and
Country_to_covid_stroke) is perfectly aligned. For example, the first entry in
Country_to_death['brazil']and Country_to_covid_stroke['brazil'] should
relate to the first hospital ID in Country_to_hospitals['brazil'], and so on for the
remaining entries.
ii. OP2= Cosine_dict: A dictionary where the key is the country name, and the value is the
Cosine Similarity between the number of deaths in 2022 and the total number of patients
admitted for covid and stroke in that country.
iii. OP3= Variance_dict: A dictionary where the key is the country name, and the value is
the variance in the number of Cancer patient admissions for a specified category,
such as 'children' across hospitals within that country.
iv. OP4= Category_Country_dict: A nested dictionary that stores information for each
hospital category ‘C’. The outer dictionary uses the hospital category as the key, and the
value is another dictionary ‘D’.
a) The dictionary ‘D’ uses the country name as the key, and the value is a list containing
the following data for hospitals in category ‘C’ within that country:
• The average number of female patients treated in hospitals under category ‘C’.
• The maximum number of staff working in hospitals within category ‘C’.
• The percentage change between the average number of deaths from 2022 to 2023
for hospitals in category ‘C’.
All returned numeric outputs must contain values rounded to four decimal places (if required
to be rounded off). Do not round the values during calculations. Instead, round them only at
the time when you save them into the final output variables.
Requirements
i. You are not allowed to import any external or internal module in python.
ii. Ensure your program does NOT call the input() function at any time. Calling the
input() function will cause your program to hang, waiting for input that automated
testing system will not provide (in fact, what will happen is that if the marking program
detects the call(s), it will not test your code at all which may result in zero grade).
iii. Your program should also not call print()function at any time except for the case of
graceful termination (if needed). If your program encounters an error state and exits
gracefully, it should return a cosine-similarity/variance/mean/percentage-change value of
zero and print an appropriate error message. At no point should you print the program’s
outputs or provide a printout of the program’s progress in calculating such outputs. Outputs
should be returned by the program instead.
iv. Do not assume that the input file names will end in .csv or .txt. File name suffixes such
as .csv and .txt are not mandatory in systems other than Microsoft Windows. Do not
enforce within your program that the file must end with a specific extension, nor should
you attempt to add an extension to the provided file name. Doing so can result in loss of
marks.
Examples
>> len(OP1)
>> OP1[0]['afghanistan']
['4eb9d3e5cf79b91', 'bba52b87bb6a32f','8a9190a50adf241']
>> OP1[1]['afghanistan']
[20, 2, 12]
>> OP1[2]['afghanistan']
32
>> OP2['afghanistan']
0.5746
>> OP2['albania']
0.9257
>> OP3['afghanistan']
785004.5
24420.5
>> len(OP4['children'])
32
>> OP4['children']['canada']
Note that you have not been asked to write specific functions. The task has been left to you.
However, it is essential that your program defines the top-level function main(CSVfile,
TXTfile, category) (commonly referred to as ‘main()’ in the project documents to
save space when writing it. Note that when main() is written it still implies that it is defined
with its three input arguments). The idea is that within main(), the program calls the other
functions. (Of course, these functions may then call further functions.) This is important
because when your code is tested on Moodle, the testing program will call your main()
function. So, if you fail to define main(), the testing program will not be able to test your
code and your submission will be graded zero. Don’t forget the submission guidelines provided
at the start of this document.
Marking rubric
24 out of 30 marks will be awarded automatically based on how well your program completes
a number of tests, reflecting normal use of the program, and how the program handles various
states including, but not limited to, different numbers of rows in the input file and / or any error
states. You need to think creatively what your program may face. Your submission will be
graded by data files other than the provided data file. Therefore, you need to be creative to
investigate corner or worst cases. I have provided few guidelines from ACS Accreditation
manual at the end of the project sheet which will help you to understand the expectations.
6 out of 30 marks will be awarded on style (3/6) “the code is clear to read” and efficiency (3/6)
“your program is well constructed and run efficiently”. For style, think about use of comments,
sensible variable names, your name at the top of the program, student ID, etc. (Please watch
the lectures where this is discussed).
Style Rubric:
0 Gibberish, impossible to understand
1 Style is really poor or fair.
2 Style is good or very good, with small lapses.
3 Excellent style, really easy to read and follow
Your program will be traversing text files of various sizes (possibly including large csv files)
so you need to minimise the number of times your program looks at the same data items.
Efficiency rubric:
0 Code too complicated to judge efficiency or wrong problem tackled
1 Very poor efficiency, additional loops, inappropriate use of readline()
2 Acceptable or good efficiency with some lapses
3 Excellent efficiency, should have no problem on large files, etc.
Automated testing is being used so that all submitted programs are being tested the same way.
Sometimes it happens that there is one mistake in the program that means that no tests are
passed. If the marker can spot the cause and fix it readily, then they are allowed to do that and
your - now fixed - program will score whatever it scores from the tests, minus 4 marks, because
other students will not have had the benefit of marker intervention. Still, that's way better than
getting zero. On the other hand, if the bug is hard to fix, the marker needs to move on to other
submissions.
Extract from Australian Computing Society Accreditation manual 2019
As per Seoul Accord section D, a complex computing problem will normally have some or all
of the following criteria:
involves wide-ranging or conflicting technical, computing, and other issues.
has no obvious solution and requires conceptual thinking and innovative analysis to
formulate suitable abstract models.
a solution requires the use of in-depth computing or domain knowledge and an
analytical approach that is based on well-founded principles.
involves infrequently encountered issues.
are outside problems encompassed by standards and standard practice for professional
computing.
involves diverse groups of stakeholders with widely varying needs.
has significant consequences in a range of contexts.
is a high-level problem possibly including many component parts or sub-problems.
identification of a requirement or the cause of a problem is ill defined or unknown.
Necessary formulas
where,
𝑥𝑥𝑖𝑖 = first set of data
𝑦𝑦𝑖𝑖 = second set of data
n = the number of samples
2) Variance, 𝑆𝑆 2
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )2
𝑆𝑆 2 =
𝑛𝑛 − 1
where,
𝑆𝑆 2 = sample variance
𝑥𝑥𝑖𝑖 = the value of the one observation
𝑥𝑥̅ = the mean value of all observations
n = the number of observations
3) Average Percentage Change:
Percentage change between the average number of deaths from 2022 to 2023, for all hospitals
of specific hospital type within a country.