Data Mining Assignment 1
Data Mining Assignment 1
Assignment 1
Due: February 13 (Tuesday) at 12 midnight
Instructions: (1) You may discuss the assignment with others. However, you MUST
do and submit your OWN work. (2) Submit a soft-copy report to the submission
folder on LMS. Include report and code needed to reproduce your results.
a. Fill in the missing values in highest and lowest price column using the
appropriate filter. Report statistics of both columns (max, mean, average, std
dev, variance).
b. If value is not valid in first and last appeared column, replace it with the year
above or below it. Report the number of affected rows.
c. Sort w.r.t dishes name and do the following things on this column:
Find unique entries and report the number
Remove records with false entries (e.g. name can't be a number)
Remove null entries
Remove punctuation
Remove leading and trailing spaces
Change to lower case
Remove duplicate entries
What changes will appear in other columns when you remove duplicate
entries. Perform and report them.
Find at least 10 different near duplicate entries and record them as one
entry e.g egg muffin and egg muffins are near duplicates. Similarly
zuppa del Giorno and zuppa del girono are same thing with spelling
mistake
Find unique entries again and report the number
d. Report the dataset size and affected rows after every operation in (a), (b), (c)
and discuss why it changed. Comment on if there is more redundant information
and ways to remove them.
e. Report 10 most popular dishes
f. Report 10 most expensive and cheap dishes. Plot their bar graph with number
of times they are appearing on the menu.
g. Report at least 5 dishes that appeared for the longest or shortest period of time
in the menu.
h. What is the best representation strategy for different attributes providing
maximum information for this dataset? Justify your choice in report
i. Comments on the results. Present and discuss any other observation/result that
you find interesting.
a. Generate the correlation matrix for the attributes in this dataset. In particular, observe
the correlation between attributes and class label, and significant correlations between
attributes and report your observation.
b. Compute the chi square stat between attribute ‘age’ and ‘number of warts’ in this
dataset.
c. Comment on the results. (This whole part deals with analysis. Make sure to present
a comprehensive observation after performing these operations)
You are required to do this task in Python. There is no restriction of using library
functions.
Note: You should discuss results/outcomes of each part in detail in your report and
provide all rapid miner files in your submission. Zip the folder and name it as
rollnumber_Name_SubjectCode e.g 16030000_JohnSnow_CS536. There will be
deduction of marks if submission instructions aren’t properly followed. In case of
plagiarism (in any of the part), whole assignment will be graded zero.