0% found this document useful (0 votes)
34 views

Data Mining Assignment 1

This document outlines the instructions for Assignment 1 of the CS 536/CS 432 Data Mining course. It includes 3 parts: 1) data understanding, visualization, and preprocessing on the Dishes dataset using RapidMiner, 2) data preprocessing tasks on the Communities and Crime dataset from UCI, and 3) dependency analysis on the cryotherapy dataset including generating a correlation matrix and chi square statistic. Students are instructed to complete the tasks individually, submit a report and code, and discuss all results in detail.

Uploaded by

Zain Aamir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Data Mining Assignment 1

This document outlines the instructions for Assignment 1 of the CS 536/CS 432 Data Mining course. It includes 3 parts: 1) data understanding, visualization, and preprocessing on the Dishes dataset using RapidMiner, 2) data preprocessing tasks on the Communities and Crime dataset from UCI, and 3) dependency analysis on the cryotherapy dataset including generating a correlation matrix and chi square statistic. Students are instructed to complete the tasks individually, submit a report and code, and discuss all results in detail.

Uploaded by

Zain Aamir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

CS 536/CS 432 – Data Mining

Assignment 1
Due: February 13 (Tuesday) at 12 midnight

Instructions: (1) You may discuss the assignment with others. However, you MUST
do and submit your OWN work. (2) Submit a soft-copy report to the submission
folder on LMS. Include report and code needed to reproduce your results.

1. Data Understanding, Visualization and Preprocessing (55 points)


Explore and understand the Dishes dataset provided on LMS. Report the
results/outcomes of every part in your report. All tasks must be performed in
RapidMiner.

You need to perform the following tasks:

a. Fill in the missing values in highest and lowest price column using the
appropriate filter. Report statistics of both columns (max, mean, average, std
dev, variance).
b. If value is not valid in first and last appeared column, replace it with the year
above or below it. Report the number of affected rows.
c. Sort w.r.t dishes name and do the following things on this column:
 Find unique entries and report the number
 Remove records with false entries (e.g. name can't be a number)
 Remove null entries
 Remove punctuation
 Remove leading and trailing spaces
 Change to lower case
 Remove duplicate entries
 What changes will appear in other columns when you remove duplicate
entries. Perform and report them.
 Find at least 10 different near duplicate entries and record them as one
entry e.g egg muffin and egg muffins are near duplicates. Similarly
zuppa del Giorno and zuppa del girono are same thing with spelling
mistake
 Find unique entries again and report the number
d. Report the dataset size and affected rows after every operation in (a), (b), (c)
and discuss why it changed. Comment on if there is more redundant information
and ways to remove them.
e. Report 10 most popular dishes
f. Report 10 most expensive and cheap dishes. Plot their bar graph with number
of times they are appearing on the menu.
g. Report at least 5 dishes that appeared for the longest or shortest period of time
in the menu.
h. What is the best representation strategy for different attributes providing
maximum information for this dataset? Justify your choice in report
i. Comments on the results. Present and discuss any other observation/result that
you find interesting.

CS 536 (Sp 17-18) – Dr. Mian Muhammad Awais Page 1 of 2


2. Data Preprocessing (15 points)
Use RapidMiner to perform the following tasks. Download the Communities and
Crime dataset from UCI repository. Study the dataset, and perform the following
tasks:
a. Study dataset and report number of missing values in data set (per attribute/per
object). Also, report attributes with high missing values (top 5)
b. Fill in the missing values in the data using an appropriate filter.
c. Standardize the dataset to zero mean and unit variance (z-score normalization)

3. Dependency Analysis (30 points)


Perform dependency analysis on cryotherapy dataset (Download dataset from here:
goo.gl/rzspPa). Result of treatment is your class label identifying whether treatment is
successful or not (this is a binary classification task)

You are required to do the following tasks:

a. Generate the correlation matrix for the attributes in this dataset. In particular, observe
the correlation between attributes and class label, and significant correlations between
attributes and report your observation.
b. Compute the chi square stat between attribute ‘age’ and ‘number of warts’ in this
dataset.
c. Comment on the results. (This whole part deals with analysis. Make sure to present
a comprehensive observation after performing these operations)

You are required to do this task in Python. There is no restriction of using library
functions.

Note: You should discuss results/outcomes of each part in detail in your report and
provide all rapid miner files in your submission. Zip the folder and name it as
rollnumber_Name_SubjectCode e.g 16030000_JohnSnow_CS536. There will be
deduction of marks if submission instructions aren’t properly followed. In case of
plagiarism (in any of the part), whole assignment will be graded zero.

CS 536 (Sp 17-18) – Dr. Mian Muhammad Awais Page 2 of 2

You might also like