0% found this document useful (0 votes)

6 views

CSCI374_Homework1

Uploaded by

eddie594100

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

CSCI374_Homework1

Uploaded by

eddie594100

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment #1: k-Nearest Neighbor

CSCI 374 Fall 2022 Oberlin College

Due: Wednesday September 21 at 11:59 PM

Background

Our first assignment this semester has two main goals:

1. Implement and evaluate our first machine learning algorithm, and
2. Practice using GitHub for source code management

Gitting Started

To begin this assignment, you will need to create a GitHub account if you don’t already have
one, as well as download the code for this assignment. Instructions can be found in the
Introduction to GitHub document on the class webpage (under Handouts).

Once you have a GitHub account, you can get started on the assignment by following this link:
https://classroom.github.com/a/a5CyZXHb

Data Sets

For this assignment, we will learn from four pre-defined data sets:
1. monks1.csv: A data set describing two classes of robots using all nominal attributes and
a binary label. This data set has a simple rule set for determining the label: if
head_shape = body_shape ∨ jacket_color = red, then yes, else no.
Each of the attributes in the monks1 data set are nominal. Monks1 was one of the first
machine learning challenge problems (http://www.mli.gmu.edu/papers/91-95/91-28.pdf).
This data set comes from the UCI Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/MONK%27s+Problems
2. penguins.csv: A data set describing observed measurements of different animals
belonging to three species of penguins. The four attributes are each continuous
measurements, and the label is the species of penguin. Special thanks and credit to
Professor Allison Horst at the University of California Santa Barbara for making this data
set public: see this Twitter post and thread with more information
(https://twitter.com/allison_horst/status/1270046399418138625) and GitHub repository
(https://github.com/allisonhorst/palmerpenguins).
3. mnist_100.csv: A data set of optical character recognition of numeric digits from images.
Each instance represents a different grayscale 28x28 pixel image of a handwritten
numeric digit (from 0 through 9). The attributes are the intensity values of the 784 pixels.
Each attribute is ordinal (treat them as continuous for the purpose of this assignment) and
a nominal label. This version of MNIST contains 100 instances of each handwritten
numeric digit, randomly sampled from the original training data for MNIST. The overall
MNIST data set is one of the main benchmarks in machine learning:
http://yann.lecun.com/exdb/mnist/. It was converted to CSV file using the python code
provided at: https://quickgrid.blogspot.com/2017/05/Converting-MNIST-Handwritten-
Digits-Dataset-into-CSV-with-Sorting-and-Extracting-Labels-and-Features-into-
Different-CSV-using-Python.html
4. mnist_1000.csv: The same as mnist_100, except containing 1000 instances of each
handwritten numeric digit.

The file format for each of these data sets is as follows:

• The first row contains a comma-separated list of the names of the label and attributes
• Each successive row represents a single instance
• The first entry (before the first comma) of each instance is the label to be learned, and all
other entries (following the commas) are attribute values.
• Some attributes are strings (representing nominal values), some are integers, and others
are real numbers. Each label is a string.

Program

Your assignment is to write a program called knn that behaves as follows:

1) It should take as input five parameters:
a. The path to a file containing a data set
b. The name of the distance function to use, from the set {H, E} (where H stands for
Hamming and E stands for Euclidian)
c. The value of k to use in the k-Nearest Neighbors algorithm
d. The percentage of instances to use for a training set
e. An integer to use as a random seed

For example, if I wrote my program in Python, I might run

python knn mnist_100.csv E 1 0.75 12345

which will perform 1-Nearest Neighbors on mnist_100.csv using the Euclidian distance
function and a random seed of 12345, where 75% of the data will be used for training
(and the remaining 25% will be used for testing)
2) To begin execution, the program should read in the data set as a set of instances
3) Then, the instances should be randomly split into training and test sets (using the
percentage and random seed input to the program)
4) Next, predictions should be made for each instance in the test set created in Step 3, using
the training set as the instances to compare to in the k-Nearest Neighbors algorithm.
5) A confusion matrix should be created based on the predictions made during Step 4, then
the confusion matrix should be output as a file with its name following the pattern:
results-<DataSet>-<k>-<Seed>.csv (e.g., results-mnist_100-1-12345.csv).
Program Output

The file format for your output file should be as follows:

• The first row should be a comma-separated list of the possible labels in the data set,
representing the list of possible predictions your program could make. This row should
end in a comma.
• The second row should be a comma-separated list of the counts of many times each of the
different labels were predicted for instances in the test set whose true label is the first
possible label, ending with the name of the first possible label (and not a final comma).
• The third row should be a comma-separated list of the counts of many times each of the
different labels were predicted for instances in the test set whose true label is the second
possible label, ending with the name of the second possible label (and not a final
comma).
• Etc. for the remaining possible true labels

For example, the confusion matrix:

Predicted Label
Yes No
200 100 Yes Actual
50 250 No Label

would be output as:

Yes,No,
200,100,Yes
50,250,No

The output for your program should be consistent with the random seed. That is, if the same
seed is input twice, your program should output the exact same confusion matrix.

Programming Languages

I would recommend using either the Java or Python programming languages to complete this
assignment. If you have a different preferred language, please talk to me to make sure that I will
be able to run your submission in that language.

You are allowed to use any library built in to your chosen language (e.g., random, sys, math, and
csv in Python; java.util.* and java.random.* in Java), but you should not use any external
libraries (e.g., scikit-learn, numpy, and Pandas in Python; Weka in Java).
Research Questions

Please use your program to answer these research questions and record your answers in a
README.md file:
1) Pick a single random seed and a single training set percentage (document both in your
README) and run k-Nearest Neighbors with a k = 1 on each of the four data sets. What
is the accuracy you observed on each data set?
2) Using the accuracies from Question 1, calculate a 95% confidence interval around the
accuracy on each data set.
3) How did your accuracy compare between the mnist_100 and mnist_1000 data sets?
Which had the higher average? Why do you think you observed this result? Did their
confidence intervals overlap? What conclusion can we draw based on their confidence
intervals?
4) Pick one data set and three different values of k (document both in your README). Run
the program with each value of k on that data set and compare the accuracy values
observed. Did changing the value of k have much of an effect on your results? Speculate
as to why or why not that observation occurred?

Bonus Question (5 points)

5) Pick 10 different random seeds (document them in your README file) and rerun k-
Nearest Neighbors with a k = 1 on the penguins.csv data. Record the average of the
accuracy across the 10 runs.

Next, rerun the program on the same 10 seeds but only consider two attributes at a time
(ignoring the other two attributes not in the chosen pair). Record the average accuracy
for each pair of attributes across the 10 seeds. Since there are four attributes, there are
six possible pairs of attributes (e.g., bill_length_mm-bill_depth_mm is one pair, so
flipper_length_mm and body_mass_g would be ignored for this pair).

Finally, compare the average accuracy results between (1-6) all six pairs of attributes and
(7) the results using all four attributes. Did any pairs of attributes do as well (or better)
than learning using all four attributes? Speculate why you observed your results.

README

Within a README file, you should include:

1) Your answers to the questions above,

2) A short paragraph describing your experience during the assignment (what did you enjoy,
what was difficult, etc.)
3) An estimation of how much time you spent on the assignment, and
4) An affirmation that you adhered to the honor code
Please remember to commit your solution code, results files, and README file to your
repository on GitHub. You do not need to wait to commit your code until you are done with the
assignment; it is good practice to do so not only after each coding session, but maybe after
hitting important milestones or solving bugs during a coding session. Make sure to document
your code, explaining how you implemented the different components of the assignment.

Honor Code

Each student is to complete this assignment individually. However, students are encouraged to
collaborate with one another to discuss the abstract design and processes of their
implementations. For example, please feel free to discuss the pseudocode for the learning
algorithm. You might also want to discuss the processes used to generate the training and test
sets from the read in data set. Or, you might need to discuss how to work with the input and
output files. At the same, since this is an individual assignment, no code can be shared between
students, nor can students look at each other’s code.

Grading Rubric

Your solution and README will be graded based on the following rubric:

Followed input directions: /5 points

Properly created training and test sets: /15 points
Correctly implemented the k-Nearest Neighbors algorithm: /25 points
Runtime efficiency: /10 points
Followed output directions: /10 points
Correctly answered research questions: /25 points
Provided requested README information: /5 points
Appropriate code documentation: /5 points

By appropriate code documentation, I mean including a header comment at the top of each file
explaining what the file provides, as well as at least one comment per function explaining the
purpose of the function.

Glove
100% (1)
Glove
10 pages
PJ 1
No ratings yet
PJ 1
6 pages
COL774_A4_v3
No ratings yet
COL774_A4_v3
4 pages
hw1 Problem Set
No ratings yet
hw1 Problem Set
8 pages
HW 1
No ratings yet
HW 1
3 pages
Assignment 2 - Data Structure Comparison
No ratings yet
Assignment 2 - Data Structure Comparison
5 pages
Project_1
No ratings yet
Project_1
4 pages
Week3 Stat
No ratings yet
Week3 Stat
4 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
CS5785 Homework 4: .PDF .Py .Ipynb
No ratings yet
CS5785 Homework 4: .PDF .Py .Ipynb
5 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
Assignment 4 - Comp8547
No ratings yet
Assignment 4 - Comp8547
2 pages
Assignment 3: Named Entity Recognition: Training Dataset
No ratings yet
Assignment 3: Named Entity Recognition: Training Dataset
4 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
COMP-377 Lab2
No ratings yet
COMP-377 Lab2
3 pages
Ass3 v1
No ratings yet
Ass3 v1
4 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
CSL465/603 - Machine Learning
No ratings yet
CSL465/603 - Machine Learning
3 pages
Lab 3
No ratings yet
Lab 3
3 pages
COL 774: Assignment 2
No ratings yet
COL 774: Assignment 2
3 pages
Lab 3: Unix and Linux
No ratings yet
Lab 3: Unix and Linux
3 pages
Data Mining Methods
No ratings yet
Data Mining Methods
17 pages
TMLS20 Machine Learning Coursework-1
No ratings yet
TMLS20 Machine Learning Coursework-1
5 pages
HW LM
No ratings yet
HW LM
36 pages
Assignment 5-Fall 2024_553
No ratings yet
Assignment 5-Fall 2024_553
8 pages
CS4038D Data Mining Assignment 2 - 2024 (1)
No ratings yet
CS4038D Data Mining Assignment 2 - 2024 (1)
2 pages
Homework 3: Due: September 26, 2017, 11:59 PM Instructions
No ratings yet
Homework 3: Due: September 26, 2017, 11:59 PM Instructions
3 pages
Ps 2
No ratings yet
Ps 2
11 pages
Lab and Homework 1: Computer Science Class
No ratings yet
Lab and Homework 1: Computer Science Class
16 pages
Ml Lab Manual 2024
No ratings yet
Ml Lab Manual 2024
41 pages
Text Classification_movie Review_news Wires
No ratings yet
Text Classification_movie Review_news Wires
5 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
CSE 455/555 Spring 2012 Homework 1: Bayes ∗ ω
100% (1)
CSE 455/555 Spring 2012 Homework 1: Bayes ∗ ω
3 pages
ML Lab Manual-17csl76
No ratings yet
ML Lab Manual-17csl76
43 pages
Miniproject 1: Machine Learning 101: Preamble
No ratings yet
Miniproject 1: Machine Learning 101: Preamble
5 pages
E4 DS203 2023 Sem2
No ratings yet
E4 DS203 2023 Sem2
2 pages
Course Project: Due: Tuesday, 03/03/09
No ratings yet
Course Project: Due: Tuesday, 03/03/09
5 pages
Fundamentals of Machine Learning Support Vector Machines, Practical Session
No ratings yet
Fundamentals of Machine Learning Support Vector Machines, Practical Session
4 pages
Assignment 5: Raw Memory: Bits and Bytes
No ratings yet
Assignment 5: Raw Memory: Bits and Bytes
6 pages
Machine learning with Titanic dataset tutorial
No ratings yet
Machine learning with Titanic dataset tutorial
7 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Other_Questions
No ratings yet
Other_Questions
76 pages
Week 2 Project - Search Algorithms - CSMM
No ratings yet
Week 2 Project - Search Algorithms - CSMM
8 pages
Implementation of The Dependency-Tree Estimation of Distribution Algorithm in C++
No ratings yet
Implementation of The Dependency-Tree Estimation of Distribution Algorithm in C++
9 pages
Assignment_1-2
No ratings yet
Assignment_1-2
4 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
A 02
No ratings yet
A 02
2 pages
INF_3201_h24_Assignment_1
No ratings yet
INF_3201_h24_Assignment_1
4 pages
vnd.openxmlformats-officedocument.wordprocessingml.document&rendition=1-10
No ratings yet
vnd.openxmlformats-officedocument.wordprocessingml.document&rendition=1-10
13 pages
ICT159 - Assignment 2
No ratings yet
ICT159 - Assignment 2
22 pages
Assignment Number 1: Machine Learning Course (Dr. S. Nadeem Ahsan) Due Date: 11-Feb-2012
No ratings yet
Assignment Number 1: Machine Learning Course (Dr. S. Nadeem Ahsan) Due Date: 11-Feb-2012
1 page
ps1 PDF
No ratings yet
ps1 PDF
5 pages
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
No ratings yet
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
3 pages
Coding_Test_problem_description_RA_final
No ratings yet
Coding_Test_problem_description_RA_final
2 pages
DL - Assignment 1
No ratings yet
DL - Assignment 1
12 pages
Assignment For Module #2: Calculate Maximum Word Frequency: Functional Requirements
No ratings yet
Assignment For Module #2: Calculate Maximum Word Frequency: Functional Requirements
3 pages
Decap776 P 1
No ratings yet
Decap776 P 1
6 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
m211oh66
No ratings yet
m211oh66
2 pages
lec25
No ratings yet
lec25
7 pages
DELA-Unit-V
No ratings yet
DELA-Unit-V
40 pages
syllabus
No ratings yet
syllabus
2 pages
6603d69bfd0a8a25b5154e4b Original
No ratings yet
6603d69bfd0a8a25b5154e4b Original
32 pages
Slides 01
No ratings yet
Slides 01
8 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
hw7
No ratings yet
hw7
3 pages
664a2ccfce05c65e387fb6a2 Original
No ratings yet
664a2ccfce05c65e387fb6a2 Original
1 page
Week 2
No ratings yet
Week 2
24 pages
Chapter 6 - Logistic Regression
No ratings yet
Chapter 6 - Logistic Regression
24 pages
Chapter 10 - Neural Network
No ratings yet
Chapter 10 - Neural Network
62 pages
Need of Cloud in DevOps
No ratings yet
Need of Cloud in DevOps
17 pages
MySQL Database Administration
No ratings yet
MySQL Database Administration
93 pages
Join in R - How To Join Data Frames in R
No ratings yet
Join in R - How To Join Data Frames in R
5 pages
Oway Catalog
No ratings yet
Oway Catalog
13 pages
Resume Djeanny Chevalier
No ratings yet
Resume Djeanny Chevalier
1 page
Result of the command show tech-sup
No ratings yet
Result of the command show tech-sup
202 pages
Wiki Software Testing
No ratings yet
Wiki Software Testing
16 pages
SQL Hands On Document - Final
0% (1)
SQL Hands On Document - Final
17 pages
Development of TFRIS - Preliminaries
No ratings yet
Development of TFRIS - Preliminaries
14 pages
BCA 232 C Programming - (C)
No ratings yet
BCA 232 C Programming - (C)
2 pages
Hid Biometric Manager Administration Guide
No ratings yet
Hid Biometric Manager Administration Guide
80 pages
Project Syndicate Scanner 3.0
No ratings yet
Project Syndicate Scanner 3.0
16 pages
Microsoft Word 2010 Notes
No ratings yet
Microsoft Word 2010 Notes
9 pages
ONFI 1 0 Gold
No ratings yet
ONFI 1 0 Gold
106 pages
Lesson 06 - Using Decision-Making Structures
No ratings yet
Lesson 06 - Using Decision-Making Structures
28 pages
CMS Final Report PBL
No ratings yet
CMS Final Report PBL
13 pages
An Attention Mechanism Based CNN Bilstm Classification Model For Detection of Inappropriate Content in Cartoon Videos
No ratings yet
An Attention Mechanism Based CNN Bilstm Classification Model For Detection of Inappropriate Content in Cartoon Videos
24 pages
SAS 9.4 Installation Windows English
No ratings yet
SAS 9.4 Installation Windows English
17 pages
Multi-Company and Extended Multi Company: Customer
No ratings yet
Multi-Company and Extended Multi Company: Customer
17 pages
CIMdata eBook Siemens Teamcenter X_tcm27-90182
No ratings yet
CIMdata eBook Siemens Teamcenter X_tcm27-90182
12 pages
Unit IV: Strings and Functions
No ratings yet
Unit IV: Strings and Functions
71 pages
Procedure For Students Online Registration
No ratings yet
Procedure For Students Online Registration
4 pages
as_built_manager
No ratings yet
as_built_manager
156 pages
IoT Based Environmental Monitoring System - Project - Proposal
No ratings yet
IoT Based Environmental Monitoring System - Project - Proposal
8 pages
Repair Guide
No ratings yet
Repair Guide
43 pages
Business Analyst
No ratings yet
Business Analyst
27 pages
Using Avsproxy (Avidemux)
No ratings yet
Using Avsproxy (Avidemux)
2 pages
EPQ Presentation
No ratings yet
EPQ Presentation
10 pages
ChatGPT Prompt For SEO
No ratings yet
ChatGPT Prompt For SEO
3 pages
Computer Studies Schemes G8 Term 3 - 2021
No ratings yet
Computer Studies Schemes G8 Term 3 - 2021
10 pages