0% found this document useful (0 votes)
6 views

python_final_project_group_03

The document is an assignment cover page for a final term project report on programming in Python, submitted by students Tazrif Yamshit Raim and Israk Hossain Pantho. It includes a declaration of originality and a dataset description for the Breast Cancer Wisconsin (Diagnostic) Dataset, detailing its features and classification potential. The document outlines various tasks involving data loading, cleaning, analysis, and scaling using Python libraries such as Pandas and Matplotlib.

Uploaded by

Jamilur Reza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

python_final_project_group_03

The document is an assignment cover page for a final term project report on programming in Python, submitted by students Tazrif Yamshit Raim and Israk Hossain Pantho. It includes a declaration of originality and a dataset description for the Breast Cancer Wisconsin (Diagnostic) Dataset, detailing its features and classification potential. The document outlines various tasks involving data loading, cleaning, analysis, and scaling using Python libraries such as Pandas and Matplotlib.

Uploaded by

Jamilur Reza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

AMERICAN INTERNATIONAL

UNIVERSITY-BANGLADESH
Choose an item.

Assignment Cover Page


Assignment Title: Final Term Project Report
Assignment No: - Date of Submission: 10 May 2024
Course Title: Programming in Python
Course Code: CSC 4162 Section: B
Semester: Spring 2023-24 Course Teacher: Dr. Abdus Salam

Declaration and Statement of Authorship:


1. I/we hold a copy of this Assignment/Case-Study, which can be produced if the original is lost/damaged.
2. This Assignment/Case-Study is my/our original work and no part of it has been copied from any other student’s work or from
any other source except where due acknowledgement is made.
3. No part of this Assignment/Case-Study has been written for me/us by any other person except where such collaborationhas
been authorized by the concerned teacher and is clearly acknowledged in the assignment.
4. I/we have not previously submitted or currently submitting this work for any other course/unit.
5. This work may be reproduced, communicated, compared and archived for the purpose of detecting plagiarism.
6. I/we give permission for a copy of my/our marked work to be retained by the Faculty for review and comparison, including
review by external examiners.
7. I/we understand thatPlagiarism is the presentation of the work, idea or creation of another person as though it is your own. It
is a formofcheatingandisaveryseriousacademicoffencethatmayleadtoexpulsionfromtheUniversity. Plagiarized material can be
drawn from, and presented in, written, graphic and visual form, including electronic data, and oral presentations. Plagiarism
occurs when the origin of them arterial used is not appropriately cited.
8. I/we also understand that enabling plagiarism is the act of assisting or allowing another person to plagiarize or to copy my/our
work.

* Student(s) must complete all details except the faculty use part.
** Please submit all assignments to your course teacher or the office of the concerned teacher.

Group Name/No.: -

No Name ID Program Signature


1 Tazrif Yamshit Raim 21-45012-2 BSc [CSE]
2 Israk Hossain Pantho 21-44401-1 BSc [CSE]

Faculty use only


FACULTYCOMMENTS

Marks Obtained

Total Marks
Dataset Title: Breast Cancer Wisconsin (Diagnostic) Dataset (Modified)

Dataset Description: The dataset offers diagnosis label based on the features. Diagnosis as target,
the dataset can be used for classification for Breast Cancer, as to whether a case is Benign or
Malignant. Diagnosis field is categorical with 2 possible values, Benign(‘B’) and Malignant(‘M’). All
other fields are continuous except perimeter that is categorical.

Dataset features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P.
Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest
Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method
which uses linear programming to construct a decision tree. Relevant features were selected using
an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that
described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of
Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

Attribute Information:

1) ID number

2) Diagnosis (M = malignant, B = benign)

3-32)

Nine real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features
were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
All continuous feature values are recorded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant; 569 total instances

Imported Libraries: To complete the task necessary libraries are imported -

Task 1: Read/Load the dataset file in your program. Use Pandas library to complete this task.

Data is read from ‘data.csv’ file using read_csv() provided by pandas library. The loaded data is
stored in df. The data is shuffled so that training can be better but for the scenario shuffling might
not be necessary. To make random operations the same every time so that random operations can
be performed again and get same distribution of data a seed is set.

Task 2: Apply appropriate data cleaning techniques to the dataset. In this step, replace bad data
using proper methods and do not delete any record except duplicate records. Use Pandas library to
complete this task.

id column was dropped as that does not hold any meaningful value for the process.
Info() gives a brief insight into the dataframe.
To check the missing value –

There is no missing value.

As the task is to do gaussian naïve bayes categorical values are needed to be converted to
continuous values. The 3 categorical features in the dataset are binned into 7 bins from lowest to
highest values. As the category can have a numerical understanding, it can be interpreted
numerically and be used for gaussian naïve bayes with good results. The values were mapped from
0 to 6. Lowest was mapped to 0 and highest was mapped to 6. As none of the instances were
dropped there were no duplicates in the dataset.
All the features were interpreted to continuous.

Task 3: Draw graphs to analyze the frequency distributions of the features. Use Matplotlib library to
complete this task. Draw all the plots in a single figure so that all plots can be seen in one diagram
(use subplot() function).

The histogram of all the features with max 30 bins were plotted in a single figure using matplotlib in
6 columns.
Task 4: Draw graphs to illustrate if there is any relationship between target column to any other
columns of the dataset. Use Matplotlib library to complete this task. Also use subplot() function to
show all plots in one figure.

All the features are shown divided into two segments; Benign and Malignant. Sky blue plots are for
Benign cases and orange plots are for Malignant cases. Distribution of all features were shown
dividing when it is Benign and when it is Malignant.

Sky Blue: Benign

Orange: Malignant
Some features show clear distinction in the values when it is Benign vs when it is Malignant. Such
as, concave_points_worst, radius_mean. These classes are vital for classification. On the other
hand, some features do not show a distinction between Benign and Malignant. Such as,
fractal_dimension_mean. They are not significant and can be dropped from the features. As it is a
binary classification and all the features are continuous, point-biserial correlation can be used to
determine significant features.

Benign is mapped to 0 and Malignant is mapped to 1. All the features are checked against the class
whether it has any correlation in predicting whether a case is Benign or Malignant or not.
pointbiserialr() returns correlation coefficient and p value. If p value of a feature is more than 0.05
then the feature is considered to be insignificant for classification and is subsequently dropped
from the dataframe and will not get used for classification.

Point-biserial correlation coefficient for radius_mean: 0.73, p-value: 0.0000


Point-biserial correlation coefficient for texture_mean: 0.42, p-value: 0.0000
Point-biserial correlation coefficient for perimeter_mean: 0.73, p-value: 0.0000
Point-biserial correlation coefficient for area_mean: 0.71, p-value: 0.0000
Point-biserial correlation coefficient for smoothness_mean: 0.36, p-value: 0.0000
Point-biserial correlation coefficient for compactness_mean: 0.60, p-value: 0.0000
Point-biserial correlation coefficient for concavity_mean: 0.70, p-value: 0.0000
Point-biserial correlation coefficient for concave points_mean: 0.78, p-value: 0.0000
Point-biserial correlation coefficient for symmetry_mean: 0.33, p-value: 0.0000
Point-biserial correlation coefficient for fractal_dimension_mean: -0.01, p-value: 0.7599
Correlation is statistically insignificant for fractal_dimension_mean
Point-biserial correlation coefficient for radius_se: 0.57, p-value: 0.0000
Point-biserial correlation coefficient for texture_se: -0.01, p-value: 0.8433
Correlation is statistically insignificant for texture_se
Point-biserial correlation coefficient for perimeter_se: 0.48, p-value: 0.0000
Point-biserial correlation coefficient for area_se: 0.55, p-value: 0.0000
Point-biserial correlation coefficient for smoothness_se: -0.07, p-value: 0.1103
Correlation is statistically insignificant for smoothness_se
Point-biserial correlation coefficient for compactness_se: 0.29, p-value: 0.0000
Point-biserial correlation coefficient for concavity_se: 0.25, p-value: 0.0000
Point-biserial correlation coefficient for concave points_se: 0.41, p-value: 0.0000
Point-biserial correlation coefficient for symmetry_se: -0.01, p-value: 0.8766
Correlation is statistically insignificant for symmetry_se
Point-biserial correlation coefficient for fractal_dimension_se: 0.08, p-value: 0.0631
Correlation is statistically insignificant for fractal_dimension_se
Point-biserial correlation coefficient for radius_worst: 0.78, p-value: 0.0000
Point-biserial correlation coefficient for texture_worst: 0.46, p-value: 0.0000
Point-biserial correlation coefficient for perimeter_worst: 0.76, p-value: 0.0000
Point-biserial correlation coefficient for area_worst: 0.73, p-value: 0.0000
Point-biserial correlation coefficient for smoothness_worst: 0.42, p-value: 0.0000
Point-biserial correlation coefficient for compactness_worst: 0.59, p-value: 0.0000
Point-biserial correlation coefficient for concavity_worst: 0.66, p-value: 0.0000
Point-biserial correlation coefficient for concave points_worst: 0.79, p-value: 0.0000
Point-biserial correlation coefficient for symmetry_worst: 0.42, p-value: 0.0000
Point-biserial correlation coefficient for fractal_dimension_worst: 0.32, p-value: 0.0000
<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 295 to 326
Task 5: Perform scaling to the features of the dataset. Remember that you will need to apply data
conversion before performing scaling if it is needed.

All the data were already converted into float64 previously during data cleaning/preprocessing in
this code segment -

The features need to be scaled for better classification results. The data is standardized with Z -
score normalization. All the instances are subtracted by that feature’s mean and then the
resultants are divided by feature’s standard deviation.

Scaled features:

diagnosis radius_mean texture_mean perimeter_mean area_mean \


295 0 -0.351099 -0.834601 -0.681320 -0.392963
16 0 -0.138276 -0.685800 0.136551 -0.236106
431 0 0.276020 -0.674174 0.136551 0.055726
453 0 -0.745532 -0.195220 -0.681320 -0.703266
15 0 -0.915791 -1.471657 -0.681320 -0.818635
.. ... ... ... ... ...
440 0 -0.802285 -0.255671 -0.681320 -0.754414
165 0 -0.754045 -0.757875 -0.681320 -0.716621
7 0 -0.705805 -0.223120 -0.681320 -0.688773
219 0 -0.944167 -2.227289 -0.681320 -0.844777
326 1 3.771999 1.622947 3.408035 5.245913

smoothness_mean compactness_mean concavity_mean concave points_mean \


295 -1.292670 -0.161722 0.284756 -0.387063
16 -1.387237 -0.828417 -0.880952 -0.816671
431 1.325338 1.445844 0.313607 0.938613
453 -0.206929 -0.841293 -0.782984 -0.727502
15 -1.508112 -1.271681 -1.075132 -1.090929
.. ... ... ... ...
440 -0.031305 0.533186 0.827908 -0.525197
165 -0.398196 -0.861174 -0.789381 -0.662301
7 1.268455 -0.050007 -0.227037 -0.362580
219 -0.029883 -0.889576 -0.796406 -0.823114
326 0.856059 1.788564 3.445827 3.092063

symmetry_mean ... radius_worst texture_worst perimeter_worst \


295 -1.384748 ... -0.490187 -0.974125 -0.421982
16 -1.672919 ... -0.330873 -0.404673 -0.421982
431 0.690813 ... -0.032936 -1.195398 -0.421982
453 0.081641 ... -0.682604 -0.523444 -0.421982
15 -1.348270 ... -0.808813 -1.216549 -1.264466
.. ... ... ... ... ...
440 0.884143 ... -0.763295 0.371409 -0.421982
165 -0.647905 ... -0.777778 -0.795154 -0.421982
7 -0.038734 ... -0.647431 0.582920 -0.421982
219 -1.570782 ... -0.966058 -2.222039 -1.264466
326 0.909677 ... 4.090590 0.926218 3.790436

area_worst smoothness_worst compactness_worst concavity_worst \


295 -0.500535 -1.450069 -0.143419 0.298199
16 -0.393221 -1.027862 -0.610571 -0.801386
431 -0.207222 0.272919 0.216320 -0.365195
453 -0.652812 -0.616167 -0.949335 -0.916185
15 -0.721135 -0.668724 -1.088909 -1.215815
.. ... ... ... ...
440 -0.716041 0.102109 1.465235 2.259620
165 -0.710948 0.907981 -0.904209 -0.833836
7 -0.630331 1.595599 0.074585 0.072434
219 -0.819491 0.491906 -0.817134 -0.802824
326 5.924959 0.145907 1.088972 1.970583

concave points_worst symmetry_worst fractal_dimension_worst


295 -0.196345 -1.457560 -0.701823
16 -0.437322 -0.896684 -0.204627
431 0.421311 -0.502293 -0.340830
453 -0.747976 -0.259839 -1.056173
15 -1.142150 -0.263072 -0.392875
.. ... ... ...
440 0.109440 0.658253 2.533276
165 -0.747368 -0.080423 0.203983
7 0.109440 -0.153159 0.388909
219 -1.043265 -1.310472 -0.385123
326 2.249939 -0.419858 -0.535722

[569 rows x 26 columns]

25 features and 1 class/target. 569 instances.


Task 6: Split your data into two parts: Training dataset and Testing dataset. You must use the
function train_test_split() to complete this task and use value 321 as the value of the random_state
parameter of this function.

The whole dataframe is divided into train and test dataframe with 80% train and 20% test data.
There are 25 features and 1 binary class/target. Train has 455 instances and test has 114 instances.

X has the features and y has the class data. They were passed to train_test_split(). The function
splits and returns split train and test X(Features) and y(Class).

Task 7: Apply Naïve Bayes Classifier to the dataset. Build (train) your prediction model in this step.

An object of GaussianNB() is declared. GaussianNB() is provided by sklearn library. It fits or trains a


model using train features and train class that were split previously. Model predicts class on test
features. The predictions were compared with the test class that were stored. There is also a
classification report that shows how good the prediction is on various criteria.
Task 8: Calculate the confusion matrix for your model. Interpret it in detail in the report.

Confusion matrix can be calculated using features provided by sklearn.

0: Benign

1: Malignant
True negative(Top left): 62 out of 68 true Benign cases were predicted as Benign. 91% accuracy.

False negative(Top right): 6 out of 68 true Benign cases were predicted as Malignant. 9% accuracy.

True positive(Bottom right): 42 out of 46 true Malignant cases were predicted as Malignant. 91%
accuracy.

False negative(Bottom left): 4 out of 46 true Malignant cases were predicted as Benign. 9%
accuracy.

In this scenario, false negatives can be dangerous. 0.9% of the times Malignant cases might show
Benign as prediction that might be life threatening for the patient.

Task 9: Calculate the train and test accuracy of your model and compare them.

gnb.predict() on features for train and test returns predictions. Those predictions are then checked
against class labels for Train and test data. That returns train and test accuracy. Train accuracy is
93% and test accuracy is 91%.
Task 10: Show how 10-fold cross validation can be used to build a naïve bayes classifier and report
the accuracy of this model.

The object gnb of GaussianNB() is used for k-fold cross valdation. K is set to 10 folds. The entire
dataset was divided into 10 smaller portions. Every time one of the portion is set as test and rest are
used as train data. cross_val_score() of sklearn library can be used to check cross validation
accuracy score. Accuracy of all 10 sets of train and test data are printed in cross validation scores.
The mean accuracy is 93%.

Conclusion: Gaussian Naïve Bayes has the potential to be a reliable classifier for Breast Cancer
diagnosis based on fine needle aspirate(FNA) features. The model exhibits strong predictive
capabilities and generalizes well to unseen data, as evidenced by both testing set accuracy and
cross-validation results. The model exhibits 91~93% accuracy. But 9% of the times Malignant are
being classified as Benign which is potentially health-hazardous considering the severity of the
case. Further studies can be done to reduce the false negatives(Malignant case being classified as
Benign) so that the solution does not pose a threat to population upon deployment.

You might also like