python_final_project_group_03
python_final_project_group_03
UNIVERSITY-BANGLADESH
Choose an item.
* Student(s) must complete all details except the faculty use part.
** Please submit all assignments to your course teacher or the office of the concerned teacher.
Group Name/No.: -
Marks Obtained
Total Marks
Dataset Title: Breast Cancer Wisconsin (Diagnostic) Dataset (Modified)
Dataset Description: The dataset offers diagnosis label based on the features. Diagnosis as target,
the dataset can be used for classification for Breast Cancer, as to whether a case is Benign or
Malignant. Diagnosis field is categorical with 2 possible values, Benign(‘B’) and Malignant(‘M’). All
other fields are continuous except perimeter that is categorical.
Dataset features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P.
Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest
Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method
which uses linear programming to construct a decision tree. Relevant features were selected using
an exhaustive search in the space of 1-4 features and 1-3 separating planes.
The actual linear program used to obtain the separating plane in the 3-dimensional space is that
described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of
Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].
Attribute Information:
1) ID number
3-32)
c) perimeter
d) area
i) symmetry
The mean, standard error and "worst" or largest (mean of the three largest values) of these features
were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
All continuous feature values are recorded with four significant digits.
Task 1: Read/Load the dataset file in your program. Use Pandas library to complete this task.
Data is read from ‘data.csv’ file using read_csv() provided by pandas library. The loaded data is
stored in df. The data is shuffled so that training can be better but for the scenario shuffling might
not be necessary. To make random operations the same every time so that random operations can
be performed again and get same distribution of data a seed is set.
Task 2: Apply appropriate data cleaning techniques to the dataset. In this step, replace bad data
using proper methods and do not delete any record except duplicate records. Use Pandas library to
complete this task.
id column was dropped as that does not hold any meaningful value for the process.
Info() gives a brief insight into the dataframe.
To check the missing value –
As the task is to do gaussian naïve bayes categorical values are needed to be converted to
continuous values. The 3 categorical features in the dataset are binned into 7 bins from lowest to
highest values. As the category can have a numerical understanding, it can be interpreted
numerically and be used for gaussian naïve bayes with good results. The values were mapped from
0 to 6. Lowest was mapped to 0 and highest was mapped to 6. As none of the instances were
dropped there were no duplicates in the dataset.
All the features were interpreted to continuous.
Task 3: Draw graphs to analyze the frequency distributions of the features. Use Matplotlib library to
complete this task. Draw all the plots in a single figure so that all plots can be seen in one diagram
(use subplot() function).
The histogram of all the features with max 30 bins were plotted in a single figure using matplotlib in
6 columns.
Task 4: Draw graphs to illustrate if there is any relationship between target column to any other
columns of the dataset. Use Matplotlib library to complete this task. Also use subplot() function to
show all plots in one figure.
All the features are shown divided into two segments; Benign and Malignant. Sky blue plots are for
Benign cases and orange plots are for Malignant cases. Distribution of all features were shown
dividing when it is Benign and when it is Malignant.
Orange: Malignant
Some features show clear distinction in the values when it is Benign vs when it is Malignant. Such
as, concave_points_worst, radius_mean. These classes are vital for classification. On the other
hand, some features do not show a distinction between Benign and Malignant. Such as,
fractal_dimension_mean. They are not significant and can be dropped from the features. As it is a
binary classification and all the features are continuous, point-biserial correlation can be used to
determine significant features.
Benign is mapped to 0 and Malignant is mapped to 1. All the features are checked against the class
whether it has any correlation in predicting whether a case is Benign or Malignant or not.
pointbiserialr() returns correlation coefficient and p value. If p value of a feature is more than 0.05
then the feature is considered to be insignificant for classification and is subsequently dropped
from the dataframe and will not get used for classification.
All the data were already converted into float64 previously during data cleaning/preprocessing in
this code segment -
The features need to be scaled for better classification results. The data is standardized with Z -
score normalization. All the instances are subtracted by that feature’s mean and then the
resultants are divided by feature’s standard deviation.
Scaled features:
The whole dataframe is divided into train and test dataframe with 80% train and 20% test data.
There are 25 features and 1 binary class/target. Train has 455 instances and test has 114 instances.
X has the features and y has the class data. They were passed to train_test_split(). The function
splits and returns split train and test X(Features) and y(Class).
Task 7: Apply Naïve Bayes Classifier to the dataset. Build (train) your prediction model in this step.
0: Benign
1: Malignant
True negative(Top left): 62 out of 68 true Benign cases were predicted as Benign. 91% accuracy.
False negative(Top right): 6 out of 68 true Benign cases were predicted as Malignant. 9% accuracy.
True positive(Bottom right): 42 out of 46 true Malignant cases were predicted as Malignant. 91%
accuracy.
False negative(Bottom left): 4 out of 46 true Malignant cases were predicted as Benign. 9%
accuracy.
In this scenario, false negatives can be dangerous. 0.9% of the times Malignant cases might show
Benign as prediction that might be life threatening for the patient.
Task 9: Calculate the train and test accuracy of your model and compare them.
gnb.predict() on features for train and test returns predictions. Those predictions are then checked
against class labels for Train and test data. That returns train and test accuracy. Train accuracy is
93% and test accuracy is 91%.
Task 10: Show how 10-fold cross validation can be used to build a naïve bayes classifier and report
the accuracy of this model.
The object gnb of GaussianNB() is used for k-fold cross valdation. K is set to 10 folds. The entire
dataset was divided into 10 smaller portions. Every time one of the portion is set as test and rest are
used as train data. cross_val_score() of sklearn library can be used to check cross validation
accuracy score. Accuracy of all 10 sets of train and test data are printed in cross validation scores.
The mean accuracy is 93%.
Conclusion: Gaussian Naïve Bayes has the potential to be a reliable classifier for Breast Cancer
diagnosis based on fine needle aspirate(FNA) features. The model exhibits strong predictive
capabilities and generalizes well to unseen data, as evidenced by both testing set accuracy and
cross-validation results. The model exhibits 91~93% accuracy. But 9% of the times Malignant are
being classified as Benign which is potentially health-hazardous considering the severity of the
case. Further studies can be done to reduce the false negatives(Malignant case being classified as
Benign) so that the solution does not pose a threat to population upon deployment.