Breast Cancer Classification With Machine Learning
Breast Cancer Classification With Machine Learning
Learning
Problem Statement
Find out whether the cancer is benign or malignant
import warnings
warnings.filterwarnings('ignore')
Out[2]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mea
5 rows × 33 columns
Attribute Information:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
1 ID number
The mean, standard error and "worst" or largest (mean of the three largest values) of these
features were computed for each image, resulting in 30 features.
In [3]: df.shape
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
In [3]: plt.figure(figsize=(5,5))
ax = sns.countplot(x=df['diagnosis'])
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()
plt.savefig('count_plot.jpg')
plt.show()
There are multicollinearity in this dataset. Several features show strong posistive
correlation.
In [11]: #finding correlated features
Data Preprocessing
In [13]: #making copy of dataframe for preprocessing
data = df.copy()
Out[15]: 0
Dealing Multicollinearity
In [17]: data['diagnosis'].unique()
Out[18]:
diagnosis radius_mean texture_mean smoothness_mean compactness_mean symmetry_mean
In [19]: data['diagnosis'].unique()
1 represents Malignant
0 represents Benign
There are 455 observations in test set and 114 observations in train set
In [23]: X_train.head()
Out[23]:
radius_mean texture_mean smoothness_mean compactness_mean symmetry_mean fractal_d
Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [25]: X_train
1. Logistic regression
Model training
logReg = LogisticRegression(random_state=5)
#training the model with train set
logReg.fit(X_train, y_train)
Out[26]: ▾ LogisticRegression
LogisticRegression(random_state=5)
Model evaluation
results
Cross validation
Accuracy is 95.84 %
Standard Deviation is 2.28 %
Model Training
ranForest = RandomForestClassifier(random_state=5)
Out[31]: ▾ RandomForestClassifier
RandomForestClassifier(random_state=5)
Model Evaluation
Cross Validation
In [36]: from sklearn.model_selection import cross_val_score
Accuracy is 94.28 %
Standard Deviation is 3.01 %
Hyperparameter Tuning
In [37]: #specifying differnet hyperparameters for random search cross validation
from sklearn.model_selection import RandomizedSearchCV
params = {'penalty': ['l1', 'l2', 'elasticnet', 'none'], 'C': [0.1, 0.25, 0.5, 0.75
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
Out[39]: ▾ LogisticRegression
Out[40]: 0.9915658504781224
We got the best params value as optimization algorithm = saga, norm of the penalty =
None and regularization parameter C = 1.5
Final Model
In [42]: #training the model with best hyperparameters
Out[42]: ▾ LogisticRegression
Model evaluation
In [44]: #calculating evaluation metrics
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
In [69]: sample_obs = [11.13, 16.62, 0.08151, 0.03834, 0.1511, 0.06148, 0.1415, 0.9671, 0.00
#making prediction
classifier.predict(scaler.transform([sample_obs]))