4 Exploratory Data Analysis.
4 Exploratory Data Analysis.
12) Exercise:
1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle account to donwload data.
(https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform a similar alanlaysis as above on this dataset with the following sections:
High level statistics of the dataset: number of points, numer of features, number of classes, data-points per class.
Explain our objective.
Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are useful towards
classification.
Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are useful in classfication.
Write your observations in english as crisply and unambigously as possible. Always quantify your results.
Objective
To analyse and predict a patient survival who had undergone surgery of breast cancer.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
haberman = pd.read_csv("haberman.csv")
haberman.head(10)
Out[40]:
age year nodes status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
5 33 58 10 1
6 33 60 0 1
7 34 59 0 2
8 34 66 9 2
9 34 58 30 1
(306, 4)
Out[43]: 1 225
2 81
Name: status, dtype: int64
Observation(s):
haberman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.7 KB
Observation(s):
haberman.describe()
Out[45]:
age year nodes status
Observation(s):
In [48]: sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="status", size=4) \
.map(plt.scatter, "age", "status") \
.add_legend();
plt.suptitle('2D Scatter plot(colored)')
plt.show();
Observation(s):
1. Patients whose age is less than 40 are slightly tend to live more than 5 years
2. Survival status is independent of age if the patient's age is more than 40 years.
Pair-plot
Pair plot for bivariate analysis
Observation(s):
1. Major overlapping is oberved,the patients who did not survive more than 5 years are mostly in age range 45-65.
Observation(s):
1. Overlapping is mostly there, survival chances are irrespective of age. But we can say the patients having age between 30
to 40 have more chances of survival comapre to patients whose ages are more than 40 years.
2. Only age cannot decide survival status.
In [27]: sns.FacetGrid(haberman,hue='status',height = 5) \
.map(sns.distplot,'year') \
.add_legend();
plt.suptitle('PDF of year');
plt.show();
Observation(s):
In [28]: sns.FacetGrid(haberman,hue='status',height = 5) \
.map(sns.distplot,'nodes') \
.add_legend();
plt.suptitle('PDF of nodes');
plt.show();
Observation(s):
status_2 = haberman[haberman['status']==2]
counts_2, bin_edges_2 = np.histogram(status_2['nodes'], bins=10,
density = True)
pdf_2 = counts_2/(sum(counts_2))
print(pdf_2);
print(bin_edges_2);
cdf_2 = np.cumsum(pdf_2)
plt.plot(bin_edges_2[1:],pdf_2);
plt.plot(bin_edges_2[1:], cdf_2,label='not survived')
plt.xlabel('nodes')
plt.suptitle('CDF of nodes')
plt.legend()
plt.show()
Observation(s):
Box plot
In [32]: sns.boxplot(x='status',y='age', data=haberman)
plt.suptitle('Box plot for age');
plt.show()
Violin plots
Violin plot 1 (Age)
1. Patients in age group 45 to 65 are the most who died within 5 years.
2. Only age cannot decide the survival status.
3. There is much overlapping in this plot, but vaguely we can say in year 1958 to 1960 and 1963 to 1965 signifiacant
number of patients died.
4. Patients with survival status 1 have less nodes compare to status 2, means patients with more nodes have less chances
of survival.
5. Most of the patients who survived have zero nodes, but also there are many numbers of patients having zero nodes and
they died within 5 years, absence of nodes does not always guarantee survival.
Observation(s):
1. In year 1958 - 1964, operations done mostly on patients having age 45 to 55.
Conclusions:
1. Survival chances are lesser if the number of positive axillary nodes are more, but also absence of positive axillary nodes
will not give guaranty of survival.
2. Age alone cannot decide the survival chance, although patients less than 35 years have more survival chances.
3. Operation year parameter doesn't play major role deciding survival chance.
4. The dataset is imbalanced and overlapping is there in many factors so the survival status cannot be implied directly.