Diabetic Prediction Using LogicalRegression
Diabetic Prediction Using LogicalRegression
June 8, 2023
a) Load the dataset using pandas: Use the pandas library to load the dataset from the
‘diabetes.csv’ file
1
[2]: data = pd.read_csv("diabetes.csv")
data.head(5)
b) Exploring the dataset Exploring the dataset is crucial for understanding its structure, iden-
tifying missing values, data distribution, correlations, and outliers, as well as for making informed
decisions regarding data preprocessing and feature selection. It provides insights that guide data
analysis and model building.
[3]: data.shape # Checking number of rows and coulmn in the dataset
[3]: (768, 9)
[4]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
data.info() - Checking the information about the dataset, including the number of entries, the
number of columns, column names, data types of each column, and any missing values. It will help
in understanding the structure and properties of the dataset.
2
[5]: data.describe()
data.describe() - Checks the descriptive statistics for the numerical columns in the dataset. The
statistics include count, mean, standard deviation, minimum value, 25th percentile (Q1), median
(50th percentile or Q2), 75th percentile (Q3), and maximum value. It provides a summary of the
central tendency, spread, and distribution of the numerical data.
[6]: data.isnull().sum() # Checking if the dataset contains any null values.
[6]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
if num_duplicates > 0:
print(f"The dataset contains {num_duplicates} duplicate values")
data = data.drop_duplicates
print("Number of duplicate values after dropping:", num_duplicates)
3
else:
print("The dataset doesn't contain any duplicate values.")
c) Extract data from the outcome column as a variable named Y: Extract the values
from the ‘outcome’ column and assign them to a variable called Y.
[9]: X = data.iloc[:,:-1]
X.head(5)
4
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
DiabetesPedigreeFunction Age
0 0.627 50
1 0.351 31
2 0.672 32
3 0.167 21
4 2.288 33
d) Extract data from every column except the outcome column as a variable named
X: Extract the data from all columns except the ‘outcome’ column and assign them to a variable
called X.
[10]: Y = data.iloc[:,-1]
Y.head(5)
[10]: 0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64
e) Divide the dataset into two parts for training and testing: Split the dataset into a
training set and a testing set in a 70% - 30% proportion. This will be used to train the model on
the training set and evaluate its performance on the testing set.
[11]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30,␣
↪random_state = 51)
[12]: print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
(537, 8)
(231, 8)
(537,)
(231,)
[13]: LogisticRegression()
5
1.5 Evaluate the model :
[14]: Y_predict = logistic.predict(X_test)
print("Y_predict:\n",Y_predict)
Y_predict:
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1
0 1 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0
1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1 1 1 0
0 0 0 0 1 0 1 0 0]
[15]: print("Y_test:\n",Y_test)
Y_test:
737 0
505 0
296 1
711 0
329 0
..
405 0
315 0
131 1
364 0
322 1
Name: Outcome, Length: 231, dtype: int64
Confusion Matrix :
[[131 11]
[ 37 52]]
6
ax.scatter(range(len(Y_test)), Y_test, color='blue', label='Actual Outcome')
# Add a legend
ax.legend()
7
1.7 Use the model:
Once the model is trained and evaluated, you can use it to make predictions on new, unseen data.
This can be done by providing new input values to the model and using the predict function to
obtain the predicted outcome.