Logistic Regression vs. SVMs - Solution
Logistic Regression vs. SVMs - Solution
Activity Overview
This activity is designed to consolidate your knowledge about logistic regression and support vector
machines (SVMs) from a practical point of view.
In particular, we'll try to diagnostically predict whether a patient has diabetes by using the Python libraries
pandas and sklearn . We will begin by reading the dataset diabetes
(https://www.kaggle.com/uciml/pima-indians-diabetes-database) from Kaggle.
This activity is designed to help you apply the machine learning algorithms you have learned using the
packages in Python . Python concepts, instructions, and starter code are embedded within this Jupyter
Notebook to help guide you as you progress through the activity. Remember to run the code of each code
cell prior to submitting the assignment. Upon completing the activity, we encourage you to compare your
work against the solution file to perform a self-assessment.
Support vector machines (SVMs) are supervised learning models with associated learning algorithms that
analyze data for classification and regression analysis.
We begin by visualizing the first ten rows of the dataframe df using the function .head() . By default,
.head() displays the first five rows of a dataframe.
Complete the code cell below by passing the desired number of rows to the function .head() as an
integer.
In [2]: df.head(10)
#df.head( )
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction
1 1 85 66 29 0 26.6 0.351
3 1 89 66 23 94 28.1 0.167
6 3 78 50 32 88 31.0 0.248
Next, we retrieve some more information about our dataframe by using the properties .shape and
columns and the function .describe() .
In [3]: df.shape
Out[3]: (768, 9)
In [4]: df.columns
In [5]: df.describe()
Out[5]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedig
Observe the dataframe df above. Which variable would you like to predict for this problem?
Understanding which variable we are trying to predict is important as we need to split our dataframe df
into X and y dataframes. The X dataframe will contain all the variables in df that will be used to make
the prediction; y will contain the dependent variable, in this case Outcome .
Run the code cell below to create our X dataframe and to visualize the first five rows using the command
head() .
In [6]: X = df.iloc[:, :-1] #select all the columns in df except the last one
Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction
1 1 85 66 29 0 26.6 0.351
3 1 89 66 23 94 28.1 0.167
Next, we need to separate the Outcome from our original dataframe df . In the code cell below, fill in the
ellipsis with the name of our target variable.
In the code cells below, use the function head() to visualize the first five rows of y .
In [8]: y.head()
Out[8]: 0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64
As a reminder, the training set is the portion of the original dataset that we use to train the model. The model
sees and learns from this data.
The testing dataset is the sample of data used to provide an unbiased evaluation of a final model fit on the
training dataset. It is only used once a model is completely trained.
To split the data into and training a testing datasets we can use the function train_test_split
(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from
sklearn . This function splits arrays or matrices into random train and test subsets and returns a list
containing train-test split of inputs.
X : Input dataframe
y : Outpur dataframe
test_size : Should be between 0.0 and 1.0 and represent the proportion of the dataset to include in
the test split
random_state : Controls the shuffling applied to the data before applying the split. Ensures the
reproducibility of the results across multiple function calls
In the code cell below, fill in the ellipsis to set the argument test_size equal to 0.25 and
random_state equal to 0 .
In [10]: from sklearn.model_selection import train_test_split
In the code cell below, we've imported the classifiers LogisticRegression (https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and SVC (https://scikit-
learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC) from
sklearn .
classifiers = [
LogisticRegression(),
SVC(kernel="linear")]
In the code cell below, we instantiate the LogisticRegression classifier and we fit it to our training sets.
Out[13]: 0.7604166666666666
In the code cell below, fill in the ellipsis with the name of the classifier we have imported for SVM.
Out[14]: 0.7586805555555556
Which model performs better?
In the code cell below, we have used the function predict() without the logistic regression classifier
log_clf to make a prediction on the y testing set.
Accuracy: 80.21%
Finally, we are interested in looking at the accuracy of the SVMs model. In the code cell below, compute the
prediction on the testing set (y_eval_svm) by following the code above.
Run the code cell below to compute the acuracy for this model.
Accuracy: 77.08%
In [ ]: