Machine Learning Project Report
Machine Learning Project Report
PROJECT
ABHISHEK.V.
PGDSBA
List of Content or Index.
Problem 1:
1.1 Read the dataset. Do the descriptive statistics and do the null value condition
check. Write an inference on it. (4 Marks)
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check
for Outliers. (7 Marks)
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here
or not? Data Split: Split the data into train and test (70:30). (4 Marks)
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.7 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model. Final Model: Compare the models and write inference which model
is best/optimized. (7 marks)
1.8 Based on these predictions, what are the insights? (5 marks)
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
2.1 Find the number of characters, words, and sentences for the mentioned
documents. – 3 Marks
2.2 Remove all the stopwords from all three speeches. – 3 Marks
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks
1.1) Read the dataset. Describe the data briefly. Interpret the inferences for each.
Initial steps like head() .info(), Data Types, etc . Null value check, Summary stats,
Skewness must be discussed.
Dataset head:
All the variable except vote and gender are in int64 datatypes.
But when looking at the values in the dataset for the other variables, they all look
like categorical column except age.
The dataset has few duplicates and removing them is the best choice as duplicate
does not add any value. Below snippets also shows the shape of the dataset after
removing the duplicate.
Bivariate Analysis.
Pairplot-
Pairplot tells us about the interaction each variable with every other variable
present.
As such there is no strong relationship is present between the variable.
There is a mixture of positive and negative relationships through which excepted.
Overall, its rough estimate of the interaction, clearer picture be obtained by heatmap
and other kinds of plots.
People above 45 years of age thinks that Blair is doing good job.
Analysis of Hague and Age.
Hague has slightly more concentration of neutral points than that of Blair for people
above 50 years of age.
Multicollinearity is an important issue which can harm the model. Heatmap is the
good way of identifying this issue. Because it gives us the basic idea of the
relationship the variables having with each other.
Observations:
* Highest positive correlation is between ‘economic_cond_national’ and
‘economic_cond_household’ (35%). Good thing is that its not huge.
* Highest negative correlaton is between Blair and Hague(35%) but that is not
huge.
Thus, Multicollinearity won’t be issue in this dataset.
Outlier Check/Treatment:
Using boxplot.
There are outliers present the‘economic_cond_national’ and the
‘economic_cond_househlod’ variables that can be seen from the boxplots.
We will find the upper and lower limits to get clear picture of the outlier.
The upper and lower limit is not much distinct from each other and the outliers the
lower side only that too having value 1 where the lower limit is 1.5.
So it is not advisable to treat the outliers in this case.
We will move forward without treating the outliers.
1.3) Encode the data (having string values) for Modelling. Is Scaling necessary here
or not ? ,Data Split: Split the data into train and test (70:30) .
Encoding the dataset.
As many machine learning models cannot work with the string values we will
encode the categorical variables and convert their datatypes into integer type.
From the info of the dataset, we know there are 2 categorical type variables, so we
need to encode 2 variables with suitable technique.
Those 2 variables are ‘vote’ and ‘gender’. Their distribution is given below.
Gender Distribution.
Vote Distribution.
From the above results we can see that both variables contain only two
classifications of data in them.
After encoding the info.
Data.
Scaling.
We are not going to scale the data for logistic Regression, LDA, Naive Bayes
model as it is not necessary.
But in case of KNN it is necessary to scale the data, as distance based logarithm.
Scaling data gives similar weightage to all the variables.
1.4) Apply Logistic Regression and LDA (Linear Discriminant Analysis). Interpret
the inferences of both models.
* Logistic Regression.
Applying logistic Regression and fitting the train data.
Confusion matrix display.
The model is not overfitting or underfitting. Training and testing results shows that
the model is excellent with good precision and recall values.
* LDA (Linear Discriminant Analysis).
Applying LDA and fitting the train data.
Training and Testing results shows that the model is excellent with good precision
and recall values. The LDA model is better than the logistic Regression with better
test accuracy and recall values.
1.5) Apply KNN Model and Naïve Bayes Model. Interpret the inferences of each
model.
Applying KNN model and fitting the train data.
Confusion matrix display.
1.6) Model Tuning, Bagging and Boosting (1.5 pts). Apply grid search on each
model (include all models) and make models on best params.
BAGGING:
Bagging is a machine learning ensemble meta-algorithm designed to improve the
stability and accuracy of machine learning algorithm used in statistical
classification and regression.
Bagging reduces variance and helps to avoid overfitting. Using Decision Tree
Classifier for Bagging Below.
Applying Bagging model and fitting Train data.
Confusion matrix display.
Bagging reduces variance and helps to avoid overfitting. Using Random Tree Forest
for Bagging Below.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model, classification report Final Model.
Logistic Regression
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.889 AUC ROC curve for Test:0.883
LDA:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.883 AUC ROC curve for Test:0.884
LDA (grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.894 AUC ROC curve for Test:0.865
KNN:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.932 AUC ROC curve for Test:0.871
KNN (grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:1.000 AUC ROC curve for Test:0.824
Naïve Bayes:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.886 AUC ROC curve for Test:0.885
Naïve Bayes(grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train:0.892 AUC ROC curve for Test:0.855
Bagging (DecisionTree):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 1.000 AUC ROC curve for Test:0.877
Bagging –RandomForest:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.997 AUC ROC curve for Test:0.897
Bagging –RandomForest(Grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.998 AUC ROC curve for Test:0.864
Boosting – Adaboost:
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.910 AUC ROC curve for Test:0.880
Boosting – Adaboost(Grid_searchcv):
Confusion matrix:
Train set: Test set:
AUC ROC curve for Train: 0.915 AUC ROC curve for Test:0.861
Model comparison:
This is a process through which we will compare all model build and find the best
optimized among. There are total of 8 different kind of model which model build 2
times in following fashion.
- Without scaling.
- With scaling (Grid_searchcv)
The basis on which models are evaluated are known as performance metrics. The
metrics on which models are evaluated are.
- Accuracy
- AUC
- Recall
- Precision
- F1-score
Without Scaling:
All models performed well with slight difference ranging from (1-5%)
With Scaling (Grid_Searchcv)
All models performed well with slight difference ranging from (1-5%)
As for the scaled and unscaled data models, scaling only improved the performance
of the distance based algorithm for others its slightly decreased performance
overall. Here only KNN from scaled data model performed slightly well than KNN
unscaled model.
Best optimized model-on the basis of all the comparisons and performance metrics
‘Logistic Regression Model’ without scaling performed best of all.
1.8) Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective.
Inferences:
Logistic Regression performed the best out of all the models build.
Logistic Regression equation for the model.
How each feature contributes to the predict output.
Using logistic Regression model without scaling for predict the outcome as it
has the best optimized performance.
Hyper-parameter is an important aspect of model building. There is a
limitation to this as to process the combinations huge amount of processing
power is required. But if tuning can be done with many sets of parameters
than we might get even better results.
Gathering more data will also help in training the models and thus improving
their predictive powers.
We can also create a function in which all the models predict the outcomes in
sequence. This will help in better understanding and the probability of what
the outcome will be.
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
1.President Franklin D. Roosevelt in 1941
2.President John F. Kennedy in 1961
3.President Richard Nixon in 1973
2.1) Find the number of characters, words and sentences for the mentioned
documents. (Hint: use. words(), .raw(), .sent() for extracting counts)
1.President Franklin D. Roosevelt in 1941 has 7571 characters 1536 words and 68
sentence.
2.President John F. Kennedy in 1961 has 7618 characters 1546 words and 52
sentence.
3.President Richard Nixon in 1973 has 9991 characters 2028 words and 69
sentence.
2.2) Remove all the stopwords from the three speeches. Show the word count
before and after the removal of stopwords. Show a sample sentence after the
removal of stopwords.
1.Words along with stopwords in Roosevelt speech.
2.3) Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
In the below snippets we could see the words that occurred most number of times in
their inaugural address.
1.President Roosevelt’s 1941 speech.
2.4) Plot the word cloud of each of the three speeches. (after removing the
stopwords).