Machine Learning Business Report
Machine Learning Business Report
Report
Prepared by:
Mitesh Kumar Agrawal
DSBA June’21
Problem 2.............................................................................................................................................................33-34
2.1 Find the number of characters, words, and sentences for the mentioned documents...........................................33
2.2 Remove all the stopwords from all three speeches.................................................................................................33
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the top three
words. (after removing the stopwords).........................................................................................................................34
Table5: Skewness 6
Data Ingestion
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an inference
on it
Sample Data:
Data Info:
Inferences:
Dataset consists of 1525 voters and 9 columns. Out of which 2 are object datatype i.e. Vote and gender and rest
7 are of integer datatype
As we can see from the above that there is 0 null values in the complete dataset.
Data Summary:
Vote column consists of 2 unique values. Out of which 'Labour' party qis the highest in terms of frequency.
Minimum and maximum age of voters is 24 and 93 respectively while the average age of voters is 54.
Female voters are more in number than male.
Skewness:
Table5: Skewness
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers
Univariate Analysis
Categorical Variable:
Observations-
There are more number of voters who support Labour party compared to Conservative party
Number of female voters are slighty more than male voters
Numerical Variable:
All the numerical variable are approximately normally distributed with multimodal instances
Outliers are present in two variables i.e. economic.cond.household and economic.cond.national
In few of the boxplots, min and max values are not clearly visible.
Bivariate Analysis
Categorical variable:
Female voters are more in number for both parties i.e. Labour and Conservative
Numerical Variable:
Household Economic condition for Labour party is slightly better than Conservative party
Fig10: Pairplot
11 | P a g e GreatLakes Assignment: Machine Learning
Pairplot for this dataset does not gives us a clear idea of the relationship between the variables.
Outlier Check:
From the plot we can see outlier is present in both economic.cond.household and economic.cond.national. Since only
one outlier is present in both of the variables and that too on the lower end so we will not do outlier treatment.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split: Split
the data into train and test
Encoding the Data
Machine Learning models do not work with string values. So we need to encode the categorical variables.
Vote:
Gender:
Since there are only two values in both the variables and there is no level or order in the subcategory any encoding (i.e.
label encoding or one hot encoding )will give the same result.
Scaling is done so that the data which belongs to wide variety of ranges can be brought together in similar relative
range and thus optimizing the model performance.
It is recommended to do feature scaling when we are dealing with distance based models/algorithms(KNN, Regression
etc.) since they are very sensitive to the range of data points . It is very useful in checking and reducing multi-collinearity
in data.
But the tree based methods would not require scaling in general because it uses split method.
Since most of the variables are in the rage of 0-10 except age, we will scale only the age variable using Z-score method
for scaling.
We need to split the data into train and test so that we can compare the performance of the model in both train and
test datasets. When the target variable is imbalanced we generally split the data into 70:30 ratio of train and test. Here
also we have split the data into the ratio of 70:30.
Shape of the data after splitting where X refers to independent variable and y refers to dependent/target variable:
Train-
X(1067,8)
y(1067,1)
Test-
X(458,7)
y(458,1)
Logistic regression is a process of modeling the probability of a discrete outcome given an input variable. The most
common logistic regression models a binary outcome; something that can take two values such as true/false, yes/no,
and so on. Multinomial logistic regression can model scenarios where there are more than two possible discrete
outcomes. Logistic regression is a useful analysis method for classification problems, where you are trying to determine
if a new sample fits best into a category.
Rsquare on Train:
Linear Discriminant Analysis as its name suggests is a linear model for classification and dimensionality reduction. Most
commonly used for feature extraction in pattern classification problems
Inferences:
Both of the models perfomed well with both Train and Test dataset.
There is no underfitting or overfitting as accuracy is also very close
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set immediately instead
it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much similar to the new data
Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
Inferences:
The results of the KNN model shows that the accuracy on the train and test data are distant apart.
The results of the Naive Baye's model shows the accuracy on train and test is same but it is less than the Logistic
regression model.
So, we can conclude these models didn't performed so well.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting
Tuning is the process of maximizing the model's performance without overfitting or creating too high variance. In
machine learning, this is accomplished by selecting appropriate 'hyper-parameters'
Models such as Bagging, Ada Boosting and Gradient Boosting are resistant to overfitting.
Bootstrap Aggregating, also knows as bagging, is a machine learning ensemble meta-algorithm designed to improve
the stability and accuracy of machine learning algorithms used in statistical classification and regression. It decreases
the variance and helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of
the model averaging approach.
Fig12: Bagging
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak
classifiers. It is done by building a model by using weak models in series. Firstly, a model is built from the training
data. Then the second model is built which tries to correct the errors present in the first model. This procedure is
continued and models are added until either the complete training data set is predicted correctly or the maximum
number of models are added .
Fig13: Boosting
There are many types of Boosting. Out of which we have used two:
AdaBoost minimises loss function related to any classification error and is best used with weak learners. The method
was mainly designed for binary classification problems and can be utilised to boost the performance of decision trees.
Gradient Boosting is used to solve the differentiable loss function problem. The technique can be used for both
classification and regression problems.
Inferences:
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the models
and write inference which model is best/optimized.
1. Logistic Regression
Confusion Matrix on Train
2. LDA
Confusion Matrix on Train and test
Fig19: LDA ROC curve and ROC_AUC score on Train and Test
3. Naive Baye's
Accuracy
AUC
Recall
Precision
F1-Score
On the basis of Accuracy- Logistic Regression, LDA and Naive Bayes performed well
On the basis of AUC- Logistic Regression and LDA performed well
On the basis of Recall- Bagging performed well
On the basis of Precision- Logistic Regression, LDA and Naive Bayes performed well
On the basis of F1-Score- Logistic Regression performed well
So we could clearly see that Logistic Regression performed well on the basis of multiple parameters. So best
optimized model would be Logistic Regression model.
Hague : -0.8379785998010736
Blair: 0.5746328767419848
political.knowledge : -0.48267335884793466
economic.cond.national :0.3375643790321967
age: -0.3246813099750099
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
32 | P a g e GreatLakes Assignment: Machine Learning
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned documents.
Characters
1. President Franklin D. Roosevelt in 1941: 7571
2. President John F. Kennedy in 1961: 7618
3. President Richard Nixon in 1973: 991
Words
1. President Franklin D. Roosevelt in 1941: 1536
2. President John F. Kennedy in 1961: 1546
3. President Richard Nixon in 1973: 2028
Sentences:
1. President Franklin D. Roosevelt in 1941: 68
2. President John F. Kennedy in 1961: 52
3. President Richard Nixon in 1973: 69
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention the
top three words. (after removing the stopwords
Results after removing stopwords:
1. President Franklin D. Roosevelt in 1941
top3 words: e, n, r