0% found this document useful (0 votes)
35 views

Predictive Modelling Project

Uploaded by

arpitasaha.1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Predictive Modelling Project

Uploaded by

arpitasaha.1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 94

Predictive Modelling Project

Problem 1 - Define the problem and perform exploratory Data Analysis

- Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables

Observations:
- Data set contains 8192 rows with 22 columns (Shape).
Observations:
- Data set contains missing values in (rchar) and (wchar) . .
- Data set contains object named ‘runqz’.
- Data set contains 13 float values, 8 integer values and 1 object.
Observations:
- The minimum and maximum (lread) Reads (transfers per second ) between system
memory and user memory is 0 to 1845.
- The minimum and maximum (lwrite) writes (transfers per second) between system
memory and user memory is 0 to 575.
- The minimum and maximum (scall) Number of system calls of all types per second is 109 to
12493.
- The minimum and maximum (sread) Number of system read calls per second is 6 to 5318.
- The minimum and maximum (swrite) Number of system write calls per second is 7 to 5456.
- The minimum and maximum (fork) Number of system fork calls per second is 0 to 20.12 .
- The minimum and maximum (exec) Number of system exec calls per second is 0 to 59.56 .
- The minimum and maximum (rchar) Number of characters transferred per second by
system read calls is 278 to 2526649.
- The minimum and maximum (wchar) Number of characters transfreed per second by
system write calls is 1498 to 1801623.
- The minimum and maximum (pgout) Number of page out requests per second is 0 to
81.44.
- The minimum and maximum (ppgout) Number of pages, paged out per second is 0 to
184.20.
- The minimum and maximum (pgfree) Number of pages per second placed on the free list is
0 to 523.
- The minimum and maximum (pgscan) Number of pages checked if they can be freed per
second 0 to 1237.
- The minimum and maximum (atch) Number of page attaches (satisfying a page fault by
reclaiming a page in memory) per second is 0 to 211.58 .
- The minimum and maximum (pgin) Number of page-in requests per second is 0 to 141.20 .
- The minimum and maximum (ppgin) Number of pages paged in per second is 0 to 292.61.
- The minimum and maximum (pflt) Number of page faults caused by protection errors
(copy-on-writes) is 0 to 899.80 .
- The minimum and maximum (vflt) Number of page faults caused by address translation is
0.2 to 1365.
- The minimum and maximum (freemem) Number of memory pages available to user
processes is 55 to 12027.
- The minimum and maximum (freeswap) Number of disk blocks available for page swapping
is 2 to 2243187.
- The minimum and maximum (usr) Portion of time (%) that cpus run in user mode is 0 to 99.
Observations:
- All the variables except (usr) has outliers.
- Every column having the outliers. As the Linear regression is sensitive for outliers, but in
my opinion outlier treatment is not quite good because each and every data is unique with
his own entry.
- And Treating the outliers will affect the original value of the data and it may lead to wrong
prediction also. So, we will proceed the data with the outliers.
- Here in every column ‘0’ place an important role as its showing huge difference in the
range of the data.
If we treat the 0 ,there will be change in data also (like null values) as the real data may
have 0, so we will proceed with these.
Observations:
- All the variables are left skewed in this data, only (usr) is right skewed.

Dropped the categorical column before correlation as correlation only takes numerical
values.
Observations:
- The variable
'lread','lwrite','scall','sread','swrite','fork','exec','rchar','wchar','pgout','ppgout','pgfree','pgsc
an','atch','pgin','ppgin','pflt','vflt','freemem','freeswap','usr' are having correlation with each
other.
- Correlation values near to 1 or -1 are highly positively correlated and highly
negatively correlated respectively. Correlation values near to 0 are not correlated to each
other.

Observations:
- No prominent relationship between the variables can be seen here in this pairplot.
- Pairplot shows the relationship between the variables in the form of scatterplot and the
distribution of the variable in the form of histogram. From the histogram we can see that
the whole dataset is left skewed.
- As the given data set contains huge numbers of columns the pair plot is looking little
messy.
- And as the plot we can see some columns having the positive correlation b/w them. Some
having no correlation and some columns have negative correlation as well.
- Now Let us split the data and build a model to proceed.

Problem 1 - Data Pre-processing


Prepare the data for modelling: - Missing Value Treatment (if needed) - Outlier Detection
(treat, if needed) - Feature Engineering - Encode the data - Train-test split.

Obse
rvations:
- There is no duplicated column, data set doesn’t have duplicate rows as well
- Data set doesn’t have null values as well except rchar and wchar columns.
- Let us use the ‘For loop ’ to treat these null values by replace with mean values.
Observations:
- After the treatment null values in the data set was clear, no disturbance in data set.
- Linear regression sensitive to the null values.
Observations:
- After treating the missing values we can see there sre no more missing values in the
dataset.
Observations:
- We can see there are outliers we need to treat them before building the model in a
technique known as cap and floor soo that the data doesent get disturbed.
Observations:
- As we can see after treating the outliers with cap and floor technique all the outliers have
been adjusted.
Observations:
- This is the ultimate data we get after Label encoding method replacing the ‘Cpu_bound’ as
1 and ‘Notcpu_bound’ as 0.

Observations:
- This is the dataset with only the independent variables.

Observations:
- The dataset is split into X and Y training and test set in 75:25 ratio.
- We will be building the model first on training data.
- As the Train and the test data split up we can process with creating the linear model. Now
for creating the OLS model, we can use the .ols from stats model api package.
- And Fit the data with x_train and y_train.
Problem 1- Model Building - Linear regression
- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Observations:
- Here in this model R2 is 0.794 it is closer to 1 so we can say that it is a good kind of model.
As we know closer the value of R2 to 1 better is the model.
- These variables ('lread', 'lwrite', 'scall', 'swrite','exec',
'rchar', 'wchar', 'pgout','pgscan', 'atch', 'ppgin', 'pflt', 'vflt', 'freemem', 'freeswap',
'runqsz_Not_CPU_Bound') have p- value equals to 0 so these variables are significant
variables.
- These variables ('sread','fork','ppgout', 'pgfree', 'pgin',) are insignificant variables.
- The R-square value tells that the model can explain 79.4% of the variance in the training
set.
- Adjusted R-square also nearly to the R-square,79.4%.
After dropping the features causing strong multicollinearity and the statistically insignificant
ones, our model performance hasn't dropped sharply . This shows that these variables did
not have much predictive power.
Testing the Assumptions of Linear Regression

For Linear Regression, we need to check if the following assumptions hold:-

1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of error terms
5. No strong Multicollinearity

Observations.
- Actual values are the actual datapoints.
- Fitted values are the predicted values.
- Residuals are the errors(A-P)
Observations:

- No pattern in the data thus the assumption of linearity and independence of predictors
satisfied.
- The variance seem to be equal here no change.

Observations:
- Since p-value > 0.05 we can say that the residuals are normal.
Observations:
- Since p-value < 0.05 we can say that the residuals are heteroscedastic.

The model built olsmod_6 satisfies all assumptions of Linear Regression.


Observations:
- The R2 value is 0.736.
- The Adjusted R2 value is 0.735.
- All the variables become significant.

Observations:
- For every 1 unit increase in lread (reads per second between system memory and user
memory),the usr decreases by a factor of -0.162.
- For every 1 unit increase in lwrite (writes per second between system memory and user
memory),the usr decreases by a factor of 0.175.
- For every 1 unit increase in scall (system calls of all types per second),the usr decreases by
a factor of -0.000501.
- For every 1 unit increase in sread (system read calls per second),the usr decreases by a
factor of -0.00279.
- For every 1 unit increase in swrite (system write calls per second),the usr decreases by a
factor of -0.0179.
- For every 1 unit increase in exec (system exec calls per second),the usr decreases by a
factor of -1.582.
- For every 1 unit increase in rchar (characters transferred per second by system read
calls),the usr decreases by a factor of approximately -8.072e-06.
- For every 1 unit increase in pgscan (pages checked if they can be freed per second),the usr
decreases by a factor of -1.4133.
- For every 1 unit increase in freemem (memory pages available to user processes),the usr
decreases by a factor of -0.000507.
- For every 1 unit increase in freeswap (disk blocks available for page swapping),the usr
decreases by a factor of 8.553e-06.
- For every 1 unit increase in runqsz_Not_CPU_Bound(Process run queue size),the usr
decreases by a factor of 1.238.
Observations:
- RMSE which we are calculating on testing data it is not much of a difference
from the training data RMSE.

Key Takeaways:
- R-squared of the model is 0.736 and adjusted R-squared is 0.735, which shows that
the model is able to explain 74% variance in the data.This is quite good.
- A unit increase in the lwrite will result in a 0.1751 unit increase in the usr, all other
variables remaining constant.
- The usr of a process of Not_CPU_Bound will be 1.24 units, all other variables
remaining constant.
- RMSE on the train data: 5.0275
- RMSE on the test data: 5.2515
- We can see that RMSE on the train and test sets are comparable. So, our
model is not suffering from overfitting.
- Hence, we can conclude the final model is good for prediction as well as
inference purposes.
Inference:
We constructed a number of models by removing variables one at a time in order to
produce an effective model. By taking into account several aspects like R-squared, Adj R-
squared, P value, and creating VIF, the variables are eliminated. On beforehand we have to
clean up the data by handling the outliers and impute the missing values before moving on
to the linear regression model. We have tried to build a Linear Regression without treating
the outliers which gave us a very low R-squared value which shows the model is not
efficient. Linear Regression before Outlier treatment: R-squared and Adjusted R-squared
value are74%and 73%respectively.This value is considered has a very low score therefore
we have moved over to build an effective liner regression model.
Variable column has the variables that we have dropped one by one, corresponding changes
in R- squared and Adj R-squared are filled up beside to it. It shows we have started with
0.794 and ended with 0.736 even though it shows negative improvement in R-squared
value, the reason we have choosen to drop the variables is they have Multicollinearity
within the independent variables which effects the effective model therefore we have
removed the variables which has Multicollinearity.
For linear regression the residuals has to be in normal distributed, in our model the
residuals are build up close to normal distribution form which make the model very
effective. Shapiro test which helps to identify if the residuals are in normal distribution, p
value (P value = 1.3018460702341419e-37) on the Shapiro test is lesser than 0.05 therefore
it is proved the residual is not normally distributed. Homoscedasticity test is performed to
check if the presence of non-constant variance in the error terms results in
heteroscedasticity. The P-value on Homoscedasticity test is 0.03885140 therefore the null
hypothesis is rejected so that we can say that the residuals are heteroscadastic. Last but not
the least, our model worked very well in both training and test data. This is tested using
RMSE . RMSE on the train and test sets are comparable(Train data: 5.027 & Test data:5.251).
Therefore, our model is not suffering from overfitting.
The dependent variable urs - Portion of time CPUs run in user mode rises as the following
variables' units fall.
lread - Reads (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
rchar-Number of characters transferred per second by system read calls
wchar-Number of characters transferred per second by system write calls
ppgin-Number of pages paged in per second
pflt-Number of page faults caused by protection errors (copy on writes)
runqsz-Process run queue size
freemem-Number of memory pages available to user processes. It has a non negative
coefficient
freeswap-Number of disk blocks available for page swapping.
Through this model, we advise that there is a greater likelihood of an increase in the amount
of time CPUs are used in user mode when the aforementioned factors are used sparingly.

Problem 2 - Define the problem and perform exploratory Data Analysis


- Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables

Observation:
- Data set contains 1437 rows with 10 columns (Shape).
Observation:
 Data set contains missing values in (Wife_age) and
(No_of_children_born) . .
 Data set contains object named variables (‘Wife_ education’,
'Husband_education',
'Wife_religion','Wife_Working','Standard_of_living_index','Media_exposur
e','Contraceptive_method_used')
 Data set contains 2 float values, 1 integer values and 7 object.

Observations:

 By using describe (include=all) we can find NaN values mostly due only
3 variables are in numeric .

Observations:

 The minimum age of Wife is 16 and maximum is 49.


 The minimum number of children born is o and maximum is 16.
 Minimum of husbands occupation is 1 and maximum is 4.

Univariate analysis
Checking the spread of the data using boxplot and
histogram for the continuous variables.
Observation:

 No variable except (No_of_children_born) has outliers.


Observations:

 Only the (No_of_children_born) variable is left skewed in this data.

Multivariate analysis

Checking for Correlations.


Observations:

 Correlation values near to 1 or -1 are highly positively correlated and


highly negatively correlated respectively. Correlation values near to 0
are not correlated to each other.
 The variable 'Wife_age','No_of_children_born','Husband_Occupation' are
having correlation with each other.
 Wife_age vs No_of_children_born has correlation of 54%
Pairplot using sns
Observations:

 No prominent relationship between the variables can be seen here in


this pairplot.
 Pairplot shows the relationship between the variables in the form of
scatterplot.
 The Pair plot shows how the relationship between the every variables.
Among these variables we need to predict those models that they use
the contraceptive method or not.
 There is no variance in the depth of variables, scattered data will help
models to perform well.
 Each variable has equivalent contribution of
Contraceptive_method_used dependent variable.

Problem 2 - Data Pre-processing


Prepare the data for modelling: - Missing value Treatment (if needed) - Outlier
Detection(treat, if needed) - Feature Engineering (if needed) - Encode the data
- Train-test split

Checking duplicate and Null values.

Are there any duplicates ?

Observations:

 There are 80 duplicate rows in total.


 Let us drop these duplicate rows.

Observations:

 All the duplicate rows are now removed.

Are there any missing values?

Observations:

 Data set doesn’t have null values as well except 'Wife_age' and
'No_of_children_born' columns.
 Let us use the ‘For loop ’ to treat these null values by replace with
mean values.

Imputing missing values


Observations:

 After the treatment null values in the data set was clear, no
disturbance in data set.
 Logistic regression sensitive to the null values.

Observations:

 After treating the missing values we can see there sre no more missing
values in the dataset.
Outlier Checks

Observations:

 We can see there are no outliers in most of the variables except


'No_of_children_born',this doesent needs to be treated as so less
outliers once treated can disturb the whole dataset.

Encoding
unique values for categorical variables.
Observations:

 Data has been encoded for the given dataset which enable us to use
the data for different models like Logistic Regression, LDA and CART.
 Contraceptive_method_used has two unique values "Yes" and "No",
these values are encoded to "0" and "1" respectively.

Copy all the predictor variables into X dataframe

Copy target into the y dataframe.

Split X and y into training and test set in 75:25 ratio

Observations:

 Contraceptive_method_used variable has taken has a y variable


(dependent variable) and all other variables are taken has x variable
(independent variable).
 The given data set is split into 70:25; 70% data are consider has
training data and 25% of data are taken for testing the model.
Problem 2 - Model Building and Compare the Performance of the Models

 Build a Logistic Regression model - Build a Linear Discriminant Analysis


model - Build a CART model - Prune the CART model by finding the best
hyperparameters using GridSearch - Check the performance of the
models across train and test set using different metrics - Compare the
performance of all the models built and choose the best one with
proper rationale

Fit the Logistic Regression model

Predicting on Training and Test dataset

Getting the Predicted Classes and Probs


Model Evaluation
Accuracy - Training Data

 AUC Value closer to 1 tells that there is good seperatibility between the
predicted classes and thus the model is good for prediction
 ROC Curve visually represents the above concept where the plot should
be as far as possible from the diagnol.

AUC and ROC for the training data


Accuracy - Test Data

AUC and ROC for the test data


Confusion Matrix for the training data
Confusion Matrix for test data
Applying GridSearchCV for Logistic Regression
Prediction on the training set

Getting the probabilities on the test set

Confusion matrix on the training data


Confusion matrix on the test data
ACCURACY of predicting 0's and 1's correctly

Conclusion:

Note :

Precison : tells us how many predictions are actually positive out of all the total positive
predicted.

Recall: how many observations of positive class are actually predicted as positive.

Inferences :

For predicting Contraceptive_method_used(Label 0 ):


Precision (61%) – 61% of women opt for a contraceptive method of choice .

Recall (46%) – Out of all the women actually opt for a contraceptive method of
choice, 46% of women have been predicted correctly .

For predicting Contraceptive_method_used(Label 1 ):


Precision (62%) – 62% of women opt for a contraceptive method of choice.

Recall (74%) – Out of all the women actually opt for a contraceptive method of
choice, 74% of women have been predicted correctly .
Overall accuracy of the model – 61 % of total predictions are correct.

Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This
proves no overfitting or underfitting has happened, and overall the model is a good model
for classification.

Applying Standard Scaler to scale the data

LDA Model
Split X and y into training and test set in 75:25 ratio

Build LDA Model

Prediction
Training Data and Test Data Confusion Matrix Comparison
Training Data and Test Data Classification Report
Comparison

Inferences
Note :

Precison : tells us how many predictions are actually positive out of all the total positive
predicted.

Recall: how many observations of positive class are actually predicted as positive.
For predicting Contraceptive_method_used(Label 0 ):

Precision (68%) – 68% of women opt for a contraceptive method of choice .

Recall (50%) – Out of all the women actually opt for a contraceptive method of
choice, 50% of women have been predicted correctly .

For predicting Contraceptive_method_used(Label 1 ):

Precision (62%) – 62% of women opt for a contraceptive method of choice.

Recall (77%) – Out of all the women actually opt for a contraceptive method of
choice, 77% of women have been predicted correctly .

Overall accuracy of the model – 68 % of total predictions are correct

Accuracy, AUC, Precision and Recall for test data is almost inline with
training data. This proves no overfitting or underfitting has happened, and
overall the model is a good model for classification.

Probability prediction for the training and test data


Observations:

 AUC for the Training Data: 0.724


 AUC for the Test Data: 0.651

Generate Coefficients and intercept for the Linear


Discriminant Function.
By the above equation and the coefficients it is clear that

 predictor 'Wife_ education_Tertiary' has the largest magnitude thus this


helps in classifying the best.
 predictor 'Standard_of_living_index_Very Low' has the smallest
magnitude thus this helps in classifying the least.

Using LDA for Dimensionality Reduction

The output above confirms we only have 1 feature for all the records in our
training and test sets.
The output below shows that with only a single feature, our machine learning
model achieves an accuracy of 61% which is same as the accuracy achieved
using all the features.

Building a Decision Tree Classifier


Variable Importance

Predicting Test Data

Regularising the Decision Tree


Adding Tuning Parameters
Generating New Tree

Variable Importance

Predicting on Training and Test dataset

Predicted Training data is 1044 & Predicted Testing data is 349.


Getting the Predicted Classes.

Getting the Predicted Probabilities


Model Evaluation
Measuring AUC-ROC Curve

AUC and ROC for the training data


AUC and ROC for the test data
Confusion Matrix for the training data
Confusion Matrix for test data

Conclusion:

Accuracy on the Training Data: 92%


Accuracy on the Test Data: 59%
AUC on the Training Data: 98%
AUC on the Test: 56%
Accuracy, AUC, Precision and Recall for test data is not inline with training
data. This proves there is overfitting or underfitting has happened, and overall
the model is not a good model for classification.

Key Takeaways:

We constructed three different models Logistic regression, LDA and CART models to predict
Contraceptive method used dependent variable. By taking into account several aspects like
coefficient, AUC, Accuracy, precision, recall, f1-score we were able to compare models
between them. On beforehand we did the encoding so make sure the data are ready to
build the Logistic regression, LDA and CART models.Outliers are treated and object variables
are encoded to convert it to numeric variable.
As explained in the model comparisonCART model performed wellthan the other models.
Where we have highest Coffecient that variable is the main contributor in predicting
dependent variable. In our case Contraceptive method used is the dependent variable all
other variables are independent variable. All the variable has positive Coffecient, this shows
where there is a unit increase in the independent variable, dependent variable has the
impact of Coffecient times.
For an example:
Wife age unit increase impact the Contraceptive method used by 0.33 times No of children
bornage unit increase impact the Contraceptive method used by 0.25 times.

You might also like