Predictive Modelling Project
Predictive Modelling Project
- Problem definition - Check shape, Data types, statistical summary - Univariate analysis -
Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables
Observations:
- Data set contains 8192 rows with 22 columns (Shape).
Observations:
- Data set contains missing values in (rchar) and (wchar) . .
- Data set contains object named ‘runqz’.
- Data set contains 13 float values, 8 integer values and 1 object.
Observations:
- The minimum and maximum (lread) Reads (transfers per second ) between system
memory and user memory is 0 to 1845.
- The minimum and maximum (lwrite) writes (transfers per second) between system
memory and user memory is 0 to 575.
- The minimum and maximum (scall) Number of system calls of all types per second is 109 to
12493.
- The minimum and maximum (sread) Number of system read calls per second is 6 to 5318.
- The minimum and maximum (swrite) Number of system write calls per second is 7 to 5456.
- The minimum and maximum (fork) Number of system fork calls per second is 0 to 20.12 .
- The minimum and maximum (exec) Number of system exec calls per second is 0 to 59.56 .
- The minimum and maximum (rchar) Number of characters transferred per second by
system read calls is 278 to 2526649.
- The minimum and maximum (wchar) Number of characters transfreed per second by
system write calls is 1498 to 1801623.
- The minimum and maximum (pgout) Number of page out requests per second is 0 to
81.44.
- The minimum and maximum (ppgout) Number of pages, paged out per second is 0 to
184.20.
- The minimum and maximum (pgfree) Number of pages per second placed on the free list is
0 to 523.
- The minimum and maximum (pgscan) Number of pages checked if they can be freed per
second 0 to 1237.
- The minimum and maximum (atch) Number of page attaches (satisfying a page fault by
reclaiming a page in memory) per second is 0 to 211.58 .
- The minimum and maximum (pgin) Number of page-in requests per second is 0 to 141.20 .
- The minimum and maximum (ppgin) Number of pages paged in per second is 0 to 292.61.
- The minimum and maximum (pflt) Number of page faults caused by protection errors
(copy-on-writes) is 0 to 899.80 .
- The minimum and maximum (vflt) Number of page faults caused by address translation is
0.2 to 1365.
- The minimum and maximum (freemem) Number of memory pages available to user
processes is 55 to 12027.
- The minimum and maximum (freeswap) Number of disk blocks available for page swapping
is 2 to 2243187.
- The minimum and maximum (usr) Portion of time (%) that cpus run in user mode is 0 to 99.
Observations:
- All the variables except (usr) has outliers.
- Every column having the outliers. As the Linear regression is sensitive for outliers, but in
my opinion outlier treatment is not quite good because each and every data is unique with
his own entry.
- And Treating the outliers will affect the original value of the data and it may lead to wrong
prediction also. So, we will proceed the data with the outliers.
- Here in every column ‘0’ place an important role as its showing huge difference in the
range of the data.
If we treat the 0 ,there will be change in data also (like null values) as the real data may
have 0, so we will proceed with these.
Observations:
- All the variables are left skewed in this data, only (usr) is right skewed.
Dropped the categorical column before correlation as correlation only takes numerical
values.
Observations:
- The variable
'lread','lwrite','scall','sread','swrite','fork','exec','rchar','wchar','pgout','ppgout','pgfree','pgsc
an','atch','pgin','ppgin','pflt','vflt','freemem','freeswap','usr' are having correlation with each
other.
- Correlation values near to 1 or -1 are highly positively correlated and highly
negatively correlated respectively. Correlation values near to 0 are not correlated to each
other.
Observations:
- No prominent relationship between the variables can be seen here in this pairplot.
- Pairplot shows the relationship between the variables in the form of scatterplot and the
distribution of the variable in the form of histogram. From the histogram we can see that
the whole dataset is left skewed.
- As the given data set contains huge numbers of columns the pair plot is looking little
messy.
- And as the plot we can see some columns having the positive correlation b/w them. Some
having no correlation and some columns have negative correlation as well.
- Now Let us split the data and build a model to proceed.
Obse
rvations:
- There is no duplicated column, data set doesn’t have duplicate rows as well
- Data set doesn’t have null values as well except rchar and wchar columns.
- Let us use the ‘For loop ’ to treat these null values by replace with mean values.
Observations:
- After the treatment null values in the data set was clear, no disturbance in data set.
- Linear regression sensitive to the null values.
Observations:
- After treating the missing values we can see there sre no more missing values in the
dataset.
Observations:
- We can see there are outliers we need to treat them before building the model in a
technique known as cap and floor soo that the data doesent get disturbed.
Observations:
- As we can see after treating the outliers with cap and floor technique all the outliers have
been adjusted.
Observations:
- This is the ultimate data we get after Label encoding method replacing the ‘Cpu_bound’ as
1 and ‘Notcpu_bound’ as 0.
Observations:
- This is the dataset with only the independent variables.
Observations:
- The dataset is split into X and Y training and test set in 75:25 ratio.
- We will be building the model first on training data.
- As the Train and the test data split up we can process with creating the linear model. Now
for creating the OLS model, we can use the .ols from stats model api package.
- And Fit the data with x_train and y_train.
Problem 1- Model Building - Linear regression
- Apply linear Regression using Sklearn - Using Statsmodels Perform checks for significant
variables using the appropriate method - Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Observations:
- Here in this model R2 is 0.794 it is closer to 1 so we can say that it is a good kind of model.
As we know closer the value of R2 to 1 better is the model.
- These variables ('lread', 'lwrite', 'scall', 'swrite','exec',
'rchar', 'wchar', 'pgout','pgscan', 'atch', 'ppgin', 'pflt', 'vflt', 'freemem', 'freeswap',
'runqsz_Not_CPU_Bound') have p- value equals to 0 so these variables are significant
variables.
- These variables ('sread','fork','ppgout', 'pgfree', 'pgin',) are insignificant variables.
- The R-square value tells that the model can explain 79.4% of the variance in the training
set.
- Adjusted R-square also nearly to the R-square,79.4%.
After dropping the features causing strong multicollinearity and the statistically insignificant
ones, our model performance hasn't dropped sharply . This shows that these variables did
not have much predictive power.
Testing the Assumptions of Linear Regression
1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of error terms
5. No strong Multicollinearity
Observations.
- Actual values are the actual datapoints.
- Fitted values are the predicted values.
- Residuals are the errors(A-P)
Observations:
- No pattern in the data thus the assumption of linearity and independence of predictors
satisfied.
- The variance seem to be equal here no change.
Observations:
- Since p-value > 0.05 we can say that the residuals are normal.
Observations:
- Since p-value < 0.05 we can say that the residuals are heteroscedastic.
Observations:
- For every 1 unit increase in lread (reads per second between system memory and user
memory),the usr decreases by a factor of -0.162.
- For every 1 unit increase in lwrite (writes per second between system memory and user
memory),the usr decreases by a factor of 0.175.
- For every 1 unit increase in scall (system calls of all types per second),the usr decreases by
a factor of -0.000501.
- For every 1 unit increase in sread (system read calls per second),the usr decreases by a
factor of -0.00279.
- For every 1 unit increase in swrite (system write calls per second),the usr decreases by a
factor of -0.0179.
- For every 1 unit increase in exec (system exec calls per second),the usr decreases by a
factor of -1.582.
- For every 1 unit increase in rchar (characters transferred per second by system read
calls),the usr decreases by a factor of approximately -8.072e-06.
- For every 1 unit increase in pgscan (pages checked if they can be freed per second),the usr
decreases by a factor of -1.4133.
- For every 1 unit increase in freemem (memory pages available to user processes),the usr
decreases by a factor of -0.000507.
- For every 1 unit increase in freeswap (disk blocks available for page swapping),the usr
decreases by a factor of 8.553e-06.
- For every 1 unit increase in runqsz_Not_CPU_Bound(Process run queue size),the usr
decreases by a factor of 1.238.
Observations:
- RMSE which we are calculating on testing data it is not much of a difference
from the training data RMSE.
Key Takeaways:
- R-squared of the model is 0.736 and adjusted R-squared is 0.735, which shows that
the model is able to explain 74% variance in the data.This is quite good.
- A unit increase in the lwrite will result in a 0.1751 unit increase in the usr, all other
variables remaining constant.
- The usr of a process of Not_CPU_Bound will be 1.24 units, all other variables
remaining constant.
- RMSE on the train data: 5.0275
- RMSE on the test data: 5.2515
- We can see that RMSE on the train and test sets are comparable. So, our
model is not suffering from overfitting.
- Hence, we can conclude the final model is good for prediction as well as
inference purposes.
Inference:
We constructed a number of models by removing variables one at a time in order to
produce an effective model. By taking into account several aspects like R-squared, Adj R-
squared, P value, and creating VIF, the variables are eliminated. On beforehand we have to
clean up the data by handling the outliers and impute the missing values before moving on
to the linear regression model. We have tried to build a Linear Regression without treating
the outliers which gave us a very low R-squared value which shows the model is not
efficient. Linear Regression before Outlier treatment: R-squared and Adjusted R-squared
value are74%and 73%respectively.This value is considered has a very low score therefore
we have moved over to build an effective liner regression model.
Variable column has the variables that we have dropped one by one, corresponding changes
in R- squared and Adj R-squared are filled up beside to it. It shows we have started with
0.794 and ended with 0.736 even though it shows negative improvement in R-squared
value, the reason we have choosen to drop the variables is they have Multicollinearity
within the independent variables which effects the effective model therefore we have
removed the variables which has Multicollinearity.
For linear regression the residuals has to be in normal distributed, in our model the
residuals are build up close to normal distribution form which make the model very
effective. Shapiro test which helps to identify if the residuals are in normal distribution, p
value (P value = 1.3018460702341419e-37) on the Shapiro test is lesser than 0.05 therefore
it is proved the residual is not normally distributed. Homoscedasticity test is performed to
check if the presence of non-constant variance in the error terms results in
heteroscedasticity. The P-value on Homoscedasticity test is 0.03885140 therefore the null
hypothesis is rejected so that we can say that the residuals are heteroscadastic. Last but not
the least, our model worked very well in both training and test data. This is tested using
RMSE . RMSE on the train and test sets are comparable(Train data: 5.027 & Test data:5.251).
Therefore, our model is not suffering from overfitting.
The dependent variable urs - Portion of time CPUs run in user mode rises as the following
variables' units fall.
lread - Reads (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
rchar-Number of characters transferred per second by system read calls
wchar-Number of characters transferred per second by system write calls
ppgin-Number of pages paged in per second
pflt-Number of page faults caused by protection errors (copy on writes)
runqsz-Process run queue size
freemem-Number of memory pages available to user processes. It has a non negative
coefficient
freeswap-Number of disk blocks available for page swapping.
Through this model, we advise that there is a greater likelihood of an increase in the amount
of time CPUs are used in user mode when the aforementioned factors are used sparingly.
Observation:
- Data set contains 1437 rows with 10 columns (Shape).
Observation:
Data set contains missing values in (Wife_age) and
(No_of_children_born) . .
Data set contains object named variables (‘Wife_ education’,
'Husband_education',
'Wife_religion','Wife_Working','Standard_of_living_index','Media_exposur
e','Contraceptive_method_used')
Data set contains 2 float values, 1 integer values and 7 object.
Observations:
By using describe (include=all) we can find NaN values mostly due only
3 variables are in numeric .
Observations:
Univariate analysis
Checking the spread of the data using boxplot and
histogram for the continuous variables.
Observation:
Multivariate analysis
Observations:
Observations:
Observations:
Data set doesn’t have null values as well except 'Wife_age' and
'No_of_children_born' columns.
Let us use the ‘For loop ’ to treat these null values by replace with
mean values.
After the treatment null values in the data set was clear, no
disturbance in data set.
Logistic regression sensitive to the null values.
Observations:
After treating the missing values we can see there sre no more missing
values in the dataset.
Outlier Checks
Observations:
Encoding
unique values for categorical variables.
Observations:
Data has been encoded for the given dataset which enable us to use
the data for different models like Logistic Regression, LDA and CART.
Contraceptive_method_used has two unique values "Yes" and "No",
these values are encoded to "0" and "1" respectively.
Observations:
AUC Value closer to 1 tells that there is good seperatibility between the
predicted classes and thus the model is good for prediction
ROC Curve visually represents the above concept where the plot should
be as far as possible from the diagnol.
Conclusion:
Note :
Precison : tells us how many predictions are actually positive out of all the total positive
predicted.
Recall: how many observations of positive class are actually predicted as positive.
Inferences :
Recall (46%) – Out of all the women actually opt for a contraceptive method of
choice, 46% of women have been predicted correctly .
Recall (74%) – Out of all the women actually opt for a contraceptive method of
choice, 74% of women have been predicted correctly .
Overall accuracy of the model – 61 % of total predictions are correct.
Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This
proves no overfitting or underfitting has happened, and overall the model is a good model
for classification.
LDA Model
Split X and y into training and test set in 75:25 ratio
Prediction
Training Data and Test Data Confusion Matrix Comparison
Training Data and Test Data Classification Report
Comparison
Inferences
Note :
Precison : tells us how many predictions are actually positive out of all the total positive
predicted.
Recall: how many observations of positive class are actually predicted as positive.
For predicting Contraceptive_method_used(Label 0 ):
Recall (50%) – Out of all the women actually opt for a contraceptive method of
choice, 50% of women have been predicted correctly .
Recall (77%) – Out of all the women actually opt for a contraceptive method of
choice, 77% of women have been predicted correctly .
Accuracy, AUC, Precision and Recall for test data is almost inline with
training data. This proves no overfitting or underfitting has happened, and
overall the model is a good model for classification.
The output above confirms we only have 1 feature for all the records in our
training and test sets.
The output below shows that with only a single feature, our machine learning
model achieves an accuracy of 61% which is same as the accuracy achieved
using all the features.
Variable Importance
Conclusion:
Key Takeaways:
We constructed three different models Logistic regression, LDA and CART models to predict
Contraceptive method used dependent variable. By taking into account several aspects like
coefficient, AUC, Accuracy, precision, recall, f1-score we were able to compare models
between them. On beforehand we did the encoding so make sure the data are ready to
build the Logistic regression, LDA and CART models.Outliers are treated and object variables
are encoded to convert it to numeric variable.
As explained in the model comparisonCART model performed wellthan the other models.
Where we have highest Coffecient that variable is the main contributor in predicting
dependent variable. In our case Contraceptive method used is the dependent variable all
other variables are independent variable. All the variable has positive Coffecient, this shows
where there is a unit increase in the independent variable, dependent variable has the
impact of Coffecient times.
For an example:
Wife age unit increase impact the Contraceptive method used by 0.33 times No of children
bornage unit increase impact the Contraceptive method used by 0.25 times.