0% found this document useful (0 votes)
152 views

Great Learning Predictive Modelling Project

The document discusses two problems related to predictive modeling. Problem 1 deals with linear regression on system activity data to predict the 'usr' variable. Various data cleaning steps are described along with feature engineering, model building and selection. Problem 2 deals with predicting contraceptive use from survey data using logistic regression, LDA and CART. Model performance is evaluated on train and test sets.

Uploaded by

rameshj16708
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

Great Learning Predictive Modelling Project

The document discusses two problems related to predictive modeling. Problem 1 deals with linear regression on system activity data to predict the 'usr' variable. Various data cleaning steps are described along with feature engineering, model building and selection. Problem 2 deals with predicting contraceptive use from survey data using logistic regression, LDA and CART. Model performance is evaluated on train and test sets.

Uploaded by

rameshj16708
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Problem 1: Linear Regression

The comp-activ databases is a collection of a computer systems activity measures .


The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running
in a multi-user university department. Users would typically be doing a large variety of tasks
ranging from accessing the internet, editing files or running very cpu-bound programs.
As you are a budding data scientist you thought to find out a linear equation to build a
model to predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how
each attribute affects the system to be in 'usr' mode using a list of system attributes.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
Data types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis,
Multivariate Analysis.

Solution:

EDA:

Total number of rows and columns:

Reviewing info of the given data


Describing the data:

Checking for null value:


Univariate analysis:
Bivariate analysis:
Correlation_matrix –
Insights:
 Data consist of both categorical and numerical values.
 There are total 8192 rown and 22 columns in the dataset. Out of 22 columns, only 1
column is of objective type, 8 are of integer and 13 are float data type.
 ‘usr’ is the target variable and all are predictable.
 Looking into the field, there are outliers.
 We also noticed that there are no duplicate records.
1.2 Impute null values if present, also check for the values which are equal to zero. Do
they have any meaning or do we need to change them or drop them? Check for the
possibility of creating new features if required. Also check for outliers and duplicates if
there.

Solution:

There are null values in 2 fields.


Below is the view post replacing null value:
There are also outliers and we will not be treating the outliers here, since it may have a
greater impact on the outcome.

Insights:
 Observed null values in 2 fields i.e., rchar and wchar
 We imputed null values with the mean of the dataset.
 Most of the continuous fields have an outliers and we have not treated them.
 There are no duplicates as well.

1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.

Solution:
OLS Regression results:
Problem 2:

You are a statistician at the Republic of Indonesia Ministry of Health and you are provided
with a data of 1473 females collected from a Contraceptive Prevalence Survey. The
samples are married women who were either not pregnant or do not know if they were at
the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, check for duplicates and outliers and write an inference on it. Perform
Univariate and Bivariate Analysis and Multivariate Analysis.\

Solution:

Top 5 rows:

Last 5 rows:
Data types:

Data describe:

Check for outliers:

Univariate analysis:
Bivariate analysis:

Multivariate analysis

Insights:
 Data consist of both categorical and numerical values
 There are total 1473 rows and 10 columns. Out of 10 columns, 7 columns are of
object type, 1 column of integer type and remaining 2 are of float type.
 ‘contraceptive used’ are the target variable and other are predictor variable.
 Looking into the fields in univariate analysis, outliers are available only in the field
‘Number of children’.
 There is a positive correlation between wife’s age and number of children.
 Null values have been identified and imputed as well.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis) and CART.

Solution:

Logistic classification report:


2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Final Model: Compare Both the models and write inference which model is
best/optimized.

Solution:

You might also like