Great Learning Predictive Modelling Project
Great Learning Predictive Modelling Project
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the
Data types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis,
Multivariate Analysis.
Solution:
EDA:
Solution:
Insights:
Observed null values in 2 fields i.e., rchar and wchar
We imputed null values with the mean of the dataset.
Most of the continuous fields have an outliers and we have not treated them.
There are no duplicates as well.
1.3 Encode the data (having string values) for Modelling. Split the data into train and test
(70:30). Apply Linear regression using scikit learn. Perform checks for significant variables
using appropriate method from statsmodel. Create multiple models and check the
performance of Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate reasoning.
Solution:
OLS Regression results:
Problem 2:
You are a statistician at the Republic of Indonesia Ministry of Health and you are provided
with a data of 1473 females collected from a Contraceptive Prevalence Survey. The
samples are married women who were either not pregnant or do not know if they were at
the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, check for duplicates and outliers and write an inference on it. Perform
Univariate and Bivariate Analysis and Multivariate Analysis.\
Solution:
Top 5 rows:
Last 5 rows:
Data types:
Data describe:
Univariate analysis:
Bivariate analysis:
Multivariate analysis
Insights:
Data consist of both categorical and numerical values
There are total 1473 rows and 10 columns. Out of 10 columns, 7 columns are of
object type, 1 column of integer type and remaining 2 are of float type.
‘contraceptive used’ are the target variable and other are predictor variable.
Looking into the fields in univariate analysis, outliers are available only in the field
‘Number of children’.
There is a positive correlation between wife’s age and number of children.
Null values have been identified and imputed as well.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis) and CART.
Solution:
Solution: