Bike Sharing Analysis
Bike Sharing Analysis
Workflow of dataset
● About Dataset
● Feature Engineering
● Missing Value Analysis
● Outlier Analysis
● Correlation Analysis
● Visualizing Distribution Of Data
● Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)
● Filling 0's In Windspeed Using Random Forest
● Random Forest Model
About dataset
Bike sharing systems are a means of renting bicycles where the process of obtaining
membership, rental, and bike return is automated via a network of kiosk locations
throughout a city. Using these systems, people are able rent a bike from a one location
and return it to a different place on an as-needed basis. Currently, there are over 500
bike-sharing programs around the world.
Feature Engineering
From the given dataset, the columns "season","holiday","workingday" and "weather" should
be of "categorical" data type. But the current data type is "int" for those columns. We
transform the dataset in the following ways so that we can get started up with our exploratory
data analysis (EDA).
● We Create new columns "date,"hour","weekDay","month" from "datetime" column.
● Coerce the datatype of "season","holiday","workingday" and weather to category.
● Drop the datetime column as we already extracted useful features from it.
Now we did missing value analysis and found no missing value in given dataset.
Outliers Analysis
At first look, "count" variable contains lot of outlier data points which skews the distribution
towards right (as there are more data points beyond Outer Quartile Limit). In addition to that,
following inferences can also been made from the simple boxplots given below.
● Spring season has got relatively lower count.The dip in median value in
boxplot gives evidence for it.
● The boxplot with "Hour Of The Day" is quiet interesting.The median value
are relatively higher at 7AM - 8AM and 5PM - 6PM. It can be attributed to
regular school and office users at that time.
● Most of the outlier points are mainly contributed from "Working Day" than
"Non Working Day". It is quiet visible from from figure 4.
Correlation Analysis
We plot a correlation plot between "count" and ["temp","atemp","humidity","windspeed"].
● temp and humidity features has got positive and negative correlation with
count respectively.Although the correlation between them are not very
prominent still the count variable has got little dependency on "temp" and
"humidity".
● windspeed is not gonna be really useful numerical feature and it is visible
from it correlation value with "count"
● "atemp" is variable is not taken into since "atemp" and "temp" has got strong
correlation with each other. During model building any one of the variable
has to be dropped since they will exhibit multicollinearity in the data.
● "Casual" and "Registered" are also not taken into account since they are
leakage variables in nature and need to dropped during model building.
● It is quite obvious that people tend to rent bike during summer season since it is
really conducive to ride bike at that season.Therefore June, July and August has
got relatively higher demand for bicycle.
● On weekdays more people tend to rent bicycle around 7AM-8AM and 5PM-6PM.
(Regular office days, school days).
● Above pattern is not observed on "Saturday" and "Sunday".More people tend to
rent bicycle between 10AM and 4PM.
● The peak user count around 7AM-8AM and 5PM-6PM is purely contributed by
registered user.
We got RMSLE Value For Random Forest: 0.102804484141