0% found this document useful (0 votes)
229 views4 pages

Bike Sharing Analysis

The document summarizes a student's analysis of a bike sharing dataset for a course on regression and time series analysis. It includes preprocessing steps like feature engineering, missing value analysis, outlier detection, and correlation analysis. Visualization techniques are used to understand patterns in bike rentals by month, season, hour, weekday, and user type. A random forest model is fit to the data and achieves an RMSLE score of 0.102804484141.

Uploaded by

Devansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
229 views4 pages

Bike Sharing Analysis

The document summarizes a student's analysis of a bike sharing dataset for a course on regression and time series analysis. It includes preprocessing steps like feature engineering, missing value analysis, outlier detection, and correlation analysis. Visualization techniques are used to understand patterns in bike rentals by month, season, hour, weekday, and user type. A random forest model is fit to the data and achieves an RMSLE score of 0.102804484141.

Uploaded by

Devansh Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Name- Gaurav Gupta

Roll no- 16HS20013


Subject- Regression and Time series
Course code- MA20005

Bike sharing Analysis

Workflow of dataset
● About Dataset
● Feature Engineering
● Missing Value Analysis
● Outlier Analysis
● Correlation Analysis
● Visualizing Distribution Of Data
● Visualizing Count Vs (Month,Season,Hour,Weekday,Usertype)
● Filling 0's In Windspeed Using Random Forest
● Random Forest Model

About dataset

Bike sharing systems are a means of renting bicycles where the process of obtaining
membership, rental, and bike return is automated via a network of kiosk locations
throughout a city. Using these systems, people are able rent a bike from a one location
and return it to a different place on an as-needed basis. Currently, there are over 500
bike-sharing programs around the world.

Feature Engineering
From the given dataset, the columns "season","holiday","workingday" and "weather" should
be of "categorical" data type. But the current data type is "int" for those columns. We
transform the dataset in the following ways so that we can get started up with our exploratory
data analysis (EDA).
● We Create new columns "date,"hour","weekDay","month" from "datetime" column.
● Coerce the datatype of "season","holiday","workingday" and weather to category.
● Drop the datetime column as we already extracted useful features from it.

Missing Values Analysis

Now we did missing value analysis and found no missing value in given dataset.
Outliers Analysis
At first look, "count" variable contains lot of outlier data points which skews the distribution
towards right (as there are more data points beyond Outer Quartile Limit). In addition to that,
following inferences can also been made from the simple boxplots given below.

● Spring season has got relatively lower count.The dip in median value in
boxplot gives evidence for it.
● The boxplot with "Hour Of The Day" is quiet interesting.The median value
are relatively higher at 7AM - 8AM and 5PM - 6PM. It can be attributed to
regular school and office users at that time.
● Most of the outlier points are mainly contributed from "Working Day" than
"Non Working Day". It is quiet visible from from figure 4.

Correlation Analysis
We plot a correlation plot between "count" and ["temp","atemp","humidity","windspeed"].

● temp and humidity features has got positive and negative correlation with
count respectively.Although the correlation between them are not very
prominent still the count variable has got little dependency on "temp" and
"humidity".
● windspeed is not gonna be really useful numerical feature and it is visible
from it correlation value with "count"
● "atemp" is variable is not taken into since "atemp" and "temp" has got strong
correlation with each other. During model building any one of the variable
has to be dropped since they will exhibit multicollinearity in the data.
● "Casual" and "Registered" are also not taken into account since they are
leakage variables in nature and need to dropped during model building.

Visualizing Distribution Of Data


As it is visible from the below figures that "count" variable is skewed towards right. We take
log transformation on "count" variable after removing outlier data points. After the
transformation the data looks much better (reducess its skewness).
Visualizing Count Vs (Month, Season, Hour, Weekday, Usertype)

● It is quite obvious that people tend to rent bike during summer season since it is
really conducive to ride bike at that season.Therefore June, July and August has
got relatively higher demand for bicycle.
● On weekdays more people tend to rent bicycle around 7AM-8AM and 5PM-6PM.
(Regular office days, school days).
● Above pattern is not observed on "Saturday" and "Sunday".More people tend to
rent bicycle between 10AM and 4PM.
● The peak user count around 7AM-8AM and 5PM-6PM is purely contributed by
registered user.
We got RMSLE Value For Random Forest: 0.102804484141

You might also like