Group7 Report
Group7 Report
A report submitted to
By
On
28-12-2020
Research Question
The research question for our group project is to predict the number of users that opt for the Bike-Share
program in Washington D.C. A bicycle-sharing program is a system in which users can use or rent bicycles
from specific points in the city for short term use available for sharing by paying a nominal price or availing it
for free. There are around 500+ bike sharing programs globally. Such programs are started with the aim of
reducing traffic congestion, air, and noise pollution by providing affordable access to rent bicycles for short
trips instead of using motor vehicles. For each of the programs, the number of users on any given day vary
depending on many factors. If the number of hourly users for these programs can be predicted well, then it will
allow the management authority for these programs to manage them more efficiently and in a more cost-
effective manner.
The aim of this project is to use Machine Learning models to effectively predict the number of bike-sharing
users in any given 1-hour period during a day, taking into consideration other factors.
The data set used for this project has been taken from University of California, Irvine’s Machine Learning
repository. The link for the data set is as follows: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
The data set consists of information on an hourly and data basis along with various weather and seasonal
information. The data set used is a csv file with data from 17379 hours spread across 731 days along with 16
features or variables.
Record index
Date
Month (1 to 12)
Hour (0 to 23)
For exploratory data analysis, we are going to use the R software for our project. With the help of ggplot2 and
ggextra packages, we made some plots to understand the impact of the various features available on the bicycle
usage. Some of the graphs are shown below:
From the above two scatter plots, we can see appositive correlation between both Temperature and Adjusted
Temperature with the bike hourly usage count for majority of the temperature range. This seems logical since
people will avoid riding bicycles in cold weather. For the highest temperatures(which is very small subset of
the data), the curve is dipping since people will also avoid riding bicycles in extremely hot weather.
From the histogram on the x-axis, we can see that there are many more clear days as depicted by weather
situation 1 than overcast or rainy days (weather situation 2 and 3).
From the above plot, the conclusion regarding how wind speed affects usage is not clear.
The correlation can be said as weak between the two variables. The correlation matrix of all the continuous
variables has been shown below:
Looking at the above graph, we can say that the usage is least during the late night hours (minimum usage
between 4-5 a.m.) and the highest usage is between 8-9 a.m. and 5-7 p.m. But, the fit is not linear but with
some data manipulation (which will be discussed later), we can somewhat represent a linear fit by finding the
usage based on the temporal distance to 4 a.m.
Similar trend can be observed between month and usage and with similar data manipulation by calculating the
temporal distance to January, the plot has been made to obtain a linear fit.
Finally, the plot between the “Year” variable and usage is studied and is found that the usage grew from year 1
to year 2 indicating that the program grew in popularity since its inception.
Data Pre-processing
1. Removing the feature that do not add any valuable information which is “index” in our project.
2. Extracting the week number form the date format for that particular year and using that “week number”
as a predictor variable for bike usage count.
3. Using one hot encoding which is the method of splitting non-binary categorical variables like month,
week number, hour, weather situation etc into binary sub-features wherein each sub-feature tells whether a
certain category of the original feature is true (1) or not(0).
4. Modifying the cyclic variables to determine the temporal distance from a single time point. We found the
temporal distance from 4 a.m. for Hours and from mid of January for Weeks and Months.
Simulations
METHOD-1
The number of samples to be passed to our input layer at a single time is fixed through the batch_size
parameter so we will will feed 256 datapoints at a time in the model, we'll look the whole dataset 100 times
while training our model.
To avoid overfit and unnecessary computing time while at the same time ensure that the model gets enough
sample data to train itself repeatdly so as to capture the patterm and don’t be underfit we have simulated
initially with fixed 32 neurons, and no hidden layer, with activation function of sigmoid for the output layer,
optimizer as “rmsprop” and loss method as “mse”, observing the results as:
We can see that the accuracy has in predicting has improved slightly but losses also increased.
METHOD-2
METHOD-3
loss: 0.03443, accuracy: 0.008812, validation _loss: 0.03552 and validation _acc: 0.01079
METHOD-4
RNN method was employed for its usefulness in a time series method. Here but the accuracy remained constant
and low acroos all the simulations method employed, though we tried reducing losses by reducing overfitting
by reducing no. of epochs.
ANNEXURE