0% found this document useful (0 votes)
42 views

Group7 Report

This document summarizes a student project to predict the number of bike-sharing users in Washington D.C. using machine learning models. The students will use a public dataset containing hourly bike rental data along with weather and seasonal information from 2011-2012. They will explore the data using plots to understand relationships between bike usage and factors like temperature, humidity and time of day. Different machine learning models like RNNs will be tested and hyperparameters like batch size, epochs and number of nodes will be adjusted to improve the accuracy of predictions. The best performing model will help bike-sharing operators manage their programs more efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Group7 Report

This document summarizes a student project to predict the number of bike-sharing users in Washington D.C. using machine learning models. The students will use a public dataset containing hourly bike rental data along with weather and seasonal information from 2011-2012. They will explore the data using plots to understand relationships between bike usage and factors like temperature, humidity and time of day. Different machine learning models like RNNs will be tested and hyperparameters like batch size, epochs and number of nodes will be adjusted to improve the accuracy of predictions. The best performing model will help bike-sharing operators manage their programs more efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

PROJECT: Predict the number of bike-sharing users

A report submitted to

Prof. Ujjwal Das

In the partial fulfillment of the course

Advanced Methods for Data Analysis (AMDA)

By

Ayushi Birla- 1911046

C K Venkata Sai Anudeep- 1911065

Debdipta Ray- 1911075

Vishwanath Kaimal- 1911297

On

28-12-2020
Research Question

The research question for our group project is to predict the number of users that opt for the Bike-Share
program in Washington D.C. A bicycle-sharing program is a system in which users can use or rent bicycles
from specific points in the city for short term use available for sharing by paying a nominal price or availing it
for free. There are around 500+ bike sharing programs globally. Such programs are started with the aim of
reducing traffic congestion, air, and noise pollution by providing affordable access to rent bicycles for short
trips instead of using motor vehicles. For each of the programs, the number of users on any given day vary
depending on many factors. If the number of hourly users for these programs can be predicted well, then it will
allow the management authority for these programs to manage them more efficiently and in a more cost-
effective manner.

The aim of this project is to use Machine Learning models to effectively predict the number of bike-sharing
users in any given 1-hour period during a day, taking into consideration other factors.

Data Set Used

The data set used for this project has been taken from University of California, Irvine’s Machine Learning
repository. The link for the data set is as follows: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

The data set consists of information on an hourly and data basis along with various weather and seasonal
information. The data set used is a csv file with data from 17379 hours spread across 731 days along with 16
features or variables.

Variables Used: The variables used for the analysis are:

 Record index

 Date

 Season (1: spring, 2: summer, 3: fall, 4: winter)

 Year (0: 2011, 1:2012)

 Month (1 to 12)

 Hour (0 to 23)

 Holiday: whether that day is holiday or not

 Weekday: day of the week

 Working-day: if day is neither weekend nor holiday, value is 1. Otherwise, value is 0


 Weather situation:
—> 1: Clear, Few clouds, Partly cloudy
—> 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
—> 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
—> 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

 Normalized temperature in Celsius. Values are divided to 41 (max)

 Normalized feeling temperature in Celsius. Values are divided to 50 (max)

 Normalized humidity. The values are divided to 100 (max)

 Normalized wind speed. The values are divided to 67 (max)

 Count of casual users

 Count of registered users

 Count of total rental bikes including both casual and registered

Exploratory Data Analysis

For exploratory data analysis, we are going to use the R software for our project. With the help of ggplot2 and
ggextra packages, we made some plots to understand the impact of the various features available on the bicycle
usage. Some of the graphs are shown below:
From the above two scatter plots, we can see appositive correlation between both Temperature and Adjusted
Temperature with the bike hourly usage count for majority of the temperature range. This seems logical since
people will avoid riding bicycles in cold weather. For the highest temperatures(which is very small subset of
the data), the curve is dipping since people will also avoid riding bicycles in extremely hot weather.

Looking at the scatter plot between Humidity and


Hourly usage, we can see a negative correlation
with a linear fit being very close to the best fit
curve for almost majority of the data, excluding
some outliers having very low humidity. The
negative correlation exists because the climate of
Washington D.C is very humid, thereby increasing
the chances of rainfall which again will lead to low
usage of bicycles. So, we can conclude that the
weather situation will have an impact on bicycle
usage with rainfall reducing the usage.

From the histogram on the x-axis, we can see that there are many more clear days as depicted by weather
situation 1 than overcast or rainy days (weather situation 2 and 3).

From the above plot, the conclusion regarding how wind speed affects usage is not clear.
The correlation can be said as weak between the two variables. The correlation matrix of all the continuous
variables has been shown below:

Looking at the above graph, we can say that the usage is least during the late night hours (minimum usage
between 4-5 a.m.) and the highest usage is between 8-9 a.m. and 5-7 p.m. But, the fit is not linear but with
some data manipulation (which will be discussed later), we can somewhat represent a linear fit by finding the
usage based on the temporal distance to 4 a.m.

Similar trend can be observed between month and usage and with similar data manipulation by calculating the
temporal distance to January, the plot has been made to obtain a linear fit.
Finally, the plot between the “Year” variable and usage is studied and is found that the usage grew from year 1
to year 2 indicating that the program grew in popularity since its inception.

Data Pre-processing

1. Removing the feature that do not add any valuable information which is “index” in our project.
2. Extracting the week number form the date format for that particular year and using that “week number”
as a predictor variable for bike usage count.
3. Using one hot encoding which is the method of splitting non-binary categorical variables like month,
week number, hour, weather situation etc into binary sub-features wherein each sub-feature tells whether a
certain category of the original feature is true (1) or not(0).
4. Modifying the cyclic variables to determine the temporal distance from a single time point. We found the
temporal distance from 4 a.m. for Hours and from mid of January for Weeks and Months.

Analyzing the inputs required for the project- RNN

To set up RNN, we needed to decide a number of variables


such as the batch size, epochs, number of nodes, the
optimizer to be used and which method to use. Because
there is no means of theoretically generating the optimum
numbers to the variables, we had to gather empirical data
and decide what values to be used for these variables. We
decided to run a number of combinations of various
variables and calculated the loss and absolute error and
accuracy.

Simulations
METHOD-1

The number of samples to be passed to our input layer at a single time is fixed through the batch_size
parameter so we will will feed 256 datapoints at a time in the model, we'll look the whole dataset 100 times
while training our model.

To avoid overfit and unnecessary computing time while at the same time ensure that the model gets enough
sample data to train itself repeatdly so as to capture the patterm and don’t be underfit we have simulated
initially with fixed 32 neurons, and no hidden layer, with activation function of sigmoid for the output layer,
optimizer as “rmsprop” and loss method as “mse”, observing the results as:

loss: 0.00495, accuracy: 0.008812, validation_loss: 0.005169 and validation_acc: 0.01115

We can see that the accuracy has in predicting has improved slightly but losses also increased.

METHOD-2

To improve the accuracy of our output


we sequentially increased the no. of
hidden layers to 3, and no. of neurons
is also set at 32 for each layer, with
activation function of relu for each
dense layer, sigmoid for the output
layer, optimizer as “rmsprop” and
loss method as “mse” and to reduce
losses we reduce the no. of epochs
from 100 to 80 see for any marked
improvement, the observed result is:

loss: 0.006159, accuracy: 0.008812,


validation _loss: 0.005592 and
validation _acc: 0.01115

Since we have added 3 hidden layers,


the losses are bound to increase at
this stage too. But we can see marked
changes due to reduction in epochs
from 100 to 80.

METHOD-3

We tried to change the no. of neurons


for each layer, setting 256 initially, to 128, to 64
and then to 32 to see if the model now able to easily train itself and give us the needed accurate predictions.
Now with activation function of relu for 1 st and 2nd dense layer, tanh activation function for 3 rd dense layer,
sigmoid for the output layer, optimizer as “rmsprop” and loss method as “mse”, observing the results as:

loss: 0.03443, accuracy: 0.008812, validation _loss: 0.03552 and validation _acc: 0.01079

METHOD-4

Then we changed the optimizer method to


“adam” and loss method was kept as “mse”,
now the observed result is:

loss: 0.03444, accuracy: 0.008812, validation


_loss: 0.03555 and validation _acc: 0.01079

Based on the above findings we fixed the


model that seemed most optimum to us and
then decided to add hidden layers to the model

RNN method was employed for its usefulness in a time series method. Here but the accuracy remained constant
and low acroos all the simulations method employed, though we tried reducing losses by reducing overfitting
by reducing no. of epochs.

ANNEXURE

Units Batch Size Epochs


128 256 32 109/109 [==============================] - 0s 1ms/step -
loss: 0.0019 - mean_absolute_error: 0.0277 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.001946411 0.027735395 0.008633094
109/109 [==============================] - 0s 2ms/step -
loss: 0.0343 - mean_absolute_error: 0.1469 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034260292 0.146898240 0.008633094
256 109/109 [==============================] - 0s 1ms/step -
loss: 0.0025 - mean_absolute_error: 0.0320 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.002471824 0.031979896 0.008633094
128 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1454 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034191489 0.145444036 0.008633094
256 109/109 [==============================] - 0s 2ms/step -
loss: 0.0342 - mean_absolute_error: 0.1458 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034202401 0.145819664 0.008633094
109/109 [==============================] - 0s 2ms/step -
loss: 0.0342 - mean_absolute_error: 0.1443 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034194238 0.144284979 0.008633094
128 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1443 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034194194 0.144286588 0.008633094
256 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1439 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034210850 0.143858105 0.008633094
256 100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1449 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034185514 0.144897506 0.008633094
128 100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1457 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034198176 0.145694748 0.008633094
100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1451 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034186549 0.145129487 0.008633094
100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1462 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034219287 0.146214798 0.008633094

You might also like