0% found this document useful (0 votes)

42 views

Group7 Report

This document summarizes a student project to predict the number of bike-sharing users in Washington D.C. using machine learning models. The students will use a public dataset containing hourly bike rental data along with weather and seasonal information from 2011-2012. They will explore the data using plots to understand relationships between bike usage and factors like temperature, humidity and time of day. Different machine learning models like RNNs will be tested and hyperparameters like batch size, epochs and number of nodes will be adjusted to improve the accuracy of predictions. The best performing model will help bike-sharing operators manage their programs more efficiently.

Uploaded by

Battagiri Sai Jyothi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Group7 Report

Uploaded by

Battagiri Sai Jyothi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

PROJECT: Predict the number of bike-sharing users

A report submitted to

Prof. Ujjwal Das

In the partial fulfillment of the course

Advanced Methods for Data Analysis (AMDA)

Ayushi Birla- 1911046

C K Venkata Sai Anudeep- 1911065

Debdipta Ray- 1911075

Vishwanath Kaimal- 1911297

28-12-2020
Research Question

The research question for our group project is to predict the number of users that opt for the Bike-Share
program in Washington D.C. A bicycle-sharing program is a system in which users can use or rent bicycles
from specific points in the city for short term use available for sharing by paying a nominal price or availing it
for free. There are around 500+ bike sharing programs globally. Such programs are started with the aim of
reducing traffic congestion, air, and noise pollution by providing affordable access to rent bicycles for short
trips instead of using motor vehicles. For each of the programs, the number of users on any given day vary
depending on many factors. If the number of hourly users for these programs can be predicted well, then it will
allow the management authority for these programs to manage them more efficiently and in a more cost-
effective manner.

The aim of this project is to use Machine Learning models to effectively predict the number of bike-sharing
users in any given 1-hour period during a day, taking into consideration other factors.

Data Set Used

The data set used for this project has been taken from University of California, Irvine’s Machine Learning
repository. The link for the data set is as follows: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

The data set consists of information on an hourly and data basis along with various weather and seasonal
information. The data set used is a csv file with data from 17379 hours spread across 731 days along with 16
features or variables.

Variables Used: The variables used for the analysis are:

 Record index

 Date

 Season (1: spring, 2: summer, 3: fall, 4: winter)

 Year (0: 2011, 1:2012)

 Month (1 to 12)

 Hour (0 to 23)

 Holiday: whether that day is holiday or not

 Weekday: day of the week

 Working-day: if day is neither weekend nor holiday, value is 1. Otherwise, value is 0

 Weather situation:
—> 1: Clear, Few clouds, Partly cloudy
—> 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
—> 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
—> 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

 Normalized temperature in Celsius. Values are divided to 41 (max)

 Normalized feeling temperature in Celsius. Values are divided to 50 (max)

 Normalized humidity. The values are divided to 100 (max)

 Normalized wind speed. The values are divided to 67 (max)

 Count of casual users

 Count of registered users

 Count of total rental bikes including both casual and registered

Exploratory Data Analysis

For exploratory data analysis, we are going to use the R software for our project. With the help of ggplot2 and
ggextra packages, we made some plots to understand the impact of the various features available on the bicycle
usage. Some of the graphs are shown below:
From the above two scatter plots, we can see appositive correlation between both Temperature and Adjusted
Temperature with the bike hourly usage count for majority of the temperature range. This seems logical since
people will avoid riding bicycles in cold weather. For the highest temperatures(which is very small subset of
the data), the curve is dipping since people will also avoid riding bicycles in extremely hot weather.

Looking at the scatter plot between Humidity and

Hourly usage, we can see a negative correlation
with a linear fit being very close to the best fit
curve for almost majority of the data, excluding
some outliers having very low humidity. The
negative correlation exists because the climate of
Washington D.C is very humid, thereby increasing
the chances of rainfall which again will lead to low
usage of bicycles. So, we can conclude that the
weather situation will have an impact on bicycle
usage with rainfall reducing the usage.

From the histogram on the x-axis, we can see that there are many more clear days as depicted by weather
situation 1 than overcast or rainy days (weather situation 2 and 3).

From the above plot, the conclusion regarding how wind speed affects usage is not clear.
The correlation can be said as weak between the two variables. The correlation matrix of all the continuous
variables has been shown below:

Looking at the above graph, we can say that the usage is least during the late night hours (minimum usage
between 4-5 a.m.) and the highest usage is between 8-9 a.m. and 5-7 p.m. But, the fit is not linear but with
some data manipulation (which will be discussed later), we can somewhat represent a linear fit by finding the
usage based on the temporal distance to 4 a.m.

Similar trend can be observed between month and usage and with similar data manipulation by calculating the
temporal distance to January, the plot has been made to obtain a linear fit.
Finally, the plot between the “Year” variable and usage is studied and is found that the usage grew from year 1
to year 2 indicating that the program grew in popularity since its inception.

Data Pre-processing

1. Removing the feature that do not add any valuable information which is “index” in our project.
2. Extracting the week number form the date format for that particular year and using that “week number”
as a predictor variable for bike usage count.
3. Using one hot encoding which is the method of splitting non-binary categorical variables like month,
week number, hour, weather situation etc into binary sub-features wherein each sub-feature tells whether a
certain category of the original feature is true (1) or not(0).
4. Modifying the cyclic variables to determine the temporal distance from a single time point. We found the
temporal distance from 4 a.m. for Hours and from mid of January for Weeks and Months.

Analyzing the inputs required for the project- RNN

To set up RNN, we needed to decide a number of variables

such as the batch size, epochs, number of nodes, the
optimizer to be used and which method to use. Because
there is no means of theoretically generating the optimum
numbers to the variables, we had to gather empirical data
and decide what values to be used for these variables. We
decided to run a number of combinations of various
variables and calculated the loss and absolute error and
accuracy.

Simulations
METHOD-1

The number of samples to be passed to our input layer at a single time is fixed through the batch_size
parameter so we will will feed 256 datapoints at a time in the model, we'll look the whole dataset 100 times
while training our model.

To avoid overfit and unnecessary computing time while at the same time ensure that the model gets enough
sample data to train itself repeatdly so as to capture the patterm and don’t be underfit we have simulated
initially with fixed 32 neurons, and no hidden layer, with activation function of sigmoid for the output layer,
optimizer as “rmsprop” and loss method as “mse”, observing the results as:

loss: 0.00495, accuracy: 0.008812, validation_loss: 0.005169 and validation_acc: 0.01115

We can see that the accuracy has in predicting has improved slightly but losses also increased.

METHOD-2

To improve the accuracy of our output

we sequentially increased the no. of
hidden layers to 3, and no. of neurons
is also set at 32 for each layer, with
activation function of relu for each
dense layer, sigmoid for the output
layer, optimizer as “rmsprop” and
loss method as “mse” and to reduce
losses we reduce the no. of epochs
from 100 to 80 see for any marked
improvement, the observed result is:

loss: 0.006159, accuracy: 0.008812,

validation _loss: 0.005592 and
validation _acc: 0.01115

Since we have added 3 hidden layers,

the losses are bound to increase at
this stage too. But we can see marked
changes due to reduction in epochs
from 100 to 80.

METHOD-3

We tried to change the no. of neurons

for each layer, setting 256 initially, to 128, to 64
and then to 32 to see if the model now able to easily train itself and give us the needed accurate predictions.
Now with activation function of relu for 1 st and 2nd dense layer, tanh activation function for 3 rd dense layer,
sigmoid for the output layer, optimizer as “rmsprop” and loss method as “mse”, observing the results as:

loss: 0.03443, accuracy: 0.008812, validation _loss: 0.03552 and validation _acc: 0.01079

METHOD-4

Then we changed the optimizer method to

“adam” and loss method was kept as “mse”,
now the observed result is:

loss: 0.03444, accuracy: 0.008812, validation

_loss: 0.03555 and validation _acc: 0.01079

Based on the above findings we fixed the

model that seemed most optimum to us and
then decided to add hidden layers to the model

RNN method was employed for its usefulness in a time series method. Here but the accuracy remained constant
and low acroos all the simulations method employed, though we tried reducing losses by reducing overfitting
by reducing no. of epochs.

ANNEXURE

Units Batch Size Epochs

128 256 32 109/109 [==============================] - 0s 1ms/step -
loss: 0.0019 - mean_absolute_error: 0.0277 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.001946411 0.027735395 0.008633094
109/109 [==============================] - 0s 2ms/step -
loss: 0.0343 - mean_absolute_error: 0.1469 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034260292 0.146898240 0.008633094
256 109/109 [==============================] - 0s 1ms/step -
loss: 0.0025 - mean_absolute_error: 0.0320 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.002471824 0.031979896 0.008633094
128 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1454 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034191489 0.145444036 0.008633094
256 109/109 [==============================] - 0s 2ms/step -
loss: 0.0342 - mean_absolute_error: 0.1458 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034202401 0.145819664 0.008633094
109/109 [==============================] - 0s 2ms/step -
loss: 0.0342 - mean_absolute_error: 0.1443 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034194238 0.144284979 0.008633094
128 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1443 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034194194 0.144286588 0.008633094
256 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1439 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034210850 0.143858105 0.008633094
256 100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1449 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034185514 0.144897506 0.008633094
128 100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1457 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034198176 0.145694748 0.008633094
100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1451 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034186549 0.145129487 0.008633094
100 109/109 [==============================] - 0s 1ms/step -
loss: 0.0342 - mean_absolute_error: 0.1462 - accuracy: 0.0086
loss mean_absolute_error accuracy
0.034219287 0.146214798 0.008633094

Subjective Questions
92% (13)
Subjective Questions
6 pages
Bike Sharing Assignment
100% (6)
Bike Sharing Assignment
7 pages
cz4041 Project Final Report Nyc Taxi Fare Prediction
0% (1)
cz4041 Project Final Report Nyc Taxi Fare Prediction
18 pages
Yulu Business Case Final
No ratings yet
Yulu Business Case Final
19 pages
Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Solution - Data Analysis With Python-Project-2 - v1.0
No ratings yet
Solution - Data Analysis With Python-Project-2 - v1.0
14 pages
Data Analysis and Visualization in R - Final Paper - Bike Sharing Dataset Analysis
No ratings yet
Data Analysis and Visualization in R - Final Paper - Bike Sharing Dataset Analysis
16 pages
ACTIVITIES Ramirez Martinez Fernanda Gpe
No ratings yet
ACTIVITIES Ramirez Martinez Fernanda Gpe
5 pages
Bike Renting PDF
No ratings yet
Bike Renting PDF
26 pages
Linear regression subjective questions
No ratings yet
Linear regression subjective questions
17 pages
Practise Questions
No ratings yet
Practise Questions
26 pages
IS4240 - AY1314S2 - Assignment - DM1
No ratings yet
IS4240 - AY1314S2 - Assignment - DM1
3 pages
ML week 15
No ratings yet
ML week 15
6 pages
Bike Sharing Company Analysis
No ratings yet
Bike Sharing Company Analysis
14 pages
Bike Sharing Prediction Project Structure
No ratings yet
Bike Sharing Prediction Project Structure
37 pages
Regression Linaire Python Tome I
No ratings yet
Regression Linaire Python Tome I
9 pages
data_analysis 2
No ratings yet
data_analysis 2
22 pages
Bike Sharing in Washington DC
No ratings yet
Bike Sharing in Washington DC
36 pages
subjective questions answers
No ratings yet
subjective questions answers
14 pages
Regression Model To Predict Bike Sharing Demand
100% (1)
Regression Model To Predict Bike Sharing Demand
5 pages
Bike Sharing Analysis
No ratings yet
Bike Sharing Analysis
4 pages
WWW Tensorflow Org Tutorials Structured Data Time Series
No ratings yet
WWW Tensorflow Org Tutorials Structured Data Time Series
41 pages
Solution To Linear Regression Assignment
No ratings yet
Solution To Linear Regression Assignment
5 pages
business-case-yulu-hypothesis-testing.ipynb - Colab
No ratings yet
business-case-yulu-hypothesis-testing.ipynb - Colab
4 pages
AmeyaYaminiLinearRegressionDoc (1)
No ratings yet
AmeyaYaminiLinearRegressionDoc (1)
15 pages
TD
No ratings yet
TD
4 pages
Report
No ratings yet
Report
5 pages
Bike Rental (Project)
No ratings yet
Bike Rental (Project)
16 pages
Electricity Consumption Forecasting For Optimal Resource Management Using Hybrid ES-RNN Model
No ratings yet
Electricity Consumption Forecasting For Optimal Resource Management Using Hybrid ES-RNN Model
27 pages
Time Series Forecasting With 2D Convolutions
No ratings yet
Time Series Forecasting With 2D Convolutions
33 pages
Project
No ratings yet
Project
27 pages
yulu-srk
No ratings yet
yulu-srk
20 pages
Bike Sharing Demand Prediction PPT
No ratings yet
Bike Sharing Demand Prediction PPT
42 pages
DSE 200X Final Project DarioDiazCuevas
No ratings yet
DSE 200X Final Project DarioDiazCuevas
46 pages
Linear Regression Assignment_Subjective
No ratings yet
Linear Regression Assignment_Subjective
7 pages
(Anh Duc Nguyen) Capstone
No ratings yet
(Anh Duc Nguyen) Capstone
53 pages
Paper 8-Weather Prediction Using Linear Regression Model-Bnmit IITCEE ICCCI - Conference-1
No ratings yet
Paper 8-Weather Prediction Using Linear Regression Model-Bnmit IITCEE ICCCI - Conference-1
4 pages
CISC 867 Deep Learning: 12. Recurrent Neural Networks
No ratings yet
CISC 867 Deep Learning: 12. Recurrent Neural Networks
72 pages
Project 6 - Time Series PDF
No ratings yet
Project 6 - Time Series PDF
21 pages
Bike Sharing Data Analysis
No ratings yet
Bike Sharing Data Analysis
24 pages
Memoria
No ratings yet
Memoria
47 pages
Time Series Project Report E22CSEU1522 (1)
No ratings yet
Time Series Project Report E22CSEU1522 (1)
11 pages
agb13FMDP0623report
No ratings yet
agb13FMDP0623report
76 pages
Output
No ratings yet
Output
24 pages
Capital Bike Sharing Dataset Description-Part 3
No ratings yet
Capital Bike Sharing Dataset Description-Part 3
2 pages
Assignment 2 - LP1
No ratings yet
Assignment 2 - LP1
7 pages
Post Midsem Prob
No ratings yet
Post Midsem Prob
5 pages
ForecastingIndividualassignment MohammadMujtaba 12020063
No ratings yet
ForecastingIndividualassignment MohammadMujtaba 12020063
20 pages
Your First Neural Network
No ratings yet
Your First Neural Network
15 pages
Weather History Report
No ratings yet
Weather History Report
14 pages
Time Series Forecasting
No ratings yet
Time Series Forecasting
7 pages
SSL_Assignment_Report_1 (5)
No ratings yet
SSL_Assignment_Report_1 (5)
11 pages
TSL Documentation
No ratings yet
TSL Documentation
127 pages
EDA Case of Study 2022
No ratings yet
EDA Case of Study 2022
43 pages
BA ZG512 EC-2R FIRST SEM 2024-2025
No ratings yet
BA ZG512 EC-2R FIRST SEM 2024-2025
12 pages
aca20hw
No ratings yet
aca20hw
54 pages
Internship Report Bike Data
No ratings yet
Internship Report Bike Data
30 pages
Bitcoin chapter 2
No ratings yet
Bitcoin chapter 2
18 pages
Ds R Capstone Template
No ratings yet
Ds R Capstone Template
36 pages
An Introduction To Data Acquisition
From Everand
An Introduction To Data Acquisition
Jason King
No ratings yet
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
Epc & Petri-Net For Book Acquisition Process by Abc Library
No ratings yet
Epc & Petri-Net For Book Acquisition Process by Abc Library
4 pages
Cloud Computing and ERP: V. Venkata Rao I.I.M. Ahmedabad
No ratings yet
Cloud Computing and ERP: V. Venkata Rao I.I.M. Ahmedabad
11 pages
ERP at A Management Institute
No ratings yet
ERP at A Management Institute
7 pages
Project: Predicting Box Office Revenues: A Report Submitted To
No ratings yet
Project: Predicting Box Office Revenues: A Report Submitted To
10 pages
SWOT Analysis Starbucks
No ratings yet
SWOT Analysis Starbucks
2 pages
Should Walmart Be Worried About Aldi?
No ratings yet
Should Walmart Be Worried About Aldi?
2 pages
Group10 Case3
No ratings yet
Group10 Case3
9 pages
Master of Business Administration 2020-22
No ratings yet
Master of Business Administration 2020-22
3 pages
Perkins 1300 Technical Data
100% (2)
Perkins 1300 Technical Data
12 pages
Irrigation Scheduling Through Cropwat Weap Software PDF
No ratings yet
Irrigation Scheduling Through Cropwat Weap Software PDF
12 pages
(Travel-France) Practical Guide To Brittany
No ratings yet
(Travel-France) Practical Guide To Brittany
30 pages
Senamhi - Huanta
No ratings yet
Senamhi - Huanta
247 pages
MCWP 3-35.7 Navy Metoc
No ratings yet
MCWP 3-35.7 Navy Metoc
126 pages
Os 100 Melhores Livros de Fantasia
No ratings yet
Os 100 Melhores Livros de Fantasia
3 pages
Indian Geography New
100% (1)
Indian Geography New
265 pages
The Hidden Traps in Decision Making: by John S. Hammond, Ralph L. Keeney, and Howard Raiffa
No ratings yet
The Hidden Traps in Decision Making: by John S. Hammond, Ralph L. Keeney, and Howard Raiffa
21 pages
Lily Text
No ratings yet
Lily Text
2 pages
P-19 Manual
No ratings yet
P-19 Manual
71 pages
Statistical Method in Civil Enginerng
No ratings yet
Statistical Method in Civil Enginerng
80 pages
Printout Juan Two
No ratings yet
Printout Juan Two
6 pages
Aircrew Chapter 2 - Aircraft Familiarization 2006 05 10 Edits
No ratings yet
Aircrew Chapter 2 - Aircraft Familiarization 2006 05 10 Edits
58 pages
What Is Global Warming
No ratings yet
What Is Global Warming
6 pages
Derwent Catalog Produtos
No ratings yet
Derwent Catalog Produtos
68 pages
CWTS 1module 4
No ratings yet
CWTS 1module 4
11 pages
Andhra Pradesh Mine Block Summary CAK Block CHINTALAYAPALLE ABDULLAPURAM ET 2492 PDF
No ratings yet
Andhra Pradesh Mine Block Summary CAK Block CHINTALAYAPALLE ABDULLAPURAM ET 2492 PDF
4 pages
Aircraft Composite Repair
100% (7)
Aircraft Composite Repair
98 pages
Our Absolute Dependence On God
No ratings yet
Our Absolute Dependence On God
4 pages
CHINT Busbar+Trunking+System
No ratings yet
CHINT Busbar+Trunking+System
24 pages
Absorber Tower: Standard Features
No ratings yet
Absorber Tower: Standard Features
2 pages
Dekguard PU100: UV Resistant, Single Component Aliphatic, Polyurethane Protective Coating System
No ratings yet
Dekguard PU100: UV Resistant, Single Component Aliphatic, Polyurethane Protective Coating System
3 pages
Aarey Forest
No ratings yet
Aarey Forest
15 pages
Gingee PDF
No ratings yet
Gingee PDF
163 pages
Passage 1 The Last March of The Emperor Penguins
No ratings yet
Passage 1 The Last March of The Emperor Penguins
5 pages
Impact of Past and On-Going Changes On Climate and Weather On Vector-Borne Diseases Transmission: A Look at The Evidence
No ratings yet
Impact of Past and On-Going Changes On Climate and Weather On Vector-Borne Diseases Transmission: A Look at The Evidence
9 pages
EHEST Pre Flight Planning Checklist
No ratings yet
EHEST Pre Flight Planning Checklist
2 pages
PGL Questions
No ratings yet
PGL Questions
14 pages
China MSA Advisory MSA 06 21
No ratings yet
China MSA Advisory MSA 06 21
12 pages

Group7 Report

Uploaded by

Group7 Report

Uploaded by

PROJECT: Predict the number of bike-sharing users

Prof. Ujjwal Das

In the partial fulfillment of the course

Advanced Methods for Data Analysis (AMDA)

Ayushi Birla- 1911046

C K Venkata Sai Anudeep- 1911065

Debdipta Ray- 1911075

Vishwanath Kaimal- 1911297

Data Set Used

Variables Used: The variables used for the analysis are:

 Season (1: spring, 2: summer, 3: fall, 4: winter)

 Year (0: 2011, 1:2012)

 Holiday: whether that day is holiday or not

 Weekday: day of the week

 Working-day: if day is neither weekend nor holiday, value is 1. Otherwise, value is 0

 Normalized temperature in Celsius. Values are divided to 41 (max)

 Normalized feeling temperature in Celsius. Values are divided to 50 (max)

 Normalized humidity. The values are divided to 100 (max)

 Normalized wind speed. The values are divided to 67 (max)

 Count of casual users

 Count of registered users

 Count of total rental bikes including both casual and registered

Exploratory Data Analysis

Looking at the scatter plot between Humidity and

Analyzing the inputs required for the project- RNN

To set up RNN, we needed to decide a number of variables

loss: 0.00495, accuracy: 0.008812, validation_loss: 0.005169 and validation_acc: 0.01115

To improve the accuracy of our output

loss: 0.006159, accuracy: 0.008812,

Since we have added 3 hidden layers,

We tried to change the no. of neurons

Then we changed the optimizer method to

loss: 0.03444, accuracy: 0.008812, validation

Based on the above findings we fixed the

Units Batch Size Epochs

You might also like