0% found this document useful (0 votes)

141 views25 pages

Uber Data Analysis

The document analyzes Uber ride data from New York City. Key steps include: 1. Importing and checking the data, which has 29,101 rows and 13 columns. 2. Creating new features like month, day, hour from the pickup date column and handling missing values. 3. Exploratory univariate analysis of variables like speed, pickups, temperature. 4. Bivariate analysis showing weak correlation between weather variables and pickups but a clear increasing trend in monthly bookings.

Uploaded by

Naing Naing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views25 pages

Uber Data Analysis

Uploaded by

Naing Naing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Uber Data Analysis

Data Import and sanity checks

Read data into R
uber = read.csv("uber.csv")

Check the dimension of data set

dim(uber)
29101 13

View top and bottom rows to make sure no formatting issues are there or header
and footer is included in data set
head(uber)
pickup_dt borough pickups spd vsb temp dewp slp pcp01 pc
p06 pcp24 sd
1 2015-01-01 01:00:00 Bronx 152 5 10 30 7 1023.5 0
0 0 0
2 2015-01-01 01:00:00 Brooklyn 1519 5 10 30 7 1023.5 0
0 0 0
3 2015-01-01 01:00:00 EWR 0 5 10 30 7 1023.5 0
0 0 0
4 2015-01-01 01:00:00 Manhattan 5258 5 10 30 7 1023.5 0
0 0 0
5 2015-01-01 01:00:00 Queens 405 5 10 30 7 1023.5 0
0 0 0
6 2015-01-01 01:00:00 Staten Island 6 5 10 30 7 1023.5 0
0 0 0
hday
1 Y
2 Y
3 Y
4 Y
5 Y
6 Y

tail(uber)
pickup_dt borough pickups spd vsb temp dewp slp pcp0
1 pcp06 pcp24 sd
29096 2015-06-30 23:00:00 Brooklyn 990 7 10 75 65 1011.8
0 0 0 0
29097 2015-06-30 23:00:00 EWR 0 7 10 75 65 1011.8
0 0 0 0
29098 2015-06-30 23:00:00 Manhattan 3828 7 10 75 65 1011.8
0 0 0 0
29099 2015-06-30 23:00:00 Queens 580 7 10 75 65 1011.8
0 0 0 0
29100 2015-06-30 23:00:00 Staten Island 0 7 10 75 65 1011.8
0 0 0 0
29101 2015-06-30 23:00:00 <NA> 3 7 10 75 65 1011.8
0 0 0 0
hday
29096 N
29097 N
29098 N
29099 N
29100 N
29101 N
0 0 0 0 N

This looks fine, let us now check for data types and structure
str(uber)
'data.frame': 29101 obs. of 13 variables:
$ pickup_dt: Factor w/ 4343 levels "2015-01-01 01:00:00",..: 1 1 1 1 1 1 1 2
2 2 ...
$ borough : Factor w/ 6 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 NA 1 2 3
...
$ pickups : int 152 1519 0 5258 405 6 4 120 1229 0 ...
$ spd : num 5 5 5 5 5 5 5 3 3 3 ...
$ vsb : num 10 10 10 10 10 10 10 10 10 10 ...
$ temp : num 30 30 30 30 30 30 30 30 30 30 ...
$ dewp : num 7 7 7 7 7 7 7 6 6 6 ...
$ slp : num 1024 1024 1024 1024 1024 ...
$ pcp01 : num 0 0 0 0 0 0 0 0 0 0 ...
$ pcp06 : num 0 0 0 0 0 0 0 0 0 0 ...
$ pcp24 : num 0 0 0 0 0 0 0 0 0 0 ...
$ sd : num 0 0 0 0 0 0 0 0 0 0 ...
$ hday : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...

Pickup date is date variable which is taken as factor

Borough and hday are another valid factors, rest all are numeric variables

Check summary statistics

summary(uber)
pickup_dt borough pickups spd
2015-01-01 01:00:00: 7 Bronx :4343 Min. : 0.0 Min. : 0
.000
2015-01-01 02:00:00: 7 Brooklyn :4343 1st Qu.: 1.0 1st Qu.: 3
.000
2015-01-01 03:00:00: 7 EWR :4343 Median : 54.0 Median : 6
.000
2015-01-01 04:00:00: 7 Manhattan :4343 Mean : 490.2 Mean : 5
.985
2015-01-01 05:00:00: 7 Queens :4343 3rd Qu.: 449.0 3rd Qu.: 8
.000
2015-01-01 10:00:00: 7 Staten Island:4343 Max. :7883.0 Max. :21
.000
(Other) :29059 NA's :3043
vsb temp dewp slp pcp01
Min. : 0.000 Min. : 2.00 Min. :-16.00 Min. : 991.4 Min. :0
.00000
1st Qu.: 9.100 1st Qu.:32.00 1st Qu.: 14.00 1st Qu.:1012.5 1st Qu.:0
.00000
Median :10.000 Median :46.00 Median : 30.00 Median :1018.2 Median :0
.00000
Mean : 8.818 Mean :47.67 Mean : 30.82 Mean :1017.8 Mean :0
.00383
3rd Qu.:10.000 3rd Qu.:64.50 3rd Qu.: 50.00 3rd Qu.:1022.9 3rd Qu.:0
.00000
Max. :10.000 Max. :89.00 Max. : 73.00 Max. :1043.4 Max. :0
.28000

pcp06 pcp24 sd hday

Min. :0.00000 Min. :0.00000 Min. : 0.000 N:27980
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 0.000 Y: 1121
Median :0.00000 Median :0.00000 Median : 0.000
Mean :0.02613 Mean :0.09046 Mean : 2.529
3rd Qu.:0.00000 3rd Qu.:0.05000 3rd Qu.: 2.958
Max. :1.24000 Max. :2.10000 Max. :19.000

# Almost all borough has identical distribution, few NA's are observed
# pickup shows possibility of outliers
# visibility of 0 shows extreme conditions, but cannot be ruled out
# temperatures are in Fahrenheit so given range of 2 to 89 translates roughly -16
to 31 Celsius

Check for any missing Values

anyNA(uber)
[1] TRUE

sum(is.na(uber))
[1] 3043

This corresponds to missing value of borough as was seen in summary output

sapply(uber, function(x) sum(is.na(x)))
pickup_dt borough pickups spd vsb temp dewp s
lp pcp01
0 3043 0 0 0 0 0
0 0
pcp06 pcp24 sd hday
0 0 0 0

Sapply iterates over all columns and checks for na values in given command
This confirm only one column (borough) has NA
Also, borough contains high number of NA values, imputing with any technique
might introduce bias. We would instead create a new category called “Unknown”
for missing values here.
uber$borough = as.factor(replace(as.character(uber$borough), is.na(uber$borough),"Unknown
"))

table(uber$borough)

Bronx Brooklyn EWR Manhattan Queens Staten

Island
4343 4343 4343 4343 4343
4343
Unknown
3043

Generate features from date variable.

Given date variable is in factor form which might not provide meaningful insights.
Let is try to break it into features like month, day , hour etc

# convert date into date form first

uber$start_date = strptime(uber$pickup_dt,'%Y-%m-%d %H:%M')

uber$start_month = month(uber$start_date)
uber$start_day = day(uber$start_date)
uber$start_hour = hour(uber$start_date)
uber$wday = weekdays(uber$start_date)
uber = uber[,-14]

We have added new features for month of ride, day of month and hour of ride.
Also wday represent which day of week it is.

Check for number of holidays each month

unique(uber[which(uber$hday=="Y"),c("start_day","start_month")])
start_day start_month
1 1 1
2848 19 1
6649 12 2
7293 16 2
20608 10 5
23055 25 5
24526 3 6
We can see that we have two holidays in Jan, 2 in Feb, 2 in May and 1 in June No
holidays in march and april
table(uber$hday,uber$start_month)

1 2 3 4 5 6
N 4588 4169 4957 4798 4730 4738
Y 309 323 0 0 328 161

This shows number of trips in holidays vs non-holidays in month

We will come again to check the effect on trips on holidays vs non-holidays
Before that let us do some univariate analysis

Uni-Variate Analysis
Speed:
boxplot(uber$spd)

hist(uber$spd)
Boxplot shows there are outliers in data set.
Histogram also shows the right skew in distribution
On an average speed is 5 miles/hour

Check the distribution for pickups

Many have 0 rides or close to it.
But skew is clearly visible

Total number of unique days

uniqueN(uber, by=c('start_month', 'start_day'))
[1] 181

plot(aggregate(pickups~hday,data=uber, mean), type="b")

check for outliers in other variable as well

Temperature and dew points doesn’t show any outliers

Pcp01 has less of outliers, sd shows plenty
Check variable distributions
Two peaks can be seen, one at around 35 other at around 60.
Bigger hike at 35 (~1.5 C) suggest cold weather conditions, summers are not so
intense
Dew Point

Distribution is quite like that of temperature.

Sea level pressure

This resembles to normal distribution
We would expect pressure, temperature and dew points to show some correlatio
n, hence we can expect similar distribution for them.

Let us now see relation of other variables on one another and pickup points

Bi-Variate analysis
Check for correlation among variables
corrplot(cor(uber[,4:12]))
As hypothesized temperature shows high correlation with dew point
Visibility decrease with pcp one hour prior to observation
Temperature and dew point are negatively correlated to temperature.
Other variables do show slight correlation among them.

Pick up vs wind speed

plot(uber$spd, uber$pickups, xlab= "speed", ylab="pickup", main ="pickup vs speed")
abline(lm(uber$pickups~uber$spd))
This does not seems to be strong predictor for pick up. Regression line is almost
flat

Visibility vs pickup

Again, not an important predictor

plot(uber$pcp01, uber$pickups, xlab= "pcp01", ylab="pickup", main ="pcp01 vs pickups")
abline(lm(uber$pickups~uber$pcp01))

plot(uber$temp, uber$pickups, xlab= "temp", ylab="pickup", main ="temperature vs pickup")

abline(lm(uber$pickups~uber$temp))
We could see that none of weather related variables emerges as strong predictors
for pick up

Let us try some time based variables

Monthly bookings for uber

plot(aggregate(pickups~start_month,data=uber, sum), type="b")

There is clear increasing trend in monthly bookings

Let us check the bookings done on each day of month

There is steep fall in last days of month. This can partially be attributed to month
of Feb having just 28 days
But we can see peak is at around 21 s t day of month

Let us exclude the month of feb and check daily rides

uber %>%
filter(.,start_month !=2) %>%
ggplot(aes(x=start_day, y=pickups))+geom_bar(stat='identity')
Well this does not show any steep decline for days of 29 th and 30th . There is sharp
decrase on 31s t, but then it has to be attributed to fact that 3 out of 5 months
have 31s t

Let us check the bookings on hourly basis

plot(aggregate(pickups~start_hour,data=uber, sum), type="b")
We see bookings peak at around 19 th -20th hour then it decreases till 5 in
morning.
There is increasing trend till 10 (should be due to office rush). Then again slight
downward movement till 12 after which it starts increasing

Bookings based on day of week

ggplot(aes(x = reorder(wday, pickups), y = pickups), data = uber) +
geom_bar(aes(fill=pickups), width=0.5, stat = "identity") + coord_flip()
Saturday has max. number of bookings and Monday the least

Booking based on boroughs

plot(aggregate(pickups~borough,data=uber, sum), type="b")
Manhattan has the highest no of bookings, followed by Brooklyn and Queens
EWR, Unknown and Staten Island have lower bookings

Multi variate analysis

Let us check the bookings per hour separated by boroughs

ggplot(uber, aes(start_hour, pickups)) +

geom_jitter(alpha = 0.3, aes(colour = borough)) +
geom_smooth(aes(color = borough))
Hourly pattern is seen in each hour for each borough. This combination could be
important for prediction

Let us check the pattern in each borough for different precipitation and dew point
values
ggplot(uber, aes(start_hour, borough)) +
geom_jitter( alpha = 0.4, aes(color = pcp24 > 0)) +
geom_smooth(aes(color = pcp24 > 0))
Variation for precipitation for 1 hour
We can clearly see that weather variables do not show any appreciable distinction
or pattern in number of rides booked

So, we can conclude that against our assumptions that bookings would increase in
case of rains, temperature we have seen that there is little effect of temperature
variables on booking

We have seen a clear trend in bookings per hour where number increases from 5
in morning till 10, drops slightly till 12 and then increases again till 8 at night

There was also a pattern in bookings per day of week where highest were clocked
on Saturday and minimum on Monday

Holiday was another variable which sees more number of bookings on non-
holidays compared to holidays. Point to note is that holidays and non-holidays
does not include week day off. It just compares 6 holidays against the regular
days. We can stretch this by considering all Sundays as holiday and replotting the
difference
Bookings by day of month does not reveal any obvious pattern, but there is
increase in monthly bookings done.

In terms of borough Manhattan contributes to largest share in bookings done

We could further explore, booking pattern on working/non-working days as

highlighted, booking pattern in each borough with respect to holidays, days of
month etc.
Pickups broken by boroughs

We can observe that majority of rides are in Manhattan.

EWR, staten island and unknown location have high number of 0 rides.

uber %>% group_by(borough) %>%

+ summarise(`Total Pickups` = sum(pickups)) %>%
+ arrange(desc(`Total Pickups`))
# A tibble: 7 x 2
borough `Total Pickups`
<fct> <int>
1 Manhattan 10367841
2 Brooklyn 2321035
3 Queens 1343528
4 Bronx 220047
5 Staten Island 6957
6 Unknown 6260
7 EWR 105
We could see that non holidays avg number of trips is around 500 compared to
430 on holidays

This could be good predictor in determining demand for number of rides

Week3HW 091323
No ratings yet
Week3HW 091323
8 pages
Uber Creative Test Questions
No ratings yet
Uber Creative Test Questions
3 pages
Introducing The Art of Statistics How To Learn Fro
No ratings yet
Introducing The Art of Statistics How To Learn Fro
6 pages
Time+Series+Forecasting Monograph
No ratings yet
Time+Series+Forecasting Monograph
58 pages
Cracktheuberinterviewpdf 160326180505
No ratings yet
Cracktheuberinterviewpdf 160326180505
8 pages
Uber Case Study
100% (1)
Uber Case Study
12 pages
Accenture Making Delivery Work For Everyone 1 1
No ratings yet
Accenture Making Delivery Work For Everyone 1 1
36 pages
Industry Report - E-Commerce in Mexico 2020
No ratings yet
Industry Report - E-Commerce in Mexico 2020
89 pages
Case Study - The Growth Hacking Story of Uber
No ratings yet
Case Study - The Growth Hacking Story of Uber
31 pages
Thera Bank - Project - Submission - V1 PDF
No ratings yet
Thera Bank - Project - Submission - V1 PDF
26 pages
Marketing Test Mark Scheme
100% (1)
Marketing Test Mark Scheme
3 pages
Research Proposal 2
No ratings yet
Research Proposal 2
4 pages
Research Paper
No ratings yet
Research Paper
7 pages
Case Study 2 - How Can A Wellness Technology Company Play It Smart
0% (1)
Case Study 2 - How Can A Wellness Technology Company Play It Smart
12 pages
Electronic Reverse Auction and The Public Sector: Factors of Success Moshe E. Shalev & Stee Asbjorensen
100% (3)
Electronic Reverse Auction and The Public Sector: Factors of Success Moshe E. Shalev & Stee Asbjorensen
25 pages
Regression in Marketing
No ratings yet
Regression in Marketing
90 pages
USP - Statistical Tools For Procedure Validation
100% (1)
USP - Statistical Tools For Procedure Validation
35 pages
Sat Study Guide Problem Solving Data Analysis
No ratings yet
Sat Study Guide Problem Solving Data Analysis
18 pages
Uber Crisis Handling Case
No ratings yet
Uber Crisis Handling Case
13 pages
Uber Statistics Report
No ratings yet
Uber Statistics Report
17 pages
Study Id14764 e Commerce in Latin America Statista Dossier
No ratings yet
Study Id14764 e Commerce in Latin America Statista Dossier
89 pages
Uber Using Big Data Analysis, Data Visualization
No ratings yet
Uber Using Big Data Analysis, Data Visualization
8 pages
Group 1 Uber Case
No ratings yet
Group 1 Uber Case
8 pages
Spencer: Privacy and Predictive Analytics in E-Commerce
No ratings yet
Spencer: Privacy and Predictive Analytics in E-Commerce
19 pages
Starbucks Sentiment Analysis Using VADER
No ratings yet
Starbucks Sentiment Analysis Using VADER
23 pages
Flipkart APM
No ratings yet
Flipkart APM
10 pages
Uber Case
33% (3)
Uber Case
2 pages
Analytics Roadmap
No ratings yet
Analytics Roadmap
30 pages
Uber Case Study
No ratings yet
Uber Case Study
6 pages
Ridesharing Presentation
No ratings yet
Ridesharing Presentation
9 pages
Uber Case Analysis
100% (1)
Uber Case Analysis
6 pages
Telecommunications Sector
No ratings yet
Telecommunications Sector
16 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Visual Analytics
No ratings yet
Visual Analytics
36 pages
04 - Introduction To Synthetic Data
No ratings yet
04 - Introduction To Synthetic Data
15 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
5 pages
Everything
No ratings yet
Everything
23 pages
Case Uber
No ratings yet
Case Uber
9 pages
Linking Social Media Marketing Efforts With Customer Brand Engagement in Driving Brand Loyalty
No ratings yet
Linking Social Media Marketing Efforts With Customer Brand Engagement in Driving Brand Loyalty
20 pages
Uber Case Study
No ratings yet
Uber Case Study
15 pages
Taxi App
No ratings yet
Taxi App
36 pages
Uber
No ratings yet
Uber
7 pages
A Study On Impact of Online Advertising On Consumer Behavior (With Special Reference To E-Mails)
No ratings yet
A Study On Impact of Online Advertising On Consumer Behavior (With Special Reference To E-Mails)
5 pages
What Is Digital Marketing
No ratings yet
What Is Digital Marketing
2 pages
Estudio de Social Media en México (Social Bakers)
No ratings yet
Estudio de Social Media en México (Social Bakers)
63 pages
The Box-Jenkins Methodology For RIMA Models
No ratings yet
The Box-Jenkins Methodology For RIMA Models
172 pages
DA Project Report
No ratings yet
DA Project Report
17 pages
Vision Document of Cab Reservation System
No ratings yet
Vision Document of Cab Reservation System
6 pages
Little Black Book of Real Time Bidding
No ratings yet
Little Black Book of Real Time Bidding
18 pages
DTI - Innovation in Services
No ratings yet
DTI - Innovation in Services
198 pages
Business Plan For Lubs
No ratings yet
Business Plan For Lubs
7 pages
Access Paths
No ratings yet
Access Paths
3 pages
Summary On "Uber: The New Face of E-Commerce"
100% (1)
Summary On "Uber: The New Face of E-Commerce"
2 pages
Revealing The True Cost of Financial Crime: What's Hiding in The Shadows?
100% (1)
Revealing The True Cost of Financial Crime: What's Hiding in The Shadows?
32 pages
Spam Detection Using BERT
No ratings yet
Spam Detection Using BERT
6 pages
Uber Business 2021
No ratings yet
Uber Business 2021
29 pages
What Caused The Rise of Airbnb
No ratings yet
What Caused The Rise of Airbnb
9 pages
Project On Sentimental Analysis: Submitted by
No ratings yet
Project On Sentimental Analysis: Submitted by
17 pages
Elasticity and Its Application
No ratings yet
Elasticity and Its Application
33 pages
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
No ratings yet
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
3 pages
Minor Project Report By-Dhruv Rai
No ratings yet
Minor Project Report By-Dhruv Rai
56 pages
To Uber or Not to Uber
From Everand
To Uber or Not to Uber
Tim Bennett
No ratings yet
August of Money: The Quest for Cashless Society
From Everand
August of Money: The Quest for Cashless Society
Mehul Desai
No ratings yet
East West Airlines Output
No ratings yet
East West Airlines Output
33 pages
ADA Assignment - Final - 2024
No ratings yet
ADA Assignment - Final - 2024
5 pages
Data Analytics Certification Program Learnbay
No ratings yet
Data Analytics Certification Program Learnbay
36 pages
Paired T-Test
No ratings yet
Paired T-Test
7 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Repeatability & Reproducibility of Determination of Nitrogen Content of Fishmeal by Combustion Dumas & Comparison With Kjeldahl
No ratings yet
Repeatability & Reproducibility of Determination of Nitrogen Content of Fishmeal by Combustion Dumas & Comparison With Kjeldahl
15 pages
Linear Regression Using R
No ratings yet
Linear Regression Using R
24 pages
Assignment Questions
No ratings yet
Assignment Questions
4 pages
Control Charts
100% (1)
Control Charts
136 pages
Skittles Project Final
No ratings yet
Skittles Project Final
5 pages
Analytical Chemistry (Theory)
No ratings yet
Analytical Chemistry (Theory)
10 pages
BS en 14663-2005 (2006)
No ratings yet
BS en 14663-2005 (2006)
26 pages
Neural Networks For The Identification and Control of Blast Furnace Hot Metal Quality
No ratings yet
Neural Networks For The Identification and Control of Blast Furnace Hot Metal Quality
16 pages
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
7 pages
Reflection Paper On Median
No ratings yet
Reflection Paper On Median
2 pages
TT Mag Issue 008
No ratings yet
TT Mag Issue 008
92 pages
CPG FMCG Food Benchmark Study
No ratings yet
CPG FMCG Food Benchmark Study
94 pages
Lesson Plan #1
No ratings yet
Lesson Plan #1
5 pages
Creswell Mixed Methods
100% (3)
Creswell Mixed Methods
32 pages
01 Cleaning Data The Chauvenet Way
No ratings yet
01 Cleaning Data The Chauvenet Way
11 pages
Normal Distribution For ML
No ratings yet
Normal Distribution For ML
17 pages
Harter Et Al 2024 The Effect of Delivery Time On Repurchase Behavior in Quick Commerce
No ratings yet
Harter Et Al 2024 The Effect of Delivery Time On Repurchase Behavior in Quick Commerce
17 pages
Next Wave Mobility
No ratings yet
Next Wave Mobility
13 pages
Comparative Analysis of Machine Learning Techniques For Indian Liver Disease Patients
No ratings yet
Comparative Analysis of Machine Learning Techniques For Indian Liver Disease Patients
5 pages
Cluster, Gaps, Peaks & OUtliers
No ratings yet
Cluster, Gaps, Peaks & OUtliers
12 pages