0% found this document useful (0 votes)
141 views25 pages

Uber Data Analysis

The document analyzes Uber ride data from New York City. Key steps include: 1. Importing and checking the data, which has 29,101 rows and 13 columns. 2. Creating new features like month, day, hour from the pickup date column and handling missing values. 3. Exploratory univariate analysis of variables like speed, pickups, temperature. 4. Bivariate analysis showing weak correlation between weather variables and pickups but a clear increasing trend in monthly bookings.

Uploaded by

Naing Naing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views25 pages

Uber Data Analysis

The document analyzes Uber ride data from New York City. Key steps include: 1. Importing and checking the data, which has 29,101 rows and 13 columns. 2. Creating new features like month, day, hour from the pickup date column and handling missing values. 3. Exploratory univariate analysis of variables like speed, pickups, temperature. 4. Bivariate analysis showing weak correlation between weather variables and pickups but a clear increasing trend in monthly bookings.

Uploaded by

Naing Naing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Uber Data Analysis

Data Import and sanity checks


Read data into R
uber = read.csv("uber.csv")

Check the dimension of data set


dim(uber)
29101 13

View top and bottom rows to make sure no formatting issues are there or header
and footer is included in data set
head(uber)
pickup_dt borough pickups spd vsb temp dewp slp pcp01 pc
p06 pcp24 sd
1 2015-01-01 01:00:00 Bronx 152 5 10 30 7 1023.5 0
0 0 0
2 2015-01-01 01:00:00 Brooklyn 1519 5 10 30 7 1023.5 0
0 0 0
3 2015-01-01 01:00:00 EWR 0 5 10 30 7 1023.5 0
0 0 0
4 2015-01-01 01:00:00 Manhattan 5258 5 10 30 7 1023.5 0
0 0 0
5 2015-01-01 01:00:00 Queens 405 5 10 30 7 1023.5 0
0 0 0
6 2015-01-01 01:00:00 Staten Island 6 5 10 30 7 1023.5 0
0 0 0
hday
1 Y
2 Y
3 Y
4 Y
5 Y
6 Y

tail(uber)
pickup_dt borough pickups spd vsb temp dewp slp pcp0
1 pcp06 pcp24 sd
29096 2015-06-30 23:00:00 Brooklyn 990 7 10 75 65 1011.8
0 0 0 0
29097 2015-06-30 23:00:00 EWR 0 7 10 75 65 1011.8
0 0 0 0
29098 2015-06-30 23:00:00 Manhattan 3828 7 10 75 65 1011.8
0 0 0 0
29099 2015-06-30 23:00:00 Queens 580 7 10 75 65 1011.8
0 0 0 0
29100 2015-06-30 23:00:00 Staten Island 0 7 10 75 65 1011.8
0 0 0 0
29101 2015-06-30 23:00:00 <NA> 3 7 10 75 65 1011.8
0 0 0 0
hday
29096 N
29097 N
29098 N
29099 N
29100 N
29101 N
0 0 0 0 N

This looks fine, let us now check for data types and structure
str(uber)
'data.frame': 29101 obs. of 13 variables:
$ pickup_dt: Factor w/ 4343 levels "2015-01-01 01:00:00",..: 1 1 1 1 1 1 1 2
2 2 ...
$ borough : Factor w/ 6 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 NA 1 2 3
...
$ pickups : int 152 1519 0 5258 405 6 4 120 1229 0 ...
$ spd : num 5 5 5 5 5 5 5 3 3 3 ...
$ vsb : num 10 10 10 10 10 10 10 10 10 10 ...
$ temp : num 30 30 30 30 30 30 30 30 30 30 ...
$ dewp : num 7 7 7 7 7 7 7 6 6 6 ...
$ slp : num 1024 1024 1024 1024 1024 ...
$ pcp01 : num 0 0 0 0 0 0 0 0 0 0 ...
$ pcp06 : num 0 0 0 0 0 0 0 0 0 0 ...
$ pcp24 : num 0 0 0 0 0 0 0 0 0 0 ...
$ sd : num 0 0 0 0 0 0 0 0 0 0 ...
$ hday : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...

Pickup date is date variable which is taken as factor


Borough and hday are another valid factors, rest all are numeric variables

Check summary statistics


summary(uber)
pickup_dt borough pickups spd
2015-01-01 01:00:00: 7 Bronx :4343 Min. : 0.0 Min. : 0
.000
2015-01-01 02:00:00: 7 Brooklyn :4343 1st Qu.: 1.0 1st Qu.: 3
.000
2015-01-01 03:00:00: 7 EWR :4343 Median : 54.0 Median : 6
.000
2015-01-01 04:00:00: 7 Manhattan :4343 Mean : 490.2 Mean : 5
.985
2015-01-01 05:00:00: 7 Queens :4343 3rd Qu.: 449.0 3rd Qu.: 8
.000
2015-01-01 10:00:00: 7 Staten Island:4343 Max. :7883.0 Max. :21
.000
(Other) :29059 NA's :3043
vsb temp dewp slp pcp01
Min. : 0.000 Min. : 2.00 Min. :-16.00 Min. : 991.4 Min. :0
.00000
1st Qu.: 9.100 1st Qu.:32.00 1st Qu.: 14.00 1st Qu.:1012.5 1st Qu.:0
.00000
Median :10.000 Median :46.00 Median : 30.00 Median :1018.2 Median :0
.00000
Mean : 8.818 Mean :47.67 Mean : 30.82 Mean :1017.8 Mean :0
.00383
3rd Qu.:10.000 3rd Qu.:64.50 3rd Qu.: 50.00 3rd Qu.:1022.9 3rd Qu.:0
.00000
Max. :10.000 Max. :89.00 Max. : 73.00 Max. :1043.4 Max. :0
.28000

pcp06 pcp24 sd hday


Min. :0.00000 Min. :0.00000 Min. : 0.000 N:27980
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 0.000 Y: 1121
Median :0.00000 Median :0.00000 Median : 0.000
Mean :0.02613 Mean :0.09046 Mean : 2.529
3rd Qu.:0.00000 3rd Qu.:0.05000 3rd Qu.: 2.958
Max. :1.24000 Max. :2.10000 Max. :19.000

# Almost all borough has identical distribution, few NA's are observed
# pickup shows possibility of outliers
# visibility of 0 shows extreme conditions, but cannot be ruled out
# temperatures are in Fahrenheit so given range of 2 to 89 translates roughly -16
to 31 Celsius

Check for any missing Values

anyNA(uber)
[1] TRUE

sum(is.na(uber))
[1] 3043

This corresponds to missing value of borough as was seen in summary output


sapply(uber, function(x) sum(is.na(x)))
pickup_dt borough pickups spd vsb temp dewp s
lp pcp01
0 3043 0 0 0 0 0
0 0
pcp06 pcp24 sd hday
0 0 0 0

Sapply iterates over all columns and checks for na values in given command
This confirm only one column (borough) has NA
Also, borough contains high number of NA values, imputing with any technique
might introduce bias. We would instead create a new category called “Unknown”
for missing values here.
uber$borough = as.factor(replace(as.character(uber$borough), is.na(uber$borough),"Unknown
"))

table(uber$borough)

Bronx Brooklyn EWR Manhattan Queens Staten


Island
4343 4343 4343 4343 4343
4343
Unknown
3043

Generate features from date variable.

Given date variable is in factor form which might not provide meaningful insights.
Let is try to break it into features like month, day , hour etc

# convert date into date form first

uber$start_date = strptime(uber$pickup_dt,'%Y-%m-%d %H:%M')

uber$start_month = month(uber$start_date)
uber$start_day = day(uber$start_date)
uber$start_hour = hour(uber$start_date)
uber$wday = weekdays(uber$start_date)
uber = uber[,-14]

We have added new features for month of ride, day of month and hour of ride.
Also wday represent which day of week it is.

Check for number of holidays each month

unique(uber[which(uber$hday=="Y"),c("start_day","start_month")])
start_day start_month
1 1 1
2848 19 1
6649 12 2
7293 16 2
20608 10 5
23055 25 5
24526 3 6
We can see that we have two holidays in Jan, 2 in Feb, 2 in May and 1 in June No
holidays in march and april
table(uber$hday,uber$start_month)

1 2 3 4 5 6
N 4588 4169 4957 4798 4730 4738
Y 309 323 0 0 328 161

This shows number of trips in holidays vs non-holidays in month


We will come again to check the effect on trips on holidays vs non-holidays
Before that let us do some univariate analysis

Uni-Variate Analysis
Speed:
boxplot(uber$spd)

hist(uber$spd)
Boxplot shows there are outliers in data set.
Histogram also shows the right skew in distribution
On an average speed is 5 miles/hour

Check the distribution for pickups


Many have 0 rides or close to it.
But skew is clearly visible

Total number of unique days


uniqueN(uber, by=c('start_month', 'start_day'))
[1] 181

plot(aggregate(pickups~hday,data=uber, mean), type="b")

check for outliers in other variable as well

Temperature and dew points doesn’t show any outliers


Pcp01 has less of outliers, sd shows plenty
Check variable distributions
Two peaks can be seen, one at around 35 other at around 60.
Bigger hike at 35 (~1.5 C) suggest cold weather conditions, summers are not so
intense
Dew Point

Distribution is quite like that of temperature.

Sea level pressure


This resembles to normal distribution
We would expect pressure, temperature and dew points to show some correlatio
n, hence we can expect similar distribution for them.

Let us now see relation of other variables on one another and pickup points

Bi-Variate analysis
Check for correlation among variables
corrplot(cor(uber[,4:12]))
As hypothesized temperature shows high correlation with dew point
Visibility decrease with pcp one hour prior to observation
Temperature and dew point are negatively correlated to temperature.
Other variables do show slight correlation among them.

Pick up vs wind speed


plot(uber$spd, uber$pickups, xlab= "speed", ylab="pickup", main ="pickup vs speed")
abline(lm(uber$pickups~uber$spd))
This does not seems to be strong predictor for pick up. Regression line is almost
flat

Visibility vs pickup

Again, not an important predictor


plot(uber$pcp01, uber$pickups, xlab= "pcp01", ylab="pickup", main ="pcp01 vs pickups")
abline(lm(uber$pickups~uber$pcp01))

plot(uber$temp, uber$pickups, xlab= "temp", ylab="pickup", main ="temperature vs pickup")


abline(lm(uber$pickups~uber$temp))
We could see that none of weather related variables emerges as strong predictors
for pick up

Let us try some time based variables

Monthly bookings for uber


plot(aggregate(pickups~start_month,data=uber, sum), type="b")

There is clear increasing trend in monthly bookings

Let us check the bookings done on each day of month


There is steep fall in last days of month. This can partially be attributed to month
of Feb having just 28 days
But we can see peak is at around 21 s t day of month

Let us exclude the month of feb and check daily rides


uber %>%
filter(.,start_month !=2) %>%
ggplot(aes(x=start_day, y=pickups))+geom_bar(stat='identity')
Well this does not show any steep decline for days of 29 th and 30th . There is sharp
decrase on 31s t, but then it has to be attributed to fact that 3 out of 5 months
have 31s t

Let us check the bookings on hourly basis


plot(aggregate(pickups~start_hour,data=uber, sum), type="b")
We see bookings peak at around 19 th -20th hour then it decreases till 5 in
morning.
There is increasing trend till 10 (should be due to office rush). Then again slight
downward movement till 12 after which it starts increasing

Bookings based on day of week


ggplot(aes(x = reorder(wday, pickups), y = pickups), data = uber) +
geom_bar(aes(fill=pickups), width=0.5, stat = "identity") + coord_flip()
Saturday has max. number of bookings and Monday the least

Booking based on boroughs


plot(aggregate(pickups~borough,data=uber, sum), type="b")
Manhattan has the highest no of bookings, followed by Brooklyn and Queens
EWR, Unknown and Staten Island have lower bookings

Multi variate analysis

Let us check the bookings per hour separated by boroughs

ggplot(uber, aes(start_hour, pickups)) +


geom_jitter(alpha = 0.3, aes(colour = borough)) +
geom_smooth(aes(color = borough))
Hourly pattern is seen in each hour for each borough. This combination could be
important for prediction

Let us check the pattern in each borough for different precipitation and dew point
values
ggplot(uber, aes(start_hour, borough)) +
geom_jitter( alpha = 0.4, aes(color = pcp24 > 0)) +
geom_smooth(aes(color = pcp24 > 0))
Variation for precipitation for 1 hour
We can clearly see that weather variables do not show any appreciable distinction
or pattern in number of rides booked

So, we can conclude that against our assumptions that bookings would increase in
case of rains, temperature we have seen that there is little effect of temperature
variables on booking

We have seen a clear trend in bookings per hour where number increases from 5
in morning till 10, drops slightly till 12 and then increases again till 8 at night

There was also a pattern in bookings per day of week where highest were clocked
on Saturday and minimum on Monday

Holiday was another variable which sees more number of bookings on non-
holidays compared to holidays. Point to note is that holidays and non-holidays
does not include week day off. It just compares 6 holidays against the regular
days. We can stretch this by considering all Sundays as holiday and replotting the
difference
Bookings by day of month does not reveal any obvious pattern, but there is
increase in monthly bookings done.

In terms of borough Manhattan contributes to largest share in bookings done

We could further explore, booking pattern on working/non-working days as


highlighted, booking pattern in each borough with respect to holidays, days of
month etc.
Pickups broken by boroughs

We can observe that majority of rides are in Manhattan.


EWR, staten island and unknown location have high number of 0 rides.

uber %>% group_by(borough) %>%


+ summarise(`Total Pickups` = sum(pickups)) %>%
+ arrange(desc(`Total Pickups`))
# A tibble: 7 x 2
borough `Total Pickups`
<fct> <int>
1 Manhattan 10367841
2 Brooklyn 2321035
3 Queens 1343528
4 Bronx 220047
5 Staten Island 6957
6 Unknown 6260
7 EWR 105
We could see that non holidays avg number of trips is around 500 compared to
430 on holidays

This could be good predictor in determining demand for number of rides

You might also like