Uber Data Analysis
Uber Data Analysis
View top and bottom rows to make sure no formatting issues are there or header
and footer is included in data set
head(uber)
pickup_dt borough pickups spd vsb temp dewp slp pcp01 pc
p06 pcp24 sd
1 2015-01-01 01:00:00 Bronx 152 5 10 30 7 1023.5 0
0 0 0
2 2015-01-01 01:00:00 Brooklyn 1519 5 10 30 7 1023.5 0
0 0 0
3 2015-01-01 01:00:00 EWR 0 5 10 30 7 1023.5 0
0 0 0
4 2015-01-01 01:00:00 Manhattan 5258 5 10 30 7 1023.5 0
0 0 0
5 2015-01-01 01:00:00 Queens 405 5 10 30 7 1023.5 0
0 0 0
6 2015-01-01 01:00:00 Staten Island 6 5 10 30 7 1023.5 0
0 0 0
hday
1 Y
2 Y
3 Y
4 Y
5 Y
6 Y
tail(uber)
pickup_dt borough pickups spd vsb temp dewp slp pcp0
1 pcp06 pcp24 sd
29096 2015-06-30 23:00:00 Brooklyn 990 7 10 75 65 1011.8
0 0 0 0
29097 2015-06-30 23:00:00 EWR 0 7 10 75 65 1011.8
0 0 0 0
29098 2015-06-30 23:00:00 Manhattan 3828 7 10 75 65 1011.8
0 0 0 0
29099 2015-06-30 23:00:00 Queens 580 7 10 75 65 1011.8
0 0 0 0
29100 2015-06-30 23:00:00 Staten Island 0 7 10 75 65 1011.8
0 0 0 0
29101 2015-06-30 23:00:00 <NA> 3 7 10 75 65 1011.8
0 0 0 0
hday
29096 N
29097 N
29098 N
29099 N
29100 N
29101 N
0 0 0 0 N
This looks fine, let us now check for data types and structure
str(uber)
'data.frame': 29101 obs. of 13 variables:
$ pickup_dt: Factor w/ 4343 levels "2015-01-01 01:00:00",..: 1 1 1 1 1 1 1 2
2 2 ...
$ borough : Factor w/ 6 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 NA 1 2 3
...
$ pickups : int 152 1519 0 5258 405 6 4 120 1229 0 ...
$ spd : num 5 5 5 5 5 5 5 3 3 3 ...
$ vsb : num 10 10 10 10 10 10 10 10 10 10 ...
$ temp : num 30 30 30 30 30 30 30 30 30 30 ...
$ dewp : num 7 7 7 7 7 7 7 6 6 6 ...
$ slp : num 1024 1024 1024 1024 1024 ...
$ pcp01 : num 0 0 0 0 0 0 0 0 0 0 ...
$ pcp06 : num 0 0 0 0 0 0 0 0 0 0 ...
$ pcp24 : num 0 0 0 0 0 0 0 0 0 0 ...
$ sd : num 0 0 0 0 0 0 0 0 0 0 ...
$ hday : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
# Almost all borough has identical distribution, few NA's are observed
# pickup shows possibility of outliers
# visibility of 0 shows extreme conditions, but cannot be ruled out
# temperatures are in Fahrenheit so given range of 2 to 89 translates roughly -16
to 31 Celsius
anyNA(uber)
[1] TRUE
sum(is.na(uber))
[1] 3043
Sapply iterates over all columns and checks for na values in given command
This confirm only one column (borough) has NA
Also, borough contains high number of NA values, imputing with any technique
might introduce bias. We would instead create a new category called “Unknown”
for missing values here.
uber$borough = as.factor(replace(as.character(uber$borough), is.na(uber$borough),"Unknown
"))
table(uber$borough)
Given date variable is in factor form which might not provide meaningful insights.
Let is try to break it into features like month, day , hour etc
uber$start_month = month(uber$start_date)
uber$start_day = day(uber$start_date)
uber$start_hour = hour(uber$start_date)
uber$wday = weekdays(uber$start_date)
uber = uber[,-14]
We have added new features for month of ride, day of month and hour of ride.
Also wday represent which day of week it is.
unique(uber[which(uber$hday=="Y"),c("start_day","start_month")])
start_day start_month
1 1 1
2848 19 1
6649 12 2
7293 16 2
20608 10 5
23055 25 5
24526 3 6
We can see that we have two holidays in Jan, 2 in Feb, 2 in May and 1 in June No
holidays in march and april
table(uber$hday,uber$start_month)
1 2 3 4 5 6
N 4588 4169 4957 4798 4730 4738
Y 309 323 0 0 328 161
Uni-Variate Analysis
Speed:
boxplot(uber$spd)
hist(uber$spd)
Boxplot shows there are outliers in data set.
Histogram also shows the right skew in distribution
On an average speed is 5 miles/hour
Let us now see relation of other variables on one another and pickup points
Bi-Variate analysis
Check for correlation among variables
corrplot(cor(uber[,4:12]))
As hypothesized temperature shows high correlation with dew point
Visibility decrease with pcp one hour prior to observation
Temperature and dew point are negatively correlated to temperature.
Other variables do show slight correlation among them.
Visibility vs pickup
Let us check the pattern in each borough for different precipitation and dew point
values
ggplot(uber, aes(start_hour, borough)) +
geom_jitter( alpha = 0.4, aes(color = pcp24 > 0)) +
geom_smooth(aes(color = pcp24 > 0))
Variation for precipitation for 1 hour
We can clearly see that weather variables do not show any appreciable distinction
or pattern in number of rides booked
So, we can conclude that against our assumptions that bookings would increase in
case of rains, temperature we have seen that there is little effect of temperature
variables on booking
We have seen a clear trend in bookings per hour where number increases from 5
in morning till 10, drops slightly till 12 and then increases again till 8 at night
There was also a pattern in bookings per day of week where highest were clocked
on Saturday and minimum on Monday
Holiday was another variable which sees more number of bookings on non-
holidays compared to holidays. Point to note is that holidays and non-holidays
does not include week day off. It just compares 6 holidays against the regular
days. We can stretch this by considering all Sundays as holiday and replotting the
difference
Bookings by day of month does not reveal any obvious pattern, but there is
increase in monthly bookings done.