A Framework For Passengers Demand Prediction and Recommendation
A Framework For Passengers Demand Prediction and Recommendation
Abstract— With the rapid development of mobile internet and drivers based on GPS traces [4, 5]. Other approaches [6-8]
wireless network technologies, more and more people use the are proposed to provide recommendation for carpooling
mobile app to call a taxicab to pick them up. Therefore, service to passengers according to traffic data analysis.
understanding the passengers’ travel demand becomes crucial Generally speaking, taxi drivers expect that the places
to improve the utilization of the taxicabs and reduce their cost. where passengers may take a taxi are given by the time and
In this paper, based on spatio-temporal clustering, we propose around the location. However, most of the researches about
a demand hotspots prediction framework to generate GPS data is focused on recommendation, while the
recommendation for taxi drivers. Specially, an adaptive relationships between passenger demands and space-time are
prediction approach is presented to demand hotspots and their
rarely captured.
hotness; and then combing the driver’s location and the
hotness, top candidates are recommended and visually
Therefore, to deal with these issues, we present a demand
presented to drivers. Based on the dataset provided by CAR hotspots prediction framework based on the spatio-temporal
INC., the experiment shows that our approach gains a analysis to predict and recommend the hotspots for drivers.
significant improvement in hotspots prediction and Based on the analysis of historical data, including when and
recommendation, with 15.21% improvement on average f- where passengers get on a taxi, we generate the demand
measure for prediction and 79.6% hit ratio for distribution to learn the patterns which can help to improve
recommendation. the performance of the spatio-temporal clustering. Then the
hotness score for each hotspot is predicted to represent the
Keywords- spatio-temporal cluster, passenger demand hotspot, potential requirement of the passengers. Considering the fact
demand prediction, hotspot recommendation that it would take time for the drivers to reach a given
location while the requirement is dynamic, the top-k
I. INTRODUCTION locations which combines the hotness and the distance is
With the rapid development of mobile internet and visually presented for each taxi driver to help them improve
wireless network technologies in these years, the the efficiency.
transportation industry has been greatly changed. More and Hence, the major contributions of this paper are a
more passengers in cities are relying on different mobile framework to predict passenger demands and generate
apps, such as DiDi 1 , Uber 2 , CAR 3 , Yongche 4 , to call a recommendation for drivers to improve their efficiency,
taxicab to pick them up for travel. This makes the knowledge including the following folds:
about the potential passengers’ requirements important and An adaptive prediction approach is proposed to
valuable, which is lack for many taxicab drivers, especially identify the hotspots and predict the hotness of the passenger
the novice-like drivers. Actually, understanding the travel demands based on the historical GPS data;
requirements can not only help the drivers picking up A method combing the hotness prediction and
passengers more quickly and earning more money, but also locations to calculate attractive score is presented to generate
reduce the cruising time and energy waste. Therefore, how to recommendation for each driver;
understand the travel requirements for efficiency A visual prototype system is developed to prove the
improvement becomes an important issue for the effectiveness of the presented framework, with hotspots
transportation industry. distribution denoted by hotness score and hit ratio that taxi
Many efforts have been proposed to address this problem. drivers succeed in picking up passengers in predicted places;
A lot of clustering approaches [1-3] have been used in The rest of this paper is organized as follows. Section II
hotspot analysis, including k-means, hierarchical clustering, presents our framework for passenger demands prediction
Density-Based Spatial Clustering of Applications with Noise and recommendation. Section III details the methodology to
(DBSCAN). There are papers focusing on understanding the predict the hotness of passenger requirements and generate
traffic flow movement and the corresponding benefits for recommendation for drivers. Section IV reports the data and
the experiment results. Section V discusses the related work
1 and Section VI concludes the whole paper.
http://www.xiaojukeji.com/en/index.html
2
https://www.uber.com/
3
http://www.10101111.com/
4
http://www.yongche.com/
DH i pi , HSi , t (3)
where HSi is hotness score of cluster ci.
Figure 1. The system framework 2) Passenger Demand Identify
In order to predict passenger demands, we should identify
In fact, the position passenger concentrated in is varying when and where a demand happens.
with time and location of taxi is dynamically changed.
a) Data Information
Hotspots prediction is performed offline. The process is as
follows: we first obtain historical taxi pick-up and drop-off The taxi GPS data used in this paper is provided by
points from GPS trajectories and deal the taxi pick-up points CAR Inc5. It is generated by about 3,760 taxis in Beijing
with a spatio-temporal clustering approach, then extract the from July 1 to July 29, 2015, while the total number of taxis
hotspots in different time slots and different regions. is sampled with a time interval ranging from 30-40 s. Each
Specifically, we divide 24 hours into 24 time segments and record denotes where and when passengers get on a taxi.
all pick-up points into 24 subsets based on 24 time segments, Table I reports the dimensions of the taxi GPS data.
then an adaptive DBSCAN method is utilized to obtain the TABLE I. METADATA OF RECORD
locations with high density in each time segment. We can get
a core point standing for hotspot in a small region step by Field Description
step in every historical date. Different time segment’s pick-
up points are deposited into different files prepared for VehicleID the unique taxi ID
online recommendation. Thirdly, the results are taken as Lon longitude of the point
input parameters of the Exponentially Weighted Moving- Lat latitude of the point
Average (EWMA) model, which is one of time series
forecasting methods. Online recommendation process is as Timestamp the sampling time
follows: we get the current hotspots based on the output of current state of point, 0 indicates the taxi is
EWMA model according to taxi drivers’ request time and PassengerState
vacant and 1 indicates the taxi is occupied
location, then they are ranked by the attractive score of
hotspots. We choose top-k hotpots around location of taxi to current state of record, 0 indicates the record is
RecordState
recommend to taxi drivers. correct and 1 indicates the record is incorrect
In our proposed framework, the prediction and evaluation
on recommendation are critical and we will discuss their
details in the following sections. 5
http://www.zuche.com
341
b) Data Preprocessing
Because of the abnormalities such as GPS device failure,
the GPS position may sometimes be incorrect. We focus on
the passenger demands of taxis in the city, thus data should
be cleaned firstly. To clarify the real vacant and occupied
trajectories, we carry out the data preprocessing as follows:
Step 1 Extract the raw taxi trajectories from GPS records
A shift of PassengerState means an occupied /vacant
event, for example, PassengerState changes from 0 to 1
indicates an occupied event, while PassengerState changes
from 1 to 0 indicates a vacant event. An occupied trajectory
is defined as a point sequence beginning with an occupied
event and ending with a vacant event, while a vacant
trajectory event is defined as a point sequence beginning
with a vacant event and ending with an occupied event. The
occupied/vacant event is presented in by (4). Figure 2. The pick-up points around Wangfujing Street
where Velocityi,j is calculated as the Manhattan distance Note that there are two types of points in the dataset,
divided by time interval between point i and point j. pick-up points and drop-off points. Each pick-up point
Threshold is set to 120km/h according to urban traffic corresponds to a drop-off point.
regulation. For each record, if f(pi, pj) is 0, then the record After trajectories been processed, we have detected the
should be filtered. pick-up and drop-off points. Then we perform a range query
Step 3 Pick-up and Drop-off Detection according to the location and time of the taxi, which picks
out the relevant request records for calculating hotspots. The
We filter the abnormal trajectories whose average records with the condition will be selected and form the
velocity is out of a normal range. During this stage, we dataset for clustering. Since the data could sometimes be
detect the places where passengers get on a taxi and where noisy, we conduct a request filtering as algorithm 1 to reduce
taxis drop off a passenger. For a trajectory TR, in which the false selections.
point meets
Algorithm 1 Request Filtering(location, time, expected)
P pi |S pi1 0 S pi 1
(6)
Input: location, the current location of car
time, the current time
D pi |S pi1 1 S pi 0
records, the records data
Output: filtered, the set of filtered data
where Spi is the PassengerState of point i. Through Procedure:
processing of trajectories, we get the set of pick-up points P
1. filtered ← ∅, P ← ∅
and drop-off points D. Fox example, Figure 2 shows the
2. n ← sizeof(filtered)
pick-up points around Wangfujing Street, which is the most
famous commercial area in Beijing. A blue marker 3. for ri in records
represents a pick-up point while the red marker indicates a 4. if !is_around(location)/* distance between taxi and
drop-off point. location of record more than threshold */
5. || !is_timeslot(time) /* find out records in time
segment of request */ then
6. continue;
7. end if
8. if n < expected
9. P ← ri
342
10. filtered ← filtered Ĥ P The expected value of e is the value of eps. We use the
historical data to learn the parameter eps. For example, as
11. n ← sizeof(filtered)
shown in Figure 3, the number of clusters and noise detected
12. end if by algorithm changes with different value of when i is
13. end for between 0 and 20. As the number of i increases, the number
14. return filtered of clusters and isolated points shows a downward trend. The
Lines 4~7 execute the query in database to get the value of i is increasing as epsi goes on and after i = 7, the
relevant records according to the query condition, saved in number of clusters and noise reaches the convergence. Thus
set P. Lines 8~12 execute the iterative process if the number we get the optimal parameters of epsi when i = 7.
of returned set is not enough. 200
clusters
3) Hotspot Prediction based on Adaptive DBSCAN noise
343
Here, is attenuation factor which is denoted as α =
empirically, is the number of observation days. The
degree of weighting is determined by the factor . For
example, Figure 5 shows the values of when N = 15.
1
0.8
weight
0.4
0.2
0
EWMA,N=15
Y X i i B. Recommendation
Yn 1 i 1
n Online Recommendation aims to provide the taxi drivers
(11)
X
i 1
i
with the best places to cruise, where it will bring a high
probability to get a passenger. Here we propose the
where is the i-th day’s observation, and
is the day attractive score to evaluate this possibility for each driver.
we want to predict. Note that
for = 1,2, … , is the Definition 4 (Attractive Score, AS): Given the HS, time t
weight of the i-th day while n is the size of the time window. and distance d between taxi and hotspot, the attractive score
In this way, the + 1 day’s prediction value is given by of hotspot becomes smaller as the distance increases. The
attractive score of ci at time t is defined as
p1 (1 ) p2 (1 ) 2 p3 ... (1 ) n1
EWMAn 1 (12)
1 (1 ) (1 ) 2 +...+(1 ) n
344
HSi ,t IV. RESULTS AND DISCUSSION
ASi ,t 2
(13)
d A. DataSet
In this stage, we provide recommendation based on the We evaluate our method using taxi trajectories data
proposed model according to the location and time of a taxi provided by CAR Inc. It is generated by about 3,760 taxis in
driver. Figure 2 outlines the steps of online recommendation, the city of Beijing from July 1-29, 2015.
including the prediction of pick-up points, calculation of the As we known, people have a different travel style on
hotness score, the sort of hotspots and visualization. Each weekdays and weekends. As shown in Figure 7, this
of hotspots contains location and hotness score, the higher difference is significant. Therefore only the data during the
the hotness score, the higher value of the place. weekdays can be used to predict the passenger demands in
Hotspots are urban areas in which passengers request to weekdays. Also we can observe that on weekdays, the
take a taxi with high probability. The activities in hotspots number of passengers around morning rush hour (8.am) is
can characterize the spatial distribution of passengers’ significantly higher than that on weekends. This matches the
demands. Once the clusters are identified, the hotness score generally accepted assumption that people are going to work
can be calculated. We rank top-k hotspots based on the score, during the morning rush hours. Likewise, the time slots of
then the system returns the top-k places to the taxi driver. 4.pm-7.pm correspond to the evening rushing hour in the
Notions are defined as Table III. workday when people go home. This means that people have
different travel preferences at different time in same day.
TABLE III. NOTIONS IN HOTNESS SCORE Thus we further segment time of day into 24 slots, the traffic
conditions and the semantic meaning of people’s travel are
Description
similar in the same time slot.
The cluster id is i 2000
Weeke…
Weekday
, The hotness score of cluster i 1600
Number of Passengers
, The attractive score of cluster 1200
10. while i < k do α 0.134 0.188 0.237 0.316 0.422 0.563 0.75
11. candidate ← candidate ∪ ;
12. i ← sizeof(candidate); 2) Comparing Methodology
In order to prove the effectiveness of the clustering
13. end while
method, we conduct a depth analysis of comparison on our
14. return candidate;
approach and consider the following four comparisons:
Lines 4~7 calculate the AS for each cluster ci and add it to ADBSCAN with EWMA method (AE)
the set A. Line 9 sorts the set A desc. Lines 10~12 generate
the top-k candidate places. DBSCAN with EWMA method (DE)
DBSCAN without EWMA method (D)
ADBSCAN without EWMA method (A)
345
3) Evaluation Metrics Recall 71.08% 62.16% 53.88% 54.38%
We use three measures to evaluate the prediction F-Measure 69.31% 60.76% 58.56% 54.1%
approach, including precision, recall and F-Measure. Table V details four methods evaluation on three
Precision is defined as the fraction of predicted records that average measures of an hour and shows that our approach
are relevant. Recall is defined as the fraction of predicted has a 15.21% improvement on f-measure comparing with
records that are retrieved. F-Measure integrates precision and method A for prediction.
recall into a single, composite harmonic mean. Formally,
C. Recommendation Effectivenss
|{correct}|
Precision (14) 1) Comparing Methodology
|{all}|
In order to evaluate the effectiveness of recommendation,
|{relevant} {retrieved}| we perform experiments on three methods: only distance
Recall (15) considered, only attractive score considered and
|{retrieved}|
combination of distance and attractive score.
2 Precision Recall 2) Evaluation Metrics
F - Measure (16)
Precision Recall According to (13), it can be seen that the further away the
4) Result and Discussion hotspot is from the given location, the lower attractive score
Figure 8 shows three measures for four methods as the it will be assigned. We consider two measures to evaluate
time varying. Note that three measures are relatively low the effectiveness by comparison between prediction and real
from time 0 to 6. As time goes, more and more people go to data, including Number of Hit Points (NHP) and Hit Ratio
work, the approach performs better than former. (HR). If the taxi picks up a passenger in the recommended
100% location within one kilometer, we successfully hit the target.
NHP is the total number of hit. Formally,
80%
| {hit} |
60% HR (17)
40%
| {recommended } |
3) Result and Discussion
20%
As shown in Figure 9, it can be seen that taking the
0% attractive score into account achieves a much better
0 4 8 12 16 20 performance than the method which only considers the
(a) Precision on four methods distance, with a 13.89% improvement for HR.
100% Actually, by assigning different weights to the attractive
80%
score and the distance, we get a best performance when
, : DIS = 0.4:0.6, resulting into the following method for
60%
recommendation:
40% 100% 500
0% V. RELATED WORK
0 4 8 12 16 20
AE DE D A
Taxicab service falls into two general categories and
research follows this, occasionally attempting to bridge them
(c) F-Measure on four methods
[10]. The first category is dispatching where companies
Figure 8. Comparison between four methods in different time slots dispatch taxicabs to customer requested specific locations.
TABLE V. DETAIL OF THREE MEASURES ON FOUR METHODS
The second category is cruising. The taxicab cruises on the
road to looking for a passenger along the streets empirically.
Methods As in the application of the analysis of the taxi demands,
Measures Yuan J, etc. presented a bidirectional recommender for taxi
AE DE D A
Precision 67.92% 59.67% 54.47% 54.33% driver and people, using the knowledge of passengers’
mobility patterns and taxi drivers’ pick-up behaviors leaned
346
from the GPS trajectories of taxicabs [11]. Shen Ying, etc. Program of Application Foundation and Advanced
developed an analysis method based on a city’s short-dated Technology grant 14JCYBJC15600. Keman Huang is the
taxi GPS traces and provide recommendation to help taxi corresponding author.
drivers cruising to find potential passengers with optimal
routes [12]. Li proposed an algorithm using taxi GPS traces REFERENCES
to create a usage based on road segment [13]. Luis Moreira- [1] Murtagh, F. "A Survey of Recent Advances in Hierarchical
Matias etc. introduced a novel methodology for predicting Clustering Algorithms." Computer Journal 26.4(1983):354-359.
the spatio-temporal distribution of taxi-passengers demand [2] Macqueen, J. "Some Methods for Classification and Analysis of
[14]. MultiVariate Observations." In 5th Berkeley Symp. Math. Statist.
Prob 1967:281-297.
In other applications of taxi trajectory, researchers have
[3] Ester, Martin, et al. "A Density-Based Algorithm for Discovering
been concerned with understanding the traffic flow Clusters in Large Spatial Databases with Noise." Proceedings of the
movement and the corresponding benefits for drivers [15]. 2nd International Conference Knowledge Discovery and Data Mining
Yu Zheng, etc. implemented a series of researches based on 1996.
GPS trajectory: GeoLife (Geography Life) [16]. It is an [4] Yuan, Jing, et al. "T-drive: driving directions based on taxi
application system which uses GPS as data-centered and trajectories." Proceedings of the 18th SIGSPATIAL International
shown on electronic map. Taxi service strategies, as the Conference on Advances in Geographic Information Systems ACM,
2010:99-108.
crowd intelligence of massive taxi drivers, Daqing Zhang,
[5] V. W. Chu, R. K. Wong, W. Liu, F. Chen and C. S. Perng, "Traffic
etc. intended to understand the service strategies of skilled
Analysis as a Service via a Unified Model," Services Computing
taxi drivers, based on a large-scale GPS historical database (SCC), 2014 IEEE International Conference on, Anchorage, AK,
[17]. In [18], an exhaustive survey of the work on mining the 2014, pp. 195-202, doi: 10.1109/SCC.2014.34
traces which can tell us where passengers are picked up and [6] S. Ma, Y. Zheng and O. Wolfson, "T-share: A large-scale dynamic
dropped off and classifying the existing work into some taxi ridesharing service," Data Engineering (ICDE), 2013 IEEE 29th
categories. International Conference on, Brisbane, QLD, 2013, pp. 410-421, doi:
10.1109/ICDE.2013.6544843
Different from all previous work, we first conduct spatio-
temporal clustering method to extract hotspots from the taxi [7] Z. Zhang, G. Wang, B. Cao and Y. Han, "Data Services for
Carpooling Based on Large-Scale Traffic Data Analysis," Services
trajectories. Then, we combine the historical hotspots and Computing (SCC), 2015 IEEE International Conference on, New
time series forecasting model to predict the demands of York, NY, 2015, pp. 672-679, doi: 10.1109/SCC.2015.96
passengers in urban areas. [8] Ming-Kai Jiau, Shih-Chia Huang and Chih-Hsian Lin, "Optimizing
the Carpool Service Problem with Genetic Algorithm in Service-
VI. CONCLUSION Based Computing," Services Computing (SCC), 2013 IEEE
International Conference on, Santa Clara, CA, 2013, pp. 478-485, doi:
This paper proposed a novel framework which combines 10.1109/SCC.2013.56
time-series forecasting techniques and spatio-temporal [9] Zhou, Hongfang, and P. Wang. "Research on Adaptive Parameters
clustering method using historical taxi trajectories to predict Determination in DBSCAN Algorithm." Journal of Xian University
passengers’ demands in urban areas. First, we detect the of Technology (2012).
passengers demand from the GPS trajectories, including [10] Zheng, Yu, et al. "Understanding mobility based on GPS data."
location and time, then an adaptive prediction approach is International Conference on Ubiquitous Computing ACM, 2008:312-
proposed to identify the hotspots and predict the hotness of 321.
the passenger demands. Thirdly, a recommender combing [11] Yuan, Jing, et al. "Where to find my next passenger." Proceedings of
the 13th international conference on Ubiquitous computing ACM,
locations and the hotness prediction is generated for taxi 2011:109-118.
driver. Finally, a visual prototype system is developed to
[12] Shen, Ying, L. Zhao, and J. Fan. "Analysis and Visualization for Hot
prove the effectiveness of the presented framework. The Spot Based Route Recommendation Using Short-Dated Taxi GPS
experiments based on the GPS data generated by 3,760 taxis Traces." Information 6.2(2015):134-151
from CAR INC. show that comparing with original method, [13] Li, Qingquan, et al. "Hierarchical route planning based on taxi GPS-
our framework gains a 15.21% improvement on average f- trajectories." Geoinformatics, 2009 17th International Conference on
measure for prediction and 79.6% hit ratio for IEEE, 2009:1-5.
recommendation. [14] Moreira-Matias, L., et al. "Predicting Taxi–Passenger Demand Using
Actually, the prediction of hotspot not only help taxi Streaming Data." IEEE Transactions on Intelligent Transportation
Systems 14.3(2013):1393-1402.
driver find a passenger quickly, but also reduce the traffic
[15] J. Yuan, Y. Zheng, X. Xie and G. Sun, "T-Drive: Enhancing Driving
jam. In the future, we plan to extend our framework into a Directions with Taxi Drivers' Intelligence," in IEEE Transactions on
platform which combines traffic flow and road network, Knowledge and Data Engineering, vol. 25, no. 1, pp. 220-232, Jan.
providing scheduling service for the company. 2013, doi: 10.1109/TKDE.2011.200
[16] Zheng, Yu, et al. "GeoLife: A Collaborative Social Networking
ACKNOWLEDGMENT Service among User, Location and Trajectory." Bulletin of the
Technical Committee on Data Engineering 33.2(2010):32-39.
This work is supported by the National Natural Science
Foundation of China grant 61373035, 61502333, 61502334, [17] Zhang, Daqing, B. Guo, and Z. Yu. "The Emergence of Social and
Community Intelligence." Computer 44.7(2011):21 - 28.
61572350, the Open Fund of Beijing Key Laboratory on
[18] Castro, Pablo Samuel, et al. "From Taxi GPS Traces to Social and
Integration and Analysis of Large-scale Stream Data, North Community Dynamics: A Survey." Acm Computing Surveys
China University of Technology, and the Tianjin Research 46.2(2014):1167-1182.
347