0% found this document useful (0 votes)
43 views14 pages

Automatic Itinerary Planning For Traveling Services: Gang Chen, Sai Wu, Jingbo Zhou, and Anthony K.H. Tung

This document proposes an automatic itinerary planning service to generate multi-day customized trip plans for travelers. It discusses limitations of previous approaches that only consider single-day trips and popular points of interest. The document introduces a two-stage approach using MapReduce to efficiently precompute single-day itineraries, then combine them to form optimized multi-day itineraries based on users' preferences. Experimental results on real travel data demonstrate the approach can efficiently generate high-quality personalized itineraries.

Uploaded by

Divya Hajeri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views14 pages

Automatic Itinerary Planning For Traveling Services: Gang Chen, Sai Wu, Jingbo Zhou, and Anthony K.H. Tung

This document proposes an automatic itinerary planning service to generate multi-day customized trip plans for travelers. It discusses limitations of previous approaches that only consider single-day trips and popular points of interest. The document introduces a two-stage approach using MapReduce to efficiently precompute single-day itineraries, then combine them to form optimized multi-day itineraries based on users' preferences. Experimental results on real travel data demonstrate the approach can efficiently generate high-quality personalized itineraries.

Uploaded by

Divya Hajeri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

514 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO.

3, MARCH 2014

Automatic Itinerary Planning


for Traveling Services
Gang Chen, Sai Wu, Jingbo Zhou, and Anthony K.H. Tung

Abstract—Creating an efficient and economic trip plan is the most annoying job for a backpack traveler. Although travel agency
can provide some predefined itineraries, they are not tailored for each specific customer. Previous efforts address the problem by
providing an automatic itinerary planning service, which organizes the points-of-interests (POIs) into a customized itinerary.
Because the search space of all possible itineraries is too costly to fully explore, to simplify the complexity, most work assume that
user’s trip is limited to some important POIs and will complete within one day. To address the above limitation, in this paper, we
design a more general itinerary planning service, which generates multiday itineraries for the users. In our service, all POIs are
considered and ranked based on the users’ preference. The problem of searching the optimal itinerary is a team orienteering
problem (TOP), a well-known NP-complete problem. To reduce the processing cost, a two-stage planning scheme is proposed. In
its preprocessing stage, single-day itineraries are precomputed via the MapReduce jobs. In its online stage, an approximate
search algorithm is used to combine the single day itineraries. In this way, we transfer the TOP problem with no polynomial
approximation into another NP-complete problem (set-packing problem) with good approximate algorithms. Experiments on real
data sets show that our approach can generate high-quality itineraries efficiently.

Index Terms—Map reduce, trajectory, team orienteering problem, itinerary planning, location-based service

1 INTRODUCTION

T RAVELING market is divided into two parts. For casual


customers, they will pick a package from local travel
agents. The package, in fact, represents a pregenerated
However, it is impossible to list all possible itineraries for
users. A practical solution is to provide an automatic
itinerary planning service. The user lists a set of interested
itinerary. The agency will help the customer book the POIs and specifies the time and money budget. The
hotels, arrange the transportations, and preorder the tickets itinerary planning service returns top-K trip plans satisfy-
of museums/parks. It prevents the customers from con- ing the requirements. In the ideal case, the user selects one
structing their personalized itineraries, which is very time- of the returned itineraries as his plan and notifies the agent.
consuming and inefficient. For instance, Fig. 1 lists a four- However, none of the current itinerary planning algo-
day package to Hong Kong, provided by a Singapore rithms (e.g., [1] and [2]) can generate a ready-to-use trip
agency. It covers the most popular POIs for a first-time plan, as they are based on various assumptions.
traveler and the customers just need to follow the itinerary First, current planning algorithms only consider a
to schedule their trips. single day’s trip, while in real cases, most users will
Although the travel agencies provide efficient and schedule an n-day itinerary (e.g., the one shown in Fig. 1).
convenient services, for experienced travelers, the itineraries Generating an n-day itinerary is more complex than
provided by the travel agents lack customization and cannot generating a single day one. It is not equal to constructing
satisfy individual requirements. Some interested POIs are n single-day itineraries and combining them together, as a
missing in the itineraries and the packages are too expensive POI can only appear once in the itinerary. It is tricky to
for a backpack traveler. Therefore, they have to plan their group POIs into different days. One possible solution is to
trips in every detail, such as selecting the hotels, picking exploit the geolocations, for example, nearby POIs are put
in the same day’s itinerary. Alternatively, we can also rank
POIs for visiting, and contacting the car rental service.
POIs by their importance and use a priority queue to
Therefore, to attract more customers, travel agency
schedule the trip.
should allow the users to customize their itineraries and
Second, the travel agents tend to favor the popular POIs.
still enjoy the same services as the predefined itineraries. Even for a city with a large number of POIs, the travel
agents always provide the same set of trip plans, composed
. G. Chen and S. Wu are with the College of Computer Science, Zhejiang with top POIs. However, those popular POIs may not be
University, Yuquan Campus, Zheda Road, Hangzhou, Zhejiang 310027, attractive for the users, who have visited the city for several
P.R. China. E-mail: {cg, wusai}@zju.edu.cn. times or have limited time budget. It is impossible for a user
. J. Zhou and A.K.H. Tung are with the School of Computing, National
University of Singapore, Computing 1, Computing Drive, Singapore to get his personal trip plan. The travel agent’s service
117417, Singapore. E-mail: {jzhou, atung}@comp.nus.edu.sg. cannot cover the whole POI set, leading to few choices for
Manuscript received 16 Nov. 2012; revised 5 Feb. 2013; accepted 6 Feb. 2013; the users. In our algorithm, we adopt a different approach
published online 12 Mar. 2013. by giving high priorities to the selected POIs and generating
Recommended for acceptance by G. Yu. a customized trip plan on the fly.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2012-11-0783. Third, suppose we have N available POIs and there are
Digital Object Identifier no. 10.1109/TKDE.2013.46. m POIs in each single day’s itinerary averagely. We will end
1041-4347/14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 515

Fig. 1. A four-day trip to Hong Kong.


Fig. 2. Architecture of a trip planning system.
N!
up with ðNmÞ!m! candidate itineraries. It is costly to evaluate
the benefit of every itinerary and select the optimal one. initialization-adjustment model and a theoretic bound is
Therefore, in [1] and [2], some heuristic approaches given for the quality of the approximate result.
are adopted to simplify the computation. However, the To evaluate the proposed approach, we use the real data
heuristic approaches are based on some assumptions from Yahoo Travel.2 The experiments show that our
(e.g., popular POIs are selected with a higher probability). approach can efficiently return high-quality customized
They only provide limited number of itineraries and are not itineraries. The remainder of this paper is organized as
optimized for the backpack traveler, who plans to have a follows: In Section 2, we formalize the problem and give an
unique journey with his own customized itinerary. overview of our approach. Then, Section 3 and Section 4
Last but not the least, handling new emerging POIs were present the preprocessing stage and online stage of our
tricky in previous approaches. The model needs to be approach, respectively. We evaluate our approach in
rebuilt to evaluate the benefit of including the new POIs Section 5 and review previous work in Section 6. Finally,
into the itinerary. For systems based on the users’ feedback the paper is concluded in Section 6.
[2], we need to collect the comments for the new POIs from
the users, which is very time-consuming. 2 OVERVIEW
To address the above problems, in this paper, a novel
2.1 Problem Statement
itinerary planning approach is proposed. The design
philosophy of our approach is to generate itineraries that In the itinerary planning system, the user selects a set of
narrow the gap between the agents and travelers. We interested POIs, Sp , and asks the system to generate a k-day
reduce the overhead of constructing a personalized itiner- itinerary. We use ðSp ; kÞ to denote a user’s request. To
ary for the traveler; and we provide a tool for the agents to model the planning problem, we organize the POIs into a
customize their services. Fig. 2 shows an overall architec- complete graph, the POI graph.
ture of our trip-planning system. Specifically, our approach Definition 1 (POI Graph). In the POI graph G ¼ ðV ; EÞ, we
can be summarized as follows. generate a vertex for each POI and every pair of vertices are
In the preprocessing, POIs are organized into an connected via an undirected edge in E. In G, the vertex and
undirected graph, G. The distance of two POIs is evaluated edge have the following properties:
by Google Map’s APIs.1 Given a request, the system
1. 8vi 2 V , wðvi Þ denotes the weight (importance) of the
provides interfaces for the user to select preferred POIs
POI and tðvi Þ is the average time that tourists will
explicitly, while the rest POIs are assumed to be the
spend on the POI.
optional POIs. Different ranking functions are applied to
2. 8ðex ¼ vi e> vj Þ 2 E, tðex Þ is the cost of the edge,
different types of POIs. The automatic itinerary planning
computed as the average traveling time from vi to vj .
service needs to return an itinerary with the highest
ranking. Searching the optimal itinerary can be trans- Fig. 3 shows a POI graph with five nodes. Each node
formed into the team orienteering problem (TOP), which is denotes a POI and has two properties: the weight and travel
an NP-complete problem without polynomial approxima- time (shown in the red blocks). The nodes are connected via
tions [3]. Therefore, a two stage scheme is applied. weighted edges. The edge’s weight is set to the average
In the preprocessing stage, we iterate all candidate traveling time for the shortest path between the correspond-
single-day itineraries using a parallel processing frame- ing POIs in the map. In fact, there are two types of edges.
work, MapReduce [4]. The results are maintained in the The first type represents that the two nodes are directly
distributed file system (DFS) and an inverted index is connected in the map (no other POI exists in their shortest
built for efficient itinerary retrieval. To construct a path, e.g., 0 e>1). The second type contains multiple shortest
multiday itinerary, we need to selectively combine the paths in the map (e.g., 0 e>3 ¼ ð0 e>1Þ  ð1 e>3Þ). Transforming
single itineraries. The preprocessing stage, in fact, trans- the POI graph into a complete graph reduces the processing
forms the TOP into a set-packing problem [5], which has cost of our itinerary algorithm.
well-known approximated algorithms. In the online stage, The definition of POI graph assumes that the costs of
we design an approximate algorithm to generate the edges are symmetric. Namely, the traveling time from vi to
optimal itineraries. The approximate algorithm adopts the vj is equal to the time from vj to vi . In fact, as our approach

1. https://developers.google.com/maps/. 2. http://travel.yahoo.com/.
516 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

problem [3]. Even approximate algorithm within constant


factor does not exist. The existing work [6] solves the
problem by employing heuristic algorithms, which may
generate arbitrarily bad results.

2.2 System Architecture


In our system, instead of trying to propose new algorithms
for the TOP, we transform the optimal itinerary planning
problem into a set-packing problem by an offline
MapReduce process and an approximate algorithm is
applied to solve the set-packing problem. If the maximal
number of POIs in the single-day itinerary is bounded by
Fig. 3. POI graph. m, the optimal result can be approximated within factor of
2ðmþ1Þ
3 (m is the maximal number of POIs in each single-
does not rely on the assumption, it can be directly applied day itinerary).
to the case of nonsymmetric cost (e.g., traffics are different Fig. 2 shows the architecture of our trip-planning system.
for vi e> vj and vj e> vi ). In the first step, POI graph is constructed via the road
Let wðvi Þ denote the weight (importance) of POI vi . The network and POI coordinates. The Google Map’s APIs are
initial weight of vi is generated from the users’ reviews (e.g., used to evaluate the distance between POIs. The average
in Yahoo Travel, users can specify score ranging from 0 to 5 elapsed time of a POI is estimated from users’ blogs and
for each POI. We accumulate the scores and use the average travel agency’s schedules.
values as the initial weight). After the POI graph is constructed, a set of MapReduce
Users can also select a set of preferred POIs, denoted jobs are submitted to iterate all possible single-day itiner-
as Sp . Given a request ðSp ; kÞ, if vi is selected by the aries in the preprocessing. The number of itineraries is
request (vi 2 Sp ), we intentionally increase its weight to exponential to the number of POIs. However, using parallel
ðwðvi Þ þ 1Þ, where  can be set to an arbitrary integer. processing engine, such as MapReduce, we can efficiently
The intuition is that user-selected POIs are far more generate all itineraries in an offline manner. To speed up the
important than any other POIs. single-day itinerary retrieval, an inverted index is built.
For a request ðSp ; kÞ, if k ¼ 1, we just need to generate a Given a POI, all single-day itineraries involving the POI can
single-day itinerary. A single-day itinerary is represented as be efficiently retrieved.
For a user request ðSp ; kÞ, POIs’ weights are updated
L ¼ v0 e>    e> vn e> hj , where hj is a hotel POI. The elapsed
based on Sp and we compute the scores for each single-day
time is estimated as
itinerary. The problem of finding optimal k-day itinerary is
X
n X
n1 transformed to select k single-day itineraries that maximize
tðLÞ ¼ tðvi Þ þ tðvi e> viþ1 Þ þ tðvn e> hj Þ: the total score. We show that the new problem can be
i¼0 i¼0 reduced to the weighted set-packing problem, which has
In the rest of the discussion, we remove the hotel part and polynomial approximate algorithms. Therefore, we simu-
focus on how to merge the POIs into itineraries. After all late the approximate algorithm for set-packing problem to
other POIs are fixed, we will solve the hotel-selection generate the k-day itinerary. The algorithm uses a greedy
problem. strategy to create an initial solution, which is continuously
Assume there are H available hours per day for refined in the adjustment phase. The adjustment phase
traveling. The itinerary L must satisfy that tðLÞ  H. For a scans the index to find a potentially better solution.
common traveling request, it always includes a k-day In the next two sections, we first present how we apply
(k  1) trip, which is defined as the MapReduce framework to generate and index the
single-day itineraries. The parallel processing engine en-
Definition 2 (k-Day Itinerary). Given a POI graph G and time ables us to search the optimal solution in a brute-force
budget k, a valid k-day itinerary consists of k single-day manner. Next, we show after the preprocessing, the
itineraries, L ¼ fL1 ; L2 ; . . . ; Lk g, which satisfies that complexity of TOP is reduced and approximate algorithms
1. 8i8j, Li and Lj do not share a POI. are available.
2. tðLi Þ  H for all 1  i  k.
Based on the POIs included in the itinerary, the score of a 3 PREPROCESSING
k-day itinerary can be computed as The preprocessing includes two steps. In the first step, a set
of MapReduce jobs are submitted to produce all possible
k X
X
wðT Þ ¼ wðvj Þ: ð1Þ single-day itineraries. In the second step, the single-day
i¼1 vj 2Li itineraries are reorganized as an itinerary index, which
supports efficient itinerary search.
The goal of our itinerary planning algorithm is to find the
k-day itinerary with the highest score. However, we will 3.1 Intractability of Optimal Itinerary Algorithm
show that finding the optimal itinerary is an NP-complete Given a user request ðSp ; kÞ, the goal of an itinerary
problem, which is equivalent to the team orienteering planning algorithm is to provide an itinerary, which ranks
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 517

highest among all possible itineraries. The score of the old path cannot result in any new path, we will output the
itinerary is computed based on the POI weights. However, old path. For the last MapReduce job (the mth job), all
as shown in the following theorem, this is an NP-complete the candidate itineraries are used as the results. The output
problem and no polynomial time algorithm exists. key-value pair is using the sorted POIs in the itinerary as
Theorem 1. Finding optimal k-day itinerary in a POI graph the key.
G ¼ ðV ; EÞ is an NP-complete problem. Algorithm 1. map(Object key, Text value,
Proof (Sketch). The optimal k-day itinerary can be reduced Context context).
to the TOP [3], which is a well-known NP-complete // we allow maximally m  round MapReduce jobs, i.e.,
problem. Consider a simple scenario where, the maximally length of path is m
//value: existing path, each MapReduce job tries to add one
1. k vehicles are created, which start from the same more POI to the path
position.
1: Path P ¼ parsePath(value)
2. Each vehicle has a time limit (1 day) for traveling
2: for i ¼ 0 to P OIGraph.POINumber do
the POIs.
3: if isConnected(P , i) and !P .contains(i) then
3. Each vehicle collects the profit by visiting the
4: Path newP ath ¼ P .append(i)
POIs.
4. The POI accessed by a vehicle will not be 5: cost ¼ P .cost þ P OIGraph.getCost(P .endPOI, i)
considered by other vehicles. þ P OIGraph.getCost(i)
5. The POI’s profit is equal to its weight. 6: weight ¼ P .weight þ P OIGraph.getWeight(i)
7: newP ath.cost ¼ cost
The TOP is to find the traveling plan that generates
the most profits. The results of the TOP are also the best 8: newP ath.weight ¼ weight
k-day itinerary. u
t 9: if newP ath.cost  H then
10: Key newKey ¼ parsePath(newP ath).sort();
Due to the complexity of TOP, it is impossible to find the 11: context.collect(newKey, newP ath)
exact solution. Instead, previous work focuses on proposing 12: else
heuristic algorithms. The basic idea is to generate an initial 13: DFS.write(resultF ile, P )
plan and then adjust it based on some heuristic rules. Those
algorithms have three drawbacks. First, the heuristic Algorithm 2. reduce(Key key, Iterable values,
algorithms need many iterations to get a good enough Context context).
result, which incur high computation cost [7]. Second, the 1: bestCost ¼ 1
adjusting rules are too complicated and the potential gains 2: bestP ath ¼ NIL
are unknown. Finally, there is no bound of the approximate 3: for Path P : values do
result, which may be arbitrarily bad in some cases. 4: if P :cost < bestCost then
In this paper, we reduce the complexity of the TOP by 5: bestP ath ¼ P
transforming it into a set-packing [8] problem. As the 6: bestCost ¼ P :cost
transformation is done in an offline manner, the perfor- 7: context.collect(key, bestP ath)
mance of online query processing is not affected. In the mappers, to compute the weight and cost of new
itinerary, we load the POI graph table from the DFS. As the
3.2 Single-Day Itinerary
graph table is small, each reducer maintains a copy in its
The basic idea of transformation is to iterate all possible
memory. The table’s schema is as follows:
single-day itineraries. This is done by a set of MapReduce
jobs. In the first job, we generate jPj initial itineraries for the ðS P OI; E P OI; S weight; E weight; S cost; E cost; costÞ;
POI set P. Each initial itinerary only consists of one POI.
Iteratively, the subsequent MapReduce job tries to add one where S_POI and E_POI denote the two POIs linked by a
more POI to the itineraries. If no more single-day itineraries specific edge, cost is the traveling cost from S_POI to E_POI,
can be generated, the process terminates. In current and S_POI is the primary key of the table.
implementation, we allow maximally m MapReduce jobs In the reducers (Algorithm 2), we select the path with
in the transformation process to reduce the overheads. smallest cost of paths with the same POIs. In each reducer,
Therefore, a single-day itinerary contains at most m POIs. all the paths have the same POIs. We only keep the path
This strategy is based on the assumption that users cannot with smallest cost and output such path for the next round.
visit too many POIs in one day. In our crawled data set from Note that since all the paths have the same POIs, these paths
Yahoo travel, setting m to 10 is enough for Singapore data, have the same weight.
which include more than 400 POIs. Only a few single-day After all itineraries have been generated, a clean process is
itineraries can contain more than 10 POIs. invoked to remove the duplication. For two itineraries
Algorithms 1 and 2 show the pseudocodes of the (L0 ¼ v0 e>    e> vn and L1 ¼ v00 e>    e> v0n ), L0 contains L1 , iff
MapReduce job. The mappers load the partial paths from 8v0j 2 L1 ! 9vi 2 L0 ðvi ¼ v0j Þ:
the DFS, which are generated in the previous MapReduce
jobs. We try to append new POI to the existing itineraries. Namely, all POIs in L1 are also included by L0 . If L0
For each new path, we test whether it can be completed contains L1 , we will only keep L0 , as it provides more POIs
within one day. If not, we will discard the new path. If the for the users.
518 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

bucket to the 1th bucket, we can get a sorted list for all
itineraries involving a POI.
To simplify the index manipulation, an index manager is
built in our query engine. The index manager only provides
one interface scan(POI), where P OI denotes the owner of
the index. The interface returns an iterator, which can be
used to retrieve all itineraries of the POI. A memory buffer
is established to cache the used itineraries and the LRU
strategy is applied to maintain the buffer.
Fig. 4. Itinerary index.
3.4 Discussion: Why MapReduce
3.3 Itinerary Index Although the input data set (POI graph) is small in size, the
To efficiently locate the single-day itineraries, an inverted partial results of the possible itineraries are extremely large
index is built. The key is the POI and the values are all (more than 100G or even 1T). The computation is also
itineraries involving the POI. By scanning the index, we can intensive, which cannot be completed by a single machine.
retrieve all the itineraries. Fig. 4 illustrates the index MapReduce is the solution to partition the partial results and
structure. We create an index file for each POI in the DFS. generate the itineraries in parallel. Its advantages are twofold:
The file includes all single itineraries involving the POI,
1. Parallel computing effectively reduces the running
which are sorted based on their weights. For example, in
time of preprocessing. The search space explodes,
Fig. 4, “1.idx” contains all itineraries for the first POI. The
when the number of POIs and traveling days
itinerary “1j5j20j12j40” is the most important itinerary in
increases. It is impractical to generate all possible
the index file with weight 320. itineraries. But by exploiting the power of MapRe-
The inverted index is constructed via a MapReduce job. duce, we can share and balance the workload between
Algorithms 3 and 4 show the process. The mappers load the multiple machines. The scalability is achieved by
single-day itinerary and generate key-value pairs for each adding more nodes into the cluster. In our experi-
involved POI. The reducers collect all itineraries for a specific ment, the running time of preprocessing is signifi-
POI and sort them based on the weights before creating the cantly reduced with the number of nodes (see Fig. 12)
index file. In our system, the size of the index file may vary a 2. MapReduce algorithms can remove the duplicated
lot. Some POI may have an extremely large index file, due to itineraries in a simple way. In Algorithm 2, by
its popularity and short visit time. In reducers, those POIs may leveraging the framework of MapReduce, we map
result in the exception of memory overflow in the sorting all the itineraries with the same POIs into the same
process. To address this problem, in the map phase, instead of reducer and only keep one itinerary with the lowest
using the POI as the key, we generate the composite key by cost. This approach can prune the low-benefit partial
combining the POI and the itinerary weight. itineraries as early as possible and lead to less input
Algorithm 3. map(Object key, Text value, for the next round of computation.
Context context).
//value: single-day itinerary 4 GREEDY-BASED APPROXIMATION ALGORITHM
1: Itinerary it ¼ parse(value) After the itinerary indexes are constructed, the user request
2: for i ¼ 0 to it.POISize() do ðSp ; kÞ can be processed by selecting k best itineraries
3: int nextP OI ¼ it.getNext(i) from the indexes. Namely, the problem of generating
4: Key key ¼ new CompositeKey(nextP OI, it.weight/ optimal k-day itinerary is transformed into a weighted
bucketSize) set-packing problem as shown in the following theorem.
5: context.collect(key, it) Definition 3 (Weighted Set-Packing Problem). In a universe
U, we assume that each element in U has a weight and the
Algorithm 4. reduce(Key key, Iterable values, weight of any subset of U equals to the sum of the element
Context context). weights in the subset. Given a family S of U’s subsets, the set-
1: CompositeKey ck ¼ key, Set s ¼ ; packing problem is to select a subfamily S’ from S, where all
2: for Itinerary it: values do subsets in S’ are disjoint and the weight of S’ is maximal
3: s.add(it) among all possible selections.
4: sort(s)
Theorem 2. Finding optimal k-day itinerary can be reduced to
5: DFSFile f ¼ new DFSFile(ck:first þ “ ”þck:second)
the weighted set-packing problem.
6: f.write(s)
Proof (Sketch). By solving the set-packing problem, we can
In particular, we partition the itineraries into n buckets.
also get the optimal k-day itinerary, as
The bucket ID is used as a part of the composite key. In this
way, we split the itineraries of a POI into n groups and 1. Each single-day itinerary can be considered as a
each group can be efficiently sorted in the memory. Each subset of the POI set P.
group will result in an index file. However, it is not 2. The subsets selected by the set-packing problem
necessary to merge the files, as the files are partitioned are disjoint, and hence in the k-day itinerary, we
based on the weights. By scanning all files from the nth will not visit a POI twice.
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 519

4: int poi ¼ L.nextPOI();


5: Set group ¼ new Set()
6: group.add(poi)
7: int lastpoi ¼ poi
8: while not L.isEmpty() do
9: int newpoi ¼ getNearest(lastpoi; L)
10: int time ¼ getTravelTime(group; newpoi)
11: if time  one day then
Fig. 5. Example of Set-Packing
12: group.add(newpoi)
13: L.remove(newpoi)
3. Each subset is replicated k  1 times and thus, we
14: lastpoi ¼ newpoi
have k identical itineraries. For the ith itinerary, a
15: else
virtual POI xi is appended, denoting that the
16: break;
itinerary is designed for the ith day.
4. Apply the algorithm of set-packing to get the 17: i++, seed.add(group)
optimal solution. Let Sr be the result set. If 18: for i ¼ 0 to seed.size() do
jSr j > k, there must be two itineraries for the 19: Set group ¼ seed.get(i)
same day and they are not disjoint. If jSr j < k, we 20: IndexIterator iter ¼ indexManager.scan(group.
still have available days for traveling and new get(0))
itineraries can be added. Therefore, jSr j ¼ k and 21: while iter.hasMoreElements() do
Sr can be considered as a k-day itinerary. 22: Itinerary I ¼ iter.next()
u
t 23: if I.contains(group) then
24: removeReplicatedPOI(I, rev)
In step 3 of our proof, we replicate the itinerary k  1
25: rev.add(I)
times. That is to guarantee that the solution of set-packing
26: break
problem returns exactly k subsets. Fig. 5 illustrates the idea.
27: return rev
Suppose we have four index files and want to generate a
two-day itinerary. Without the replication, the set-packing We first sort the selected POIs by their weights (line 1).
algorithm may return a three-day itinerary, such as “5j1j6,” Then, in each iteration, we try to form a group, which
“2j8j9,” and “3j7j4.” By replicating the itineraries and contains a subset of POIs that can be accessed within one
adding the virtual elements X1 and X2 , the above selection day (line 3-17). We greedily select the POI with shortest
cannot work, as two itineraries will share at least one virtual distance and add it into our group (line 9-14). There are
element. In this case, the set-packing algorithm will return maximally k groups generated. All groups are used as our
another solution (e.g., “1j2j4jX1 ” and “7j5j3jX2 ”), which seeds for searching the index. We will use the first itinerary
satisfies our time requirement. that contains all the POIs in the group as our candidate
Although set-packing is also an NP-complete problem, itinerary (line 18-26). Although after the weight adjustment,
different from the TOP, in a special case, set-packing itineraries in the index file are no longer sorted by the
problem has approximate algorithms. As mentioned in weights. We can still retrieve the itinerary with maximal
the preprocessing, we set the maximal number of weight as shown in the following theorem.
MapReduce jobs in generating the single-day itineraries Theorem 3. Given a list of POIs L ¼ fv0 ; v1 ; . . . ; vn g that can be
to m. Therefore, each itinerary can have at most m POIs. accessed within one day, by scanning the index of vi in L, we
It was shown that when the size of subsets is bounded by can get the itineraries that contain all POIs in L and the first
a constant, the weighted set-packing problem can be candidate is the itinerary with maximal weight.
solved by polynomial approximations [8], [9]. By follow-
ing the above ideas, in this paper, we design a variant of Proof (Sketch). Because L can be finished within one day,
the approximate algorithm in [8], which provides a there must be some itineraries containing all the POIs in
bound of 2ðmþ1Þ for the quality of the approximate L. Let I0 and I1 be first and second candidate itineraries,
3
answers. The algorithm includes an initialization phase respectively. I0 s weight is larger than I1 s, as before
and an adjustment phase. weight adjustment, I0 has a higher weight than L1 and
after weight adjustment, both of them receive the same
4.1 Initialization weight boost. u
t
For the user request ðSp ; kÞ, we adjust the weights of POIs in
Sp to emphasize the user’s selection. If vi 2 Sp , vi s weight is To improve the weights of the obtained itineraries in the
increased to ðwðvi Þ þ 1Þ, where  is an integer larger than 0 greedy algorithm, we adopt the adjustment phase.
and wðvi Þ is the original weight of POI vi . Algorithm 5
4.2 Adjustment
shows how we generate the seed itineraries using the
greedy strategy. In the adjustment phase, new solutions are searched and used
to replace the greedy itineraries. The process repeats until no
Algorithm 5. Initialization(POIList L, Day k). improvement can be obtained. In the following discussion,
1: sortByWeight(L) we discard the virtual POIs to simplify our representations.
2: int i ¼ 0, Set seed ¼ ;, Set rev ¼ ; Suppose idxðvj Þ returns the itineraries in the index of POI vj .
3: while i < k and L.size() > 0 do We define the neighborhood of an itinerary as
520 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

Definition 4 (Neighborhood). Given an itinerary Li , its Algorithm 6. Adjustment(Set S, double P , int step).
neighborhood ngbðLi Þ is an itinerary set satisfying: 1: int j ¼ 0;
[ 2: while j < step do
ngbðLi Þ ¼ idxðvj Þ: 3: Set cand ¼ ;, int max ¼ 1, int idx ¼ 1
vj 2Li
4: for i ¼ 0 to S.size() do
5: Set ngb ¼ S.get(i).getNeighborhood()
For example, in Fig. 5,
6: Set ind ¼
ngbð1j2j4Þ ¼ f5j1j6; 5j2j4; 2j8j9; 4j2j1; 3j4j5g: ð2Þ getIndependentSetWithMaximalWeight(ngb)
7: Set S 0 ¼ S  fðS; ngbÞ þ ind
The neighborhood of Li represents the candidate itineraries 8: double B ¼ weightðS 0 Þ  weightðSÞ
that can replace Li . However, some itineraries share the 9: cand.add(S 0 )
common POIs, which cannot coexist in the result. Therefore, 10: if B > max then
we define the independent set as 11: max ¼ B, idx ¼ i
Definition 5 (Independent Set). An independent set ISðLi Þ is 12: if max > 0 then
a subset of ngbðLi Þ. Any two itineraries in ISðLi Þ do not share 13: S ¼ cand.get(idx)
a common POI. Namely, 8L0 ; L1 2 ISðLi Þ ! ðL0 and L1 are 14: else
disjoint). 15: if randProb() > P then
16: S ¼ cand.get(idx)
Neighborhood of each itinerary can have multiple 17: j++
independent sets and each set denotes a different adjust-
Theorem 4. Algorithm 6 returns a k-day itinerary, which
ment strategy. Let S be the initial itinerary set returned by
approximates the optimal solution with the bound  ¼ 2ðmþ1Þ .
Algorithm 5. An alternative solution S 0 can be constructed 3

from S by replacing the itineraries by their independent Proof. (sketch) In Algorithm 6, we add a virtual POI to each
sets. More formally, itinerary to mark its traveling day. Therefore, the adjust-
ment algorithm at most returns k disjoint itineraries.
S 0 ¼ S  fðS; ngbðLi ÞÞ þ ISðLi Þ; Otherwise, there are two itineraries sharing the same
virtual POI. Namely, they are supposed to be traveled in
where fðSa ; Sb Þ returns a subset of Sa , which shares at least the same day, which is not possible. If the algorithm
one POI with itineraries in Sb . returns less than k itineraries, we can still repeat the
For itinerary “1j2j4” in Fig. 5, its independent set is initialization and adjustment to fill in the left days. In
f2j8j9; 3j4j5g. If S ¼ f1j2j4; 7j5j3g, after the adjustment, we this way, we guarantee that Algorithm 6 returns exactly a
will get S 0 ¼ f2j8j9; 3j4j5g. All itineraries are replaced by k-day itinerary. Based on Theorem 2, the problem of
new ones. To avoid the case of cascading replacement, the selecting the k-day itinerary can be reduced to the
weighted set-packing problem. Therefore, in Algorithm 6,
size of ISðLi Þ should be less than k, as only k single-day
we simulate the heuristic set-packing algorithm. The
itineraries are required. In our implementation, we limit
heuristic algorithm has been analyzed in [8]. Suppose
the size of ISðLi Þ to k2 . Namely, at most half of the
there are X iterations in the algorithm. Let Ii be the results
itineraries are replaced. of the X  i  1 iteration. I1 will be the final result. Let di be
The benefit of itinerary adjustment is computed as the payoff factor of each iteration. We have
 
B ¼ weightðS 0 Þ  weightðSÞ: 1 1
ðm þ 1ÞwðI1 Þ  2  þ 2 wðoptÞ;
d1 2d1
If B > 0, we assume that the adjustment improves the
quality of the results. Hence, a better itinerary can be where wðI1 Þ and wðoptÞ represent the weights of the
produced by replacing the old itineraries with correspond- itinerary returned by the heuristic algorithm and the
ing independent sets. optimal itinerary, respectively. The right side of
Algorithm 6 summarizes the idea of adjustment process. the equation is minimized when d1 ¼ 1. In that case,
We set a threshold for the maximal number of adjustments. we have
In each iteration, we find the independent sets for the  
1
existing itineraries. If one itinerary has multiple indepen- ðm þ 1ÞwðI1 Þ  1 þ wðoptÞ:
2
dent sets, we will select the one with maximal weight
(line 6). The new results are then computed by performing Therefore, we have a bound  ¼ 2ðmþ1Þ
3 for the heuristic
the replacement (line 7) and we record the benefit (line 8). approach, where m is the maximal number POIs in the
After all possible replacement strategies have been checked, itinerary (m is the number of MapReduce jobs in our
we will select the one with maximal benefit. If the benefit is preprocessing). u
t
larger than 0, the result itineraries are updated as the new The most expensive operations in Algorithm 6 are
ones (line 13). Otherwise, we will perform the updates, only retrieving the neighborhood sets. We need to scan the
with a small probability (line 15-16). The idea is to simulate indices of involved POIs to find all itineraries. We find that
the hill-climbing algorithm to avoid the suboptimal solu- as Algorithm 6 only selects one independent set for each
tion. The algorithm guarantees the quality of the returned itinerary, we can save I/O costs by scanning a small portion
itinerary as shown in the below theorem. of the index file. Therefore, in our implementation, we read
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 521

the first n itineraries of an index file in batch and if


independent sets are found, the process stops. Otherwise,
we will continue to load the next n itineraries.

4.3 Hotel Selection


In fact, hotels can be considered as a special type of POIs. It
must appear as the last POI in the itinerary. We need to
calculate the traveling time from other POIs to the hotel
POIs. Hotel POIs do not incur access cost and their weights
are set as users’ rankings for the hotels. Based on the user’s
preference, we have two processing strategies.

4.3.1 Multiple Hotels


If the user does not insist on staying in the same hotel (e.g.,
he can select k different hotels, one for each day), we can
extend the preprocessing algorithm to handle the hotels. In
the MapReduce jobs, when a new itinerary Li is generated,
we test every hotel POI and try to append it to the end of Li .
Given a hotel POI hj , we use Li jhj to represent the combined
itinerary. Li jhj is considered as a single-day itinerary, if Fig. 6. Yahoo POIs.
1. The total traveling time of Li jhj is less than H. H is
the average traveling time per day. The idea is to discard a few POIs from the end of each
2. For any other nonhotel POI v which is not included itinerary and try to append the hotel POIs to the shortened
by Li , Li j
vjhj cannot be completed within H time. itinerary. In line 7-8, the itinerary progressively removes the
When we detect a new single-day itinerary, we output it last POI, until it can include the hotel POI to form a single-
to the DFS for indexing. day itinerary. For example, the total traveling time is less
The itinerary generation algorithm is exactly the same, than H. In line 10, we will get a new set of k itineraries,
except that the hotel POI can appear in different itineraries. where all itineraries contain the same hotel POI. We will
In Algorithm 6, we do not consider the hotel POIs, when generate such a k-day itinerary for each hotel. After
performing the disjoint test for itineraries. The output comparing weights of the itineraries, the one with maximal
itinerary may contain multiple hotels (hi represents the weight is returned as our final k-day itinerary.
hotel POI):
5 EXPERIMENT EVALUATIONS
2j5j10jh1 ; 3j7j8jh1 ; 9j10j0jh2 :
5.1 Data Set Description
To evaluate the performance of our proposed approaches,
4.3.2 Single Hotel
we crawl the traveling information from Yahoo Travel
If the user prefers to stay in the same hotel, the itinerary
(http://travel.yahoo.com). In particular, we focus on the
generation problem cannot be easily reduced to the set-
Singapore POIs. Fig. 6 illustrates our crawling strategy.
packing problem. Instead, we adopt a best-effort solution.
Yahoo classifies the POIs into hotels, things to do, and cities.
In particular, we still apply Algorithm 6 to find the
We use the first two types in our experiments, as the last
candidate k-day itinerary without hotel POIs. After that,
one is the geolocations for the city. Things to do contains 254
we invoke Algorithm 7 to append the hotel POI.
POIs of Singapore and hotels contain 276 hotels from
Algorithm 7. HotelSelection(Set hotels, unranked to five stars. After removing the duplicated and
Set itinerarySet). meaningless POIs, we keep 400 POIs for our experiments.
1: double max ¼ 0, Set result ¼ ; As far as we know, this is the largest data set for the
2: for i ¼ 0 to hotels.size() do automatic itinerary generation. In [2], the largest data set
3: Hotel hi ¼ hotels.get(i) only contains 163 POIs.
4: Set copy ¼ itinerarySet The POI’s weight is also crawled from Yahoo Travel. As
5: for j ¼ 0 to copy.size() do shown in Fig. 7, for each POI, Yahoo maintains a page for
6: Set Lj ¼ copy.get(j) users’ reviews. We accumulate the user scores for each POI
7: while getTravelTime(Lj ; hi Þ > H do as its weight. If a POI has not been reviewed, we assign it an
8: Lj .removelast() initial weight (e.g., 1).
9: Lj .append(hi ) The average visiting time of a POI is estimated from the
10: double weight ¼ getTotalWeight(copy) shared travel plans in Yahoo Travel. The edge cost between
11: if max < weight then any two POIs are estimated using Google Map. Specifically,
12: max ¼ weight the public transit time for the shortest path between two
13: result ¼ copy POIs is used as the edge cost. We assume that each user will
14: return result spend at most 8 hours for traveling per day.
522 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

Fig. 8. Preprocessing cost.

Fig. 9. Scalability of preprocessing.


Fig. 7. User reviews.
TABLE 1
Experiment Settings

Fig. 10. Size of a single-day itinerary.

phase. All duplicated itineraries will be shuffled to the same


reducer, where a local clean process is conducted.
Fig. 8 shows the accumulate costs of all m jobs and the
In our experiments, the query is ðSp ; kÞ, where Sp is cost of the clean job. We vary the number of POIs in the
randomly selected from the nonhotel POIs. We allow preprocessing from 100 to 400. The cost of the MR-Scan
the users to select the same hotel POIs for different days. increases for a larger POI set. However, even for 400 POIs,
The traveling time is set to three days by default. For the preprocessing only takes less than 1 hour. This is an
comparison, we implement the original TOP algorithm offline process and only needs to be invoked once. In fact,
proposed in [3]. most travel agencies do not maintain such a large number
Table 1 lists the parameters used in our experiments. (400) of POIs for a single city. Interestingly, the performance
The experiments are conducted on our in-house cluster, Dup-Clean is not correlated to the POI number. Its cost
Awan (http://awan.ddns.comp.nus.edu.sg/ganglia/). We is neutralized by the parallel processing strategy. We
use 64 nodes exclusively. Each node has one Intel X3430 observe that most nodes are not fully exploited in Dup-
Clean. Fig. 9 shows the scalability of the MapReduce jobs
2.4-GHz processor, 8-GB memory, two 500-GB SATA hard
(MR-ScanþDup-Clean). We vary the number of nodes in our
disks, and gigabit ethernet. Hadoop [10] is used as our
cluster from 8 to 64 and we observe a near-linear
MapReduce engine.
improvement over the performance. Therefore, to handle
5.2 Single-Day Itinerary Generation a larger POI-graph, we can simply add more processing
In the preprocessing, m MapReduce jobs are submitted nodes into our cluster.
sequentially to iterate all possible single-day itineraries. The In the preprocessing, the maximal number of MapReduce
jobs (m) is set to 10. Namely, each single-day itinerary can
input are our crawled POIs and the output contain all
contain at most 10 POIs. m is a configurable parameter.
single-day itineraries. This is, in fact, a brute-force search
As shown in Fig. 10, in our data set, most itineraries
strategy, but we exploit the parallel processing engine to consist of 4-7 POIs. Setting m to 10 can iterate most
reduce its cost. After the single-day itineraries are gener- itineraries in our case.
ated, we start another MapReduce job to remove the
duplicate itineraries. We call it the Dup-Clean job (the 5.3 Itinerary Indexing
previous m jobs are named MR-Scan). Dup-Clean generates The second step of preprocessing is to build the itinerary
a special namespace for each itinerary by combining its index. The index process only requires one MapReduce job
POIs. The namespace is used as the key in the shuffling and is much faster than the itinerary iteration process. In
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 523

Fig. 11. Indexing cost. Fig. 14. Effect of graph size (processing time).

Fig. 12. Scalability of indexing. Fig. 15. Effect of graph size (quality).

Fig. 13. Size of index.

Fig. 16. Effect of selected POIs (processing time).


Fig. 11, we show the indexing cost for different sizes of POI
graphs. We can efficiently recreate the index within a few
the set, the TOP approach may fail to provide a satisfied
minutes. Fig. 12 conducts the scalability test for the
performance. On the contrary, our technique enables the
indexing process. The indexing process benefits from a
itinerary to be generated within 30 milliseconds. It is not
larger cluster. Fig. 13 shows the total index size for different affected by the POI graph size. Moreover, the traveling plan
POI graphs. The size of index increases exponentially with system is accessed by multiple users concurrently. In the
the size of POI graph. But even for the graph with 400 POIs case of 400 POIs, the TOP approach can serve up to two
(a large enough POI graph for most cities), only 12 GB index requests per second, while our approach can provide a
data are generated. The index is maintained in the DFS and throughput of 40 requests per second. Our approach is more
hence, the storage cost is not the system bottleneck. scalable and feasible for the real-time processing.
5.4 Effect of POI Graph Size In fact, our approach not only reduces the processing
overhead, it also provides results with higher qualities.
In the experiments, we compare our approach (MR-Set)
Fig. 15 shows the change of weight ratio. We have 20 to
with the original TOP approach in [3]. To evaluate the query
80 percent improvement over the original TOP algorithm.
performance, two metrics, processing time, and weight ratio
The gap increases for a larger POI graph, as our approach
are adopted. The weight ratio is used to measure the quality
can efficiently exploit the POI combinations. More POIs
of the generated itineraries. In particular, let Wi and Wj
indicate a higher possibility of finding a good itinerary.
denote the total weights of MR-Set and TOP, respectively.
Wi 3
The weight ratio is defined as W j
. 5.5 Effect of Selected POIs
We first vary the graph size to test the query performance. In our query model, we allow the user to explicitly select
Figs. 14 and 15 show the processing time and weight ratio, some POIs as their preferences. The weights of the selected
respectively. Our new approach significantly reduces the POIs are adjusted to reflect the selection. This strategy may
processing cost, as we have already computed the single-day increase the importance of some unpopular POIs and
itineraries in the preprocessing. The previous TOP approach avoids generating the itinerary with the same set of top-
is not scalable. The query cost increases linearly with the popular POIs. This is how the users customizes their
number of POIs. If more POIs (e.g., restaurants) are added in itineraries in our system. Figs. 16 and 17 show the effect of
varied number of selected POIs (from 5 to 25). The default
3. In idea case, we should compare the approximate results with the
optimal ones. However, it is impossible to generate the optimal results, traveling time is set to three days. In fact, most people will
given the size of POI graph and complexity of the problem. not select too many POIs (e.g., 25) for a three-day itinerary.
524 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

Fig. 17. Effect of selected POIs (quality). Fig. 20. Effect of adjustment (processing time).

Fig. 18. Effect of traveling time (processing time). Fig. 21. Effect of adjustment (quality).

Fig. 19. Effect of traveling time (quality).


Fig. 22. Buffer size.

In Fig. 16, the cost of MR-Set increases for a larger


as it can arrange more high-weight POIs into different
number of selected POIs. This is because in the adjustment
single-day itineraries.
phase, MR-Set needs to look up the index of the
corresponding POIs to search for the replacements. Index 5.7 Effect of Adjustment
is maintained by the DFS and the I/O costs dominate the The query processing of MR-Set splits into the initialization
query cost. However, MR-Set is still much more efficient phase and adjustment phase. The initialization phase
than the original TOP. applies the greedy-based heuristic approach to generate a
Fig. 17 reveals that the quality gap between MR-Set and k-itinerary as the seed, which is further improved in the
the TOP approach enlarges, when the user selects more adjustment phase by replacing the itineraries with their
POIs as his preference. MR-Set can effectively find the independent sets. In this experiments, we show the effect of
itinerary that includes as many selected POIs as possible. It the adjustment phase. We vary the number of selected POI
can optimize the way of how to combine the selected POIs from 1 to 15.
and other POIs into the itinerary. Fig. 20 shows that the adjustment phase greatly increases
5.6 Effect of Traveling Budget the processing cost. Algorithm 6 may repeat for several
iterations before converging to a high-quality result. As
Besides the POIs, the user can change his expected traveling
mentioned before, in the adjustment phase, the query
time as well. With more time budget, his itinerary can engine loads the itinerary index from the DFS, which incurs
include more interested POIs. Figs. 18 and 19 show the high I/O cost. One way to reduce the cost is to increase the
effect of varied time budgets. The original TOP algorithm index buffer size. After an indexed itinerary is loaded from
incurs a higher overhead for the increased time budget, the DFS, we cache it in the buffer. If the buffer is full, we
because it needs to generate and refine each single-day apply the LRU strategy to remove the less used entries.
itinerary progressively. MR-Set adopts a different strategy. In Fig. 22, we change the number of buffered single-day
When it tries to adjust the itinerary, it may replace multiple itineraries in the index buffer and test the query perfor-
single-day itineraries with new ones. It considers the k-day mance. Not surprisedly, we can get a huge performance
itinerary as a whole solution, instead of treating each single- boost by deploying a large enough index buffer. In fact, the
day itinerary independently. It is interesting to observe that single-day itinerary is less than 64 bytes and caching
Fig. 19 shows a different result from Fig. 17. The weight 5 million entries only takes about 300 M memory. Any
ratio decreases, when more traveling budget is given. In modern server can effectively reduce the processing cost by
fact, the TOP algorithm benefits from a loose time budget, employing a large buffer.
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 525

Fig. 23. Effectiveness of single hotel selection. Fig. 24. User study.

Although the adjustment phase incurs high processing discover the users’ traveling patterns from their published
cost, it can significantly improve the result quality. As images, geolocations and events [11], [12], [13]. Based on the
shown in Fig. 21, the adjustment phase can double the relationships of those historical data, new itineraries are
weight of generated itinerary if more than 15 POIs are generated and recommended to the users [14], [15], [16].
selected.4 With more POIs selected, the adjustment phase This scheme leverages the user data to retrieve POIs and
can generate more replacement itineraries and therefore, organize the POIs into itinerary, which is based on a
has a better chance of finding the high-quality result. different application scenario to ours. We help the traveling
agency provide the customized itinerary service, where all
5.8 Effect of Single Hotel Selection
details of POIs are known and each user prefers different
In this section, we justify the effectiveness of hotel selection
itinerary instead of adopting the most popular ones. In our
algorithm. In Algorithm 7, we adopt a “best-effort” solution
case, the itinerary generation problem is a search problem
to append the hotel to the end of each itinerary. To evaluate
for the optimal POI combinations.
the performance of such a solution, we define a new
metric, the hotel weight ratio. In particular, let Wm and Ws In fact, searching for the optimal single-day itinerary has
denote the total weights of generated itineraries in the been well studied. It can be transformed into the traveling
multiple hotel case and single hotel case, respectively. The salesman problem (TSP) [5], which is a well-known NP-
hotel weight ratio is defined as W Ws
. Our “best-effort” complete problem. For example, in [17], given a set of POIs,
m
solution still provides high-quality results. Fig. 23 shows the the system will generate a shortest itinerary to access all the
change of the hotel weight ratio. We can see that, in the POIs. If the distance measure is a metric and symmetric,
single hotel case, the total weight of generated itineraries is the TSP has the polynomial approximate solution [18], but
penalized as each single-day itinerary should end in the the approximate solution incurs high overhead for a large
same hotel POI. However, the “best-effort” solution can POI graph [19]. Therefore, some heuristic approaches [1] are
provide an approximate result with 85-90 percent of the adopted to simplify the computation.
total weight as in the multiple hotel case. This indicates that Some interactive search algorithms [2], [20] are proposed
Algorithm 7 is still able to find good itineraries with the in recent years. These algorithms still focus on optimal
single hotel constraint. single-day itinerary planning. To reduce the computation
overhead and improve the quality of generated itineraries,
5.9 User study
users’ feedbacks are integrated into the search algorithm.
To evaluate the quality of the generated itineraries, we The search algorithm works iteratively. It proposes new
conduct a user study, which asks the users to manually rank itineraries for users based on their previous feedbacks and
the itineraries. Our study hires 20 undergraduate students as the users can adjust the weights of POIs in the itinerary or
the users. Given a set of selected POIs, we use the TOP and select new POIs into the itinerary. In the next iteration, the
MR-Set methods to generate 20 groups of itineraries (three- algorithm will refine its results based on the collected
day itineraries in the experiment). Each participant assigns a information. Those work can be considered as variants of
score (ranging from 1 to 5) to each itinerary in his group. The optimal single-day itinerary planning problems, whereas
average ranks are then computed for the itineraries our algorithms focus on generating multi-day itineraries.
generated by different approaches. Fig. 24 shows the results. Moreover, interactive algorithms pose requirements for the
Most users prefer the results generated by MR-Set. We also users, who may be reluctant to provide the feedbacks.
observe that the ratings of both the TOP and MR-Set are To the best of our knowledge, no previous work studied
reduced, when more POIs are selected as the necessary POIs. the problem of generating multiday itinerary. This problem
It is because that some of the user selected POIs are missing is more challenging than the single-day itinerary, because
in the itineraries due to the constraint of travel time. simply combining multiple optimal single-day itineraries
may result in a suboptimal solution. The multiday itinerary,
6 RELATED WORK as shown in this paper, can be reduced to the team orienting
problem (TOP) [3], which is an NP-complete problem with
Most existing work on itinerary generation take a two-step
no approximate solution. Therefore, many heuristic ap-
scheme. They first adopt the data mining algorithms to
proaches are proposed [6], [21], [22]. The heuristic ap-
4. In this figure, the weight ratio is computed between the MR-Set with proaches cannot guarantee the quality of generated
adjustment and MR-Set without adjustment. itineraries. To address the problem, in this paper, we apply
526 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 3, MARCH 2014

the MapReduce framework to generate the single-day [7] P. Vansteenwegen, W. Souffriau, and D.V. Oudheusden, “The
Orienteering Problem: A Survey,” European J. Operational Research,
itineraries. The parallel engine of MapReduce allows us to vol. 209, pp. 1-10, Feb. 2011.
solve some NP-complete problems more efficiently. Other [8] M.M. Halldórsson and B. Chandra, “Greedy Local Improvement
work [23], [24] also try to leverage the power of MapReduce and Weighted Set Packing Approximation,” J. Algorithms, vol. 39,
pp. 223-240, May 2001.
to reduce the processing cost of NP-complete problems. The [9] E.M. Arkin and R. Hassin, “On Local Search for Weighted K-Set
beauty of our approach is that after the transformation, the Packing,” Math. Operations Research, vol. 23, pp. 640-648, Mar.
1998.
itinerary planning problem is reduced to the weighted set- [10] http://hadoop.apache.org/, 2013.
packing problem, which has approximate solutions under [11] T. Rattenbury, N. Good, and M. Naaman, “Toward Automatic
some contraints. Extraction of Event and Place Semantics from Flickr Tags,” Proc.
30th Ann. Int’l ACM SIGIR Conf. Research and Development in
Information Retrieval (SIGIR ’07), pp. 103-110, 2007.
7 CONCLUSION [12] D.J. Crandall, L. Backstrom, D.P. Huttenlocher, and J.M.
Kleinberg, “Mapping the World’s Photos,” Proc. 18th Int’l
In this paper, we present an automatic itinerary generation Conf. World Wide Web (WWW), pp. 761-770, 2009.
[13] M. Clements, P. Serdyukov, A.P. de Vries, and M.J. Reinders,
service for the backpack travelers. The service creates a “Using Flickr Geotags to Predict User Travel Behaviour,” Proc.
customized multiday itinerary based on the user’s pre- 33rd Int’l ACM SIGIR Conf. Research and Development in Information
ference. This problem is a famous NP-complete problem, Retrieval (SIGIR), 2010.
[14] C.-H. Tai, D.-N. Yang, L.-T. Lin, and M.-S. Chen, “Recommending
team orienting problem, which has no polynomial time Personalized Scenic Itinerary with Geo-Tagged Photos,” Proc.
approximate algorithm. To search for the optimal solution, IEEE Int’l Conf. Multimedia and Expo (ICME), pp. 1209-1212, 2008.
a two-stage scheme is adopted. In the preprocessing stage, [15] M.D. Choudhury, M. Feldman, S. Amer-Yahia, N. Golbandi, R.
Lempel, and C. Yu, “Automatic Construction of Travel Itineraries
we iterate and index the candidate single-day itineraries Using Social Breadcrumbs,” Proc. 21st ACM Conf. Hypertext and
using the MapReduce framework. The parallel processing Hypermedia (HT), pp. 35-44, 2010.
[16] H. Yoon, Y. Zheng, X. Xie, and W. Woo, “Smart Itinerary
engine allows us to scan the whole dataset and index as
Recommendation Based on User-Generated GPS Trajectories,”
many itineraries as possible. After the preprocessing stage, Proc. Seventh Int’l Conf. Ubiquitous Intelligence and Computing (UIC),
the TOP is transformed into the weighted set-packing pp. 19-34, 2010.
[17] I. Hefez, Y. Kanza, and R. Levin, “TARSIUS: A System for Traffic-
problem, which has efficient approximate algorithms. In Aware Route Search under Conditions of Uncertainty,” Proc. 19th
the next stage, we simulate the approximate algorithm for ACM SIGSPATIAL Int’l Conf. Advances in Geographic Information
the set-packing problem. The algorithm follows the Systems (GIS), pp. 517-520, 2011.
[18] N. Christofides, “Worst-Case Analysis of a New Heuristic for the
initialization-adjustment model and can generate a result, Traveling Salesman Problem,” Technical Report 388, Graduate
which is at most 2ðmþ1Þ3 worse than the optimal result. School of Industrial Administration, Carnegie-Mellon Univ., 1976.
Experiments on real data set from Yahoo’s traveling [19] G. Laporte, “The Traveling Salesman Problem: An Overview of
Exact and Approximate Algorithms,” European J. Operational
website show that our proposed approach can efficiently Research, vol. 59, no. 2, pp. 231-247, June 1992.
generate high-quality customized itineraries. [20] R. Levin, Y. Kanza, E. Safra, and Y. Sagiv, “Interactive Route
Search in the Presence of Order Constraints,” Proc. VLDB
Endowment, vol. 3, no. 1, pp. 117-128, 2010.
ACKNOWLEDGMENTS [21] W. Souffriau, P. Vansteenwegen, G.V. Berghe, and D.V.
Oudheusden, “A Path Relinking Approach for the Team
The work of Sai Wu was supported by the National Science Orienteering Problem,” Computers and Operations Research,
Foundation of China (NSFC Grant 60970124, 61170034). The vol. 37, pp. 1853-1859, 2010.
[22] M.V.S.P. de Aragao, H. Viana, and E. Uchoa, “The Team
work of Sai Wu, Jingbo Zhou, and Anthony K.H. Tung was Orienteering Problem: Formulations and Branch-Cut and Price,”
carried out at the SeSaMe Centre. It is supported by the Proc. Algorithmic Approaches for Transportation Modeling, Optimiza-
tion, and Systems (ATMOS), vol. 14, pp. 142-155, 2010.
Singapore NRF under its IRC@SG Funding Initiative and [23] F. Chierichetti, R. Kumar, and A. Tomkins, “Max-Cover in Map-
administered by the IDMPO. Sai Wu was the corresponding Reduce,” Proc. 19th Int’l Conf. World Wide Web (WWW), pp. 231-
author. 240, 2010.
[24] Z. Zhao, G. Wang, A.R. Butt, M. Khan, V.A. Kumar, and M.V.
Marathe, “SAHAD: Subgraph Analysis in Massive Networks
Using Hadoop,” IEEE Int’l Parallel and Distributed Processing Symp.
REFERENCES (IPDPS), 2012.
[1] S. Dunstall, M.E. Horn, P. Kilby, M. Krishnamoorthy, B. Owens,
D. Sier, and S. Thiebaux, “An Automated Itinerary Planning
System for Holiday Travel,” Information Technology and Tourism,
vol. 6, no. 3, pp. 195-210, 2004.
[2] S.B. Roy, G. Das, S. Amer-Yahia, and C. Yu, “Interactive Itinerary
Planning,” Proc. IEEE 27th Int’l Conf. Data Eng. (ICDE), pp. 15-26,
2011.
[3] I.-M. Chao, B.L. Golden, and E.A. Wasil, “The Team Orienteering
Problem,” European J. Operational Research, vol. 88, no. 3, pp. 464-
474, Feb. 1996.
[4] J. Dean and S. Ghemawat, “MapReduce: A Flexible Data
Processing Tool,” Comm. ACM, vol. 53, pp. 72-77, Jan. 2010.
[5] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction
to Algorithms, second ed. The MIT Press and McGraw-Hill Book
Company, 2001.
[6] C. Archetti, A. Hertz, and M.G. Speranza, “Metaheuristics for the
Team Orienteering Problem,” J. Heuristics, vol. 13, pp. 49-76, Feb.
2007.
CHEN ET AL.: AUTOMATIC ITINERARY PLANNING FOR TRAVELING SERVICES 527

Gang Chen received the BSc, MSc, and Jingbo Zhou is currently working toward the
PhD degrees in computer science and en- PhD degree in the School of Computing,
gineering from Zhejiang University in 1993, National University of Singapore. His research
1995, and 1998, respectively. He is currently interests include indexing and query processing
a professor at the College of Computer on the complex structure, such as trajectories,
Science, Zhejiang University. He is also the trees and graphs.
executive director of Zhejiang University—Ne-
tease Joint Lab on Internet Technology. His
research interests include database, informa-
tion retrieval, information security, and com-
puter supported cooperative work.
Anthony K.H. Tung received the BSc (second
Sai Wu received the bachelor’s and master’s class honor) and MSc degrees in computer
degrees from Peking University, and the PhD science from the National University of Singa-
degree from the National University of Singa- pore (NUS), in 1997 and 1998, respectively, and
pore in 2011. Now he is an assistant professor at the PhD degree in computer sciences from
the College of Computer Science, Zhejiang Simon Fraser University in 2001. He is currently
University. His research interests include P2P an associate professor in the Department of
systems, distributed database, cloud systems, Computer Science (NUS). His research interests
and indexing techniques. He has served as a include various aspects of databases and data
program committee member for VLDB, ICDE, mining (KDD) including buffer management,
and CIKM. frequent pattern discovery, spatial clustering, outlier detection, and
classification analysis.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like