Daily Metro Origin-Destination Pattern Recognition Using Dimensionality Reduction and Clustering Methods
Daily Metro Origin-Destination Pattern Recognition Using Dimensionality Reduction and Clustering Methods
Abstract—Widespread usage of Smart Cards is leading to challenged because of their inefficiency in dealing with
unprecedentedly massive growth of the quantity of data. However, high-dimensional OD matrix. For instance, smart card system in
traditional methods still fail to fully recognize mobility patterns metropolis may produce millions of records scattered in
and a better way of data mining is to be explored. In order to hundreds of stations. Basic parameters like traffic volume,
achieve reliable pattern recognition results, principal component similarities and distances have to be computed repeatedly in the
analysis and singular value decomposition are respectively applied. original data space, so it is impractical to utilize traditional
Based on the dimensionality reduced matrix, affinity propagation statistics or pattern recognition methods to cope with the
is selected as a suitable clustering algorithm to recognize demand original high-dimensional data. Dimensionality reduction is
patterns. Spectral clustering is introduced to make a comparison.
thus a necessity to reduce redundancy and increase efficiency
Different clustering evaluation indicators are used to serve as
objective references. Representative categories are clustered,
before clustering procedures [3].
which correspond to weekdays, weekends, holidays, and different Two common ways of dimensionality reduction—Principal
months, respectively. The integration of dimensionality reduction component analysis (PCA) and singular value decomposition
and clustering offers a new way to understand daily mobility (SVD) [4] are selected to avoid the curse of dimensionality in
structure. To metro system operators, this study also provides this paper. After linear transformation and reduction, the
information on traffic volume variation and temporal distribution original OD matrix can be reduced to a decent dimension. The
of the whole year. Besides, the procedures of dealing with daily
main goal of dimensionality reduction is to capture the main
demand matrix can be applied in traffic planning, management
features to efficiently proceed clustering analysis as well as
and operation.
pattern recognition. Affinity propagation (AP) is selected in
Keywords—Affinity propagation, daily demand matrix, clustering the dimensionality reduced matrix as well as in
Principal component analysis, spectral clustering recognizing the modes of daily demand matrix (DDM) [5].
Spectral clustering [6] is another clustering means applied
I. INTRODUCTION afterwards. Different from the research by Mendes-Moreira et al.
OD matrices in public transportation restore original [7] and Khiari et al. [8] which group days into different schedule
information of how users travel spatially and temporally, so it is types by travel time, current analysis pays more attention to the
also a key input to transportation system analyses and inner relationships between OD pairs. In-depth comparisons are
transportation planning. Demand patterns are usually made between dimensionality reduction methods and between
recognized by computationally and statistically analyzing spectral clustering and AP methods, which was seldom
information from the OD matrix [1]. As a result, the extraction discussed in previous literatures. In the end, daily mobility
and classification of OD matrices can’t be over emphasized. structure of metro DDM are analyzed and visualized on such a
Weijermars and Berkum [2] discussed the clustering procedures basis. Although clustering methods (AP and spectral algorithm)
of trip demand profiles. In both of the two directions, speed and are not new, the integration of dimensionality reduction and
flow parameters sampled by automatic vehicle location system clustering was seldom discussed before. In this paper, the
were aggregated into 15 minute intervals. The results show methodologies of dimensionality reduction and clustering are
clearly that working days are clustered into a distinct category compared in detail and integrated systematically on the basis of
from weekends. On the basis of Weijermars and Berkum, our initial version in [9].
Friedrich et al. proposed the notion of characteristic traffic days II. DATA DESCRIPTION
afterwards. Traffic OD matrix is collected and analyzed on the
basis of floating phone data. The study period is presorted into In this research, analysis and calculation are conducted on
three types and the data detected by mobile phone base stations Shenzhen metro data from September 1st, 2011 to August 31th,
in different calendar days are clustered to obtain a typical travel 2012. Smart cards usage constitutes 79.7% of the total amount
pattern and OD matrix [2]. in Shenzhen, which store major and representative records of
the metro system. Shenzhen metro network is made up of 5
However, traditional methods are currently vastly lines and 118 stations. The passenger flow of Shenzhen Metro
This work was supported by the National Nature Science Foundation of saw year-on-year rise of 69.9% in 2012, which is up to 780
China under project 71171147 and the Fundamental Research Funds for the
Central Universities.
40
60
80
0
175
100
120
140
160
180
200
4
549
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop
The dimensionality of matrix M is reduced from 290*13806
to 290*175, where the number of columns is compressed and
each row still means a day. SVD does not change the internal
relationship of metro stations, information on the macro level is
still preserved.
When it comes to PCA, four retained principal components New Year
are capable to reveal the inner structure of data in such way that JANUARY
best depicts the variability of the major axis. Consequently, the SUN MON TUE WED THU FRI SAT
number of columns of the original 290*13,806 matrix is
compressed to 4 by PCA. The retained matrix holds over 85% of
the cumulative contribution rate, and the projection as well
eases the difficulty of recognition of metro demand patterns.
B. Clustering with Affinity Propagation
Dueck and Frey put forward a clustering method named Spring Festival
“affinity propagation,” by calculating parameters between data
pairs (columns of the OD matrix), where the similarity s i, k JUNE
SUN MON TUE WED THU FRI SAT
indicates to what extent the data point with index k is
appropriate to be the "exemplar" (center selected from data
distance as below: For data points xi and xk (distance among
columns of the OD matrix), s i, k
2
x i xk . The Dragon Boat Festival
availability matrix contains values a(i, k) that symbols how *Dates with circle are public holidays
appropriate it is for xi to pick xk as its "exemplar". 1st 2nd 3rd 4th 5th
categor categor categor categor categor
Responsibility r (i, k ) symbols how appropriate it is for xk to be y y y y y
6th 7 th
8 th
9th 10th
chosen as the cluster center of xi . In each iteration, categor categor categor categor categor
availabilities and responsibilities are both taken into y y y y y
11th 12th 13th
consideration in determination of the "exemplars". The 290*175 categor categor categor
and 290*4 dimensionality reduced matrices by SVD and PCA y y y
are afterwards clustered utilizing affinity propagation Fig. 3. The outcomes of PCA displayed on the calendar.
550
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop
TABLE ĉMAIN CHARACTERISTICS CORRESPONDING TO 13 CATEGORIES is on Mondays, Wednesdays and Fridays. Just like the second
one, the tenth category covers Sundays in February to August.
Category Characteristics The eleventh and the twelfth are weekdays in June, July and
1 Weekdays from Sept. to Nov. and some weekdays in Dec. August. The thirteenth is Saturdays from February to August.
The results also show that trip patterns in the same week or in
2 Weekends from Sept. to Dec.
the same holidays are not necessarily the same.
3 National Day and Sundays from Oct. to Jan.
Different from PCA, the DDM after dimensionality
4 Some weekdays in Dec. reduction using SVD is divided into 16 clusters. The SVD
5 Weekdays in Jan.
clustering outcomes are more specific and detailed than PCA,
which as well captures inner features of different patterns.
6 Days before and after Spring Festival Spring Festival is similarly fallen into a distinct category
7 Spring Festival indicating this holiday has a rather typical demand pattern that
can be determined.
8 Weekdays from Feb. to Mar. (mostly Tuesdays and Thursdays)
One surprising result is that, as can be seen from Table Ċ,
Weekdays from Feb. to Mar. (mostly Mondays, Wednesdays and Halloween Eve and Christmas Eve are as well uniquely
9
Fridays)
clustered. These two western festivals are enjoying growing
10 Sundays from Feb. to Aug. popularity in the youthful city Shenzhen. One distinguishing
11 Weekdays from Jun. to Aug. (mostly Tuesday to Thursday) feature lies in that theme parks, shopping malls and other
recreational places have huge metro traffic flow on the two eves,
12 Weekdays from Jun. to Aug. (mostly Mondays and Fridays) especially at night. Each cluster is a unique demand profile and
13 Saturdays from Feb. to Aug. the identification of this category can be conducive and guiding
to the planning and overall management of Shenzhen Metro.
Pre-arranged schemes and warnings can be worked out to cope
which is divided into 13 clusters. with rapid change in traffic flow.
The 290 valid days in current research are clustered into 13
categories with PCA methods, each symbolized by different The results of DDM using PCA and SVD have many
patterns of grids (the days expelled are represented by white similarities and can hardly be judged subjectively in terms of
color). Public holidays are labeled with red circle. Public clustering quality. On the whole, the results of clustering under
holidays contain Mid-Autumn Festival (from September 10th to PCA and SVD show demand patterns of metro system and can
12th), National Day (from October 1st to 7th), New Year (from serve as a fundamental analysis basis in operation and
January 1st to 3rd), Spring Festival (from January 22nd to 28th) management. However, the dimensionality reduced matrix of
and the Dragon Boat Festival (from June 22nd to 24th). PCA is a 290*4 matrix, while that corresponding to SVD is a
The first category contains September 1st-3rd, 6th-9th, TABLE ĊMAIN CHARACTERISTICS CORRESPONDING TO 16 CATEGORIES
15th-16th, 20th-23rd, 28th-30th, October 11th-15th, 18th-19th,
25th-27th, November 2nd-3rd, 7th-8th, 10th, 14th-17th, Category Characteristics
21st-24th, 28th-30th, December 1st, 6th-8th, 13th-14th and
1 Sundays and Mondays from Sept. to Nov.
22nd. The first one is mostly weekdays from September to
December. The mode shown above is recognized as a 2 Weekdays from Sept. to Dec.
characteristic weekday in Fall. Shenzhen metro OD volume
3 Fridays and Saturdays from Sept. to Jan.
from June to August is vastly different from Fall, which can be
automatically recognized in clustering. The second one contains 4 Weekends from Sept. to Mar.
Saturdays as well as Sundays from September to December. 5 Sundays from Oct. to Jan.
Sundays from October to January are contained in the third
cluster while most Thursdays and Fridays from September to 6 Halloween Eve and Christmas Eve
January are included in the fourth cluster. The travel mode on 7 Days before Spring Festival
Thursdays and Fridays has some distinguishing features in these
months. The fifth is mainly weekdays in January. The sixth only 8 Spring Festival
has records around Spring Festival. 9 Two days before and after Spring Festival
Apart from that, the seventh category is exclusively January 10 Days after Spring Festival
21st to 29th, totally in accordance with the Spring Festival.
11 Weekdays from Feb. to Mar.
Chinese New Year or Spring Festival is the most important
traditional Chinese holiday for family reunion. Shenzhen has a 12 Sundays from Feb. to Mar.
huge migrant population coming from every corner of China. So
13 Weekdays in Jun.
in days around the festival, labors have a strong demand to go
back home. Metro stations near major external traffic nodes are 14 Sundays from Jun. to Aug.
expected to have a large volume, including railway, bus, ferry
15 Saturdays from Jun. to Aug.
stations and airports. The eighth category is mostly Tuesdays
and Thursdays from February to March, and the ninth category 16 Weekdays from Jul. to Aug.
551
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop
552
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop
Silhouette index are selected to evaluate the clustering quality. clustering in pattern recognition was seldom discussed in
A high silhouette value or Calinski-Harabasz value indicates a previous literatures. Besides, many existing methods in pattern
good clustering solution. In the contrast, the optimal clustering recognition are sophisticated and difficult to be followed. In this
solution is achieved with the minimum Davies-Bouldin index paper, the advantages of the combination of dimensionality
value. reduction and clustering are fully drawn, and different
algorithms corresponding to dimensionality reduction and
Different methods are compared when the optimal number clustering are compared in detail and integrated systematically.
of clusters is achieved. As shown in Table Ⅲ, all the three It also sheds light on forming a set of simple and applicable
indices show that Spectral analysis is inferior to AP clustering methodologies in processing smart card data to obtain trip
with the application of PCA and SVD. AP with PCA is better patterns.
than AP with SVD judging from the index of Calinski-Harabasz,
but is slightly worse than AP with SVD judging from the indices REFERENCES
of Davies-Bouldin and Silhouette. According to the value of [1] M. A. Munizaga and C. Palma, “Estimation of a disaggregate multimodal
these indices, the performance of AP with PCA and AP with public transport Origin–Destination matrix from passive smartcard data
SVD is almost on the same level. AP clustering with the from Santiago, Chile,” Transp. Res., Part C: Emerg. Technol., vol. 24, pp.
dimensionality reduction method of PCA is much faster than 9-18, Oct. 2013.
SVD and achieves a balance between accuracy of clustering and [2] W. Weijermars and E. V. Berkum, “Analyzing highway flow patterns using
efficiency of dimensionality reduction comprehensively cluster analysis,” in Proc. IEEE Intelligent Transportation Syst., Vienna,
Austria, 2005, pp. 308-313.
considered. As a result, AP clustering with the dimensionality
reduction method of PCA is more cost-effective in current [3] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323-2326,
study. Dec. 2000.
[4] C. Ding, X. He, H. Zha, and H. D. Simon, “Adaptive dimension reduction
V. CONCLUSIONS for clustering high dimensional data,” in Proc. IEEE Int. Conf. Data
Current research on daily demand recognition forms a set of Mining, Maebashi City, Japan, 2002, pp. 147-154.
procedures that can be selected accordingly by metro operators. [5] B. J. Frey and D. Dueck, “Clustering by passing messages between data
It provides a basic understanding of daily demand patterns like points,” Science, vol. 315, no. 5814, pp. 972-976, Feb. 2007.
how many people will travel from A to B on the basis of current [6] U. V. Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17,
no. 4, pp. 395-416, Aug. 2007.
research. The demand patterns can be used as a basis for macro
traffic models and as supplementary information for operational [7] J. Mendes-Moreira, L. Moreira-Matias, J. Gama, and J. F. de Sousa,
“Validating the coverage of bus schedules: A machine learning approach,”
management. Whereas in the past, metro operators formulated Information Sciences, Vol. 293, pp. 299-313,2015.
schedules according to real time flow variation, which is [8] J. Khiari, L. Moreira-Matias, V. Cerqueira, and O. Cats, “Automated
subjective and may fail to obtain early warning information. setting of Bus schedule coverage using unsupervised machine learning,”
Days in the same cluster can be represented by a typical demand In Pacific-Asia Conference on Knowledge Discovery and Data Mining,
matrix (average value of OD matrices of days within the same Springer International Publishing, pp. 552-564, 2016.
cluster). With the help of typical demand matrix, more specific [9] C. Yang, F. F. Yan, and X. D. Xu, “Clustering Daily Metro
and targeted demand patterns can be understood and database Origin-Destination Matrix in Shenzhen China,” Appl. Mech. Mater., vol.
743, pp. 422-432, Mar. 2015.
can be built to formulate a long term reaction mechanism. In
[10] K. Fukunaga, Introduction to statistical pattern recognition. Academic
spring or fall, on working days or holidays, in downtown or press, 2013.
rural areas, metro operators can make instant plans accordingly
[11] S. Sun, C. Zhang, and G. Yu, “A Bayesian network approach to traffic flow
based on demand patterns. forecasting,” IEEE Trans. Intell. Transp. Syst., vol. 7, no. 1, pp. 124-132,
Mar. 2006.
Moreover, the daily public transit OD matrices may be
[12] S. Jiang, S. Wang, Z. Li, W. Guo, and X. Pei. “Fluctuation Similarity
averaged and retrieved to serve as inputs to other models. Daily Modeling for Traffic Flow Time Series: A Clustering Approach,” in Proc.
trip patterns along with historical OD volume can serve as the 18th IEEE Int. Conf. Intelligent Transportation Syst., Canary Islands, 2015,
input or complementary information in traffic volume pp. 848-853.
prediction and analysis of congestion patterns. For instance, hot [13] Y. Sun, N. Ye, and X. Xu, “EEG analysis of alcoholics and controls based
OD pairs or popular areas where traffic jams may occur can be on feature extraction,” in Proc. 8th IEEE Int. Conf. Signal Process, Beijing,
recognized and predicted from historical patterns using current China, 2006.
methods and real-time volume. In case of emergency in metro [14] D. Dueck and B. J. Frey, “Non-metric affinity propagation for
system, traffic flow records alone do not provide sufficient unsupervised image categorization,” in 11th IEEE Int. Conf. Computer
Vision, Rio De Janeiro, Brazil, 2007, pp. 1-8.
information to make a decision, so current work can serve as a
[15] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis.
supplementary tool. Little is known about the real pattern in London: Academic press, 1979.
different stations in different time periods. And this research [16] C. Goutte, L. K. Hansen, M. G. Liptrot, and E. Rostrup, “Featureϋspace
opens a door to understanding the overall behavior of metro clustering for fMRI metaϋanalysis,” Hum. brain map., vol. 13, no. 3, pp.
travelers. 165-183, May 2001.
[17] M. S. Hossain and R. A. Angryk, “Gdclust: A graph-based document
Another contribution of this paper is related to how to clustering technique,” in Proc. 7th IEEE Int. Conf. Data Mining
process huge smart card OD data with high dimension, and to Workshops, Omaha, Nebraska, 2007, pp. 417-422.
divide data of different trip days into different groups which [18] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE
were not obvious before. Dimensionality reduction or clustering Trans. Pattern Anal. Mach. Intell., vol. 2, pp. 224-227, Apr. 1979.
solely is well-known and widely used in trip pattern recognition. [19] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,”
However, the integration of dimensionality reduction and Commun. Stat.-Theor. M., vol. 3, no. 1, pp. 1-27, 1974.
553