0% found this document useful (0 votes)
46 views

Daily Metro Origin-Destination Pattern Recognition Using Dimensionality Reduction and Clustering Methods

data mining paper

Uploaded by

den
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Daily Metro Origin-Destination Pattern Recognition Using Dimensionality Reduction and Clustering Methods

data mining paper

Uploaded by

den
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop

Daily Metro Origin-Destination Pattern


Recognition Using Dimensionality Reduction
and Clustering Methods
Chao Yang Fenfan Yan Xiangdong Xu*
School of Transportation Engineering, School of Transportation Engineering, School of Transportation Engineering,
Key Laboratory of Road and Traffic Key Laboratory of Road and Traffic Key Laboratory of Road and Traffic
Engineering of the Ministry of Engineering of the Ministry of Engineering of the Ministry of
Education, Tongji University Education, Tongji University Education, Tongji University
Shanghai, 201804, China Shanghai, 201804, China Shanghai, 201804, China
[email protected] [email protected] [email protected]

Abstract—Widespread usage of Smart Cards is leading to challenged because of their inefficiency in dealing with
unprecedentedly massive growth of the quantity of data. However, high-dimensional OD matrix. For instance, smart card system in
traditional methods still fail to fully recognize mobility patterns metropolis may produce millions of records scattered in
and a better way of data mining is to be explored. In order to hundreds of stations. Basic parameters like traffic volume,
achieve reliable pattern recognition results, principal component similarities and distances have to be computed repeatedly in the
analysis and singular value decomposition are respectively applied. original data space, so it is impractical to utilize traditional
Based on the dimensionality reduced matrix, affinity propagation statistics or pattern recognition methods to cope with the
is selected as a suitable clustering algorithm to recognize demand original high-dimensional data. Dimensionality reduction is
patterns. Spectral clustering is introduced to make a comparison.
thus a necessity to reduce redundancy and increase efficiency
Different clustering evaluation indicators are used to serve as
objective references. Representative categories are clustered,
before clustering procedures [3].
which correspond to weekdays, weekends, holidays, and different Two common ways of dimensionality reduction—Principal
months, respectively. The integration of dimensionality reduction component analysis (PCA) and singular value decomposition
and clustering offers a new way to understand daily mobility (SVD) [4] are selected to avoid the curse of dimensionality in
structure. To metro system operators, this study also provides this paper. After linear transformation and reduction, the
information on traffic volume variation and temporal distribution original OD matrix can be reduced to a decent dimension. The
of the whole year. Besides, the procedures of dealing with daily
main goal of dimensionality reduction is to capture the main
demand matrix can be applied in traffic planning, management
features to efficiently proceed clustering analysis as well as
and operation.
pattern recognition. Affinity propagation (AP) is selected in
Keywords—Affinity propagation, daily demand matrix, clustering the dimensionality reduced matrix as well as in
Principal component analysis, spectral clustering recognizing the modes of daily demand matrix (DDM) [5].
Spectral clustering [6] is another clustering means applied
I. INTRODUCTION afterwards. Different from the research by Mendes-Moreira et al.
OD matrices in public transportation restore original [7] and Khiari et al. [8] which group days into different schedule
information of how users travel spatially and temporally, so it is types by travel time, current analysis pays more attention to the
also a key input to transportation system analyses and inner relationships between OD pairs. In-depth comparisons are
transportation planning. Demand patterns are usually made between dimensionality reduction methods and between
recognized by computationally and statistically analyzing spectral clustering and AP methods, which was seldom
information from the OD matrix [1]. As a result, the extraction discussed in previous literatures. In the end, daily mobility
and classification of OD matrices can’t be over emphasized. structure of metro DDM are analyzed and visualized on such a
Weijermars and Berkum [2] discussed the clustering procedures basis. Although clustering methods (AP and spectral algorithm)
of trip demand profiles. In both of the two directions, speed and are not new, the integration of dimensionality reduction and
flow parameters sampled by automatic vehicle location system clustering was seldom discussed before. In this paper, the
were aggregated into 15 minute intervals. The results show methodologies of dimensionality reduction and clustering are
clearly that working days are clustered into a distinct category compared in detail and integrated systematically on the basis of
from weekends. On the basis of Weijermars and Berkum, our initial version in [9].
Friedrich et al. proposed the notion of characteristic traffic days II. DATA DESCRIPTION
afterwards. Traffic OD matrix is collected and analyzed on the
basis of floating phone data. The study period is presorted into In this research, analysis and calculation are conducted on
three types and the data detected by mobile phone base stations Shenzhen metro data from September 1st, 2011 to August 31th,
in different calendar days are clustered to obtain a typical travel 2012. Smart cards usage constitutes 79.7% of the total amount
pattern and OD matrix [2]. in Shenzhen, which store major and representative records of
the metro system. Shenzhen metro network is made up of 5
However, traditional methods are currently vastly lines and 118 stations. The passenger flow of Shenzhen Metro
This work was supported by the National Nature Science Foundation of saw year-on-year rise of 69.9% in 2012, which is up to 780
China under project 71171147 and the Fundamental Research Funds for the
Central Universities.

978-1-5386-1526-3/17/$31.00 ©2017 IEEE 548


2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop

million per annual. M 290*13806 U 290*290 * ¦ 290*13806 *V13806*13806


T
  
Before data processing, preliminary data cleaning was Specifically, M stands for the DDM with the data size of
conducted to ensure the completeness of OD trips. Days with 290*13806. U is a 290*290 matrix and its columns stand for
missing data are excluded from the dataset. Days in April and orthogonal eigenvectors of M * M T , Σ is a 290*13806 matrix
May are thus removed from the original dataset while 290 valid and all the numbers along the diagonal are non-negative, which
days are employed to guarantee data quality. are widely recognized as singular values. V is defined as a
When SC users tap their cards on the metro turnstile, the 13806*13806 matrix and its columns stand for orthogonal
ticketing device of fare payment automatically records all the eigenvectors of M T * M .
information related to the trip. The system helps to restore the To what extent should the dimensionality be reduced is a
whole course of trips of all individuals. It shows important problem yet to be explored. The number of principal
information like boarding time, alighting time, card ID, and components is generally determined by the following three rules.
station ID, which can track the trajectory of individuals. First, the major statistical programs in the past used a default
Afterwards ridership in the whole network and total trip per day setting named the Kaiser criterion, which keeps exclusively
can be calculated in the same way. factors with an eigenvalue over 1.0 [12]. Second, the scree plot
III. METHODOLOGY is utilized to determine the optimal number of components. The
principal components are listed by decreasing order of
As is previously mentioned, metro data are sampled from eigenvalues. And the point at which the line begins to bend or
290 valid days, so the dataset can be represented by a makes an elbow toward less steep decline indicates the number
290*118*117 matrix. Nevertheless, the 3-dimensional matrix is of factors that should be retained. Third, if the cumulative
neither efficient to calculate nor easy to understand. In this variance explained by components exceeds 85%, it is perceived
research, the 3-dimensional matrix is simplified by joining that the top k primary components can be representative of the
laterally to form a row vector. Then each of the 118*117 OD
matrix, which corresponds to 118 metro stations in Shenzhen, is
changed to a 1*13,806 matrix. The 2-dimensional 290*13,806
matrix is applied in following discussion. Each day is indicated
by a row while the ridership of stations on the metro network are
represented by columns. Here, we define this 290*13,806
matrix as the DDM. The DDM is a rich data source which
enables us to conduct demand pattern extraction and recognition
over the whole network.
However, the DDM has a huge size of 13,806 columns. All
the columns (OD pairs) are to some extent related to each other.
As the columns have some overlapping information, the high
dimensional DDM may fail to obtain major features that specify
demand patterns for each day. Besides, the 13806-dimensional
matrix may give rise to the curse of dimensionality, which
hinders revealing daily characteristics in pattern recognition as
well as in data mining [10]. In this paper, PCA and SVD are
Fig. 1. Scree Plot.
chosen as two major forms of dimensionality reduction to
compress the number of matrix columns to avoid the previously
mentioned problem of curse of dimensionality. The linear 100
Cumulative Contribution Rate (percent)

transformation procedures of PCA and SVD have the advantage


90
of conserving major features of the previous high dimensional 85
matrix [11]. After dimensionality reduction, we select affinity 80
propagation to categorize metro demand patterns of DDM. 70
A. Dimensionality Reduction with Principal Component 60
Analysis and Singular Value Decomposition 50 PC
In PCA, the cumulative contribution rate of the preceding m 40 A
m n
primary components can be calculated as ¦ Ot / ¦ Ot , where 30 SV
D
t=1 t=1 20
Ot (t 1, 2, n) are eigenvalues sorted in descending order, 10
O1 t O2 t , t On t 0 ; n is defined as the column number of 0
DDM.
20

40

60

80
0

175
100

120

140

160

180

200
4

SVD factorizes a specific matrix M in the form that Principal Components


M=UΣV, which can be decomposed as shown in the formula:
Fig. 2. Cumulative Contribution Rate.

549
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop

former data [13]. algorithms.


In view of the current research, the first six eigenvalues of IV. OUTCOMES AND DISCUSSION
PCA and first seven eigenvalues are greater than 1.0. However,
an obvious disadvantage of the first rule is its arbitrariness (e.g., A. Results of Clustering Using Different Dimensionality
an eigenvalue of 1.01 is included whereas an eigenvalue of 0.99 Reduction Methods
is excluded). As is shown in Fig. 1, in light of the core idea of The DDM after dimensionality reduction using PCA is
scree plot according to the second rule, four principal shown in Fig. 3 results in Sep., Oct., Jan. and Feb. are shown ,
components seem to be the optimal choice for both PCA and
SVD. Four components only constitute for 57% of the total SEPTEMBER
variance in terms of SVD, which is far below 85% and may fail SUN MON TUE WED THU FRI SAT
to preserve the key information. As a result, the third law of 1 2 3
cumulative contribution rate is selected to be applied in the 4 5 6 7 8 9 10
further research. As can be seen in Fig. 2, 4 and 175 are the best
11 12 13 14 15 16 17
numbers of retained factors of PCA and SVD, respectively. The
two laws both select four as the optimal number in PCA. PCA is 18 19 20 21 22 23 24
an efficient method in that high cumulative contribution rate is 25 26 27 28 29 30
achieved even when it is reduced to a low dimension. The
contribution rate is then fixed on the same level, which is Mid-autumn Festival
therefore comparable and explicable.
OCTOBER National Day
As in SVD, by multiplying V13806*175 on both of the two SUN MON TUE WED THU FRI SAT
sides of (1), the formula can be expressed as below: 

M 290*175 M 290*13806 *V13806*175 U 290*175 * 6175*175   


'       

The dimensionality of matrix M is reduced from 290*13806       
to 290*175, where the number of columns is compressed and       
each row still means a day. SVD does not change the internal       
relationship of metro stations, information on the macro level is
 
still preserved.
When it comes to PCA, four retained principal components New Year
are capable to reveal the inner structure of data in such way that JANUARY
best depicts the variability of the major axis. Consequently, the SUN MON TUE WED THU FRI SAT
number of columns of the original 290*13,806 matrix is       
compressed to 4 by PCA. The retained matrix holds over 85% of       
the cumulative contribution rate, and the projection as well
      
eases the difficulty of recognition of metro demand patterns.
      
B. Clustering with Affinity Propagation
  
Dueck and Frey put forward a clustering method named Spring Festival
“affinity propagation,” by calculating parameters between data
pairs (columns of the OD matrix), where the similarity s i, k JUNE
SUN MON TUE WED THU FRI SAT
indicates to what extent the data point with index k is
appropriate to be the "exemplar" (center selected from data  

points) for data point i [14]. To minimize the sum of squared       


errors, each similarity is calculated in terms of Euclidean       

distance as below: For data points xi and xk (distance among       
      
columns of the OD matrix), s i, k
2
 x i  xk . The Dragon Boat Festival

availability matrix contains values a(i, k) that symbols how *Dates with circle are public holidays
appropriate it is for xi to pick xk as its "exemplar". 1st 2nd 3rd 4th 5th
 categor categor categor categor categor
Responsibility r (i, k ) symbols how appropriate it is for xk to be y y y y y
6th 7 th
8 th
9th 10th
chosen as the cluster center of xi . In each iteration, categor categor  categor categor categor
availabilities and responsibilities are both taken into y y y y y
11th 12th 13th
consideration in determination of the "exemplars". The 290*175 categor categor categor
and 290*4 dimensionality reduced matrices by SVD and PCA y y y
are afterwards clustered utilizing affinity propagation Fig. 3. The outcomes of PCA displayed on the calendar.

550
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop

TABLE ĉMAIN CHARACTERISTICS CORRESPONDING TO 13 CATEGORIES is on Mondays, Wednesdays and Fridays. Just like the second
one, the tenth category covers Sundays in February to August.
Category Characteristics The eleventh and the twelfth are weekdays in June, July and
1 Weekdays from Sept. to Nov. and some weekdays in Dec. August. The thirteenth is Saturdays from February to August.
The results also show that trip patterns in the same week or in
2 Weekends from Sept. to Dec.
the same holidays are not necessarily the same.
3 National Day and Sundays from Oct. to Jan.
Different from PCA, the DDM after dimensionality
4 Some weekdays in Dec. reduction using SVD is divided into 16 clusters. The SVD
5 Weekdays in Jan.
clustering outcomes are more specific and detailed than PCA,
which as well captures inner features of different patterns.
6 Days before and after Spring Festival Spring Festival is similarly fallen into a distinct category
7 Spring Festival indicating this holiday has a rather typical demand pattern that
can be determined.
8 Weekdays from Feb. to Mar. (mostly Tuesdays and Thursdays)
One surprising result is that, as can be seen from Table Ċ,
Weekdays from Feb. to Mar. (mostly Mondays, Wednesdays and Halloween Eve and Christmas Eve are as well uniquely
9
Fridays)
clustered. These two western festivals are enjoying growing
10 Sundays from Feb. to Aug. popularity in the youthful city Shenzhen. One distinguishing
11 Weekdays from Jun. to Aug. (mostly Tuesday to Thursday) feature lies in that theme parks, shopping malls and other
recreational places have huge metro traffic flow on the two eves,
12 Weekdays from Jun. to Aug. (mostly Mondays and Fridays) especially at night. Each cluster is a unique demand profile and
13 Saturdays from Feb. to Aug. the identification of this category can be conducive and guiding
to the planning and overall management of Shenzhen Metro.
Pre-arranged schemes and warnings can be worked out to cope
which is divided into 13 clusters. with rapid change in traffic flow.
The 290 valid days in current research are clustered into 13
categories with PCA methods, each symbolized by different The results of DDM using PCA and SVD have many
patterns of grids (the days expelled are represented by white similarities and can hardly be judged subjectively in terms of
color). Public holidays are labeled with red circle. Public clustering quality. On the whole, the results of clustering under
holidays contain Mid-Autumn Festival (from September 10th to PCA and SVD show demand patterns of metro system and can
12th), National Day (from October 1st to 7th), New Year (from serve as a fundamental analysis basis in operation and
January 1st to 3rd), Spring Festival (from January 22nd to 28th) management. However, the dimensionality reduced matrix of
and the Dragon Boat Festival (from June 22nd to 24th). PCA is a 290*4 matrix, while that corresponding to SVD is a

The first category contains September 1st-3rd, 6th-9th, TABLE ĊMAIN CHARACTERISTICS CORRESPONDING TO 16 CATEGORIES
15th-16th, 20th-23rd, 28th-30th, October 11th-15th, 18th-19th,
25th-27th, November 2nd-3rd, 7th-8th, 10th, 14th-17th, Category Characteristics
21st-24th, 28th-30th, December 1st, 6th-8th, 13th-14th and
1 Sundays and Mondays from Sept. to Nov.
22nd. The first one is mostly weekdays from September to
December. The mode shown above is recognized as a 2 Weekdays from Sept. to Dec.
characteristic weekday in Fall. Shenzhen metro OD volume
3 Fridays and Saturdays from Sept. to Jan.
from June to August is vastly different from Fall, which can be
automatically recognized in clustering. The second one contains 4 Weekends from Sept. to Mar.
Saturdays as well as Sundays from September to December. 5 Sundays from Oct. to Jan.
Sundays from October to January are contained in the third
cluster while most Thursdays and Fridays from September to 6 Halloween Eve and Christmas Eve
January are included in the fourth cluster. The travel mode on 7 Days before Spring Festival
Thursdays and Fridays has some distinguishing features in these
months. The fifth is mainly weekdays in January. The sixth only 8 Spring Festival
has records around Spring Festival. 9 Two days before and after Spring Festival
Apart from that, the seventh category is exclusively January 10 Days after Spring Festival
21st to 29th, totally in accordance with the Spring Festival.
11 Weekdays from Feb. to Mar.
Chinese New Year or Spring Festival is the most important
traditional Chinese holiday for family reunion. Shenzhen has a 12 Sundays from Feb. to Mar.
huge migrant population coming from every corner of China. So
13 Weekdays in Jun.
in days around the festival, labors have a strong demand to go
back home. Metro stations near major external traffic nodes are 14 Sundays from Jun. to Aug.
expected to have a large volume, including railway, bus, ferry
15 Saturdays from Jun. to Aug.
stations and airports. The eighth category is mostly Tuesdays
and Thursdays from February to March, and the ninth category 16 Weekdays from Jul. to Aug.

551
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop

290*175 matrix. PCA reduces the original DDM to a rather low


dimension, and the clustering results of it still capture major
demand patterns. Although SVD achieves satisfactory results, it
fails to reduce the high dimensional matrix efficiently to a
decent dimension. In other words, SVD works well at the
expense of high time complexity and is much slower than PCA
matrix when performing AP clustering.
B. Comparisons Among Clustering Methods
AP may not be convincing enough to be a better clustering
method in this research. Other mainstream methods should be
selected and compared in detail as well.
Based on graph theory, spectral clustering conducts
dimensionality reduction prior to clustering using the spectrum
of the similarity matrix, which measures the similarity of the
data and serves as an input. The process of clustering is to find a
proper partition of the graph so that the edges among other
groups present lower weights while the edges in the same Fig. 4. Determination of the best number of clusters.
cluster present higher weights. Let G (V , E ) be an undirected
clusters corresponding to the largest average silhouette value is
graph with vertex set V {v1 , v2 ,..., vn } and edge E . Spectral
the optimum [17].
algorithms are listed as follows: First, the number of clusters k
needs to be predetermined. The similarity matrix is defined as a The silhouette criterion is defined as follows:
symmetric matrix S , and each element si measures the b(i)  a(i)
 s(i ) ,       
similarities between data pairs. The unnormalized graph max{a(i), b(i)}
Laplacian matrix is defined as L D  W .( / is the where a (i ) is the average dissimilarity (Euclidean distance)
unnormalized Laplacian; ' matrix is the sum of weighted
of observation i with all other points within the same cluster,
adjacency factor; W is the weight marix) Then / and the first k
b(i ) is the lowest average dissimilarity of i and other clusters
generalized eigenvectors u1 , , uk of the generalized it does not belong to. The average silhouette is here used to
eigen-problem Lu O Du are calculated as well. Then each
evaluate clustering and to estimate the best number of clusters.
element of u can be determined, which forms a matrix 8 by As is shown in Fig. 4, the largest average silhouette value is
achieved when the cluster number equals 11. Eleven clusters
combing column vectors u1 , , uk . For i 1, , n yi is the seem to be the most appropriate in spectral clustering. However,
vector corresponding to the i -th row of U . Finally, the points Spring Festival and the first half of February are clustered into
the same category, which do not agree with the general
yi is clustered with the k-means algorithm [6]. perception. National day is not recognized, and many Sundays
and weekdays, which have different features end up in the same
In the first place, the number of clusters requires to be category. Judging from the accuracy and the universality, AP
specified. The best choice of k is often controversial and clustering using dimensionality reduction method of PCA or
ambiguous, which depends on the shape and the distribution of SVD seem to outperform spectral clustering in current research.
data points and the desired clustering resolution. The optimal
choice of k should strike a balance between compression of the C. Experiments Done Under Different Clustering Evaluation
data using clustering methods, and accuracy by assigning data Indices
points to appropriate clusters. To evaluate the performance of clustering methods, the
Davies-Bouldin index [18], Dunn’s index, and
One simple rule of thumb sets the number of cluster to Calinski-Harabasz index [19] are all widely utilized. In this
k | n / 2 with n being the number of data points [15]. study, Calinski-Harabasz index, Davies-Bouldin index and
Information criterion approach like the Akaike information
criterion (AIC), Bayesian information criterion (BIC), and the TABLE ċCLUSTERING EVALUATION INDICES OF DIFFERENT METHODS
Deviance information criterion (DIC) are also popular criteria
Clustering
[16]. Compared with the previous criteria, the average silhouette PCA+AP (13 SVD+AP (16 Spectral Analysis
Evaluation
of the data is a relatively simple and efficient criterion for Indices
categories) categories) (11 categories)
assessing the number. The silhouette measures how closely a Calinski-Harabas
data point is linked within its cluster and how loosely it is 131.3417 96.9219 93.3874
z
matched to data in the neighboring cluster. The number of Davies-Bouldin 1.5846 1.3245 1.819

Silhouette 0.2833 0.3766 0.2422

552
2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC): Workshop

Silhouette index are selected to evaluate the clustering quality. clustering in pattern recognition was seldom discussed in
A high silhouette value or Calinski-Harabasz value indicates a previous literatures. Besides, many existing methods in pattern
good clustering solution. In the contrast, the optimal clustering recognition are sophisticated and difficult to be followed. In this
solution is achieved with the minimum Davies-Bouldin index paper, the advantages of the combination of dimensionality
value. reduction and clustering are fully drawn, and different
algorithms corresponding to dimensionality reduction and
Different methods are compared when the optimal number clustering are compared in detail and integrated systematically.
of clusters is achieved. As shown in Table Ⅲ, all the three It also sheds light on forming a set of simple and applicable
indices show that Spectral analysis is inferior to AP clustering methodologies in processing smart card data to obtain trip
with the application of PCA and SVD. AP with PCA is better patterns.
than AP with SVD judging from the index of Calinski-Harabasz,
but is slightly worse than AP with SVD judging from the indices REFERENCES
of Davies-Bouldin and Silhouette. According to the value of [1] M. A. Munizaga and C. Palma, “Estimation of a disaggregate multimodal
these indices, the performance of AP with PCA and AP with public transport Origin–Destination matrix from passive smartcard data
SVD is almost on the same level. AP clustering with the from Santiago, Chile,” Transp. Res., Part C: Emerg. Technol., vol. 24, pp.
dimensionality reduction method of PCA is much faster than 9-18, Oct. 2013.
SVD and achieves a balance between accuracy of clustering and [2] W. Weijermars and E. V. Berkum, “Analyzing highway flow patterns using
efficiency of dimensionality reduction comprehensively cluster analysis,” in Proc. IEEE Intelligent Transportation Syst., Vienna,
Austria, 2005, pp. 308-313.
considered. As a result, AP clustering with the dimensionality
reduction method of PCA is more cost-effective in current [3] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323-2326,
study. Dec. 2000.
[4] C. Ding, X. He, H. Zha, and H. D. Simon, “Adaptive dimension reduction
V. CONCLUSIONS for clustering high dimensional data,” in Proc. IEEE Int. Conf. Data
Current research on daily demand recognition forms a set of Mining, Maebashi City, Japan, 2002, pp. 147-154.
procedures that can be selected accordingly by metro operators. [5] B. J. Frey and D. Dueck, “Clustering by passing messages between data
It provides a basic understanding of daily demand patterns like points,” Science, vol. 315, no. 5814, pp. 972-976, Feb. 2007.
how many people will travel from A to B on the basis of current [6] U. V. Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17,
no. 4, pp. 395-416, Aug. 2007.
research. The demand patterns can be used as a basis for macro
traffic models and as supplementary information for operational [7] J. Mendes-Moreira, L. Moreira-Matias, J. Gama, and J. F. de Sousa,
“Validating the coverage of bus schedules: A machine learning approach,”
management. Whereas in the past, metro operators formulated Information Sciences, Vol. 293, pp. 299-313,2015.
schedules according to real time flow variation, which is [8] J. Khiari, L. Moreira-Matias, V. Cerqueira, and O. Cats, “Automated
subjective and may fail to obtain early warning information. setting of Bus schedule coverage using unsupervised machine learning,”
Days in the same cluster can be represented by a typical demand In Pacific-Asia Conference on Knowledge Discovery and Data Mining,
matrix (average value of OD matrices of days within the same Springer International Publishing, pp. 552-564, 2016.
cluster). With the help of typical demand matrix, more specific [9] C. Yang, F. F. Yan, and X. D. Xu, “Clustering Daily Metro
and targeted demand patterns can be understood and database Origin-Destination Matrix in Shenzhen China,” Appl. Mech. Mater., vol.
743, pp. 422-432, Mar. 2015.
can be built to formulate a long term reaction mechanism. In
[10] K. Fukunaga, Introduction to statistical pattern recognition. Academic
spring or fall, on working days or holidays, in downtown or press, 2013.
rural areas, metro operators can make instant plans accordingly
[11] S. Sun, C. Zhang, and G. Yu, “A Bayesian network approach to traffic flow
based on demand patterns. forecasting,” IEEE Trans. Intell. Transp. Syst., vol. 7, no. 1, pp. 124-132,
Mar. 2006.
Moreover, the daily public transit OD matrices may be
[12] S. Jiang, S. Wang, Z. Li, W. Guo, and X. Pei. “Fluctuation Similarity
averaged and retrieved to serve as inputs to other models. Daily Modeling for Traffic Flow Time Series: A Clustering Approach,” in Proc.
trip patterns along with historical OD volume can serve as the 18th IEEE Int. Conf. Intelligent Transportation Syst., Canary Islands, 2015,
input or complementary information in traffic volume pp. 848-853.
prediction and analysis of congestion patterns. For instance, hot [13] Y. Sun, N. Ye, and X. Xu, “EEG analysis of alcoholics and controls based
OD pairs or popular areas where traffic jams may occur can be on feature extraction,” in Proc. 8th IEEE Int. Conf. Signal Process, Beijing,
recognized and predicted from historical patterns using current China, 2006.
methods and real-time volume. In case of emergency in metro [14] D. Dueck and B. J. Frey, “Non-metric affinity propagation for
system, traffic flow records alone do not provide sufficient unsupervised image categorization,” in 11th IEEE Int. Conf. Computer
Vision, Rio De Janeiro, Brazil, 2007, pp. 1-8.
information to make a decision, so current work can serve as a
[15] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis.
supplementary tool. Little is known about the real pattern in London: Academic press, 1979.
different stations in different time periods. And this research [16] C. Goutte, L. K. Hansen, M. G. Liptrot, and E. Rostrup, “Featureϋspace
opens a door to understanding the overall behavior of metro clustering for fMRI metaϋanalysis,” Hum. brain map., vol. 13, no. 3, pp.
travelers. 165-183, May 2001.
[17] M. S. Hossain and R. A. Angryk, “Gdclust: A graph-based document
Another contribution of this paper is related to how to clustering technique,” in Proc. 7th IEEE Int. Conf. Data Mining
process huge smart card OD data with high dimension, and to Workshops, Omaha, Nebraska, 2007, pp. 417-422.
divide data of different trip days into different groups which [18] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE
were not obvious before. Dimensionality reduction or clustering Trans. Pattern Anal. Mach. Intell., vol. 2, pp. 224-227, Apr. 1979.
solely is well-known and widely used in trip pattern recognition. [19] T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,”
However, the integration of dimensionality reduction and Commun. Stat.-Theor. M., vol. 3, no. 1, pp. 1-27, 1974.

553

You might also like