0% found this document useful (0 votes)
1 views

Spatio-Temporal Attention-BasedNeuralNetworkforCreditCardFraudDetection

The document presents a novel spatio-temporal attention-based neural network (STAN) for credit card fraud detection, addressing the limitations of traditional machine learning methods that require manual feature engineering. The STAN model integrates spatial and temporal behaviors of transaction records using attention mechanisms and 3D convolution, demonstrating superior performance in detecting fraudulent transactions compared to existing methods. Extensive experiments on real-world datasets validate the effectiveness of STAN in identifying suspicious transactions and uncovering fraud patterns.

Uploaded by

sh1619513754
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Spatio-Temporal Attention-BasedNeuralNetworkforCreditCardFraudDetection

The document presents a novel spatio-temporal attention-based neural network (STAN) for credit card fraud detection, addressing the limitations of traditional machine learning methods that require manual feature engineering. The STAN model integrates spatial and temporal behaviors of transaction records using attention mechanisms and 3D convolution, demonstrating superior performance in detecting fraudulent transactions compared to existing methods. Extensive experiments on real-world datasets validate the effectiveness of STAN in identifying suspicious transactions and uncovering fraud patterns.

Uploaded by

sh1619513754
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Spatio-Temporal Attention-Based
Neural Network for Credit Card Fraud Detection
Dawei Cheng, Sheng Xiang, Chencheng Shang,
Yiyi Zhang, Fangzhou Yang, Liqing Zhang∗
MoE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering,
Shanghai Jiao Tong University, Shanghai, China
{dawei.cheng, yi95yi, lake titicaca}@sjtu.edu.cn, [email protected]

Abstract 濥濸濽濸濶瀇濸濷
濖濴瀅濷澳濖濻濸濶濾
Credit card fraud is an important issue and incurs a con- 濕濿瀂濶濾濸濷澳濖濴瀅濷濒
濦瀈濹濹濼濶濼濸瀁瀇澳濕濴濿濴瀁濶濸濒
siderable cost for both cardholders and issuing institutions.
Contemporary methods apply machine learning-based ap-
proaches to detect fraudulent behavior from transaction 濣瀅濸濷濼濶瀇濼瀉濸澳
濙濸濸濷濵濴濶濾
濠瀂濷濸濿
records. But manually generating features needs domain
濦濶瀂瀅濸
knowledge and may lay behind the modus operandi of fraud,
which means we need to automatically focus on the most rel-
evant patterns in fraudulent behavior. Therefore, in this work,
澽濢濪濙濧濨濝濛濕濨濝濣濢
we propose a spatial-temporal attention-based neural network
(STAN) for fraud detection. In particular, transaction records
are modeled by attention and 3D convolution mechanisms
Figure 1: The framework of credit card fraud detection.
by integrating the corresponding information, including spa-
tial and temporal behaviors. Attentional weights are jointly
learned in an end-to-end manner with 3D convolution and
detection networks. Afterward, we conduct extensive experi- and feedback the analysis results to the predictive model for
ments on real-word fraud transaction dataset, the result shows model updating.
that STAN performs better than other state-of-the-art base-
lines in both AUC and precision-recall curves. Moreover, we As attacking strategies from potential fraudsters change,
conduct empirical studies with domain experts on the pro- it is essential that a well-behaved system can adapt to the
posed method for fraud post-analysis; the result demonstrates evolving strategies (Randhawa et al. 2018; Jiang et al. 2018).
the effectiveness of our proposed method in both detecting We summarize the following two major observations from
suspicious transactions and mining fraud patterns. real-world fraud transactions: 1). Temporal aggregation.
Fraudsters are subject to the limited time of the activities. As
the cardholder will freeze the card as soon as possible once
Introduction suspicious transactions have been detected, fraudsters are re-
Credit card fraud is a general term for the unauthorized use quired to reach the credit limit in a short time. That means
of funds in a transaction typically through a credit or a debit the behaviors of the fraud transaction would be exposed in
card (Bhattacharyya et al. 2011). Global card fraud losses a limited time. 2). Spatial aggregation. Fraudsters are sub-
amounted to over 25 billion US dollars in 2018 and is fore- jected to cost on the devices and merchants of transactions.
cast to continue to increase (Wang, Chen, and Chen 2019). That is, due to the economic constraints, fraudsters will use
This huge amount of losses has increased the importance the card frequently with only a small number of merchants,
of fraud-fighting. Figure 1 shows a typical fraud detection which are spatially different from normal transactions.
framework deployed in a commercial system. The card al- Many existing models to deal with fraud transactions have
liance or banks, such as VISA, MasterCard or Citibank, as- been extensively studied (Patidar, Sharma, and others 2011;
sess each transaction with an online predictive model once Bahnsen et al. 2016; Carneiro, Figueira, and Costa 2017).
it has passed card checking. Unlike a simple card checking They mainly split into one of two directions: 1). Rule-based
system, which focuses on card blacklists, budget checking, methods directly generate sophisticated rules by domain ex-
etc., the predictive model is designed to detect fraud patterns perts for identification; for example, (Seeja and Zareapoor
automatically and produces a fraud risk score. Investigators 2014) proposed an association rules method for mining fre-
can thereby focus on the high-risk transactions effectively quent fraud rules. 2). Machine learning-based methods learn

Corresponding Author static models by exploring large amounts of historical data.
Copyright  c 2020, Association for the Advancement of Artificial For example, (Fiore et al. 2017) extracted features based on
Intelligence (www.aaai.org). All rights reserved. neural networks and built supervised classifiers for detecting

362
fraudulent transactions. (Fu et al. 2016) advanced the usage
of automatic feature engineering in a convolutional neural
network (CNN). (Randhawa et al. 2018) applied AdaBoost
and majority voting on fraud records. (Jurgovsky et al. 2018)
researched on this task by a sequence LSTM model. How-
ever, all these methods require manually constructing fea-
tures before feeding into a classification model, which fails
to automatically learn the joint impact on spatial and tem-
poral patterns, as the spatio-temporal patterns have been ob-
served as the main weaknesses of fraudsters, also reported
by (Gómez et al. 2018).
Recently developed attention mechanisms have shown
the benefit on automatic feature learning (Vaswani et al.
2017; Cheng et al. 2019b). The superior performance of
3-dimensional (3D) convolution on spatio-temporal feature Figure 2: Heat maps of spatio-temporal feature slices from
learning is also demonstrated in a wide range of prediction both fraudulent and legitimate transactions.
tasks (Allamanis, Peng, and Sutton 2016). In credit card
fraud detection task, it is important to jointly consider the
“temporal aggregation” and “spatial aggregation” together feature extraction steps in the next section. It can be seen
and then drive them into a representative and deep classifier that in temporal analysis, fraud features (shown in Figure
which well-suited for spatio-temporal feature learning. 2a) change abruptly across different slices, while legitimate
Therefore, in this paper, we present the STAN model ones are much more slight (shown in Figure 2b). It confirms
for credit card fraud detection, a novel deep learning-based our original assumption of “temporal aggregation”.
method, which jointly considers “temporal aggregation” and In spatial analysis, we encode transaction merchants into
“spatial aggregation” in an attention network. Our proposed their location codes and aggregate features according to the
approach first construct raw transaction recodes into spatio- codes within a fixed time window (we set it to days here).
temporal based feature slices, then we use an attention Figures 2c to Figure 2d show the heat map of features in
mechanism to adaptively learn the importance of different spatial slices. As we can see, fraud transactions are obvi-
slices. To uncover the hidden fraud patterns, we introduce ously located in only a small number of zones, which means
a 3D convolution layer to capture intrinsic relationships fraudsters would use the card frequently under the constraint
among spatio-temporal patterns. During experiments, we of locations or devices, while for the normal transactions,
show that the results of the proposed method significantly there are no noteworthy patterns for user consuming behav-
outperform the results from other state-of-the-art baselines. ior in given time windows. As a result, this set of data anal-
In brief, the main contributions of this paper include: ysis validates our original assumption.
• We present a novel attention-based 3D convolution neural
network for credit card fraud detection by jointly captur- Problem Definition
ing two weaknesses displayed by fraudsters, summarized Transaction A transaction means the use of a credit card
as “temporal aggregation” and “spatial aggregation”. To by a consumer u to purchase commodities or services. The
the best of our knowledge, this is the first time that a fraud purchase price is sent through a processor for authorization;
detection problem has been addressed by spatio-temporal if the amount a is approved it is automatically submitted to
attention neural network approaches with a 3D convolu- the merchant m in location l.
tional mechanism.
• Our approach is extensively evaluated in a real-world Transaction Record A transaction record r can be de-
credit card fraud post analysis system, hosted by a ma- fined as a tuple of attributes in a transaction payment pro-
jor financial institution. The experimental results demon- cess r = {u, t, l, a}, where u denotes the user, t and l is the
strate the superiority of our proposed methods, which time stamp and location of the transaction, and m means the
could detect more fraud transactions with relatively high amount of this payment.
precision compared with state-of-the-art baselines. Fraud Event A fraud event d in this paper refers to a
transaction which is not authorized by its cardholder. A
Preliminaries fraud event is a special type of transaction, which means it
In this section, we first briefly present some data analysis to also preserves {u, t, l, a} attributes.
support our intuitions and then present the problem defini- The complete real-world fraud event data provided by our
tion of our work. collaborating institution offers us the unique opportunity to
tackle the problem of fraud detection. In conclusion, we now
Spatio-temporal Analysis formalize our credit card fraud detection problem as follows:
Figure 2 visualizes the scaled spatio-temporal feature slices, Given a set of transaction records R = {U , T , L, A}, a
where the left part shows fraud transactions and the right set of fraud events D, which are a subset of the transaction
part shows legitimate ones. We will describe the detailed collection {D|D ⊂ R}, and time period ti & ti+1 , for each

363
澵濨濨濙濢濨濝濣濢澔激濕濭濙濦

澷濣濢濪濣濠濩濨濝濣濢澔激濕濭濙濦澔 澶濕濗濟澔濄濦濣濤濕濛濕濨濝濣濢

澺濩濠濠澔澷濣濢濢濙濗濨澔激濕濭濙濦澔

濘瀅瀅瀂瀅
ሼ‫ݑ‬ǡ ‫ݐ‬ǡ ݈ǡ ܽሽ
濦瀃濴瀇濼濴濿澳澳濦濿濼濶濸瀆
濦瀃濴瀇濼濴濿澳濪濸濼濺濻瀇瀆

濧濸瀀瀃瀂瀅濴濿澳濪濸濼濺濻瀇瀆
濧濸瀀瀃瀂瀅濴濿澳濦濿濼濶濸瀆

濈濦濕濢濧濕濗濨濝濣濢澔濆濙濗濣濦濘濧 澺濙濕濨濩濦濙澔澹濢濛濝濢濙濙濦濝濢濛 濇濤濕濨濝濣澡濈濙濡濤濣濦濕濠 澵濨濨濙濢濨濝濣濢 濁濩濠濨濝澔澧澡澸澔澷濣濢濪濣濠濩濨濝濣濢澔澚澔濄濣濣濠濝濢濛 濆濙濧濜濕濤濝濢濛 澸濙濢濧濙澔激濕濭濙濦 澺濦濕濩濘澔濇濗濣濦濙

Figure 3: The illustration of the proposed spatio-temporal attention-based neural network (STAN) model. Raw transaction
records are processed by feature engineering, spatio-temporal attention, and multiple 3D ConvNet to learn high-level represen-
tations. Afterward, the learned representations are reshaped to vectors and fed into a detection network for fraud estimation.
Attentional weights are jointly optimized in an end-to-end mechanism with 3D convolution and detection networks.

transaction, we want to infer the possibility of whether it is within one month. As the number of users who have never
a fraud event, based on the transaction records and fraud been charged with unauthorized transactions, is much larger
events from t1 to ti . The objective is to achieve a high ac- than the number of users who are affected, we adopt user-
curacy of fraud prediction, as well as to explore the fraud level downsampling of normal users instead of transaction-
patterns of credit card transactions. level sampling, to maintain the fraud patterns during prepro-
cessing.
The Proposed Approaches Afterward, we construct the feature representation of each
record into tensor format X ∈ RN1 ×N2 ×N3 , where N1 , N2 ,
In this section, we first introduce the framework of a N3 denote the dimensions of temporal, spatial and feature
spatio-temporal attention-based neural model. After that, slices.
we present the process of feature engineering, the spatio- Temporal Slices X (t, :, :). Each temporal feature represents
temporal attention layer, the 3D convolution network (3D a vector generated in a given time window. The number and
ConNet) and the detection layer. Lastly, we introduce the diversity in temporal slices reflect its activeness and hence
optimization strategy of the proposed methods. are related to the consuming behavior of the user. We thus
extract features in temporal slices including (1) features in
Model Architecture the latest 1 second, minute, hour, day, week, month and quar-
Figure 3 shows the general network architecture of STAN. ter; (2) features in the last 1, 10, 100, 1000 transactions.
The model takes users’ transaction records as input and Spatial Slices X (:, l, :). Based on the observation that fraud-
transfers them into high-order tensor spaces in spatial, tem- sters are affected by the constraints of location, we collect
poral and feature orders. Then, we apply spatio-temporal zip codes, the business center from the State Postal Bureau
attention and the 3D convolution layer to obtain a transac- and AliTrip, and divide the features into four levels manually
tion representation vector. Spatio-temporal attention helps to according to the location. They are a one-hot representation
draw information from tensor features by different weights of the location ID in the nation, state, city and business cen-
and the 3D convolution layer helps to model hidden patterns ter levels and we concatenate them in the spatial slice.
of transactions. Finally, we reshape the learned feature rep- Features X (:, :, f ). Inspired by Fu’s work (Fu et al. 2016)
resentation from tenors to vectors for the fraud detection task that the transaction entropy is one of the important patterns
by a detection network. We will first introduce each compo- in fraud detection, we identify the extracted features includ-
nent of the model, then discuss the settings of detection layer ing current amount, average amount, total amount, transac-
and optimization in the following sections. tion times and the most recent location.

Feature Extraction Spatio-temporal Attention Net


For given transaction records r = (r1 , r2 , · · · , rn ), each The attention network aims to perform proper credit assign-
record ri = {u, t, l, a}, contains: user ID, timestamp, lo- ment to the spatial and temporal slices according to their
cation code and transaction amount. In preprocessing, we importance in the current transaction. It contains two self-
combine users who maintain multiple credit cards into a user attention layers targeting temporal and spatial slices respec-
ID and filter out inactive users that have less than 10 records tively.

364
濧濸瀀瀃瀂瀅濴濿澳
濔瀇瀇濸瀁瀇濼瀂瀁
our problem. In particular, they can learn local spatial filters
߯ሺ‫ݐ‬ǡ ǣ ǡ ǣ ሻ ‫ݐ݌݁ݎ‬ሺ‫ݐ‬ǡǣǡǣሻ
ߙଵǡଵ
that are useful for classification tasks. Second, by stacking
߯ሺͳǡ ǣ ǡ ǣ ሻ
multiple layers, the network can learn more complex fea-

濦瀃濴瀇濼濴濿澳澳濦濿濼濶濸瀆
ߙଵǡଶ
߯ሺʹǡ ǣ ǡ ǣ ሻ
tures from input spatial spaces. Finally, the optimization of
濦瀃濴瀇濼濴濿澳澳濦濿濼濶濸瀆

ߙଵǡଷ
߯ሺ͵ǡ ǣ ǡ ǣ ሻ
CNN could be learned by SGD based methods, which can be
߯ሺͶǡ ǣ ǡ ǣ ሻ ߙଵǡସ
performed efficiently with commercial graphics hardware.
߯ሺͷǡ ǣ ǡ ǣ ሻ ߙଵǡହ
Compared to 2D convolution networks, 3D ConvNet is
濧濸瀀瀃瀂瀅濴濿澳濦濿濼濶濸瀆 濦瀃濴瀇濼濴濿澳
濔瀇瀇濸瀁瀇濼瀂瀁
濧濸瀀瀃瀂瀅濴濿澳濦濿濼濶濸瀆 ideal for spatio-temporal learning of features. Due to 3D
‫ܪ‬ሺ‫ݐ‬ǡ ‫ݏ‬ǡ ݂ሻ
ߙଶǡଵ
‫ݐ݌݁ݎ‬ሺǣǡ‫ݏ‬ǡǣሻ convolution and 3D pool operations, 3D ConvNet works
temporally and spatially, whereas in 2D ConvNet it is only

濦瀃濴瀇濼濴濿澳澳濦濿濼濶濸瀆
‫ݐ݌݁ݎ‬ሺǣǡͳǡǣሻ
ߙଶǡଶ
濦瀃濴瀇濼濴濿澳澳濦濿濼濶濸瀆

ߙଶǡଷ
‫ݐ݌݁ݎ‬ሺǣǡʹǡǣሻ spatially executed. In general, the following equation repre-
‫ݐ݌݁ݎ‬ሺǣǡ͵ǡǣሻ
sents a 3D convolution operation:

ߙଶǡସ
‫ݐ݌݁ݎ‬ሺǣǡͶǡǣሻ

濧濸瀀瀃瀂瀅濴濿澳濦濿濼濶濸瀆
ߙଶǡହ
‫ݐ݌݁ݎ‬ሺǣǡͷǡǣሻ repcci (t, l, f ) = Hc−1 (t−m, l −n, f −o)Wic (m, n, o)
濧濸瀀瀃瀂瀅濴濿澳濦濿濼濶濸瀆
m,n,o
(5)
Figure 4: Illustration of spatio-temporal attention neural net- in which Wic is the 3D kernel in the c − th layer and
works. i − th kernel which convolves over the feature Hc−1 , and
Wic (m, n, o) is the element-wise weight in the 3D convolu-
tion kernel. Thus, the feature Hc is obtained by different 3D
Temporal Attention Layer Formally, given the extracted convolution kernels:
feature tensor X as described above, the temporal attention  
layer represents the transaction by a weighted sum of the 
c c c
H =σ repci + b (6)
matrix representation of all the temporal slices. Mathemati-
i
cally, it takes the form as follows:
where σ denotes the sigmoid function.
N1
 Then we hierarchically build a deep 3D ConvNet model
rept = a1,t X (t, :, :) (1) by stacking convolutional layers (represented as C) and
t=1 pooling layers (represented as P). In particular, multiple 3D
exp ((1 − λ1 ) · gt (Wt , X (t, :, :))) feature volumes are generated in the C layer. In the P layer,
a1,t = N1 (2) the maximum pool operation is also performed in 3D, that is,
t=1 exp ((1 − λ1 ) · gt (Wt , X (t, :, :))) the feature volume is subsampled based on the cube neigh-
where: a1,t is the weight for each temporal slice, and gt (·) borhood. In the fully connected layer, the 3D feature volume
is a fully connected layer with ReLU activation and param- is flattened into a vector as input.
eters Wt ; λ1 ∈ [0, 1] is the temporal penalty factor to con-
trol the importance of temporal attention; rept is the output Detection Layer and Optimization
of the temporal attention layer. It should be noted that we The fraud detection task takes the transaction representation
unfold matrices X (t, :, :) to row vectors for computational rep, which is the tensor flatted vector learned by attention
convenience and reshape the output rept into tensor format and convolution networks, and aims to learn the probability
rept ∈ RN1 ×N2 ×N3 . of whether it’s a fraudulent trade. The loss function is the
likelihood defined as follows:
Spatial Attention Layer Given the output from the tem-
poral net rept, we then apply a spatial attention mechanism N
on the top of the temporal net. It is formulated as follows: 1 
L(θ) = − [yi log(detect(repi : θ))
N i=1 (7)
N2

Ha = a2,s rept(:, s, :) (3) +λ3 (1 − yi ) log(1 − detect(repi : θ))]
s=1 where: repi denotes the representation of the i − th trans-
action record, which is the output of 3D ConvNet, and λ3
exp ((1 − λ2 ) · gs (Ws , rept(:, s, :))) indicates the sample weight according to the biased distri-
a2,s = N2 (4)
bution of fraud and legitimate records; yi denotes the label
s=1 exp ((1 − λ2 ) · gs (Ws , rept(:, s, :)))
of i − th records, which is set to 1 if the record is fraud
where Ws is the weight of spatial network gs ; Ha is the and 0 other wise; detect(repi ) is the detection function that
output of attention layer, we reshape it into tensor format maps repi to a real valued score, indicating the probability
with the same order as X ; a2,s is the weight for each spatial of whether the current transaction is fraudulent. We imple-
slices; λ2 ∈ [0, 1] is the spatial penalty factor to control the ment detect(repi : θ) with two-layer ReLu and one-layer
importance of spatial attention. sigmoid network.
The proposed STAN can be optimized through the stan-
3D Convolutional Layers dard SGD-based algorithms. In this paper, we used the
For our mission, CNN is an attractive option for three main Adam Optimizer to learn the parameters. We set the initial
reasons. First, they can clearly exploit the spatial features of learning rate to 0.001, and the batch size to 256 by default.

365
Experiments
Table 1: Performance comparison with baselines.
Experiment Settings AUC (Oct) AUC (Nov) AUC (Dec)
LR 0.7247 0.7163 0.7199
Datasets We collected fraud transactions from a major GBDT 0.7868 0.7949 0.7864
commercial bank, which comprises real-word credit card MLP 0.7803 0.8012 0.7891
transaction records spanning twelve months, from Jan 1 to Deep & Wide 0.8210 0.8197 0.8108
Dec 31, 2016. The ground truth labels are reported by con- CNN-max 0.8352 0.8367 0.8267
sumers and confirmed by domain experts in the financial AdaBM 0.8243 0.8249 0.8232
institution. We first filtered suspicious records by domain LSTM-seq 0.8368 0.8353 0.8290
experts and user reports. Based on this process, the large STAN-notemp 0.8467 0.8395 0.8406
amount of user records with both legitimate and inactive STAN-nospat 0.8602 0.8531 0.8583
transactions were excluded. If a trade is reported by a card- STAN-no3d 0.8569 0.8562 0.8506
holder or identified by financial experts as fraudulent, we
STAN-all 0.8832∗ 0.8789∗ 0.8865∗
label it as 1; otherwise, it is labeled as 0. Finally, the dataset
includes 236,706 transaction records, by 1021 users, across
1160 location codes, which were affected by fraud.
Fraud Detection
In preprocessing, each transaction includes four at-
tributes. We simplify them as: user ID, timestamp, location We evaluated the performance of different models for the
code and transaction amount. We encode categorical data, fraud detection task. Records of the first nine months were
such as user ID and location code, into one-hot representa- used as training data and then we predicted the fraud trans-
tions. We round the time record from the millisecond level actions in the following three months (Oct, Nov and Dec).
to a standard DataTime format (yyyy-MM-dd HH:mm:ss). We repeated the experiment 10 times and report the average
For the amount attribute, like many other financial signals, AUC in Table 1.
it performs a distribution with a long tail. We first cut off The first seven lines of Table 1 contain the results of
the outliers by the three-sigma rule (Friedrich Pukelsheim et some of the latest baselines. In all baselines, CNN-max
al. 1994) and then perform a log transform on the amount and LSTM-seq proved to be competitive, demonstrating
value. the necessity of deep models for fraud detection. Lines 8-
11 show the results of STAN and some of its submod-
Compared Methods We employ the following state-of- els. STAN-notemp’s performance is close to CNN-max, the
the-art methods on our benchmark dataset to highlight the spatial-attention layer prove to be effective. STAN-nospat
effectiveness of the proposed STAN. In these experiments, and STAN-no3d perform much better than the baselines and
the tasks are learned independently. These baseline includes: STAN-notemp and we validated the essentials of each sub-
LR (Logistic Regression) (McMahan 2011), GBDT (Ke et module. STAN-all outperforms all the other models.
al. 2017), MLP (Tang, Deng, and Huang 2015), Deep &
Wide (Cheng et al. 2016), CNN-max (Fu et al. 2016), Ad- Precision-recall Curves
aBM (Randhawa et al. 2018), LSTM-seq (Jurgovsky et al. In Figure 5 we present the precision-recall curves for the
2018). STAN-notemp/nospat/no3d denotes sub-models of lastest baselines. As shown, our proposed STAN performs
STAN, in which the temporal attention, spatial attention are better than baselines with respect to the area under the
not used, utilizing the 2D convolution layer instead of this precision-recall curves. The results of AdaBM and Deep &
paper’s proposed 3D ConvNet. STAN-all is the full pro- Wide are quite similar, both of them are much better than
posed spatial-temporal atention-based neural network model LR. Essentially, this might because that fraud patterns in
in this paper. credit card transaction records are too complex for a sim-
ple liner model like LR to address. With the help of deep
Parameter Settings and Evaluation Metrics In this ex- structures, LSTM-seq perform a slightly promotion com-
periment, we apply the preferred parameters for each of pared with AdaBM. In all baselines, LSTM-seq and CNN-
the baseline methods as they were originally proposed. For max are shown to be the most competitive. The reason might
STAN, we employ 2 convolution layers, each of them is set be that they preserve deep representation of raw features and
to 4 × 4 × 4 convolution kernel, followed by a max-pooling explicitly makes use of the spatial features of our problem,
layer. Two full connected layers are added on the top of 3D while Deep & Wide and AdaBM are not optimized for local
ConvNet, each of them consisting of 32 neurons. We set the spatial patterns.
temporal and spatial parameters (λ1 and λ2 ) by cross valida- Our method, STAN, consistently outperforms other state-
tion. The sample weight λ3 is set by the training distribution. of-the-art baselines. The reason is twofold: (1) STAN deals
We evaluated the detected results by precision, recall, and with both spatial and temporal features and integrates them
F-Score. In our implementation, we tried all possible thresh- into an attention network, contrasted with CNN-max which
old probabilities in our KS-test from 0 to 1 with the step only deals with spatial ones that cannot address temporal
size of 0.01. To determine the most effective threshold, we patterns of transaction records; (2) STAN uses a 3D convolu-
tested the detection result with the ground truth labels. We tional network for tensor features instead of 2D convolution
also report the AUC (area under the ROC curve) in our ex- so that it can better model spatio-temporal feature learning.
periments. Specifically, our methods work as well as, or even better, at

366
Figure 5: Precision recall curve of STAN compared with baseline methods.

Table 2: The value of attention coefficients.


Temporal Coefficients Spatial Coefficients
Seconds 0.2625 #13 0.1468
Minutes 0.0509 #21 0.1092
Hours 0.2053 #36 0.0371
Days 0.0969 #39 0.0308
Weeks 0.2971 #42 0.0227
Months 0.0161 #47 0.0214
Quarters 0.0003 #48 0.0148
Figure 6: Parameter sensitivity experiment on temporal, spa-
tial penalty parameters and the depth of convolution layers.

volution of at least two hops to display.


the very beginning of the curve compared to the compared
methods. More importantly, our methods can accurately de- Case Studies
tect many more fraud transactions (high recall) with a rela- Table 2 shows the learned coefficients of spatial and tem-
tively high precision, which is quite promising. poral attention layers, in which “Weeks”, “Seconds” and
“Hours” weights are noticeable. This is because user behav-
Parameter Sensitivity ior normally shows a periodic distribution on a weekly ba-
In this section, we study the model generalization which in- sis, but the fraudulent trades are concentrated in an instant
cludes penalty parameters, the depth of hidden convolution until exceeding the user’s credit limit (there could be more
layers and their impact on our task. than 100 transactions in one second). This phenomenon is
We vary the temporal and spatial penalty parameters (λ1 also reported by (Lepoivre et al. 2016). In spatial studies,
and λ2 ) from 0 to 1 with a step of 0.02. As shown in Fig- we present the top seven attention weights in Table 2.
ure 6a, it can be easily found that the parameters, λ, has a In order to uncover the fraud patterns from learned at-
great influence to the model performance. Our model per- tention weights, we adopted an empirical study on infected
forms better by increasing λ from 0 to 0.1, and the AUC accounts with our collaborating domain experts. We firstly
reaches the peak when λ1 = 0.1 and λ2 = 0.15. The per- randomly selected 1000 fraud transactions and then back-
formance is degradaded if we keep on increasing the value tracked other records within one week before the fraud event
of λ. The reason might be that varying λ could balance the in the infected account. Finally, we collected the records
model consider a proper spatio-temporal window. When we from infected users into hours and aggregated them by sum-
increase λ from 0 to 1, our proposed model could consider marizing the spending amount and times of trade location
features in a different spatio-temporal range and reach a per- ID, as shown in Figure 7. We get the following observations:
formance peak around λ1 = 0.1 and λ2 = 0.15. Temporally: on average, fraud transactions account for over
Figure 6b shows the influence of the depth of hidden con- 70% of a user’s credit limit, illustrated in hour 166 of the
volution layers on the AUC. With the deeper hidden con- x-axis (fraud event time) of Figure 7a, which means an aver-
volution layers, the model tends to aggregate the temporal age of 70% loss for each fraud event. We notice that a small
and spatial information from a neighborhood into a wider equal number of trades are usually issued in 1-3 hours (be-
range. As we have seen, the AUC with a depth of 1 hidden tween hour 162-165 on the x-axis) before the event. Domain
layer does not work well because the information we have is experts have demonstrated they are trade attempts by fraud-
mixed. Our model needs to ”swap” information in terms of sters, after analysis of the records. This small number of at-
temporal, spatial and feature aspects, which requires a con- tempts is important for fraudsters: 1) if successful, the card

367
develops from a similar intuition but further integrates an
attention model in both spatial and temporal aspects which
significantly improves the accuracy of the detection. To our
knowledge this is the first paper exploiting joint learning at-
tention mechanisms with 3D convolutional networks in the
context of credit card fraud detection.
Credit Card Fraud Detection Several machine learning
techniques have been used in the literature to approach the
credit card fraud detection problem. (Maes et al. 2002) tried
Bayesian Belief Networks (BBN) and Artificial Neural Net-
works (ANN) on a real dataset obtained from Europay Inter-
national. In (Zaki, Meira Jr, and Meira 2014) neural network
based models and decision tree models are compared, and
the authors found that neural networks outperforms decision
trees. The authors in (Fu et al. 2016) prove that using a con-
volution model to extract spatial patterns can achieve higher
Figure 7: Case studies of attention weights. We randomly accuracy compared with neural networks, SVMs and deci-
extract 1000 fraud transactions and backtracking records sion trees. (Randhawa et al. 2018) applied AdaBoost and
in one week before the fraud occurrence. (a) shows the majority voting on fraud records. (Jurgovsky et al. 2018)
hourly aggregated trading amount. (b) displays the heatmap researched on this task from sequence classify perspective
of transaction locations in an hourly summary. by improved LSTM model. These methods, however, feed
manually generated features into a classification model di-
rectly, which ignores the joint feature learning on spatial and
will be transferred for a large number of fraud transactions; temporal patterns. As a result, they may not be appropri-
2) if failed, the cardholder might not notice the tiny amount ate for real-world large scale fraud detection systems with
of failed trade attempts. We also observe that the number of complex and unpredictable fraud patterns. The approach we
trades in hours 140-160 is low compared with hours 0-140, present in this paper is radically different, as we employ a
which means the card user may have missed the card one structured attention model which is jointly learned within a
whole day before the fraud event. 3D CNN framework.
Spatially: users obviously have a location propensity as
shown in Figure 7b, where the brighter color (red) means
a higher frequency. We observe the two most popular trade Conclusion
merchants are located in ID #42 and #43, which are two pop- In this paper, we present a novel attentional 3D convolution
ular online payment systems. It should be noted that fraud neural network for credit card fraud detection. In particu-
transactions are concentrated in limited locations, such as lar, we uncover the weakness of fraudsters, called “temporal
#13, #21, etc., which are generally different from user’s his- aggregation” and “spatial aggregation”, and propose a 3D
torically frequent trading locations. This study confirms our convolutional neural network approach based on a spatio-
intuition of spatial aggregation and learned spatial attention temporal attention mechanism. This is the first work in
coefficients. which attentional 3D ConNet has ever been employed to the
credit card fraud detection problem. Our methods achieve
Related work promising AUC and precision-recall curves compared with
We summarize the related work in two main areas: 1) atten- other state-of-the-art baseline methods. Furthermore, we ex-
tional convolutional neural networks and 2) credit card fraud plore to uncover fraud patterns by the observation of learned
detection. attention weights in case studies. The proposed method is
extensively evaluated in an online transaction post-analysis
Attentional Convolution Neural Network Many recent system. The result demonstrates that our methods can effec-
works have shown the benefit of combining an attention tively detect fraudulent transactions. In the future, we are
mechanism in convolutional neural networks for a wide interested in building a real-time in-process fraud detection
range of prediction tasks (Allamanis, Peng, and Sutton 2016; system based on an online learning mechanism instead of
Vaswani et al. 2017), such as depth estimation (Xu et al. the offline training approach.
2018), default prediction (Cheng et al. 2019a) or language
understanding (Shen et al. 2018). For instance, pervasive at-
tention are employed on 2D convolutional neural networks
Acknowledgments
for sequence-to-sequence prediction (Elbayad, Besacier, and The work is supported by the National Key R&D Program
Verbeek 2018). Attention-gated networks have been con- of China (2018AAA0100704), the China Postdoctoral Sci-
sidered for integrating multi-scale information in (Xu et al. ence Foundation, the National Basic Research Program of
2017). In (Chen et al. 2016) an attention model is employed China (2015CB856004), and the Key Basic Research Pro-
for combining multi-scale features in the context of seman- gram of Shanghai Science and Technology Commission,
tic segmentation and object contour detection. Our approach China (16JC1402800).

368
References Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.;
Allamanis, M.; Peng, H.; and Sutton, C. 2016. A con- Ye, Q.; and Liu, T.-Y. 2017. Lightgbm: A highly efficient
volutional attention network for extreme summarization of gradient boosting decision tree. In Advances in Neural In-
source code. In International Conference on Machine formation Processing Systems, 3146–3154.
Learning, 2091–2100. Lepoivre, M. R.; Avanzini, C. O.; Bignon, G.; Legendre, L.;
Bahnsen, A. C.; Aouada, D.; Stojanovic, A.; and Ottersten, and Piwele, A. K. 2016. Credit card fraud detection with un-
B. 2016. Feature engineering strategies for credit card fraud supervised algorithms. Journal of Advances in Information
detection. Expert Systems with Applications 51:134–142. Technology 7(1).
Maes, S.; Tuyls, K.; Vanschoenwinkel, B.; and Manderick,
Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; and Westland,
B. 2002. Credit card fraud detection using bayesian and neu-
J. C. 2011. Data mining for credit card fraud: A comparative
ral networks. In Proceedings of the 1st international naiso
study. Decision Support Systems 50(3):602–613.
congress on neuro fuzzy technologies, 261–270.
Carneiro, N.; Figueira, G.; and Costa, M. 2017. A data
McMahan, H. 2011. Follow-the-regular ized-leader and
mining based system for credit-card fraud detection in e-tail.
mil-ror descent: Equivalence theorems and 11 regulariza-
Decision Support Systems 95:91–101.
tion. Journal of Machine Learning Research Proceedings
Chen, L.-C.; Yang, Y.; Wang, J.; Xu, W.; and Yuille, A. L. Trade 15:525–533.
2016. Attention to scale: Scale-aware semantic image seg- Patidar, R.; Sharma, L.; et al. 2011. Credit card fraud de-
mentation. In Proceedings of the IEEE conference on com- tection using neural network. International Journal of Soft
puter vision and pattern recognition, 3640–3649. Computing and Engineering (IJSCE) 1(32-38).
Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, Randhawa, K.; Loo, C. K.; Seera, M.; Lim, C. P.; and Nandi,
T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, A. K. 2018. Credit card fraud detection using adaboost and
M.; et al. 2016. Wide & deep learning for recommender sys- majority voting. IEEE access 6:14277–14284.
tems. In Proceedings of the 1st workshop on deep learning
for recommender systems, 7–10. ACM. Seeja, K., and Zareapoor, M. 2014. Fraudminer: A novel
credit card fraud detection model based on frequent itemset
Cheng, D.; Tu, Y.; Ma, Z.; Niu, Z.; and Zhang, L. 2019a. mining. The Scientific World Journal 2014.
Risk assessment for networked-guarantee loans using high-
order graph attention representation. In Proceedings of Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Pan, S.; and Zhang, C.
the 28th International Joint Conference on Artificial Intel- 2018. Disan: Directional self-attention network for rnn/cnn-
ligence, 5822–5828. AAAI Press. free language understanding. In Thirty-Second AAAI Con-
ference on Artificial Intelligence.
Cheng, D.; Zhang, Y.; Yang, F.; Tu, Y.; Niu, Z.; and Zhang,
L. 2019b. A dynamic default prediction framework for Tang, J.; Deng, C.; and Huang, G.-B. 2015. Extreme learn-
networked-guarantee loans. In Proceedings of the 28th ACM ing machine for multilayer perceptron. IEEE transactions
International Conference on Information and Knowledge on neural networks and learning systems 27(4):809–821.
Management, 2547–2555. ACM. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Elbayad, M.; Besacier, L.; and Verbeek, J. 2018. Pervasive
tention is all you need. In Advances in neural information
attention: 2d convolutional neural networks for sequence-to-
processing systems, 5998–6008.
sequence prediction. arXiv preprint arXiv:1808.03867.
Wang, D.; Chen, B.; and Chen, J. 2019. Credit card fraud de-
Fiore, U.; De Santis, A.; Perla, F.; Zanetti, P.; and Palmieri,
tection strategies with consumer incentives. Omega 88:179–
F. 2017. Using generative adversarial networks for improv-
195.
ing classification effectiveness in credit card fraud detection.
Information Sciences. Xu, D.; Ouyang, W.; Alameda-Pineda, X.; Ricci, E.; Wang,
X.; and Sebe, N. 2017. Learning deep structured multi-scale
Fu, K.; Cheng, D.; Tu, Y.; and Zhang, L. 2016. Credit card
features using attention-gated crfs for contour prediction. In
fraud detection using convolutional neural networks. In In-
Advances in Neural Information Processing Systems, 3961–
ternational Conference on Neural Information Processing,
3970.
483–490. Springer.
Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; and Ricci, E.
Gómez, J. A.; Arévalo, J.; Paredes, R.; and Nin, J. 2018. 2018. Structured attention guided convolutional neural fields
End-to-end neural network architecture for fraud scoring in for monocular depth estimation. In Proceedings of the IEEE
card payments. Pattern Recognition Letters 105:175–181. Conference on Computer Vision and Pattern Recognition,
Jiang, C.; Song, J.; Liu, G.; Zheng, L.; and Luan, W. 2018. 3917–3925.
Credit card fraud detection: A novel approach using aggre- Zaki, M. J.; Meira Jr, W.; and Meira, W. 2014. Data mining
gation strategy and feedback mechanism. IEEE Internet of and analysis: fundamental concepts and algorithms. Cam-
Things Journal 5(5):3637–3647. bridge University Press.
Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.;
Portier, P.-E.; He-Guelton, L.; and Caelen, O. 2018. Se-
quence classification for credit-card fraud detection. Expert
Systems with Applications 100:234–245.

369

You might also like