A Fraud Detection Visualization System Utilizing R
A Fraud Detection Visualization System Utilizing R
net/publication/259010182
CITATIONS READS
18 1,118
3 authors, including:
All content following this page was uploaded by Antonios Symvonis on 27 October 2014.
1 Introduction
2
2 Detection Procedure
Fraud detection has been studied enough in the literature. To the best of
our knowledge, there exist only few works oriented exclusively on occu-
pational fraud detection. Luell [18] utilizes data-mining and visualization
techniques to detect client advisor fraud in a financial institution. Eberle
and Holder [12] detect structural anomalies in transactions and processes
propagated by employees using a graph representation. SynerScope [23]
is an industrial visualization tool for analyzing “Big Data” capable of de-
tecting financial fraud using a visualization scheme similar to ours rep-
resenting the billing links and relations between the company and other
entities. The main difference is that our system is oriented exclusively
on occupational fraud detection based on patterns suggested by auditors
and, thus, it is equipped with a detection mechanism that preprocesses
the data and the visualization conforms to these patterns. We also uti-
lize animations for the detection of fraud schemes to avoid cluttering
the visualization. A visualization system based on concentric circles was
presented in [2] aiming at identifying periodic events using an algorithm
for periodicity detection. Our system extends the one presented in [1]
that detects periodic patterns that may conceal occupational fraud in
several ways: (i) The visualization of our system provides a complete
view of all the examined patterns and the results of the examination on
each pattern (in [1], the detection procedure was a “black-box” determin-
ing the order of the presentation of the clients in a video representing
their activity in which suspicious clients appeared first; partial results of
the detection procedure were illustrated in additional plots and charts,
which hindered the investigation), (ii) the detection mechanism is based
on a decision tree even though we have incorporated most of the pat-
terns presented in [1], (iii) for periodicity detection, we apply a variation
of the Longest Common Subsequence algorithm [24] that tackles noisy
data, (iv) a parallel coordinates plot has been added to detect unusual
employee behavior (unauthorized access to computers, business systems,
etc), (v) the system provides a database-viewer to facilitate the investi-
gation procedure.
Many of the existing publications that deal with fraud detection in gen-
eral, use of data-mining techniques [4], [17], [19]. The Financial Crimes
Enforcement Network AI System [14], [21] is a system that correlates
and evaluates all reported transactions for indications of money laun-
dering using pattern-matching techniques. The NASD Regulation De-
tection System [16], [20] identifies patterns of violative activity in the
Nasdaq stock market by combining pattern matching and data mining
techniques and provides visualizations of the results. The Link Analy-
sis Workbench [25] searches for criminal activity and terrorism in noisy
and incomplete data also utilizing pattern matching techniques. Visual-
ization techniques have been also proposed for financial fraud detection.
3D-treemaps have been used to monitor the real-time stock market per-
formance and to identify a stock that may represent an unusual trading
pattern [15]. “WireVis” [7] is a system that provides interactive visual-
izations of financial wire transactions and aims to detect financial fraud.
A system that correlates data and discovers complex networks of po-
3
Fig. 2: The startup screen of the system. Custom queries can be performed and results
are presented in the log viewer. The data have been anonymized for security
and data privacy reasons.
4
client c is related to e and is also related to employee u. For a pair
of employee-client (u, c), an event-series Tu,c = {e1(u,c) , e2(u,c) , . . .} is a
sequence of events ei(u,c) = (ti , u, c, ai ) related to client c and employee
u.
Employee
0 1 1 1
w1 Same pair of Unauthorized Unauthorized
No pattern Employee-Client Action System
1 0 Periodicity ?
w2
Yes No
0 1 0 1 Access Within
w3 Working Hours ?
Yes No Yes No
0 1 0 1 0 1 0 1 Is Employee a
w4 Frequent User
Yes No Yes No Yes No Yes No
of the System?
0 1 0 1 0 1 0 10 1 0 1 0 10 1
w5 Common
Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No
Actions ?
Fig. 3: The decision tree based on which the event-series of the employees are assigned
a severity value.
5
systems that she is not authorized to use. In the case where patterns (ii)
or (iii) occur, the event-series of the employee is assigned the maximum
value (i.e., value 1) such that the employee is definitely distinguished
in the visualization. However, in the case where pattern (i) occurs, the
investigation has to proceed further. The patterns that are taken into
consideration in this case include the following:
– Event-series periodicity: A common pattern while examining such
fraud schemes is the occurrence of the events in regular time basis.
For instance, an employee modifies intentionally the account of a
client every month within the billing cycle of the account and more
precisely, before its billing date. Assuming that event-series Tu,c re-
lated to employee u and client c is ordered according to the time-
stamps of the events, the system aims to detect similarities between
pattern time-series based on a variation of Longest Common Subse-
quence (LCSS) algorithm for time-series [24] which is robust under
noisy conditions. The pattern time-series include the ideal time-series
if the events between the entities appear in time intervals that equal
exactly to 1, 7, 15, 30 days and other time-series identified in the past
as fraud patterns. In the case where similarity with any of the above
time-series is detected, we consider the event-series of the employee
to be periodical.
– Events occurring outside working hours: Fraudulent activities
usually occur outside working hours, on weekends, on holidays or at
the end of the employee’s shift. For this reason, if such events occur
they have to be taken under consideration.
– Employee frequency in recorded systems: Each employee ac-
cording to her responsibilities operates in specific business systems.
If this is not the case, then the employee has to justify the recorded
event. Also, in several systems such as fraud management systems
(FMS), it is expected that an employee monitors the activity of a
suspicious client. Hence, events stemming from these systems have
to be given smaller weight.
– Actions taken by the employee: Similarly to the previous case,
there exist some actions that an employee is unlikely, but not unau-
thorized to perform since they do not conform to her responsibilities.
The idea behind the decision tree was to correspond to each layer one
fraud pattern and create a path according to the result of the examination
on each pattern. We consider the importance of patterns based on their
corresponding layer in the decision tree, such that the higher ones (closer
to the root) are more important. Let x = [x1 , x2 , x3 , x4 , x5 ] be the pattern
vector examined (e.g., x = [0, 0, 0, 0, 0] corresponds to non-fraudulent
activity) and let y = [y1 , . . . , y5 ] be the vector resulting from the traversal
of the decision tree from its root to its leaves according to the evaluation
of the events of pair (u, c) on each factor. If the examination of the events
leads to “Unauthorized action” or “Unauthorized system” the event-series
is directly assigned value 1. Else, each tree layer i is assigned a weight,
say wi , i = 1, . . . , 5, based on the formula presented in [22] such that
dissimilarities between the two vectors that occur at higher levels of the
tree will be more important. For this reason, the distance between vectors
x, y, say d(x, y), representing the dissimilarity by the pattern vector,
6
is calculated by applying qthe normalized Weighted qPEuclidean Distance
P5 2 5
metric formula d(x, y) = i=1 (xi − yi ) ∗ wi / i=1 wi . This value
(or the maximum of the already calculated values, if more than one fraud
patterns exist) corresponds to the severity value of the event-series of
employee u, which is the value finally assigned to the employee. Based on
these values the system generates a heat-map representing all employees
by rectangular nodes and gradient colors from blue to red (refer to the
upper-left heat-map of Figure 1), such that nodes with color close to
color-red represent employees with strong indications of fraud, whereas
blue colored nodes employees with no suspicion of fraud. Similarly to the
severity calculation of the event-series of employees, the system assigns
also a severity value to clients based on the above patterns. The only
difference is that for a given client c, the severity value is calculated
on all the events related to c (not only the ones that concern a specific
employee). In this manner, a client involved in suspicious activity with
two or more employees will be distinguished.
7
Fig. 4: Description of the main-visualization of the system. “Fat” edges (unless referring
to cluster nodes) and in particular, the red-colored ones may be indications of
fraud that have to be further examined.
8
characterized by a specific color and occupies space proportional to the
corresponding aggregate percentage of use by the entity. For each em-
ployee, this percentage is calculated based on the aggregated percentage
of use on all clients that are currently drawn in the visualization, whereas
for each client based on the percentage of use by the employee currently
visualized (unless more than one employees related to a client are drawn
simultaneously in the visualization). Similarly, for cluster nodes the ag-
gregated percentage of use for all clients that belong to the cluster is
calculated. Systems for which the employee is an unauthorized or not a
frequent user are marked by an X (refer to reference point 3 of Figure 4).
Layer L3 corresponds to the actions reported for each entity, and are
drawn in a similar manner as the ones in layer L2 . Again unauthorized
or suspicious actions are marked with an X. Layer L4 represents the
percentage of events that occur within or outside working hours. The
light-blue colored parts represent events occurring within working hours,
whereas the light-red colored parts indicate the existence of events occur-
ring outside working hours (e.g., see reference points 4 and 5 of Figure 4,
resp). For each client node there exists an additional layer (refer to L5
of Figure 4, e.g., see reference point 6) that indicates whether or not the
event-series of the client is periodical. The event-series is compared with
the pattern time-series currently stored in the system and a heat-map is
generated indicating the degree of similarity with each pattern. Again,
light-red colors indicate suspicious cases.
Regarding the investigation procedure, as already mentioned, the audi-
tor specifies a threshold and the employees with assigned severity value
above the threshold are presented in the visualization one by one, to-
gether with their related clients. The auditor is able to start, pause or
stop the video and process the visualization. In the case where a client
node is selected, additional employees related to the client can be added
to the visualization which facilitates the possible identification of two
or more employees that may cooperate in committing fraud (the case
where an employee node is selected is treated similarly). In Figure 4,
gray colored nodes (see reference point 7) represent nodes added during
post-processing when a client node is selected (refer to the node pointed
by the green arrow of Figure 4). In the case where more than one node
representing employees exist simultaneously in the visualization, the au-
ditor is able to select one of them and add the related clients to the
visualization. Gray-color is utilized for non-selected employee nodes to-
gether with their edges and related clients (if they are not related also
to the selected employee) to avoid distracting the auditor.
Since it is possible to switch between employees during the investigation,
if the post-processing of a case is completed, the animation resumes from
the last visualized employee before pausing the animation. In the case
where a cluster node is selected, the corresponding rectangles in the heat-
map representing clients are marked with an X such that they can be
added (if desired) to the visualization. However, the system permits a
specific number of additions of employee or client nodes in order to avoid
cluttering the visualization area. If this number is exceeded, the system
optionally is able to produce a visualization where only the inner-most
layer of the radial drawing (i.e., the one corresponding to the clients and
9
employees) is drawn along with the relations between them. Again, the
width of the edges is proportional to the number of events that relate
two entities. In the case where further investigation is needed the auditor
selects the desired node (which becomes larger) and the other layers of
the radial drawing (i.e., L2 − L5 ) that correspond to the particular node
appear in the visualization. In this manner, the system is able to visualize
simultaneously a larger set of entities and reveal the relations between
them. However, this may slow down the investigation process and for this
reason, we adopted both the animation approach and the simultaneous
visualization of all layers for each node.
The ordering of the nodes representing the employees is performed ac-
cording to their severity value (the more suspicious nodes are presented
first in the animation). The clients appear in arbitrary order since no
crossings between edges connecting a particular employee with her re-
lated clients can exist. The only crossings that may occur are caused
by gray-colored employees related to clients already visualized and since
these edges are also gray-colored, they do not confuse the auditor (see
Figure 4). The gray-colored employees that are added in the visualiza-
tion are placed either on the top or to the bottom of the already placed
employee-nodes to retain the relative positions of the already placed
nodes. We also have chosen not to apply crossing minimization heuris-
tics since, the addition of new nodes to the visualization may imply a
rearrangement of the positions of the already placed nodes, disturbing
the underlying mental map.
Fig. 5: A time-line plot representing the event-series for the selected pair of entities.
10
occurs for a specific action, the corresponding series will have part (or
the whole series) almost parallel to x−axis.
Fig. 6: A periodicity-plot that presents all reported actions by distinct series for the
selected pair of entities.
5 Case Study
In this section, we present the results of the evaluation of the system on
real data-sets stemming from two control systems of a telecommunication
11
Fig. 7: A parallel coordinates plot to monitor failed login attempts in a specific system.
company. All data provided to us were anonymous for security and data
privacy reasons. The data-set consists of approximately 180.637 entries
lying within a time interval of six months. The data-set consists of 710
distinct employees and 83.030 distinct clients. In the data-set, 66.2% of
clients had only one occurrence, 31.6% between 2 and 5, 1.4% between
5 and 10 while the remaining ones (i.e., 0.8%) had more than 10 occur-
rences. The auditors have also included a set of entries corresponding
to a “fictional” fraud case scenario where an employee modifies the ac-
count of a client. However, we were not communicated any information
regarding the billing date of the accounts of the clients.
Since one of the data-sets stems from a fraud management system, it is
expected that reoccurring activity between the same pair of employee and
client will occur (a “suspicious” client reported by a fraud management
system is expected to be supervised by an employee). For this reason,
we concentrated our study in identifying pairs of employees and clients
that appear to have more that 10 related events. The system identified
41 employees (5.8% of the total number of employees) that were related
with the same client with more than 10 events (refer to the orange and
red colored rectangles of the heat-map of Figure 8).
For each of the above employees we had to calculate the similarity of
the event-series with pattern time-series (see Section 3) in order to de-
tect periodicity. In particular, the auditors were interested for periodic
events that occur monthly (i.e., periodicity of 28 up to 30 days). Thus,
we could consider only the pattern time-series (refer to Section 3) that
corresponds to monthly activity. However, we decided to calculate the
severity values based on all pattern time-series in order to distinguish
any reoccurring activity and decide afterwards through the visualization
12
if suspicious activity really exists. Even though, this approach creates
more false positives that have to be investigated, it ensures that other
possible periodic events will not omitted. Among these 41 employees
that were related with the same client with more than 10 events, 17 of
them (2.4% of the total number of employees) appear to have periodic
activity and events occurring outside working hours (refer to the red-
colored rectangles of Figure 8), while the remaining ones (24 employees;
3.4% of the total number of employees) had events occurring outside
working hours (refer to the orange-colored rectangles of Figure 8). Also,
no employee had performed unauthorized access to systems or had used
unauthorized actions, whereas only one employee had used non common
actions. The results were communicated to the auditors who investigated
whether there exist real indications of fraud.
Fig. 8: A heat-map indicating the severity values for the employees participated in the
case study.
13
of Figure 9). The layer of the visualization which accommodates the ac-
tions that Employee-26 has performed, is split in two parts representing
“Action 6” and “Other”. In the “Other” action, we have clustered actions
whose percentage of use was too small (< 0.1%) to be visualized. The
problem was caused by the large number of possible actions in this par-
ticular system which does not permit the simultaneous visualization of
all actions. However, if one of the clustered actions is not common for a
particular employee, the red-colored heat-map in the corresponding layer
and the X marking in the part representing the “Other” action would
reveal the problem to the auditor.
14
outside working hours but, they appear to have strong periodic activity.
In particular, there exist six heat-maps indicating strong similarities with
pattern time-series including the one that represents monthly periodic
activity. For this reason, these clients have to be further investigated.
Fig. 10: All employees related to Client-3 are added in the visualization. The gray-
colored node corresponds to the new employee added (Employee-5) who sur-
prisingly is related to almost all high-ranked clients of Employee-29.
15
reinforced by Figure 11b which represents all reported actions for the
selected pair of entities. In particular, only one action is performed by
Employee-29 towards Client-3 and according to the auditors this action
is part of a monitoring procedure. Studying, in a similar manner, the
time-line plot and the periodicity plot for Client-4, the auditors claimed
that there existed no indications of fraud (the recorded actions were also
part of a monitoring procedure).
(a)
(b)
Fig. 11: (i) The time-line plot for Employee-40 and Client-6 indicating an obvious
monthly activity. (ii) A plot illustrating the periodicity of performed actions
for Employee-40 and Client-6.
16
reference point 8). In the next step, we examined the time-line plot for the
two entities (refer to Figure 13). One can distinguish a monthly periodic
activity between June and September (i.e., the actual dates are 22/6,
21/7, 25/9), even with small gaps (no entries in August) and some noisy
data (i.e., 7/7). The auditors that examined the case determined that
there were no indications of fraud in this particular case mostly because
of the actions performed by the employee which were again common
monitoring actions. Client-5 was reported by a fraud management system
as a suspicious client and thus, Employee-26 was monitored her in regular
time basis.
Figure 14 illustrates the fictional fraud case which was added by the
auditors. In this scenario, Employee-40 is related to Client-6 using two
business systems for which she is a frequent user and common actions
(see reference points 1, 2, 3 and 4 of Figure 14). All the events occur
within working hours (see reference point 5 of Figure 14). However, the
event-series of Employee-40 appears to have strong similarity with six
fraud pattern time-series, including the one that corresponds to period-
icity of one month, which is represented by the first heat-map of the
each periodicity layer (see reference point 6 of Figure 14).Also, Client-6
17
Fig. 13: The corresponding time-line plot for Employee-26 and Client-5.
was not related to any other employee (when selected, no other employee
was added to the visualization). All other clients related to Employee-40,
where placed in the low-severity cluster, whereas no client was placed in
the medium-severity cluster (see reference points 7 and 8 of Figure 14,
resp.). Even though, this visualization resembles a lot to the one of
Employee-26 (see Figure 12), the time-line plot (see Figure 15) and the
periodicity plot (see Figure 16) explain why this case is considered as
fraud. The first suspicions according to the auditors, are raised by the
fact that there exists activity between the two entities stemming from
two business systems (they take also into consideration the type of the
systems; an information that was not communicated to us in full detail).
The auditors explained to us that the time-line plot (refer to Figure 15)
matches to a fraud case scenario according to which there exists some
activity at the beginning (between April - May) with no specific period-
icity and then, appears periodic activity (from May to September). In
the first time interval, the fraudster is trying to plan and organize her
fraud by performing a number of actions. Once the fraud is organized,
only periodic actions are required. Another suspicious fact in this case
is that the events from May to September occur close to the same dates
of each month (from 10th to 15th). Of course, there exist some “noisy”
data that have to be excluded in order to understand the fraud pattern.
These may have been caused either on purpose to cover up the fraud or
were part of the duties of the employee.
The above assumption is reinforced by the plot of Figure 16 which reveals
the periodic occurrence of each performed action. For instance, “Action
101” (see reference point 1) appears to have periodicity around the 15th
day of the month from its second occurrence and later. Also, “Action 107”
(see reference point 2) appears to have periodicity around the 11th-13th
day of the month from its third occurrence and later. In particular, the
vast majority of events are recorded between the 10th and the 15th day
of the month. We could be more convinced that this case consists fraud
if we knew exactly the billing cycle of the account of the client.
In a similar manner, the frames of the animation illustrating the other
highly-ranked employees were investigated. In particular, we were given
more attention to the 17 frames of the animation containing periodic
events. Since the data-sets provided for the case analysis were sensitive,
we were not communicated many details about the final results of the
investigation. Fortunately for the company, the only real evidence of
fraud existed in the fictional data added by the auditors. However, the
auditors had not identified all these cases while examining the data-sets
manually and they had to make an additional investigation for them.
18
Fig. 14: A frame of the animation illustrating the activity of Employee-40. This case
corresponds to a fictional fraud case scenario.
Fig. 15: The time-line plot for Employee-40 and Client-6 indicating an obvious
monthly activity.
19
Fig. 16: A plot indicating the periodic pattern for the performed actions of Employee-
40 and Client-6.
ACKNOWLEDGEMENTS
The work of Evmorfia N. Argyriou has been co-financed by the Euro-
pean Union (European Social Fund - ESF) and Greek national funds
through the Operational Program "Education and Lifelong Learning" of
the National Strategic Reference Framework (NSRF) - Research Fund-
ing Program: Heracleitus II. Investing in knowledge society through the
European Social Fund.
20
References
1. E. N. Argyriou, A. Sotiraki, and A. Symvonis. Occupational fraud
detection through visualization. In ISI, pages 4–7, 2013.
2. E. N. Argyriou and A. Symvonis. Detecting periodicity in serial data
through visualization. In ISVC, volume 7432, pages 295–304, 2012.
3. Association of Certified Fraud Examiners. Report to the Nation on
Occupational Fraud and Abuse, 2012.
4. R. J. Bolton and D. J. Hand. Statistical fraud detection: A review.
Statistical Science, 17:2002, 2002.
5. D. Borland and R. M. Taylor II. Rainbow color map (still) considered
harmful. IEEE Comput. Graph. Appl., 27(2):14–17, 2007.
6. U. Brandes, P. Kenis, and D. Wagner. Communicating centrality in
policy network drawings. IEEE Transactions on Visualization and
Computer Graphics, 9(2):241–253, 2003.
7. R. Chang, A. Lee, M. Ghoniem, R. Kosara, and W. Ribarsky. Scal-
able and interactive visual analysis of financial wire transactions for
fraud detection. Information Visualization, 7(1):63–76, 2008.
8. F. Chevenet, C. Brun, A.-L. Banuls, B. Jacq, and R. Christen. Tree-
dyn: towards dynamic graphics and annotations for analyses of trees.
BMC Bioinformatics, 7(1):1–9, 2007.
9. W. Didimo, G. Liotta, and F. Montecchiani. Vis4aui: Visual analysis
of banking activity networks. In GRAPP/IVAPP, pages 799–802,
2012.
10. W. Didimo, G. Liotta, F. Montecchiani, and P. Palladino. An ad-
vanced network visualization system for financial crime detection. In
PacificVis, pages 203–210, 2011.
11. S. N. Dorogovtsev and J. F. F. Mendes. Evolution of Networks:
From Biological Nets to the Internet and WWW (Physics). Oxford
University Press, Inc., New York, NY, USA, 2003.
12. W. Eberle and L. B. Holder. Mining for insider threats in business
transactions and processes. In CIDM, pages 163–170, 2009.
13. E. D. Giacomo, W. Didimo, G. Liotta, and P. Palladino. Visual
analysis of financial crimes. In AVI, pages 393–394, 2010.
14. H. G. Goldberg and T. E. Senator. Restructuring databases for
knowledge discovery by consolidation and link formation. In KDD,
pages 136–141, 1995.
15. M. L. Huang, J. Liang, and Q. V. Nguyen. A visualization approach
for frauds detection in financial market. IV ’09, pages 197–202, 2009.
16. J. D. Kirkland, T. E. Senator, J. J. Hayden, T. Dybala, H. G. Gold-
berg, and P. Shyr. The nasd regulation advanced detection system
(ads). In AAAI ’98/IAAI ’98, pages 1055–1062, 1998.
17. Y. Kou, C.-T. Lu, S. Sirwongwattana, and Y.-P. Huang. Survey
of fraud detection techniques. In Networking, Sensing and Control,
2004 IEEE Int. Conf., volume 2, pages 749–754, 2004.
18. J. Luell. Employee fraud detection under real world conditions. PhD
thesis, 2010.
19. C. Phua, V. C. S. Lee, K. Smith-Miles, and R. W. Gayler. A compre-
hensive survey of data mining-based fraud detection research. CoRR,
abs/1009.6119, 2010.
21
20. T. E. Senator, H. G. Goldberg, P. Shyr, S. Bennett, S. Donoho, and
C. Lovell. chapter The NASD regulation advanced detection system:
integrating data mining and visualization for break detection in the
NASDAQ stock market, pages 363–371. 2002.
21. T. E. Senator, H. G. Goldberg, J. Wooton, M. A. Cottini, A. F. U.
Khan, C. D. Klinger, W. M. Llamas, M. P. Marrone, and R. W. H.
Wong. The financial crimes enforcement network ai system (fais)
identifying potential money laundering from reports of large cash
transactions. AI Magazine, 16(4):21–39, 1995.
22. W. G. Stillwell, D. A. Seaver, and W. Edwards. A comparison of
weight approximation techniques in multiattribute utility decision
making. Organ. Behavior and Human Performance, 28(1):62 – 77,
1981.
23. SynerScope. 2011. http://www.synerscope.com/.
24. M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. In-
dexing multi-dimensional time-series with support for multiple dis-
tance measures. In ACM SIGKDD int. conf. on Knowledge discovery
and data mining, KDD ’03, pages 216–225, 2003.
25. M. Wolverton, P. Berry, I. Harrison, J. Lowrance, D. Morley, A. Ro-
driguez, E. Ruspini, and J. Thomere. Law: A workbench for approx-
imate pattern matching in relational data. In IAAI, pages 143–150,
2003.
22