Software-Based Phishing Defense
Software-Based Phishing Defense
Abstract—Phishing is a form of cyber attack that leverages expected to continue to grow. Estimates of annual direct finan-
social engineering approaches and other sophisticated techniques cial loss to the U.S. economy caused by phishing activities
to harvest personal information from users of websites. The range from $61 million to $3 billion [49].
average annual growth rate of the number of unique phish-
ing websites detected by the Anti Phishing Working Group is To mitigate the increasing damage caused by phishing, a
36.29% for the past six years and 97.36% for the past two broad range of anti-phishing mechanisms have been proposed
years. In the wake of this rise, alleviating phishing attacks has over the past two decades. These anti-phishing techniques can
received a growing interest from the cyber security commu- be categorized into three broad groups [12]: (1) Detective solu-
nity. Extensive research and development have been conducted tions (e.g., website filtering); (2) Preventive solutions (e.g.,
to detect phishing attempts based on their unique content,
network, and URL characteristics. Existing approaches differ strong authentication [32]–[34], [43], [53], [54], [85]); and
significantly in terms of intuitions, data analysis methods, as (3) Corrective solutions (e.g., Site takedown [57], [58]). In
well as evaluation methodologies. This warrants a careful sys- this paper, we focus on detective solutions. More specifically,
tematization so that the advantages and limitations of each we look at software-based phishing detection schemes that
approach, as well as the applicability in different contexts, are specialized in identifying and classifying phishing web-
could be analyzed and contrasted in a rigorous and princi-
pled way. This paper presents a systematic study of phishing sites. This class of approaches is arguably more important
detection schemes, especially software based ones. Starting from than other approaches because it helps in reducing human
the phishing detection taxonomy, we study evaluation datasets, errors. Preventative and corrective solutions take a differ-
detection features, detection techniques, and evaluation metrics. ent approach, but if the user behind the keyboard has been
Finally, we provide insights that we believe will help guide the successfully tricked by a phishing attempt, and willingly
development of more effective and efficient phishing detection
schemes. submitted sensitive information, then no firewall, encryption
software, certificates, or authentication mechanism can help in
Index Terms—Phishing, Phishing website detection, software preventing the attack from materializing [49]. Software-based
based methods.
phishing detection also delivers improved results compared to
detection by user education (e.g., [60], [61], and [98]) because
I. I NTRODUCTION phishing attacks normally aim at exploiting human weak-
HISHING, one form of cyber-attacks, continues to be a nesses [59]. For example, a study of phishing detection using
P growing concern not only to cyber security specialists but
also to e-business users and owners. The severity of such cyber
user education [97] shows a 29% false negative rate (FNR) for
the best performance, while the software based approaches that
attack vector is continuously growing with the exponential are surveyed by the same study have FNR in the range of 0.1%
increase in digital information generation and the increased to 10%. For this reason, we focus our study on software based
reliance of people and business on cyber space. The Anti- phishing detection systems, and the term “phishing detection"
Phishing Working Group (APWG) has seen rapid growth in will refer only to this form of detection in the rest of the
the number of unique phishing websites detected from 2014 paper.
to 2016 [19]. The average annual growth rate is 97.36% and is Although the research area of phishing detection and classi-
fication is relatively rich, there is a lack of systematic analysis
Manuscript received December 16, 2016; revised May 8, 2017; accepted of the requirements, the capabilities, and the shortcomings
August 16, 2017. Date of publication September 13, 2017; date of current
version November 21, 2017. (Corresponding author: Mohsen Guizani.) of the existing anti-phishing techniques. For example, web-
Z. Dou and A. Khreishah are with the Electrical and Computer Engineering sites that offer identification and classification of phishing as
Department, New Jersey Institute of Technology, Newark, NJ 07102-1982 a service have been popular in recent years, however, those
USA (e-mail: [email protected]; [email protected]).
I. Khalil is with the Qatar Computing Research Institute, Hamad Bin services leverage different evaluation datasets from various
Khalifa University, Doha, Qatar (e-mail: [email protected]). sources at different time periods to validate their outcomes.
A. Al-Fuqaha is with the NEST Research Laboratory, College Albeit those schemes may have similar performance results
of Engineering and Applied Sciences Computer Science Department,
Western Michigan University, Kalamazoo, MI 49008 USA (e-mail: (e.g., in terms of false positive rate, true positive rate, etc.),
[email protected]). it is difficult to compare their performance because of the
M. Guizani is with the Electrical and Computer Engineering variation in the evaluation datasets employed. Consequently, a
Department, University of Idaho, Moscow, ID 83844-1023 USA (e-mail:
[email protected]). systematic assessment of the datasets used to validate phish-
Digital Object Identifier 10.1109/COMST.2017.2752087 ing detection approaches is desired, as well as necessary, in
1553-877X c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2798 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
TABLE I
order to provide a foundation for comprehensive comparisons M OST P OPULAR D EFINITIONS OF P HISHING
among different phishing detection schemes, and ultimately,
select the best in practice.
In this work, we complement the existing survey papers on
phishing detection, including [49], [59], and [103], by provid-
ing a broad systematic analysis of software based anti-phishing
approaches. Varshney et al. [103] focus on studying, analyz-
ing, and classifying the most significant and novel detection
techniques, and pointed out the advantages and disadvantages
of each approach. On the other hand, we present a more com-
prehensive systematic review of phishing detection schemes,
not only from the perspective of detection algorithms, but also
from a broader perspective that covers other important aspects
including the phishing detection life cycle, taxonomy of phish-
ing detection schemes, evaluation datasets, detection features,
and evaluation metrics and strategies. The work in [49] focuses
more on the attack side of phishing. More specifically, it
presents details about phishing attacks including the anatomy
of such attacks, why people fall in phishing attacks and how
bad phishing is. However, it only provides a high level analy-
sis of the state-of-the-art phishing countermeasures. In order to
provide a systematic review of the phishing detection research,
we first present the necessary information about the phishing
attacks by answering three questions: (1) What is phishing?,
(2) How does phishing work? and (3) What is the current
status of phishing? Then, we conduct systematic review of
phishing detection schemes in a detailed and comprehensive The rest of the paper is organized as following: Section II
manner. Finally, Khonji et al. [59] present a literature sur- describes the state-of-the-art phishing attacks, and presents
vey about anti-phishing solutions (e.g., user training, email the life cycle of phishing detection approaches. Section III
filtering and website detection, etc.), including their classifi- introduces the taxonomy of phishing detection schemes with
cation, detection techniques and evaluation metrics. Compared the corresponding literature review. Section IV presents a sys-
to [59], we focus on the software based phishing website tematic review of software based phishing detection schemes
detection schemes, which are proved to be the most effec- from different perspectives: (1) phishing detection datasets; (2)
tive anti-phishing solutions and are not systematically studied phishing detection features; (3) phishing detection techniques;
in [59]. and (4) evaluation metrics. Section V provides detailed take-
In a nutshell, the objective of this paper is to provide a away lessons for researchers and practitioners in the area of
systematic understanding of existing phishing detection stud- phishing detection. Section VI concludes the paper.
ies and provide a comprehensive way to evaluate phishing
detection approaches from different perspectives in order to
II. BACKGROUND
guide future developments and validations of new or upgraded
anti-phishing techniques. A. State-of-the-Art Phishing Attacks
We summarize our contributions in this work as follows: In this section, we first present the various definitions of
• Compile a comprehensive profile of phishing through its phishing, then we introduce some statistics about phishing
various definitions, detailed ecosystem (i.e., in terms of between January 2010 and June 2016. Finally, we describe
phishing life cycle, actors involved and their operations, the phishing ecosystem.
etc.), and the state-of-the-art phishing trends. 1) What Is Phishing?: There is no consensus on how phish-
• Present a systematic review of the software based ing should be defined. Different phishing definitions lead to
phishing detection schemes from different perspectives different research directions and approaches (e.g., email filter-
including the life cycle, taxonomy, evaluation datasets, ing or website detection). It is important to clearly identify the
detection features, detection techniques and evaluation target of any phishing detection approach to avoid confusion
metrics. about its applicability in different scenarios. The target and
• Introduce a novel feature, Network Round Trip Time scope of phishing detection approaches can be analyzed from
(NRTT), for efficient and real time detection of phishing the definition of phishing which has been adopted by such
attacks. approaches. Therefore, presenting a background on the differ-
• Provide detailed takeaway lessons for researchers and ent definitions of phishing can help the readers understand the
practitioners in the area of phishing detection that we scope and the capabilities of different approaches. Table I sum-
believe will help guide the development of effective marizes the popular definitions of phishing. On one hand, the
phishing detection schemes. definitions of PhishTank [81], APWG [19], Xiang et al. [118],
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2799
(e.g., brand monitoring [12]). Due to financial constraints, provide anti-phishing suggestions and solutions (e.g., up-to-
many free-to-use Web hosting providers may not be able to date phishing website blacklist, phishing detection toolbars,
afford deploying good anti-phishing security measures, which etc.). In addition, they may also cooperate with government
leaves their customers not only vulnerable, but even worse, agencies such as public security and law enforcement to detect
attractive targets for phishing. and prevent cyber attacks [4].
Anti-phishing institutes collect and analyze phishing data 3) What Is the Current State of Phishing?: According to
(e.g., suspicious websites reported by users) from various phishing activity trends reports published by APWG [19] from
sources (e.g., users’ reports via anti-phishing toolbars), and Jan. 2010 to Jun. 2016 (shown in Figure 3), the number of
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2801
Fig. 4. The number of phishing sites that use HTTPS. Re-printed from [83].
Fig. 3. The number of unique phishing sites per month from Jan. 2010 to
Jun. 2016.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2802 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
TABLE IV
S UMMARY OF D IFFERENCES B ETWEEN P HISHING D ETECTION T OOLBARS AND ACADEMIC P HISHING D ETECTION /C LASSIFICATION S CHEMES
study of the phishing detection research from 5 differ- toolbars typically come in the form of Web browser exten-
ent perspectives, namely, classification of phishing detection sions (i.e., default extensions or third party extensions) that
techniques, validation datasets, detection features, detection warn users about a suspicious phishing site after clicking on
techniques and detection criteria. its URL.
Publicly available anti-phishing toolbars are either embed-
III. P HISHING D ETECTION S CHEMES : TAXONOMY AND ded in the browser as default extensions (e.g., Microsoft
THE C ORRESPONDING L ITERATURE R EVIEW SmartScreen Filter [112]) or can be downloaded from third
In phishing literature, software-based phishing detection party websites (e.g., Netcraft [3]). They both display security
schemes are usually categorized into heuristic and blacklist warnings on screen when certain actions are triggered in the
based schemes [49], [59]. Heuristic-based approaches examine browser. These security warnings can be classified into two
contents of the Web pages including: (1) surface level content types [59]:
(e.g., the URL); (2) textual content (e.g., terms or words that • Passive warnings: Passive warnings display various infor-
appear on a given Web page); (3) visual content (e.g., the mation (e.g., user ratings, site suggestions, etc.) about the
layout, and the block regions etc.) [122]. These methods can website that is currently being visited but do not block
detect phishing attacks as soon as they are launched but also the content of the website, as depicted in Figure 6.
introduce relatively high false positive rates (FPR). Blacklist- • Active warnings: Active warnings display warning infor-
based approaches have a higher level of accuracy. However, mation about the website a user is trying to visit and
they do not defend against zero-hour attacks [49], [115]. block the content of the website, as depicted in Figure 7.
Combinations of heuristic and blacklist based approaches pro- Many studies have shown that the majority of Web ser-
vide more robust and flexible defense against phishing attacks vice users ignore security warnings provided by anti-phishing
than either one on a standalone basis. toolbars [31], [35], [116]. Furthermore, Egelman et al. [36]
In this paper, we classify phishing detection approaches as found that active warnings are much more effective than pas-
either public phishing detection toolbars or academic phishing sive warnings (79% of participants paid attention to active
detection/classification schemes. Phishing detection toolbars warnings while only 13% participants paid attention to passive
use blacklists and/or selected heuristics to identify phishing warnings). Table V summarizes the information gathered about
websites. There is usually little information about what heuris- the state-of-the-art anti-phishing toolbars. In the following
tics these toolbars use and how they are used. Academic paragraphs, we discuss the details of those toolbars:
phishing detection solutions are similar to phishing detec- Google Safe Browsering: It uses a browser to check
tion toolbars, but usually apply more complex technologies URLs against Google’s constantly updated blacklist of unsafe
and are usually not available/feasible for public use. Most Web resources (e.g., phishing websites) [50] and provides
academic phishing classification schemes apply combinations active warnings to the end users. According to Google Safe
of heuristics features into various data mining algorithms to Browsing’s website, for different platform and threat types, it
enhance the classification accuracy. Table IV summarizes the examines pages against the safe browsing lists. It also issues
differences between phishing detection toolbars and academic reminders before users access risky links.
phishing detection/classification schemes. Note, the “scheme McAfee SiteAdvisor: This is a Web application that reports
details” column in Table IV estimates the amount of publicly on the identity of websites by scanning them for potential mal-
available details about detection schemes, such as detection ware and spam [111]. The detection result is decided according
methodology, data mining algorithms, and datasets. to a combination of heuristics and manual verification, such
Furthermore, based on the heuristic/blacklist classification, as the age and country of the domain registration, the number
we further classify the academic phishing detection approaches of links to other known-good sites, third-party cookies, and
into more specific and fine-grained sub-categories, namely, user reviews [30]. In addition, it provides passive warnings.
(1) heuristic: URL based methods; (2) heuristic: page content Netcraft Anti-Phishing Toolbar: Provides Internet secu-
based methods; (3) heuristic: visual similarity based meth- rity services including anti-fraud and anti-phishing services,
ods; (4) heuristic: other methods; (5) blacklist based methods; application testing and PCI scanning [113]. According to its
(6) hybrid methods. Details about each category are introduced website, Netcraft’s toolbar screens and identifies the deceiving
in Section IV-B. contents in URLs. It also ensures that the navigational con-
trols (e.g., toolbar and address bar) are activated in order to
A. Public Phishing Detection Toolbars prevent pop-up windows (particularly for Firefox). In addition,
Many freely available anti-phishing toolbars offer detection it shows the geographic information of the hosting location
and blocking services against Internet phishing attacks. These of the sites and analyzes fraudulent URLs (e.g., the real
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2803
TABLE V
I NFORMATION A BOUT S ELECTED S TATE - OF - THE -A RT A NTI -P HISHING T OOLBARS
Fig. 6. Passive warnings from Netcraft anti-phishing toolbar. Reprinted from: http://toolbar.netcraft.com/.
citibank.com or barclays.co.uk sites have little possibility to if a match is found. In that case, it issues a warning message
be located in the former Soviet Union [3]). while blocking the site for user’s safety. In addition, security
SpoofGuard: A heuristics-based anti-phishing toolbar devel- checks are also performed when the user starts a download
oped for Internet Explorer with passive warnings. The heuris- from the site. Moreover, SmartScreen compares the download
tics used include (1) Domain name check: examines if the to a list of existing downloads by other users. A warning is
domain name for the attempted URL matches recent entries; issued if it’s a brand new download.
(2) URL Check: checks if the username, the port number, as EarthLink Toolbar: Helps to protect the user from on-line
well as the domain name, are suspicious; (3) Email Check: scams by displaying a security rating (i.e., passive warning)
determines whether the current URL directs to the browser for all the websites the user visited previously. Additionally, it
via email; (4) Password Field Check: determines if the input alerts the user if he tries to access a previously known fraudu-
fields of type “password" are located in the document; (5) Link lent website. It appears to rely on a combination of heuristics,
Check: searches for risky links in the body of the document; user ratings, and manual verification [30].
(6) Image Check: analyzes the images of the new site vs. the eBay Toolbar: Helps the buyers and sellers with real time
previous sites; (7) Password Tracking: prevents the user from alerts and keeps users safe from spoofing and fraudulent
typing the same username and password for multiple sites [63]. attacks by detecting fake sites via a combination of heuristics
Microsoft SmartScreen Filter: A blacklist-based phishing and blacklists through passive warnings [30].
and malware filter implemented in several Microsoft browsers, GeoTrust TrustWatch Toolbar: Provides website verification
including Internet Explorer and Microsoft Edge [112]. When service that alerts the users to potentially unsafe, or phish-
browsing the site, SmartScreen helps monitor and identify the ing Web sites based on the information of several third-party
possibility of visiting a suspicious page. If so, it issues an reputation services and certificate authorities via passive warn-
active warning before next step is taken, as well as solicit- ings [42]. TrustWatch notifies the users that the website has
ing feedback from users. SmartScreen also maintains a list of passed the verification scan based on a list of disreputable sites.
reported phishing and software sites. It screens the list to check It would also recommend additional caution when inputting
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2804 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
Fig. 7. Active warning from Google Safe Browsering. Reprinted from: https://googleblog.blogspot.com/2015/03/protecting-people-across-web-with.html.
sensitive information to the website. Furthermore, it blocks Based on the proposed criteria, all of the 41 selected works
the initial attempt when visiting potentially unsafe websites are introduced in the following sections and twelve represen-
and warns users in case of a risk in revealing information to tative studies are chosen as examples to illustrate the detailed
the site. detection methodology in each category. They are listed in
Web of Trust (WOT): A browser extension that tells the user Table VI and introduced below.
which websites he can trust via active warnings [42]. It ensues Visual similarity based methods: Chen et al. [27] describe
the user’s Internet safety from scams, malware, rogue Web a novel heuristic anti-phishing system that explicitly employs
stores and dangerous links based on community ratings and gestalt and decision theory concepts to model perceptual
reviews. similarity. More specifically, they apply logistic regression
algorithm to a set of normalized page content features. The
B. Academic Phishing Detection/Classification Schemes proposed scheme can achieve 100% true positive rate and
Unlike the public anti-phishing toolbars, which aim at 0.74% false positive rate.
providing real-time warnings about the legitimacy of vis- The most representative work in this category is done by
ited websites, academic phishing detection and classification Fu et al. [38]. They propose an effective phishing website
schemes normally focus on improving the detection accuracy detection approach via visual similarity assessment based on
and reducing the number of false alerts by employing sophis- Earth Mover’s Distance (EMD) [47]. The detection process
ticated technologies and various machine learning algorithms. contains two phases, namely, generating signature of Web
Table VI shows the time-based (from 2005 to 2016) develop- pages and computing visual similarity score from EMD.
ment of 41 selected academic phishing detection/classification The Web page processing phase (i.e., generate the signa-
approaches. In order to choose the most representative studies, ture) contains three steps: (1) obtain the image of a Web page
in this paper, we comply with the following criteria based on from its URL using Graphic Device Interface (GDI) API; (2)
state-of-the-art literature: perform image normalization (the normalized image size is
• Pioneering: Research that introduces new ideas or meth- 100 x 100, and Lanczos algorithm [93] is used to resize the
ods to the literature. image); (3) transform the Web page image by a visual sig-
• Attention: Research that receives more attentions in terms nature. The signature is comprised of the image color tuple
of the number of citations. using the [Alpha, Red, Green, and Blue] (ARGB) scheme and
• Completeness: Research that presents their work fol- the centroid of its position in the image.
lowing the entire life cycle of phishing detection in The second step is to compute the EMD between the visual
depth. similarity signatures of the two Web pages (legitimate site
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2805
TABLE VI
T IME -L INE BASED D EVELOPMENT OF P HISHING D ETECTION S CHEMES F ROM 2005 TO 2016
and phishing site). Firstly, the normalized Euclidean distance where α ∈ (0, +∞) is an amplification factor that limits the
of the degraded ARGB colors and the centroids are computed. skewness of the visual similarity for the distributed in the (0,1)
Then the two distances are added up with their corresponding range.
weights (i.e., p and q, p + q = 1). The normalized feature Large-scale experiments with 10,281 suspected Web pages
distance between ϕi and ϕj is defined as: are carried out and the proposed scheme achieves 0.71% false
positive rate and 89% true positive rate.
dij = NDfeature ϕi , ϕj = p ∗ NDcolor dci ; dcj
Similar works based on visual similarly include [15], [25],
+ q ∗ NDcentroid Cdci ; Cdcj [26], [39], [46], [70], [76], [77], [94], [106], and [122].
where ϕi =< dci , Cdci >, dc =< dA; dR; dG; dB > is the color Page content based methods: Zhang et al. [124] propose
tuple, and Cdc is the centroid value. Suppose we have signature CANTINA, a novel content-based approach for detecting
Ss,a and signature Ss,b , the EMD between Ss,a and Ss,b can be phishing Web sites based on the Term Frequency/Inverse
calculated as: Document Frequency (TF-IDF) information retrieval met-
fij ∗ dij ric. In addition, using some heuristics, the false pos-
EMD Ss,a , Ss,b = itive rate is reduced. Generally, CANTINA works as
fij
follows:
where fij is the flow matrix calculated through linear program- 1) CANTINA calculates the TF-IDF scores of each term
ming [47]. Note that if EMD=0, the two images are identical, of the content in the given website.
if EMD=1, they are completely different. 2) CANTINA generates a lexical signature by taking the
Finally, the EMD-based visual similarity of two images is five terms with highest TF-IDF weights.
defined as: 3) CANTINA sends the lexical signature to a search engine
α
VS Ss,a , Ss,b = 1 − EMD Ss,a , Ss,b (i.e., in their case, Google Search).
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2806 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
4) If the domain name of the current website matches the the number of dots in the URL. In addition, the authors create a
domain name of the top N search results, it is considered binary feature for each token in the hostname (delimited by “.”)
to be a legitimate website. Otherwise, it is concluded to and in the path URL (strings delimited by “/”, “?’, “.”, “=”,
be a phishing site. Note that, the value of N affects the “-” and “_”). The host-based features contain: (1) IP address
false positives. properties (e.g., is the IP address in a blacklist?); (2) WHOIS
CANTINA with TF-IDF alone results in a relatively high false properties (e.g., the date of registration, update, and expira-
positive rate. Therefore, several heuristics are used to reduce tion); (3) Domain name properties (e.g., the time-to-live (TTL)
the false positive rate, including: value for the DNS records associated with the hostname);
• Age of Domain: it examines the age of the domain name. (4) Geographic properties (e.g., the continent/country/city that
If the page has been registered for more than 12 months, the IP address belongs to).
the heuristic returns +1 (i.e., legitimate), otherwise it All the features of the URL are encoded into high dimen-
returns -1 (phishing). sional feature vectors and then different types of classifiers are
• Known Images: it examines whether a page contains applied to them. Here are some examples of the classifiers:
inconsistent well-known logos. • Naive Bayes: Let x denote the feature vectors and y ∈
• Suspicious URL: it examines if the URL contains an “@” {0, 1} denote the label of the website, with y = 1 for
or a “-” in the domain name. malicious and y = 0 for legitimate ones. P(x|y) denotes
• Suspicious Links: for each link in the webpage, it per- the conditional probability of the feature vector given
forms the above three URL checks. its label. Then, assuming that malicious and legitimate
• IP Address: it examines if the URL contains an IP websites are equally probable, the posterior probability
address. that the feature vector x belongs to a malicious URL is
• Dots in URL: it examines the number of dots in the URL. computed as:
• Forms: it examines if a Web page contains any HTML P(x|y = 1)
text entry form requesting sensitive personal data (e.g., P(y = 1|x) =
P(x|y = 1) + P(x|y = 0)
password).
In addition, CANTINA uses a simple forward linear model Finally, the right hand side of the equation is thresholded
to make the decision: to predict the binary label of the feature vector x.
• Support Vector Machine (SVM): The decision using
S=f wi ∗ hi SVMs is expressed in terms of a kernel function K(x, x )
where hi is the result of each heuristic, wi is the weight of that computes the similarity between two feature vectors
each heuristic, and f is a simple threshold function. and non-negative coefficients αi that indicate which train-
ing examples lie close to the decision boundary. SVMs
f (x) = 1 if x > 0, f (x) = −1 if x <= 0. classify new examples by computing their distance to the
Here, 1 means legitimate site and -1 means a phishing site. decision boundary:
The proposed scheme could achieve 97% true positive rate n
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2807
five heuristics (i.e., top-level domains, IP address, directory Marchal et al. [75] develop a phishing detection system
structure, query string, brand name) to compute simple com- that requires very little training data, which is language-
binations of blacklisted sites to discover new phishing sites. independent, resilient to adaptive attacks and implemented
Also, it proposes an approximate matching algorithm to deter- entirely on client-side. The proposed target identification algo-
mine whether a given URL is a phishing site or not. PhishNet rithm is faster than previous works and can help reduce false
consists of two major components, namely, component I: positives. The proposed scheme achieves 0.5% false positive
predicting malicious URLs and component II: approximate rate and 99% true positive rate.
matching. The most representative work in this category is
The basic idea of component I is to combine different URL Monarch [102], a real-time system that determines whether
heuristics of known phishing URLs from a blacklist (i.e., the submitted URL is spam or not. The authors deploy a real
PhishTank database) to generate new phishing URLs. These implementation to demonstrate its scalability, accuracy, and
five URL heuristics include: (1) top-level domains (TLDs): run time performance. Monarch consists of four components:
by changing the TLDs of known blacklist entries, a list of (1) URL aggregation: it accepts URL submissions from a num-
new URLs can be obtained; (2) IP address: the predicted ber of major email providers and Twitter’s streaming API;
new phishing sites are obtained by enumerating all the (2) Feature collection: it visits a URL via Firefox Web browser
combinations of the hostnames and pathnames of the known to collect page content; (3) Feature extraction: it transforms the
blacklisted websites with the same IP address; (3) directory raw data generated from the feature collection component into
structure: the idea is that two URLs sharing a common direc- a feature vector (e.g., transforming URLs into binary features
tory structure (e.g., www.abc.com/online/signin/paypal.htm and converting HTML content into a bag of words [110]).
and www.xyz.com/online/signin/ebay.htm) may have (4) Classification: feature vectors are applied to a proposed
similar sets of file names. Therefore, the predicted distributed logistic regression classifier for classification. The
new URLs are www.abc.com/online/signin/ebay.htm selected features in [102] are represented by a combination
and www.xyz.com/online/signin/paypal.htm; (4) query of URL based features, page content based features, whitelist
string: starting from the observation that some URLs and other features (e.g., routing data), including:
with the exact same directory structure differ only in • Initial URL and Landing URL: domain tokens, path
query string (e.g., www.abc.com/online/signin/ebay?XYZ, tokens, query parameters, number of sub-domains, length
and www.xyz.com/online/signin/paypal?ABC), two of domain, length of path, length of URL.
new URLs, www.abc.com/online/signin/ebay?ABC and • Redirects: number of redirects, type of redirect.
www.xyz.com/online/signin/paypal?XYZ, are created; • Sources and Frames: URL features for each embedded
(5) brand name: the intuition here is that phishers often target IFrame links and sources links.
multiple brand names using the same URL structure method. • HTML Content: tokens of main HTML, frame HTML,
Therefore, the predicted URLs are obtained by changing the and script content.
brand names embedded in the known phishing URLs. • Page Links: URL features for each link, number of links,
After obtaining the whole set of the predicted URLs, ratio of internal domains to external domains.
PhishNet first performs a DNS lookup to filter out sites that • JavaScript Events: number of user prompts, tokens of
cannot be resolved. Then it conducts a content similarity prompts.
check (i.e., using an online tool at http://www.webconfs.com) • Pop-up Windows: URL features for each window URL.
between the known phishing URLs and the corresponding pre- • Plugins: URL features for each plugin URL.
dicted URLs. The predicted URL is concluded to be a phishing • HTTP Headers: tokens of all field names and values;
site if the similarity score exceeds a certain threshold. • DNS: IP of each host, mailserver domains and IPs,
The second component performs an approximate match of nameserver domains and IPs.
a given URL to determine whether it is a phishing site or not. • Geolocation: country code, city code of each IP.
It first breaks the input URL into four different entities: IP • Routing Data: ASN/BGP prefix for each IP encountered.
address, hostname, directory structure and brand name. Then • Whitelist: a whitelist of known good domains.
it assesses each entity by matching with the corresponding part Logistic Regression (LR) with L1-regularization is chosen
of the known phishing URLs to generate an evaluation score. as the classifier. To predict the class label (y = −1 means
If the score is higher than a certain threshold, it is considered non-spam, y = +1 means spam) of a URL’s feature vector
to be a phishing URL. x. We train a linear classifier characterized by weight vector
About 18,000 new phishing URLs are discovered from a set w. Given a set of n labeled training points (xi ; yi ), i = 1:n,
of 6,000 new blacklist entries. The proposed scheme achieves the training process is to find w that minimizes the following
3% false positive rate and 95% true positive rate. objective function:
Similar works based on blacklist/white-list include [23] n
and [96]. f (w) = log 1 + exp −yi (xi ∗ wi ) + λ ∗ ||w||1
Hybrid methods: Whittaker et al. [108] use a logistic 1
regression classifier to maintain Google’s phishing blacklist The first component is the log likelihood of the training data
automatically by examining the URL and the contents of as a function of the weight vector. The second component
a page. The proposed scheme correctly classifies more than is the regularization which adds a penalty to the objective
90% of phishing pages several weeks after training concludes. function [71].
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2808 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
TABLE VII
To perform the learning process over large-scale datasets in S OURCES OF DATASETS FOR P HISHING AND L EGITIMATE W EBSITES
real time, the data is divided into m shares and then processed
in a distributed manner (i.e., using Hadoop Spark [121]).
Monarch can achieve an overall accuracy of 91% with
0.87% false positives with a throughput of 638,000 URLs per
day. Similar works that use hybrid features include [13], [16],
[69], [79], [80], and [84].
Other methods: Ramesh et al. [90] present a phishing
website detection approach based on the phishing target identi-
fication. After obtaining the target domain name, the proposed
scheme performs third-party DNS look up for comparison to
decide the legitimacy of the suspicious page. The proposed
scheme achieves 0.32% false positive rate and 0.33% false
negative rate.
Similar works based on phishing target identification
include [101] and [107].
A. Evaluation Datasets
The evaluation is tightly coupled with the ground truth
datasets employed by the various approaches. Different
approaches collect ground truth from different cyber intel-
ligence sources. Such sources may employ different testing
methodologies and target different types of phishing activities,
and hence cover different phishing domains. That is, evaluation
based on one dataset may differ from that based on another.
Therefore, we argue that having a publicly available reference
datasets is crucial for systematizing the evaluation of various
approaches. Because it is an important step towards providing
a benchmark to compare and contrast the efficiency of vari-
ous approaches and it can help researchers to further advance
the area in a more systematic way. The absence of reference
sets combined with difficulties in sharing code, make it hard • Dataset timeliness: Phishing websites tend to have very
to repeat experiments for systematic comparison of effective- short life time. Therefore, phishing blacklist providers
ness. In the following, we list the identifying features of the usually update information in hourly, daily or weekly
datasets used in the literature: schedules. Even if two schemes use the same data source
• Dataset source: Table VII lists the commonly used data with the same dataset size, they may contain different
sources of phishing websites and legitimate websites, phishing website information.
together with the approaches that leverage each source. • Ratio of legitimate to phishing websites: the ratio of legit-
There is no common consensus on the quality of the dif- imate to phishing instances shows the extent to which the
ferent sources due to the lack of knowledge about the experiments represent a real world distribution (≈ 100/1).
methodologies used in compiling and maintaining each • Training set to testing set ratio: the ratio of training to
source. testing instances indicates the scalability of the approach.
• Dataset size: the evaluation dataset size varies a lot In Section V, we use these aspects to perform a system-
among different approaches. Generally speaking, the atic and comprehensive evaluation of the various phishing
larger the dataset, the more credible the results. detection approaches.
• Dataset redundancy: Datasets, especially those of phish-
ing websites, usually contain repeated entries due B. Phishing Detection Features
to multiple submissions and overlap among different 1) Most Commonly Used Phishing Detection Features: In
sources. However, little information is provided about this section, we summarize the most commonly used features
datasets redundancy in the literature. by various phishing detection approaches. Even though the
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2809
listing of atomic features presented here is not exhaustive, it – Registered information about domain names: For
includes the popular features used in most of the state-of-the- most of the observed phishing sites, either the regis-
art phishing detection approaches. Figure 8 summarizes the tration record is not available in WHOIS databases
features. or the claimed identity is not accurate in the record.
(i) URL-based lexical features: URLs are rich of lexical fea- – Age of Domain: Many of the observed phishing web-
tures that have been widely used in various phishing detection sites have domains that are registered only a few days
approaches [22], [108], [118], including: before phishing emails are sent out, that is, phishing
• URL replaced with IP address: Some phishing web- domains are likely to be short lived.
sites do not use host-names, but rather use IP address • Geographic information: Geographical location is one of
directly to locate the fake site. Such behavior is nor- the most commonly used indicators in detecting phish-
mally employed either to obfuscate the legitimate URL ing because phishing websites are likely to be hosted
or simply to reduce cost. in locations different from those of legitimate web-
• URL Length: Phishing websites usually have longer URLs sites [3]. For example, Netcraft [3] provides location
compared to legitimate websites. information (i.e., IP-based country information) to help
• Number of dots and sub domains: Phishing URLs often in identifying fraudulent URLs. For example, the real
contain more “dots" and sub-domains compared to legit- bankofamerica.com is unlikely to be hosted in Russia.
imate ones. • Domain name similarity: A measure of the similarity
• Number of re-directions: Malicious URLs often have between a potential phishing domain name and a tar-
multiple URL redirects in order to evade detection by get domain name. The similarity can be measured in
blacklists. many ways. For example, it can be measured based on
• Use of HTTPS protocol: Legitimate websites often use the Edit Distance between the two domains [28]. The
HTTPS protocol, while phishing sites usually do not. Edit Distance (a.k.a., Levenshtein distance) is the num-
(ii) URL-based host features: ber of characters that need to be inserted or deleted in
• WHOIS information: WHOIS is a query and response order to transform one domain into another. The smaller
protocol that is widely used for querying databases that the number of insertions and deletions, the higher the
store registration information about websites [51], [72]. similarity.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2810 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2811
Fig. 10. ROC curves of FPR-TPR for different feature sets. Fig. 11. ROC curves of FPR-TPR for selected feature sets, NRTT and the
all feature sets.
backbone. This type of instability is addressed by measuring the proportion of positives that are correctly identified (i.e., the
NRTT from different vantage points as detailed in [56]. percentage of phishing sites which are correctly identified):
Routing instabilities may result in permanent changes in
network communications latency due to, for example, perma- # of correctly detected phishing
TPR =
nent network routing changes. It has been shown in many Total # of phishing
previous works [29], [64], [65], [91], [95], that only a small FPR measures the proportion of positives that are incorrectly
portion of the Internet is responsible for the vast majority of identified (i.e., the percentage of legitimate sites which are
the routing instabilities and these routing changes exhibit a wrongly identified as phishing sites):
strong temporal periodicity, despite the growth of the Internet.
# of wrongly detected legitimate
Leveraging NRTT for our problem, that is, distinguishing FPR =
phishing from legitimate websites is much more practical than Total # of legitimate
the application envisioned by Khalil et al. [56], that is, Web Figure 9 shows the ROC (Receiver operating characteristic)
authentication: (i) Identifying phishing websites does not have curves of FPR vs. TPR for different URL based features
the limitation and the concern of mobile clients, not only including Length of the URL, Number of dots in the URL,
because Web servers are static but also because NRTT is Number of re-directions of the URL, and the URL set (i.e.,
only computed and compared on the fly for the two web- combination of all of the 3 URL features plus the binary fea-
sites (the suspected phishing and the target website). (ii) No tures: usage of HTTPS protocol and IP address in the URL).
reference profiles are maintained and stored at the vantage It shows that the URL set alone could achieve about 90% TPR
point. (iii) The unsolved routing network instabilities men- with 2% FPR.
tioned above do not exist in our case. For Web authentication, Figure 10 shows the ROC curves of FPR vs. TPR for dif-
the reference profile and the real time profiles are measured ferent feature sets, including: WHOIS set, URL set, Web of
at different times. That is, it is possible that permanent route trust score and the combination of all of the three sets “all
changes occur between measuring reference and real-time pro- 3 sets” (i.e., the combination of WHOIS set, URL set and
files, which may harshly affect the efficiency. On the other Web of trust score). The WHOIS set contains two features,
hand, NRTT of both the phishing and the legitimate website namely, the age of the domain and the existence of the regis-
are measured at the same time in real time, and hence, perma- tering information in WHOIS database. The Web of trust score
nent instabilities are not a concern. (iv) Long term instabilities is provided by SEO (search engine optimization) that collects
are also not a concern in our problem. This is because local all website ranking information based on Google, Bing, Yahoo,
network congestion does not apply in the case of Web hosting among others. The results show that the combination of all the
servers compared to Web clients who may have poor network selected feature sets can achieve about 93% TPR with 0.5%
connections. Additionally, NRTT signals are sent at the same FPR.
time for the phishing and legitimate websites, that is, the insta- Figure 11 shows the ROC curves of FPR vs. TPR for “other
bilities affect both and hence the difference between the two 3 sets” (i.e., the combination of WHOIS set, URL set and
remain unchanged. Web of trust score), NRTT and “all sets” (i.e., “other 3 sets”
In order to demonstrate the effectiveness of NRTT as a + NRTT). It clearly shows that, with the combination of all
phishing detection feature, we perform a set of experiments to the features, the proposed scheme can achieve 99% TPR and
evaluate the trade-off between True Positive Rate (TPR) and 0.2% FPR.
False Positive Rate (FPR) among different selected features The evaluation dataset contains 820 verified on-line phish-
and different feature sets (including NRTT). TPR measures ing websites (redundancy reduced) collected from PhishTank
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2812 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
TABLE VIII
S UMMARY OF P HISHING D ETECTION /C LASSIFICATION T ECHNIQUES false alarm rate and Type I error in some parts of the
literature.
• True Negative Rate (TNR): The number of correctly iden-
tified legitimate websites divided by the total number of
legitimate websites.
# of correctly identified legitimate
TNR =
Total # of legitimate
• False Negative Rate (FNR): The number of phishing sites
that incorrectly identified as legitimate sites divided by
the number of phishing sites. It is also known as miss
rate, Type II error in some parts of literature.
# of phishing identified as legitimate
FNR =
Total # of phishing
• Precision (P): The rate of correctly detected phishing
sites in relation to all sites that were detected as phishing.
# of phishing correctly identified
P=
Total # of sites detected as phishing
• F1 score: The harmonic mean between precision P and
recall R.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2813
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2814 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
to the best of our knowledge, none of the existing approaches utilizes human expertise to identify phishing attempts, which
provide a framework that can be used to quantitatively eval- has been implemented using various heuristics. However, such
uate the robustness of features. Most of the approaches that approaches require man-in-the-loop, and hence are too slow.
recognize the problem only qualitatively discuss the robustness They fail to handle large scale datasets and cannot cope
of some of the important features used in their approaches. with high data rates, frequent dataset changes, or adaptive
Therefore, we believe that providing a framework that outlines attack behaviors. Therefore, machine learning technologies,
qualitative and quantitative evaluation guidelines of the robust- which utilize data-driven algorithms, were introduced to help
ness of features is an open problem that calls for attention from automate the learning process. Different machine learning
the research community. Such framework has to consider both algorithms are used. Support Vector Machine (SVM), Logistic
complexity of feature forging as well as its impact on attacker Regression (LR), and Bayesian-based classifiers, are among
benefits. Such framework will be an important tool in the face the mostly used algorithms in the literature.
of the ever evolving attack as it helps researchers and prac- Through our extensive investigation of the large body of
titioners to design phishing detection techniques leveraging phishing detection approaches, we learned that one size does
features that are both adaptive and hard to manipulate without not fit all and hence, it is extremely difficult to recommend one
considerably affecting attack utility. machine learning algorithm over another. Each machine learn-
One additional issue to consider while designing the detec- ing algorithm has its own strengths and weaknesses, which
tion features of an approach is the time it takes to mine the has to be carefully considered to optimize the goals of the
feature. Some features could be extremely useful in identify- detection approach. For example, SVM is considered among
ing and detecting phishing attempts, however, they may take the most robust and accurate classification algorithms [117].
a relatively long time to compute, such as page reputation and However, it has the drawback of being computationally ineffi-
virtual appearance similarity. The use of such features may cient, and hence may not be appropriate for large scale datasets
either result in user inconvenience due to service delays until or high data rates. On the other hand, LR is one of the most
computation completes, or may result in security risks in case widely used statistical models for binary data [12]. However,
the service is provided before computing the features. it performs poorly when nonlinear relationships exist between
feature sets. Furthermore, even though Bayesian-based classi-
C. Detection Schemes fiers are easy to construct and can be readily applied to large
As presented in Section IV-C, phishing detection scale datasets [103], they assume independent features, and
systems use various data mining algorithms and detection hence are very restrictive.
approaches, each with its own advantages and disadvantages. Recent research efforts leverage Deep Learning (DL) algo-
Understanding the underlying data mining algorithms is rithms to improve the performance of phishing detection
important in evaluating the performance, the scalability and schemes. DL allows computational models that are composed
the robustness of phishing detection schemes. of multiple processing layers to learn representations of data
When designing a phishing detection scheme, we recom- with multiple levels of abstraction [102]. DL has been suc-
mend to follow the life cycle illustrated in Figure 5 in order to cessfully applied in many research fields, such as speech
help fellow researchers obtain a comprehensive understanding recognition, visual object recognition, drug discovery and
of the proposed approach and to make it easy for future studies genomics. Therefore, we believe that DL could be a viable
to conduct comparative evaluations. Specifically, a detection alternative to traditional machine learning algorithms (e.g.,
approach has to clearly state what it can and what it cannot SVM, LR), especially when handling complex and large scale
do in terms of phishing detection and blocking, to avoid rely- datasets.
ing on the approach in scenarios where it may not be efficient. Another important issue that we have identified through this
Additionally, details of dataset specifications in terms of con- survey is the absence of deep and systematic evaluation of
tent and volume that better support the approach should be the performance of phishing detection approaches. The vast
clearly articulated and documented. majority of the approaches focus on evaluating and analyz-
A very important lesson that we have learned is that, ing the detection accuracy, while they overlook the run-time
due to the dynamic nature of cyber attacks, the most reli- performance of the approach. Some approaches may show
able and efficient phishing detection approaches are those acceptable performance during the design and test phases
that can continuously adapt to cope with such dynamisms. due to the relatively small size datasets used during these
References [108] and [118] are examples of such dynamic phases. However, real world datasets are usually more com-
approaches. Additionally, the robustness of phishing detec- plex and much larger, which may cause such approaches
tion approaches is tightly coupled with the robustness of the to perform poorly in real world applications. Systematic
features used by the approach. Therefore, an approach may performance analysis can provide important guidelines to eval-
result in excellent detection accuracy at the time of design uate the scalability of detection approaches, which in turn can
or in its early deployment, but fails miserably later due to help in improving performance by considering, for example,
either changes in the dataset or deliberate manipulation of the distributed platforms and parallel algorithms.
features utilized by the approach.
Two main categories have been considered in the underlying D. Evaluation Metrics
technologies (e.g., feature mining, classification, etc.) of phish- In addition to evaluating the quality of the detection scheme
ing detection approaches. The first category of approaches in terms of FPR and TPR, it is also imperative to have
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2815
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2816 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
systematic evaluation of the effectiveness and the scalability and page content based features. Also, studies that are incor-
of the underlying detection algorithms. Effectiveness metrics porated with more features tend to have better performance
have been discussed in Section IV-D, however, we note that results. The recent trend is to leverage the classifier itself
most of the work in the literature lacks evaluation of other to optimize the detection accuracy using a large number of
performance aspects such as speed of detection, usability, and various detection features.
practical deployment, among others.
The majority of phishing detection approaches leverage
VI. C ONCLUSION
machine learning concepts including clustering and classifica-
tion techniques. Therefore, they adopt the evaluation metrics In this paper, we provide a systematic study of exist-
and strategies developed in this domain. However, as men- ing phishing detection works from different perspectives. We
tioned earlier, the cyber security domain is more challenging first describe the background knowledge about the phishing
due to the adaptive nature of attackers. Therefore, the evalua- ecosystem and the state-of-the-art phishing statistics. Then we
tion results during the design phase should be considered with present a systematic review of the automatic phishing detection
caution, as they may not hold later. In other words, the design schemes. Specifically, we provide a taxonomy of the phishing
phase results are limited in time validity and scope, which calls detection schemes, discuss the datasets used in training and
for the phishing detection community to think about adaptive evaluating various detection approaches, discuss the features
evaluation strategies that cope with the unique challenges in used by various detection schemes, discuss the underlying
the cyber security domain. For example, the researchers could detection algorithms and the commonly used evaluation met-
firstly classify the dataset into different categories (e.g., by rics. Finally, we provide recommendations that we believe will
type, time period, country, etc.), then perform the evaluation help guide the development of more effective phishing detec-
over every type of the dataset to obtain a more comprehensive tion schemes and make it easy to compare and contrast various
and convincing results. Another important issue is the dif- schemes.
ficulty in providing comparative evaluations among different
phishing detection techniques. This is mainly due to the lack R EFERENCES
of standard benchmarks, and the lack of reference datasets as [1] (2016). Phishing Trends & Intelligence Report: Hacking the Human.
a consequence of the dynamic nature of the attackers and the [Online]. Available: https://info.phishlabs.com/pti-report-download
potential sensitivity of data, which restricts sharing. [2] The Alexa Top 500 Sites on the Web. Accessed: Nov. 21, 2016. [Online].
Available: http://www.dmoz.org/
Unfortunately, many of the above mentioned challenges
[3] Anti-Phishing Extension: Netcraft. Accessed: Dec. 5, 2016. [Online].
across all the aspects continue to exist, and hence call for Available: http://toolbar.netcraft.com/
a collaborative effort among the research community to alle- [4] Anti-Phishing Working Group. Accessed: Nov. 15, 2016. [Online].
viate their negative impact on the effectiveness and coverage Available: http:/www.antiphishing.org/
[5] Clean MX Malicious URL List. Accessed: Nov. 21, 2016.
of phishing detection approaches. [Online]. Available: http://support.clean-mx.com/clean-mx/
Table IX shows the comparison results across the previous phishing.php?response=alive
four evaluation dimensions. From the performance perspective, [6] DMOZ—The Directory of the Web. Accessed: Nov. 21, 2016. [Online].
Available: http://www.dmoz.org/
we can see that 11 out of 12 schemes focus on the evaluation [7] Malwarepatrol. Accessed: Oct. 31, 2016. [Online]. Available:
of true positive rate, false positive rate or other equivalent https://www.malwarepatrol.net/open-source.shtml
evaluation metrics. This is mainly because of the fact that TPR [8] Millersmiles Spoof Email and Phishing Scams
List. Accessed: Nov. 21, 2016. [Online]. Available:
determines the detection capability of the scheme while FPR http://www.millersmiles.co.uk/scams.php
represents its negative effects. Thus, they together provide the [9] SURBL URL Reputation Data. Accessed: Nov. 21, 2016. [Online].
most valuable performance information about the quality of Available: http://www.surbl.org/lists
[10] G. Aaron and R. Rasmussen, Global Phishing Survey: Trends
different approaches. and Domain Name Use in 2h2009, Anti-Phishing Working Group,
PhishTank is the most dominant source for phishing web- Lexington, MA, USA, 2010.
sites because it provides large quantity, up-to-date and verified [11] S. Abu-Nimeh and S. Nair, “Bypassing security toolbars and phish-
ing filters via DNS poisoning,” in Proc. IEEE Glob. Telecommun.
phishing list for free. Yahoo and DMOZ were competitors Conf. (GLOBECOM), New Orleans, LA, USA, 2008, pp. 1–6.
to each other for providing legitimate websites information. [12] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, “A comparison of
However, Yahoo closed its service since the end of 2014 machine learning techniques for phishing detection,” in Proc. ACM
Anti Phishing Working Groups 2nd Annu. eCrime Researchers Summit,
for some unknown reasons. Other favorite sources for legiti- Pittsburgh, PA, USA, 2007, pp. 60–69.
mate websites include Alexa top sites and Google keywords [13] M. Aburrous, M. A. Hossain, K. Dahal, and F. Thabtah, “Intelligent
searching. Although researchers try to use a larger number of phishing detection system for e-banking using fuzzy data mining,”
Expert Syst. Appl., vol. 37, no. 12, pp. 7913–7921, 2010.
datasets for more convincing evaluation results, few of them
[14] M. Aburrous and A. Khelifi, “Phishing detection plug-in toolbar using
considered some fundamental aspects about the datasets. For intelligent fuzzy-classification mining techniques,” in Proc. Int. Conf.
example, the ratio of the number of legitimate websites to Soft Comput. Softw. Eng., San Francisco, CA, USA, 2013.
the number of phishing websites, which is about 100 to 1 in [15] S. Afroz and R. Greenstadt, “Phishzoo: An automated Web phishing
detection approach based on profiling and fuzzy matching,” in Proc. 5th
reality. IEEE Int. Conf. Semantic Comput. (ICSC), 2009.
Blacklists are commonly used in the public phishing detec- [16] A. Aggarwal, A. Rajadesingan, and P. Kumaraguru, “PhishAri:
tion toolbars because they have the fastest response time. From Automatic realtime phishing detection on Twitter,” in Proc. IEEE
eCrime Researchers Summit (eCrime), 2012, pp. 1–12.
Table IX, we can conclude that the most commonly used fea- [17] F. Aloul, S. Zahidi, and W. El-Hajj, “Two factor authentication using
tures (also with the best performance results) are URL based mobile phones,” in Proc. AICCSA, Rabat, Morocco, 2009, pp. 641–644.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2817
[18] D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker, [43] A. Gharaibeh et al., “Smart cities: A survey on data management,
“Spamscatter: Characterizing Internet scam hosting infrastructure,” in security and enabling technologies,” IEEE Commun. Surveys Tuts., to
Proc. Usenix Security, Boston, MA, USA, 2007, pp. 1–14. be published.
[19] APWG Phishing Trends Reports, Anti Phishing Working Group, 2016. [44] (2016). Anti-Phishing Working Group. [Online]. Available:
[20] M. Aydin and N. Baykal, “Feature extraction and classification phish- http://www.antiphishing.org
ing websites based on URL,” in Proc. IEEE Conf. Commun. Netw. [45] X. Han, N. Kheir, and D. Balzarotti, “Phisheye: Live monitoring
Security (CNS), Florence, Italy, 2015, pp. 769–770. of sandboxed phishing kits,” in Proc. ACM SIGSAC Conf. Comput.
[21] A. Bergholz et al., “New filtering approaches for phishing email,” Commun. Security, Vienna, Austria, 2016, pp. 1402–1413.
J. Comput. Security, vol. 18. no. 1, pp. 7–35, 2010. [46] M. Hara, A. Yamada, and Y. Miyake, “Visual similarity-based phish-
[22] A. Blum, B. Wardman, T. Solorio, and G. Warner, “Lexical feature ing detection without victim site information,” in Proc. IEEE Symp.
based phishing URL detection using online learning,” in Proc. 3rd ACM Comput. Intell. Cyber Security (CICS), Nashville, TN, USA, 2009,
Workshop Artif. Intell. Security, Chicago, IL, USA, 2010, pp. 54–60. pp. 30–36.
[23] Y. Cao, W. Han, and Y. Le, “Anti-phishing based on automated indi- [47] F. L. Hitchcock, “The distribution of a product from several sources to
vidual white-list,” in Proc. 4th ACM Workshop Digit. Identity Manag., numerous localities,” J. Math. Phys., vol. 20, nos. 1–4, pp. 224–230,
Alexandria, VA, USA, 2008, pp. 51–60. 1941.
[48] T. Holz, C. Gorecki, K. Rieck, and F. C. Freiling, “Measuring and
[24] D. D. Caputo, S. L. Pfleeger, J. D. Freeman, and M. E. Johnson, “Going
detecting fast-flux service networks,” in Proc. 15th Netw. Distrib. Syst.
spear phishing: Exploring embedded training and awareness,” IEEE
Security Symp., 2008.
Security Privacy, vol. 12, no. 1, pp. 28–38, Jan./Feb. 2014.
[49] J. Hong, “The state of phishing attacks,” Commun. ACM, vol. 55, no. 1,
[25] K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting phish- pp. 74–81, 2012.
ing with discriminative keypoint features,” IEEE Internet Comput., [50] Google Inc. Google Safe Browsing. Accessed: Dec. 5, 2016. [Online].
vol. 13, no. 3, pp. 56–63, May/Jun. 2009. Available: https://developers.google.com/safe-browsing/
[26] T.-C. Chen, S. Dick, and J. Miller, “Detecting visually similar [51] T. N. Jagatic, N. A. Johnson, M. Jakobsson, and F. Menczer, “Social
Web pages: Application to phishing detection,” ACM Trans. Internet phishing,” Commun. ACM, vol. 50, no. 10, pp. 94–100, 2007.
Technol., vol. 10, no. 2, p. 5, 2010. [52] C. Karlof, U. Shankar, J. D. Tygar, and D. Wagner, “Dynamic pharm-
[27] T.-C. Chen, T. Stepan, S. Dick, and J. Miller, “An anti-phishing system ing attacks and locked same-origin policies for Web browsers,” in
employing diffused information,” ACM Trans. Inf. Syst. Security, Proc. 14th ACM Conf. Comput. Commun. Security, Alexandria, VA,
vol. 16, no. 4, p. 16, 2014. USA, 2007, pp. 58–71.
[28] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side [53] I. Khalil and S. Bagchi, “Secos: Key management for scalable and
defense against Web-based identity theft,” in Proc. NDSS, San Diego, energy efficient crypto on sensors,” in Proc. IEEE Depend. Syst. Netw.,
CA, USA, 2004. 2003.
[29] G. Comarela, G. Gürsun, and M. Crovella, “Studying interdomain rout- [54] I. Khalil, S. Bagchi, and N. Shroff, “Analysis and evaluation of secos,
ing over long timescales,” in Proc. Conf. Internet Meas., Barcelona, a protocol for energy efficient and secure communication in sensor
Spain, 2013, pp. 227–234. networks,” Ad Hoc Netw., vol. 5, no. 3, pp. 360–391, 2007.
[30] L. F. Cranor, S. Egelman, J. I. Hong, and Y. Zhang, “Phinding phish: [55] I. Khalil, Z. Dou, and A. Khreishah, “TPM-based authentication
An evaluation of anti-phishing toolbars,” in Proc. NDSS, San Diego, mechanism for apache hadoop,” in Proc. Int. Conf. Security Privacy
CA, USA, 2007. Commun. Syst., Beijing, China, 2014, pp. 105–122.
[31] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” in [56] I. Khalil, Z. Dou, and A. Khreishah, “Your credentials are compro-
Proc. SIGCHI Conf. Human Factors Comput. Syst., Montreal, QC, mised, do not panic: You can be well protected,” in Proc. 11th ACM
Canada, 2006, pp. 581–590. AsiaCCS, Xi’an, China, 2016, pp. 925–930.
[32] Z. Dou, I. Khalil, and A. Khreishah, “CLAS: A novel communica- [57] I. Khalil, I. Hababeh, and A. Khreishah, “Secure inter cloud data migra-
tions latency based authentication scheme,” Security Commun. Netw., tion,” in Proc. 7th Int. Conf. Inf. Commun. Syst. (ICICS), Irbid, Jordan,
vol. 2017, 2017, Art. no. 4286903. 2016, pp. 62–67.
[33] Z. Dou, I. Khalil, and A. Khreishah, “A novel and robust authentication [58] I. Khalil, T. Yu, and B. Guan, “Discovering malicious domains through
factor based on network communications latency,” IEEE Syst. J., to be passive DNS data graph analysis,” in Proc. 11th ACM Asia Conf.
published. Comput. Commun. Security, Xi’an, China, 2016, pp. 663–674.
[34] Z. Dou, I. Khalil, A. Khreishah, and A. Al-Fuqaha, “Robust insider [59] M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: A literature
attacks countermeasure for Hadoop: Design and implementation,” IEEE survey,” IEEE Commun. Surveys Tuts., vol. 15, no. 4, pp. 2091–2121,
Syst. J., to be published. 4th Quart., 2013.
[35] J. S. Downs, M. B. Holbrook, and L. F. Cranor, “Decision strategies [60] P. Kumaraguru et al., “Getting users to pay attention to anti-phishing
and susceptibility to phishing,” in Proc. 2nd Symp. Usable Privacy education: Evaluation of retention and transfer,” in Proc. Anti Phishing
Security, Pittsburgh, PA, USA, 2006, pp. 79–90. Working Groups 2nd Annu. eCrime Researchers Summit, Pittsburgh,
PA, USA, 2007, pp. 70–81.
[36] S. Egelman, L. F. Cranor, and J. Hong, “You’ve been warned: An
[61] P. Kumaraguru, S. Sheng, A. Acquisti, L. F. Cranor, and J. Hong,
empirical study of the effectiveness of Web browser phishing warn-
“Teaching Johnny not to fall for phish,” ACM Trans. Internet Technol.,
ings,” in Proc. SIGCHI Conf. Human Factors Comput. Syst., Florence,
vol. 10, no. 2, p. 7, 2010.
Italy, 2008, pp. 1065–1074.
[62] M. Kwon et al., “Use of network latency profiling and redundancy for
[37] M. N. Feroz and S. Mengel, “Examination of data, rule generation and cloud server selection,” in Proc. IEEE 7th Int. Conf. Cloud Comput.,
detection of phishing URLs using online logistic regression,” in Proc. Anchorage, AK, USA, 2014, pp. 826–832.
IEEE Int. Conf. Big Data (Big Data), Washington, DC, USA, 2014, [63] Spoofguard, Stanford Security Lab., Stanford, CA, USA, 2004.
pp. 241–250. [Online]. Available: https://crypto.stanford.edu/SpoofGuard/
[38] A. Y. Fu, L. Wenyin, and X. Deng, “Detecting phishing Web pages with [64] C. Labovitz, G. R. Malan, and F. Jahanian, “Internet routing
visual similarity assessment based on earth mover’s distance (EMD),” instability,” IEEE/ACM Trans. Netw., vol. 6, no. 5, pp. 515–528,
IEEE Trans. Depend. Secure Comput., vol. 3, no. 4, pp. 301–311, Oct. 1998.
Oct./Dec. 2006. [65] M. Lad, J. H. Park, T. Refice, and L. Zhang, “A study of Internet routing
[39] A. Y. Fu, L. Wenyin, and X. Deng, “EMD based visual similarity for stability using link weight,” Dept. Comput. Sci., Univ. California at
detection of phishing webpages,” in Proc. Int. Workshop Web Doc. San Diego, San Diego, CA, USA, Tech. Rep., 2008.
Anal., vol. 2005. 2005. [66] A. Le, A. Markopoulou, and M. Faloutsos, “PhishDef: Url names say
[40] S. Garera, N. Provos, M. Chew, and A. D. Rubin, “A framework it all,” in Proc. IEEE INFOCOM, Shanghai, China, 2011, pp. 191–195.
for detection and measurement of phishing attacks,” in Proc. ACM [67] C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, “Detecting out-
Workshop Recurring Malcode, Alexandria, VA, USA, 2007, pp. 1–8. liers: Do not use standard deviation around the mean, use absolute
[41] S. Gastellier-Prevost, G. G. Granadillo, and M. Laurent, “A dual deviation around the median,” J. Exp. Soc. Psychol., vol. 49, no. 4,
approach to detect pharming attacks at the client-side,” in Proc. 4th pp. 764–766, 2013.
IFIP Int. Conf. New Technol. Mobility Security (NTMS), Paris, France, [68] G. L’Huillier, A. Hevia, R. Weber, and S. Ríos, “Latent semantic anal-
2011, pp. 1–5. ysis and keyword extraction for phishing classification,” in Proc. IEEE
[42] GeoTrust. Geotrust TrustWatch Toolbar. Accessed: Dec. 5, 2016. Int. Conf. Intell. Security Inf. (ISI), Vancouver, BC, Canada, 2010,
[Online]. Available: https://www.geotrust.com/comcasttoolbar/ pp. 129–131.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2818 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017
[69] P. Likarish, E. Jung, D. Dunbar, T. E. Hansen, and J. P. Hourcade, [94] N. Sanglerdsinlapachai and A. Rungsawang, “Using domain top-page
“B-APT: Bayesian anti-phishing toolbar,” in Proc. IEEE Int. Conf. similarity feature in machine learning-based Web phishing detection,”
Commun., Beijing, China, 2008, pp. 1745–1749. in Proc. 3rd Int. Conf. Knowl. Disc. Data Min. (WKDD), 2010,
[70] G. Liu, B. Qiu, and L. Wenyin, “Automatic detection of phish- pp. 187–190.
ing target from phishing webpage,” in Proc. 20th Int. Conf. Pattern [95] A. Shaikh, A. Varma, L. Kalampoukas, and R. Dube, “Routing sta-
Recognit. (ICPR), Istanbul, Turkey, 2010, pp. 4153–4156. bility in congested networks: Experimentation and analysis,” ACM
[71] Z. Q. J. Lu, “The elements of statistical learning: Data mining, infer- SIGCOMM Comput. Commun. Rev., vol. 30, no. 4, pp. 163–174, 2000.
ence, and prediction,” J. Roy. Stat. Soc. A, Stat. Soc., vol. 173, no. 3, [96] M. Sharifi and S. H. Siadati, “A phishing sites blacklist generator,” in
pp. 693–694, 2010. Proc. IEEE/ACS Int. Conf. Comput. Syst. Appl., Doha, Qatar, 2008,
[72] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond black- pp. 840–843.
lists: Learning to detect malicious Web sites from suspicious URLs,” [97] S. Sheng, M. Holbrook, P. Kumaraguru, L. F. Cranor, and J. Downs,
in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., Paris, “Who falls for phish?: A demographic analysis of phishing susceptibil-
France, 2009, pp. 1245–1254. ity and effectiveness of interventions,” in Proc. SIGCHI Conf. Human
[73] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying Factors Comput. Syst., Atlanta, GA, USA, 2010, pp. 373–382.
suspicious URLs: An application of large-scale online learning,” in [98] S. Sheng et al., “Anti-phishing phil: The design and evaluation of a
Proc. 26th Annu. Int. Conf. Mach. Learn., Montreal, QC, Canada, 2009, game that teaches people not to fall for phish,” in Proc. 3rd Symp.
pp. 681–688. Usable Privacy Security, Pittsburgh, PA, USA, 2007, pp. 88–99.
[74] S. Marchal, J. François, R. State, and T. Engel, “PhishStorm: Detecting [99] S. Sinha, M. Bailey, and F. Jahanian, “Shades of grey: On the effective-
phishing with streaming analytics,” IEEE Trans. Netw. Service Manag., ness of reputation-based ‘blacklists,”’ in Proc. 3rd Int. Conf. Malicious
vol. 11, no. 4, pp. 458–471, Dec. 2014. Unwanted Softw., Fairfax, VA, USA, Oct. 2008, pp. 57–64.
[75] S. Marchal, K. Saari, N. Singh, and N. Asokan, “Know your phish: [100] A. K. Sood and S. Zeadally, “A taxonomy of domain-generation
Novel techniques for detecting phishing sites and their targets,” in Proc. algorithms,” IEEE Security Privacy, vol. 14, no. 4, pp. 46–53,
IEEE 36th Int. Conf. Distrib. Comput. Syst. (ICDCS), Nara, Japan, Jul./Aug. 2016.
2016, pp. 323–333. [101] C. L. Tan, K. L. Chiew, K. Wong, and S. N. Sze, “PhishWHO: Phishing
[76] M.-E. Maurer and D. Herzner, “Using visual website similarity for webpage detection via identity keywords extraction and target domain
phishing detection and reporting,” in Proc. Extended Abstracts Human name finder,” Decis. Support Syst., vol. 88, pp. 18–27, Aug. 2016.
Factors Comput. Syst. CHI, Austin, TX, USA, 2012, pp. 1625–1630. [102] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, “Design and
[77] E. Medvet, E. Kirda, and C. Kruegel, “Visual-similarity-based phishing evaluation of a real-time URL spam filtering service,” in Proc. IEEE
detection,” in Proc. 4th Int. Conf. Security Privacy Commun. Netw., Symp. Security Privacy, Berkeley, CA, USA, 2011, pp. 447–462.
Istanbul, Turkey, 2008, p. 22. [103] G. Varshney, M. Misra, and P. K. Atrey, “A survey and classification
[78] I.-C. Mihai and L. Giurea, “Management of eLearning platforms secu- of Web phishing detection schemes,” Security Commun. Netw., vol. 9,
rity,” in Proc. Int. Sci. Conf. eLearn. Softw. Educ., vol. 1. 2016, no. 18, pp. 6266–6284, 2016.
pp. 422–427. [104] J. Wang, T. Herath, R. Chen, A. Vishwanath, and H. R. Rao, “Research
article phishing susceptibility: An investigation into the processing of
[79] M. Moghimi and A. Y. Varjani, “New rule-based phishing detection
a targeted spear phishing email,” IEEE Trans. Prof. Commun., vol. 55,
method,” Expert Syst. Appl., vol. 53, pp. 231–242, Jul. 2016.
no. 4, pp. 345–362, Dec. 2012.
[80] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Predicting phishing
[105] W. D. Yu, S. Nargundkar, and N. Tiruthani, “A phishing vulnerabil-
websites based on self-structuring neural network,” Neural Comput.
ity analysis of Web based systems,” in Proc. IEEE Symp. Comput.
Appl., vol. 25, no. 2, pp. 443–458, 2014.
Commun. (ISCC), Marrakech, Morocco, 2008, pp. 326–331.
[81] PhishTank: An Anti-Phishing Site, LLC OpenDNS, San Francisco, CA,
[106] L. Wenyin, G. Huang, L. Xiaoyue, X. Deng, and Z. Min, “Phishing
USA, accessed: Dec. 5, 2016.
Web page detection,” in Proc. 8th Int. Conf. Document Anal.
[82] A. Oprea, Z. Li, T.-F. Yen, S. H. Chin, and S. Alrwais, “Detection Recognit. (ICDAR), Seoul, South Korea, 2005, pp. 560–564.
of early-stage enterprise infection by mining large-scale log data,” [107] L. Wenyin, G. Liu, B. Qiu, and X. Quan, “Antiphishing through phish-
in Proc. 45th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw., ing target discovery,” IEEE Internet Comput., vol. 16, no. 2, pp. 52–61,
Rio de Janeiro, Brazil, Jun. 2015, pp. 45–56. Mar./Apr. 2012.
[83] P. Pajares. Phishing Safety: Is HTTPS Enough? [Online]. Available: [108] C. Whittaker, B. Ryner, and M. Nazif, “Large-scale automatic classi-
http://blog.trendmicro.com/trendlabs-security-intelligence/phishing- fication of phishing pages,” in Proc. NDSS, vol. 10. San Diego, CA,
safety-is-https-enough/ USA, 2010.
[84] Y. Pan and X. Ding, “Anomaly based Web phishing page detection,” [109] Avalanche (Phishing Group)—Wikipedia, the Free Encyclopedia,
in Proc. ACSAC, vol. 6, 2006, pp. 381–392. Wikipedia, San Francisco, CA, USA, 2016.
[85] R. K. Panta, S. Bagchi, and I. M. Khalil, “Efficient wireless reprogram- [110] Bag-of-Words Model—Wikipedia, the Free Encyclopedia, Wikipedia,
ming through reduced bandwidth usage and opportunistic sleeping,” Ad San Francisco, CA, USA, 2016.
Hoc Netw., vol. 7, no. 1, pp. 42–62, 2009. [111] Mcafee Siteadvisor—Wikipedia, the Free Encyclopedia, Wikipedia,
[86] B. Parmar, “Protecting against spear-phishing,” Comput. Fraud San Francisco, CA, USA, 2016, accessed: Sep. 6, 2016.
Security, vol. 2012, no. 1, pp. 8–11, 2012. [112] Microsoft Smartscreen—Wikipedia, the Free Encyclopedia, Wikipedia,
[87] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta, “PhishNet: San Francisco, CA, USA, 2016, accessed: Sep. 28, 2016.
Predictive blacklisting to detect phishing attacks,” in Proc. IEEE [113] Netcraft—Wikipedia, the Free Encyclopedia, Wikipedia, San Francisco,
INFOCOM, San Diego, CA, USA, 2010, pp. 1–5. CA, USA, 2016, accessed: Sep. 3, 2016.
[88] A. Ramachandran, D. Dagon, and N. Feamster, “Can DNS-based [114] Yahoo! Directory—Wikipedia, the Free Encyclopedia, Wikipedia,
blacklists keep up with bots,” in Proc. 3rd Conf. Email Anti Spam, San Francisco, CA, USA, 2016, accessed: Jun. 7, 2016.
2006. [115] Zero-Day (Computing)—Wikipedia, the Free Encyclopedia, Wikipedia,
[89] V. Ramanathan and H. Wechsler, “Phishing website detection San Francisco, CA, USA, 2016.
using latent Dirichlet allocation and adaboost,” in Proc. IEEE Int. [116] M. Wu, R. C. Miller, and S. L. Garfinkel, “Do security toolbars actu-
Conf. Intell. Security Informat. (ISI), Arlington, VA, USA, 2012, ally prevent phishing attacks?” in Proc. SIGCHI Conf. Human Factors
pp. 102–107. Comput. Syst., Montreal, QC, Canada, 2006, pp. 601–610.
[90] G. Ramesh, I. Krishnamurthi, and K. S. S. Kumar, “An effica- [117] X. Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst.,
cious method for detecting phishing webpages through target domain vol. 14, no. 1, pp. 1–37, 2008.
identification,” Decis. Support Syst., vol. 61, pp. 12–22, May 2014. [118] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “Cantina+: A feature-
[91] J. Rexford, J. Wang, Z. Xiao, and Y. Zhang, “BGP routing stability of rich machine learning framework for detecting phishing Web sites,”
popular destinations,” in Proc. 2nd ACM SIGCOMM Workshop Internet ACM Trans. Inf. Syst. Security, vol. 4, no. 2, 2011, Art. no. 21.
Meas., Marseille, France, 2002, pp. 197–202. [119] G. Xiang and J. I. Hong, “A hybrid phish detection approach by identity
[92] P. Robichaux and D. L. Ganger, “Gone phishing: Evaluating anti- discovery and keywords retrieval,” in Proc. 18th Int. Conf. World Wide
phishing tools for windows,” 3Sharp Project, Redmond, WA, USA, Web, Madrid, Spain, 2009, pp. 571–580.
Tech. Rep., Sep. 2006. [120] J. Yearwood, M. Mammadov, and A. Banerjee, “Profiling phishing
[93] J. C. Russ and R. P. Woods, “The image processing handbook,” emails based on hyperlink information,” in Proc. Int. Conf. Adv. Soc.
J. Comput. Assisted Tomograph., vol. 19, no. 6, pp. 979–981, 1995. Netw. Anal. Min. (ASONAM), Odense, Denmark, 2010, pp. 120–127.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2819
[121] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Ala Al-Fuqaha (S’00–M’04–SM’09) received the
“Spark: Cluster computing with working sets,” in Proc. HotCloud, M.S. degree in electrical and computer engineer-
Boston, MA, USA, 2010, p. 10. ing from the University of Missouri-Columbia in
[122] H. Zhang, G. Liu, T. W. S. Chow, and W. Liu, “Textual and 1999 and the Ph.D. degree in electrical and computer
visual content-based anti-phishing: A Bayesian approach,” IEEE Trans. engineering from the University of Missouri-Kansas
Neural Netw., vol. 22, no. 10, pp. 1532–1546, Oct. 2011. City in 2004. He is currently a Professor and the
[123] J. Zhang, S. Saha, G. Gu, S.-J. Lee, and M. Mellia, “Systematic min- Director of NEST Research Laboratory, Computer
ing of associated server herds for malware campaign discovery,” in Science Department, Western Michigan University.
Proc. 35th IEEE Int. Conf. Distrib. Comput. Syst., Columbus, OH, His research interests include wireless vehicular
USA, 2015, pp. 630–641. networks, cooperation and spectrum access eti-
[124] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: A content-based quettes in cognitive radio networks, smart services
approach to detecting phishing Web sites,” in Proc. 16th Int. Conf. in support of the Internet of Things, management and planning of software
World Wide Web, Banff, AB, Canada, 2007, pp. 639–648. defined networks, and performance analysis and evaluation of high-speed
computer and telecommunications networks. In 2014, he was a recipient
of the Outstanding Researcher Award with the College of Engineering and
Applied Sciences, Western Michigan University. He is currently serving
on the Editorial Board for Security and Communication Networks (Wiley),
Zuochao Dou received the B.S. degree in elec- Wireless Communications and Mobile Computing (Wiley), EAI Transactions
tronics from the Beijing University of Technology on Industrial Networks and Intelligent Systems, and the International Journal
in 2009, the M.S. degree from the University of of Computing and Digital Systems. He has served as a Technical Program
Southern Denmark, concentrating on embedded con- Committee Member and a Reviewer of many international conferences and
trol systems in 2011, and the M.S. degree from the journals.
University of Rochester majoring in communications
and signal processing in 2013. He is currently pur-
suing the Ph.D. degree in cloud computing security
and network security under the supervision of Dr.
A. Khreishah and Dr. I. Khalil.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app