0% found this document useful (0 votes)
99 views23 pages

Software-Based Phishing Defense

This document summarizes a systematic review of software-based approaches for detecting web phishing. It discusses that phishing attacks are a growing cybersecurity problem, with annual losses estimated between $61 million to $3 billion in the US. A variety of technical approaches have been proposed to detect phishing, including software-based detection schemes that analyze website content, networks, and URLs. However, these approaches differ significantly in their methods and evaluation, warranting a careful review. The paper aims to systematically analyze phishing detection schemes, especially software-based ones, to provide insights that can help guide development of more effective techniques.

Uploaded by

vtu16975
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views23 pages

Software-Based Phishing Defense

This document summarizes a systematic review of software-based approaches for detecting web phishing. It discusses that phishing attacks are a growing cybersecurity problem, with annual losses estimated between $61 million to $3 billion in the US. A variety of technical approaches have been proposed to detect phishing, including software-based detection schemes that analyze website content, networks, and URLs. However, these approaches differ significantly in their methods and evaluation, warranting a careful review. The paper aims to systematically analyze phishing detection schemes, especially software-based ones, to provide insights that can help guide development of more effective techniques.

Uploaded by

vtu16975
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO.

4, FOURTH QUARTER 2017 2797

Systematization of Knowledge (SoK): A Systematic


Review of Software-Based Web Phishing Detection
Zuochao Dou, Student Member, IEEE, Issa Khalil, Member, IEEE, Abdallah Khreishah, Member, IEEE,
Ala Al-Fuqaha, Senior Member, IEEE, and Mohsen Guizani, Fellow, IEEE

Abstract—Phishing is a form of cyber attack that leverages expected to continue to grow. Estimates of annual direct finan-
social engineering approaches and other sophisticated techniques cial loss to the U.S. economy caused by phishing activities
to harvest personal information from users of websites. The range from $61 million to $3 billion [49].
average annual growth rate of the number of unique phish-
ing websites detected by the Anti Phishing Working Group is To mitigate the increasing damage caused by phishing, a
36.29% for the past six years and 97.36% for the past two broad range of anti-phishing mechanisms have been proposed
years. In the wake of this rise, alleviating phishing attacks has over the past two decades. These anti-phishing techniques can
received a growing interest from the cyber security commu- be categorized into three broad groups [12]: (1) Detective solu-
nity. Extensive research and development have been conducted tions (e.g., website filtering); (2) Preventive solutions (e.g.,
to detect phishing attempts based on their unique content,
network, and URL characteristics. Existing approaches differ strong authentication [32]–[34], [43], [53], [54], [85]); and
significantly in terms of intuitions, data analysis methods, as (3) Corrective solutions (e.g., Site takedown [57], [58]). In
well as evaluation methodologies. This warrants a careful sys- this paper, we focus on detective solutions. More specifically,
tematization so that the advantages and limitations of each we look at software-based phishing detection schemes that
approach, as well as the applicability in different contexts, are specialized in identifying and classifying phishing web-
could be analyzed and contrasted in a rigorous and princi-
pled way. This paper presents a systematic study of phishing sites. This class of approaches is arguably more important
detection schemes, especially software based ones. Starting from than other approaches because it helps in reducing human
the phishing detection taxonomy, we study evaluation datasets, errors. Preventative and corrective solutions take a differ-
detection features, detection techniques, and evaluation metrics. ent approach, but if the user behind the keyboard has been
Finally, we provide insights that we believe will help guide the successfully tricked by a phishing attempt, and willingly
development of more effective and efficient phishing detection
schemes. submitted sensitive information, then no firewall, encryption
software, certificates, or authentication mechanism can help in
Index Terms—Phishing, Phishing website detection, software preventing the attack from materializing [49]. Software-based
based methods.
phishing detection also delivers improved results compared to
detection by user education (e.g., [60], [61], and [98]) because
I. I NTRODUCTION phishing attacks normally aim at exploiting human weak-
HISHING, one form of cyber-attacks, continues to be a nesses [59]. For example, a study of phishing detection using
P growing concern not only to cyber security specialists but
also to e-business users and owners. The severity of such cyber
user education [97] shows a 29% false negative rate (FNR) for
the best performance, while the software based approaches that
attack vector is continuously growing with the exponential are surveyed by the same study have FNR in the range of 0.1%
increase in digital information generation and the increased to 10%. For this reason, we focus our study on software based
reliance of people and business on cyber space. The Anti- phishing detection systems, and the term “phishing detection"
Phishing Working Group (APWG) has seen rapid growth in will refer only to this form of detection in the rest of the
the number of unique phishing websites detected from 2014 paper.
to 2016 [19]. The average annual growth rate is 97.36% and is Although the research area of phishing detection and classi-
fication is relatively rich, there is a lack of systematic analysis
Manuscript received December 16, 2016; revised May 8, 2017; accepted of the requirements, the capabilities, and the shortcomings
August 16, 2017. Date of publication September 13, 2017; date of current
version November 21, 2017. (Corresponding author: Mohsen Guizani.) of the existing anti-phishing techniques. For example, web-
Z. Dou and A. Khreishah are with the Electrical and Computer Engineering sites that offer identification and classification of phishing as
Department, New Jersey Institute of Technology, Newark, NJ 07102-1982 a service have been popular in recent years, however, those
USA (e-mail: [email protected]; [email protected]).
I. Khalil is with the Qatar Computing Research Institute, Hamad Bin services leverage different evaluation datasets from various
Khalifa University, Doha, Qatar (e-mail: [email protected]). sources at different time periods to validate their outcomes.
A. Al-Fuqaha is with the NEST Research Laboratory, College Albeit those schemes may have similar performance results
of Engineering and Applied Sciences Computer Science Department,
Western Michigan University, Kalamazoo, MI 49008 USA (e-mail: (e.g., in terms of false positive rate, true positive rate, etc.),
[email protected]). it is difficult to compare their performance because of the
M. Guizani is with the Electrical and Computer Engineering variation in the evaluation datasets employed. Consequently, a
Department, University of Idaho, Moscow, ID 83844-1023 USA (e-mail:
[email protected]). systematic assessment of the datasets used to validate phish-
Digital Object Identifier 10.1109/COMST.2017.2752087 ing detection approaches is desired, as well as necessary, in
1553-877X  c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2798 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

TABLE I
order to provide a foundation for comprehensive comparisons M OST P OPULAR D EFINITIONS OF P HISHING
among different phishing detection schemes, and ultimately,
select the best in practice.
In this work, we complement the existing survey papers on
phishing detection, including [49], [59], and [103], by provid-
ing a broad systematic analysis of software based anti-phishing
approaches. Varshney et al. [103] focus on studying, analyz-
ing, and classifying the most significant and novel detection
techniques, and pointed out the advantages and disadvantages
of each approach. On the other hand, we present a more com-
prehensive systematic review of phishing detection schemes,
not only from the perspective of detection algorithms, but also
from a broader perspective that covers other important aspects
including the phishing detection life cycle, taxonomy of phish-
ing detection schemes, evaluation datasets, detection features,
and evaluation metrics and strategies. The work in [49] focuses
more on the attack side of phishing. More specifically, it
presents details about phishing attacks including the anatomy
of such attacks, why people fall in phishing attacks and how
bad phishing is. However, it only provides a high level analy-
sis of the state-of-the-art phishing countermeasures. In order to
provide a systematic review of the phishing detection research,
we first present the necessary information about the phishing
attacks by answering three questions: (1) What is phishing?,
(2) How does phishing work? and (3) What is the current
status of phishing? Then, we conduct systematic review of
phishing detection schemes in a detailed and comprehensive The rest of the paper is organized as following: Section II
manner. Finally, Khonji et al. [59] present a literature sur- describes the state-of-the-art phishing attacks, and presents
vey about anti-phishing solutions (e.g., user training, email the life cycle of phishing detection approaches. Section III
filtering and website detection, etc.), including their classifi- introduces the taxonomy of phishing detection schemes with
cation, detection techniques and evaluation metrics. Compared the corresponding literature review. Section IV presents a sys-
to [59], we focus on the software based phishing website tematic review of software based phishing detection schemes
detection schemes, which are proved to be the most effec- from different perspectives: (1) phishing detection datasets; (2)
tive anti-phishing solutions and are not systematically studied phishing detection features; (3) phishing detection techniques;
in [59]. and (4) evaluation metrics. Section V provides detailed take-
In a nutshell, the objective of this paper is to provide a away lessons for researchers and practitioners in the area of
systematic understanding of existing phishing detection stud- phishing detection. Section VI concludes the paper.
ies and provide a comprehensive way to evaluate phishing
detection approaches from different perspectives in order to
II. BACKGROUND
guide future developments and validations of new or upgraded
anti-phishing techniques. A. State-of-the-Art Phishing Attacks
We summarize our contributions in this work as follows: In this section, we first present the various definitions of
• Compile a comprehensive profile of phishing through its phishing, then we introduce some statistics about phishing
various definitions, detailed ecosystem (i.e., in terms of between January 2010 and June 2016. Finally, we describe
phishing life cycle, actors involved and their operations, the phishing ecosystem.
etc.), and the state-of-the-art phishing trends. 1) What Is Phishing?: There is no consensus on how phish-
• Present a systematic review of the software based ing should be defined. Different phishing definitions lead to
phishing detection schemes from different perspectives different research directions and approaches (e.g., email filter-
including the life cycle, taxonomy, evaluation datasets, ing or website detection). It is important to clearly identify the
detection features, detection techniques and evaluation target of any phishing detection approach to avoid confusion
metrics. about its applicability in different scenarios. The target and
• Introduce a novel feature, Network Round Trip Time scope of phishing detection approaches can be analyzed from
(NRTT), for efficient and real time detection of phishing the definition of phishing which has been adopted by such
attacks. approaches. Therefore, presenting a background on the differ-
• Provide detailed takeaway lessons for researchers and ent definitions of phishing can help the readers understand the
practitioners in the area of phishing detection that we scope and the capabilities of different approaches. Table I sum-
believe will help guide the development of effective marizes the popular definitions of phishing. On one hand, the
phishing detection schemes. definitions of PhishTank [81], APWG [19], Xiang et al. [118],

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2799

TABLE II TABLE III


TARGETS AND S TRATEGIES OF P HISHING O PERATIONS OF D IFFERENT P LAYERS I NVOLVED IN P HISHING

Ramesh et al. [90] cover the majority of cases in which


phishers aim at stealing sensitive personal information such as
authentication credentials. Table II shows the comparison of
those phishing definitions based on phishing target and phish-
ing strategy. The most dominant phishing strategies are social
engineering (e.g., through fraudulent emails) and technical
subterfuge (e.g., malware infection). However, sophisticated (ii) Phishing Actors: There are six actors involved in a
techniques (e.g., pharming [52]) are also used to harvest users’ typical phishing life cycle (see Figure 2), as defined in the
personal information from the Internet. On the other hand, the following paragraphs:
definitions of Whittaker et al. [108] and Khonji et al. [59] do • Phisher: Individuals or organizations that conduct phish-
not limit the attacker’s target (e.g., sensitive personal infor- ing attacks in order to obtain a certain type of benefit,
mation). They describe the phishing strategy (e.g., phishing such as financial gain, identity hiding (e.g., refers to the
website or socially engineered messages) without stating a spe- situation in which phishers do not use the stolen identities
cific phishing target (e.g., only state the attackers’ benefit). To directly, but rather sell them to interested criminals and
sum up, the definition of Whittaker et al. [108] is the most cyber attackers.), fame and notoriety, etc. [59], [105].
general among those reviewed, while APWG [19] defines the • Web service provider: Companies that provide a certain
most commonly used phishing attacks in a specific manner. type of service (e.g., email, social network, e-banking,
2) How Does Phishing Work?: In this section, we introduce on-line shopping, etc.) on the Internet (usually through a
the ecosystem of phishing in terms of phishing process, actors website).
involved, their actions and interactions. • Web service subscriber: Customers who subscribe to
(i) Phishing Process: In a generic/traditional phishing sce- Web services provided by the Web service provider.
nario (i.e., mass-email phishing campaigns), an attacker hosts Subscribers are the potential targets of traditional phish-
a fake website, and presents users of a Web service with con- ing attacks.
vincing emails containing a link to the fake website. When • Web hosting provider: Companies that provide website
a user of the Web service opens the link and enters her sen- hosting services to Web service companies.
sitive data, data is collected by the server hosting the fake • Anti-phishing institutes: Institutes that support those tack-
website. As shown in Figure 1, Mihai and Giurea [78] sug- ling the phishing menace and provide advice on anti-
gest that a generic phishing process can be identified in five phishing controls and information on current trends [4].
steps: (1) Reconnaissance: Phishers look for famous Web ser- • Spear phishing targets: Specific individuals or companies
vice brands with a broad customer base; (2) Weaponization: targeted by phishers.
Phishers design the phishing websites and social engineer Each actor involved in the phishing process has different
on email spam; (3) Distribution: Phishers deliver emails to actions and reactions (summarized in Table III). Phishers try
the victims; (4) Exploitation: Phishers exploit weaknesses of to use sophisticated techniques to evade phishing detection
humans to lure the victims into phishing traps via socially approaches (e.g., DNS poisoning [11]). In addition, there is
engineered emails. (5) Exfiltration: Phishers collect sensitive a growing trend in which phishers have decoupled the pro-
data from the phishing databases. cess of phishing website hosting from the process of sending
Unlike generic phishing attacks, spear phish- phishing emails in order to evade the anti-phishing solutions
ing targets particular individuals or organizations. (Han et al. [45]).
References [24], [86], and [104]. Spear phishing attacks Web service providers usually announce blacklists of phish-
typically extract sensitive data from their victims by attaching ing websites and recommend users to use strong authentication
a type of malware to emails or in the phishing website. schemes (e.g., [17], [32]–[34], and [55]). Additionally, Web
Industry statistics indicate that spear phishing attacks have service subscribers highly depend on browser filters (e.g.,
a success rate of 19%, while the success rate of generic Google Safe Browser [50]) and other third party anti-phishing
phishing attacks is less than 5% [86]. toolbars (e.g., Netcraft [3]) to detect and block phishing
For the purpose of this paper, we will not consider email attempts.
filtering (e.g., [21], [68], and [120]) as a phishing detection The role of Web hosting providers is rather ambiguous
method. Our focus is on detection of website phishing for in the phishing process. Reputable providers usually enforce
both generic and spear phishing attacks. strict “Terms of Use” and avail certain anti-phishing solutions
horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2800 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

Fig. 1. Illustration of the phishing process.

Fig. 2. Players involved in the phishing process.

(e.g., brand monitoring [12]). Due to financial constraints, provide anti-phishing suggestions and solutions (e.g., up-to-
many free-to-use Web hosting providers may not be able to date phishing website blacklist, phishing detection toolbars,
afford deploying good anti-phishing security measures, which etc.). In addition, they may also cooperate with government
leaves their customers not only vulnerable, but even worse, agencies such as public security and law enforcement to detect
attractive targets for phishing. and prevent cyber attacks [4].
Anti-phishing institutes collect and analyze phishing data 3) What Is the Current State of Phishing?: According to
(e.g., suspicious websites reported by users) from various phishing activity trends reports published by APWG [19] from
sources (e.g., users’ reports via anti-phishing toolbars), and Jan. 2010 to Jun. 2016 (shown in Figure 3), the number of

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2801

Fig. 4. The number of phishing sites that use HTTPS. Re-printed from [83].

Fig. 3. The number of unique phishing sites per month from Jan. 2010 to
Jun. 2016.

unique phishing websites established per month increased sig-


nificantly since 2015 (i.e., the average number for 2016 is 2.93
times the average from prior years). It is clear that phish-
ers profited from this type of cyber-attacks, which result in
financial loss for both Web subscribers and business owners.
Therefore, agile techniques to mitigate phishing will continue
to be a pressing need.
Phishing attacks tend to employ advanced techniques to lure
Web service users into their rogue websites. Using the database
from Trend Micro Web reputation technology, Pajares [83]
reports the number of phishing sites that use HTTPS connec-
tions increased significantly in 2014 compared to 2010 (shown
in Figure 4). Attackers become more cautious and attentive
when designing phishing websites to evade existing phishing
Fig. 5. Life cycle of typical phishing detection schemes.
detection methods [1]. Some phishing groups are capable and
desire to perform more advanced phishing attacks. Avalanche
(commonly known as the Avalanche Gang) is a criminal syn-
different detection theme that warrants a separate comprehen-
dicate involved in phishing attacks [109]. In 2010, APWG
sive study on its own. we reemphasize here that our focus is
reported that Avalanche was responsible for two-thirds of all
on the area of software-based phishing detection which aims
phishing attacks in the second half of 2009, describing it as
at detecting or blocking phishing websites.
“one of the most sophisticated and damaging on the Internet”
The life-cycle of software-based phishing detection is illus-
and “the world’s most prolific phishing gang” [10]. It has been
trated in Figure 5. Starting from the initial inputs, the detection
discovered that Avalanche uses different techniques to evade
scheme extracts phishing detection features (or called heuris-
the anti-phishing mechanisms.
tics, as detailed in Section IV-B) and/or blacklists from various
In addition, more and more sophisticated techniques are
sources (e.g., URL related information, trusted third party,
being used to implement phishing attacks. For example, the
WHOIS server, etc.) via different feature mining approaches
pharming attack, a refined version of phishing attacks, is
(e.g., search engines, target identification algorithms, etc.).
designed to steal users’ credentials by redirecting them to
Then, it applies different data mining algorithms and/or pro-
fraudulent websites using DNS-based techniques [41], [58].
poses various detection strategies to the engineered features to
Many computer security experts predict that the use of pharm-
achieve its objectives (e.g., identifying phishing links, block-
ing attacks will continue to grow as more criminals embrace
ing phishing websites, etc.). To evaluate the performance
these techniques [52].
of phishing detection schemes, various evaluation datasets
are collected from different sources (e.g., PhishTank, Yahoo
directory, etc.). Finally, leveraging the collected datasets and
B. Life Cycle of Phishing Detection following various validation strategies (e.g., cross validation),
As mentioned in Section I, we do not incorporate phish- the proposed scheme is evaluated based on multiple metrics
ing detection approaches that rely on user education due to (e.g., False Positive Rate, False Negative Rate, etc.).
their poor performance. In addition, we do not cover phishing In the coming sections, following the life cycle of software-
detection methods that perform email filtering because it is a based phishing detection schemes, we present a comprehensive

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2802 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

TABLE IV
S UMMARY OF D IFFERENCES B ETWEEN P HISHING D ETECTION T OOLBARS AND ACADEMIC P HISHING D ETECTION /C LASSIFICATION S CHEMES

study of the phishing detection research from 5 differ- toolbars typically come in the form of Web browser exten-
ent perspectives, namely, classification of phishing detection sions (i.e., default extensions or third party extensions) that
techniques, validation datasets, detection features, detection warn users about a suspicious phishing site after clicking on
techniques and detection criteria. its URL.
Publicly available anti-phishing toolbars are either embed-
III. P HISHING D ETECTION S CHEMES : TAXONOMY AND ded in the browser as default extensions (e.g., Microsoft
THE C ORRESPONDING L ITERATURE R EVIEW SmartScreen Filter [112]) or can be downloaded from third
In phishing literature, software-based phishing detection party websites (e.g., Netcraft [3]). They both display security
schemes are usually categorized into heuristic and blacklist warnings on screen when certain actions are triggered in the
based schemes [49], [59]. Heuristic-based approaches examine browser. These security warnings can be classified into two
contents of the Web pages including: (1) surface level content types [59]:
(e.g., the URL); (2) textual content (e.g., terms or words that • Passive warnings: Passive warnings display various infor-

appear on a given Web page); (3) visual content (e.g., the mation (e.g., user ratings, site suggestions, etc.) about the
layout, and the block regions etc.) [122]. These methods can website that is currently being visited but do not block
detect phishing attacks as soon as they are launched but also the content of the website, as depicted in Figure 6.
introduce relatively high false positive rates (FPR). Blacklist- • Active warnings: Active warnings display warning infor-

based approaches have a higher level of accuracy. However, mation about the website a user is trying to visit and
they do not defend against zero-hour attacks [49], [115]. block the content of the website, as depicted in Figure 7.
Combinations of heuristic and blacklist based approaches pro- Many studies have shown that the majority of Web ser-
vide more robust and flexible defense against phishing attacks vice users ignore security warnings provided by anti-phishing
than either one on a standalone basis. toolbars [31], [35], [116]. Furthermore, Egelman et al. [36]
In this paper, we classify phishing detection approaches as found that active warnings are much more effective than pas-
either public phishing detection toolbars or academic phishing sive warnings (79% of participants paid attention to active
detection/classification schemes. Phishing detection toolbars warnings while only 13% participants paid attention to passive
use blacklists and/or selected heuristics to identify phishing warnings). Table V summarizes the information gathered about
websites. There is usually little information about what heuris- the state-of-the-art anti-phishing toolbars. In the following
tics these toolbars use and how they are used. Academic paragraphs, we discuss the details of those toolbars:
phishing detection solutions are similar to phishing detec- Google Safe Browsering: It uses a browser to check
tion toolbars, but usually apply more complex technologies URLs against Google’s constantly updated blacklist of unsafe
and are usually not available/feasible for public use. Most Web resources (e.g., phishing websites) [50] and provides
academic phishing classification schemes apply combinations active warnings to the end users. According to Google Safe
of heuristics features into various data mining algorithms to Browsing’s website, for different platform and threat types, it
enhance the classification accuracy. Table IV summarizes the examines pages against the safe browsing lists. It also issues
differences between phishing detection toolbars and academic reminders before users access risky links.
phishing detection/classification schemes. Note, the “scheme McAfee SiteAdvisor: This is a Web application that reports
details” column in Table IV estimates the amount of publicly on the identity of websites by scanning them for potential mal-
available details about detection schemes, such as detection ware and spam [111]. The detection result is decided according
methodology, data mining algorithms, and datasets. to a combination of heuristics and manual verification, such
Furthermore, based on the heuristic/blacklist classification, as the age and country of the domain registration, the number
we further classify the academic phishing detection approaches of links to other known-good sites, third-party cookies, and
into more specific and fine-grained sub-categories, namely, user reviews [30]. In addition, it provides passive warnings.
(1) heuristic: URL based methods; (2) heuristic: page content Netcraft Anti-Phishing Toolbar: Provides Internet secu-
based methods; (3) heuristic: visual similarity based meth- rity services including anti-fraud and anti-phishing services,
ods; (4) heuristic: other methods; (5) blacklist based methods; application testing and PCI scanning [113]. According to its
(6) hybrid methods. Details about each category are introduced website, Netcraft’s toolbar screens and identifies the deceiving
in Section IV-B. contents in URLs. It also ensures that the navigational con-
trols (e.g., toolbar and address bar) are activated in order to
A. Public Phishing Detection Toolbars prevent pop-up windows (particularly for Firefox). In addition,
Many freely available anti-phishing toolbars offer detection it shows the geographic information of the hosting location
and blocking services against Internet phishing attacks. These of the sites and analyzes fraudulent URLs (e.g., the real

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2803

TABLE V
I NFORMATION A BOUT S ELECTED S TATE - OF - THE -A RT A NTI -P HISHING T OOLBARS

Fig. 6. Passive warnings from Netcraft anti-phishing toolbar. Reprinted from: http://toolbar.netcraft.com/.

citibank.com or barclays.co.uk sites have little possibility to if a match is found. In that case, it issues a warning message
be located in the former Soviet Union [3]). while blocking the site for user’s safety. In addition, security
SpoofGuard: A heuristics-based anti-phishing toolbar devel- checks are also performed when the user starts a download
oped for Internet Explorer with passive warnings. The heuris- from the site. Moreover, SmartScreen compares the download
tics used include (1) Domain name check: examines if the to a list of existing downloads by other users. A warning is
domain name for the attempted URL matches recent entries; issued if it’s a brand new download.
(2) URL Check: checks if the username, the port number, as EarthLink Toolbar: Helps to protect the user from on-line
well as the domain name, are suspicious; (3) Email Check: scams by displaying a security rating (i.e., passive warning)
determines whether the current URL directs to the browser for all the websites the user visited previously. Additionally, it
via email; (4) Password Field Check: determines if the input alerts the user if he tries to access a previously known fraudu-
fields of type “password" are located in the document; (5) Link lent website. It appears to rely on a combination of heuristics,
Check: searches for risky links in the body of the document; user ratings, and manual verification [30].
(6) Image Check: analyzes the images of the new site vs. the eBay Toolbar: Helps the buyers and sellers with real time
previous sites; (7) Password Tracking: prevents the user from alerts and keeps users safe from spoofing and fraudulent
typing the same username and password for multiple sites [63]. attacks by detecting fake sites via a combination of heuristics
Microsoft SmartScreen Filter: A blacklist-based phishing and blacklists through passive warnings [30].
and malware filter implemented in several Microsoft browsers, GeoTrust TrustWatch Toolbar: Provides website verification
including Internet Explorer and Microsoft Edge [112]. When service that alerts the users to potentially unsafe, or phish-
browsing the site, SmartScreen helps monitor and identify the ing Web sites based on the information of several third-party
possibility of visiting a suspicious page. If so, it issues an reputation services and certificate authorities via passive warn-
active warning before next step is taken, as well as solicit- ings [42]. TrustWatch notifies the users that the website has
ing feedback from users. SmartScreen also maintains a list of passed the verification scan based on a list of disreputable sites.
reported phishing and software sites. It screens the list to check It would also recommend additional caution when inputting

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2804 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

Fig. 7. Active warning from Google Safe Browsering. Reprinted from: https://googleblog.blogspot.com/2015/03/protecting-people-across-web-with.html.

sensitive information to the website. Furthermore, it blocks Based on the proposed criteria, all of the 41 selected works
the initial attempt when visiting potentially unsafe websites are introduced in the following sections and twelve represen-
and warns users in case of a risk in revealing information to tative studies are chosen as examples to illustrate the detailed
the site. detection methodology in each category. They are listed in
Web of Trust (WOT): A browser extension that tells the user Table VI and introduced below.
which websites he can trust via active warnings [42]. It ensues Visual similarity based methods: Chen et al. [27] describe
the user’s Internet safety from scams, malware, rogue Web a novel heuristic anti-phishing system that explicitly employs
stores and dangerous links based on community ratings and gestalt and decision theory concepts to model perceptual
reviews. similarity. More specifically, they apply logistic regression
algorithm to a set of normalized page content features. The
B. Academic Phishing Detection/Classification Schemes proposed scheme can achieve 100% true positive rate and
Unlike the public anti-phishing toolbars, which aim at 0.74% false positive rate.
providing real-time warnings about the legitimacy of vis- The most representative work in this category is done by
ited websites, academic phishing detection and classification Fu et al. [38]. They propose an effective phishing website
schemes normally focus on improving the detection accuracy detection approach via visual similarity assessment based on
and reducing the number of false alerts by employing sophis- Earth Mover’s Distance (EMD) [47]. The detection process
ticated technologies and various machine learning algorithms. contains two phases, namely, generating signature of Web
Table VI shows the time-based (from 2005 to 2016) develop- pages and computing visual similarity score from EMD.
ment of 41 selected academic phishing detection/classification The Web page processing phase (i.e., generate the signa-
approaches. In order to choose the most representative studies, ture) contains three steps: (1) obtain the image of a Web page
in this paper, we comply with the following criteria based on from its URL using Graphic Device Interface (GDI) API; (2)
state-of-the-art literature: perform image normalization (the normalized image size is
• Pioneering: Research that introduces new ideas or meth- 100 x 100, and Lanczos algorithm [93] is used to resize the
ods to the literature. image); (3) transform the Web page image by a visual sig-
• Attention: Research that receives more attentions in terms nature. The signature is comprised of the image color tuple
of the number of citations. using the [Alpha, Red, Green, and Blue] (ARGB) scheme and
• Completeness: Research that presents their work fol- the centroid of its position in the image.
lowing the entire life cycle of phishing detection in The second step is to compute the EMD between the visual
depth. similarity signatures of the two Web pages (legitimate site

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2805

TABLE VI
T IME -L INE BASED D EVELOPMENT OF P HISHING D ETECTION S CHEMES F ROM 2005 TO 2016

and phishing site). Firstly, the normalized Euclidean distance where α ∈ (0, +∞) is an amplification factor that limits the
of the degraded ARGB colors and the centroids are computed. skewness of the visual similarity for the distributed in the (0,1)
Then the two distances are added up with their corresponding range.
weights (i.e., p and q, p + q = 1). The normalized feature Large-scale experiments with 10,281 suspected Web pages
distance between ϕi and ϕj is defined as: are carried out and the proposed scheme achieves 0.71% false
    positive rate and 89% true positive rate.
dij = NDfeature ϕi , ϕj = p ∗ NDcolor dci ; dcj
  Similar works based on visual similarly include [15], [25],
+ q ∗ NDcentroid Cdci ; Cdcj [26], [39], [46], [70], [76], [77], [94], [106], and [122].
where ϕi =< dci , Cdci >, dc =< dA; dR; dG; dB > is the color Page content based methods: Zhang et al. [124] propose
tuple, and Cdc is the centroid value. Suppose we have signature CANTINA, a novel content-based approach for detecting
Ss,a and signature Ss,b , the EMD between Ss,a and Ss,b can be phishing Web sites based on the Term Frequency/Inverse
calculated as: Document Frequency (TF-IDF) information retrieval met-

  fij ∗ dij ric. In addition, using some heuristics, the false pos-
EMD Ss,a , Ss,b =  itive rate is reduced. Generally, CANTINA works as
fij
follows:
where fij is the flow matrix calculated through linear program- 1) CANTINA calculates the TF-IDF scores of each term
ming [47]. Note that if EMD=0, the two images are identical, of the content in the given website.
if EMD=1, they are completely different. 2) CANTINA generates a lexical signature by taking the
Finally, the EMD-based visual similarity of two images is five terms with highest TF-IDF weights.
defined as: 3) CANTINA sends the lexical signature to a search engine
    α
VS Ss,a , Ss,b = 1 − EMD Ss,a , Ss,b (i.e., in their case, Google Search).

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2806 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

4) If the domain name of the current website matches the the number of dots in the URL. In addition, the authors create a
domain name of the top N search results, it is considered binary feature for each token in the hostname (delimited by “.”)
to be a legitimate website. Otherwise, it is concluded to and in the path URL (strings delimited by “/”, “?’, “.”, “=”,
be a phishing site. Note that, the value of N affects the “-” and “_”). The host-based features contain: (1) IP address
false positives. properties (e.g., is the IP address in a blacklist?); (2) WHOIS
CANTINA with TF-IDF alone results in a relatively high false properties (e.g., the date of registration, update, and expira-
positive rate. Therefore, several heuristics are used to reduce tion); (3) Domain name properties (e.g., the time-to-live (TTL)
the false positive rate, including: value for the DNS records associated with the hostname);
• Age of Domain: it examines the age of the domain name. (4) Geographic properties (e.g., the continent/country/city that
If the page has been registered for more than 12 months, the IP address belongs to).
the heuristic returns +1 (i.e., legitimate), otherwise it All the features of the URL are encoded into high dimen-
returns -1 (phishing). sional feature vectors and then different types of classifiers are
• Known Images: it examines whether a page contains applied to them. Here are some examples of the classifiers:
inconsistent well-known logos. • Naive Bayes: Let x denote the feature vectors and y ∈
• Suspicious URL: it examines if the URL contains an “@” {0, 1} denote the label of the website, with y = 1 for
or a “-” in the domain name. malicious and y = 0 for legitimate ones. P(x|y) denotes
• Suspicious Links: for each link in the webpage, it per- the conditional probability of the feature vector given
forms the above three URL checks. its label. Then, assuming that malicious and legitimate
• IP Address: it examines if the URL contains an IP websites are equally probable, the posterior probability
address. that the feature vector x belongs to a malicious URL is
• Dots in URL: it examines the number of dots in the URL. computed as:
• Forms: it examines if a Web page contains any HTML P(x|y = 1)
text entry form requesting sensitive personal data (e.g., P(y = 1|x) =
P(x|y = 1) + P(x|y = 0)
password).
In addition, CANTINA uses a simple forward linear model Finally, the right hand side of the equation is thresholded
to make the decision: to predict the binary label of the feature vector x.
• Support Vector Machine (SVM): The decision using
S=f wi ∗ hi SVMs is expressed in terms of a kernel function K(x, x )
where hi is the result of each heuristic, wi is the weight of that computes the similarity between two feature vectors
each heuristic, and f is a simple threshold function. and non-negative coefficients αi that indicate which train-
ing examples lie close to the decision boundary. SVMs
f (x) = 1 if x > 0, f (x) = −1 if x <= 0. classify new examples by computing their distance to the
Here, 1 means legitimate site and -1 means a phishing site. decision boundary:
The proposed scheme could achieve 97% true positive rate n

while maintaining 1% false positive rate. h(x) = αi (2yi − 1)K(xi , x)


In 2011, Xiang et al. [118] extended the work of 1
Zhang et al. [124] by proposing CANTINA+ which is claimed where h(x) is the threshold to predict a binary label for
to be the most comprehensive feature-based approach in the the feature vector x.
literature. It exploits the HTML Document Object Model • Logistic Regression: LR classification is based on the
(DOM), search engines and third party services with machine distance from a hyperplane decision boundary [71]. The
learning techniques to detect phishing. It has been shown to decision function is σ (z) = [1+e−z ]−1 that converts these
achieve 0.4% false positive rate and over 92% true positive distances into probabilities that feature vectors have pos-
rate. itive or negative labels. The conditional probability that
Similar works based on page content include [14], feature vector x has a label y = 1 is:
[89], and [119].
P(y = 1|x) = σ (wx + b)
URL based methods: Garera et al. [40] claim that it is often
possible to tell whether or not a URL belongs to a phishing where w (i.e., the weight vector) and b (i.e., the scalar
attack without requiring any knowledge of the corresponding bias) are parameters computed based on the training data.
page data. By applying several selected features (i.e., page Finally, the right hand side of the equation is thresholded
rank, domain name white list and URL based features) into to decide the label of the feature vector x.
logistic regression learning algorithm, the proposed scheme is The proposed scheme can achieve 0.1% false positive rate
efficient and has a high accuracy. and 92.4% true positive rate.
The most representative work in this category is done by Similar works based on URL related features include [20],
Ma et al. [72]. They propose a phishing detection approach [22], [37], [66], and [74].
to automatically classify URLs based on different data mining Blacklist based methods: PhishNet [87] is a predictive
algorithms across both lexical and host based URL features. blacklisting scheme to detect phishing attacks. Traditional
The lexical features selected in this method include the blacklist approaches (i.e., exact match with the blacklisted
length of the hostname, the length of the entire URL, as well as entries) are easy for attackers to evade. Instead, PhishNet uses

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2807

five heuristics (i.e., top-level domains, IP address, directory Marchal et al. [75] develop a phishing detection system
structure, query string, brand name) to compute simple com- that requires very little training data, which is language-
binations of blacklisted sites to discover new phishing sites. independent, resilient to adaptive attacks and implemented
Also, it proposes an approximate matching algorithm to deter- entirely on client-side. The proposed target identification algo-
mine whether a given URL is a phishing site or not. PhishNet rithm is faster than previous works and can help reduce false
consists of two major components, namely, component I: positives. The proposed scheme achieves 0.5% false positive
predicting malicious URLs and component II: approximate rate and 99% true positive rate.
matching. The most representative work in this category is
The basic idea of component I is to combine different URL Monarch [102], a real-time system that determines whether
heuristics of known phishing URLs from a blacklist (i.e., the submitted URL is spam or not. The authors deploy a real
PhishTank database) to generate new phishing URLs. These implementation to demonstrate its scalability, accuracy, and
five URL heuristics include: (1) top-level domains (TLDs): run time performance. Monarch consists of four components:
by changing the TLDs of known blacklist entries, a list of (1) URL aggregation: it accepts URL submissions from a num-
new URLs can be obtained; (2) IP address: the predicted ber of major email providers and Twitter’s streaming API;
new phishing sites are obtained by enumerating all the (2) Feature collection: it visits a URL via Firefox Web browser
combinations of the hostnames and pathnames of the known to collect page content; (3) Feature extraction: it transforms the
blacklisted websites with the same IP address; (3) directory raw data generated from the feature collection component into
structure: the idea is that two URLs sharing a common direc- a feature vector (e.g., transforming URLs into binary features
tory structure (e.g., www.abc.com/online/signin/paypal.htm and converting HTML content into a bag of words [110]).
and www.xyz.com/online/signin/ebay.htm) may have (4) Classification: feature vectors are applied to a proposed
similar sets of file names. Therefore, the predicted distributed logistic regression classifier for classification. The
new URLs are www.abc.com/online/signin/ebay.htm selected features in [102] are represented by a combination
and www.xyz.com/online/signin/paypal.htm; (4) query of URL based features, page content based features, whitelist
string: starting from the observation that some URLs and other features (e.g., routing data), including:
with the exact same directory structure differ only in • Initial URL and Landing URL: domain tokens, path
query string (e.g., www.abc.com/online/signin/ebay?XYZ, tokens, query parameters, number of sub-domains, length
and www.xyz.com/online/signin/paypal?ABC), two of domain, length of path, length of URL.
new URLs, www.abc.com/online/signin/ebay?ABC and • Redirects: number of redirects, type of redirect.
www.xyz.com/online/signin/paypal?XYZ, are created; • Sources and Frames: URL features for each embedded
(5) brand name: the intuition here is that phishers often target IFrame links and sources links.
multiple brand names using the same URL structure method. • HTML Content: tokens of main HTML, frame HTML,
Therefore, the predicted URLs are obtained by changing the and script content.
brand names embedded in the known phishing URLs. • Page Links: URL features for each link, number of links,
After obtaining the whole set of the predicted URLs, ratio of internal domains to external domains.
PhishNet first performs a DNS lookup to filter out sites that • JavaScript Events: number of user prompts, tokens of
cannot be resolved. Then it conducts a content similarity prompts.
check (i.e., using an online tool at http://www.webconfs.com) • Pop-up Windows: URL features for each window URL.
between the known phishing URLs and the corresponding pre- • Plugins: URL features for each plugin URL.
dicted URLs. The predicted URL is concluded to be a phishing • HTTP Headers: tokens of all field names and values;
site if the similarity score exceeds a certain threshold. • DNS: IP of each host, mailserver domains and IPs,
The second component performs an approximate match of nameserver domains and IPs.
a given URL to determine whether it is a phishing site or not. • Geolocation: country code, city code of each IP.
It first breaks the input URL into four different entities: IP • Routing Data: ASN/BGP prefix for each IP encountered.
address, hostname, directory structure and brand name. Then • Whitelist: a whitelist of known good domains.
it assesses each entity by matching with the corresponding part Logistic Regression (LR) with L1-regularization is chosen
of the known phishing URLs to generate an evaluation score. as the classifier. To predict the class label (y = −1 means
If the score is higher than a certain threshold, it is considered non-spam, y = +1 means spam) of a URL’s feature vector
to be a phishing URL. x. We train a linear classifier characterized by weight vector
About 18,000 new phishing URLs are discovered from a set w. Given a set of n labeled training points (xi ; yi ), i = 1:n,
of 6,000 new blacklist entries. The proposed scheme achieves the training process is to find w that minimizes the following
3% false positive rate and 95% true positive rate. objective function:
Similar works based on blacklist/white-list include [23] n
  
and [96]. f (w) = log 1 + exp −yi (xi ∗ wi ) + λ ∗ ||w||1
Hybrid methods: Whittaker et al. [108] use a logistic 1
regression classifier to maintain Google’s phishing blacklist The first component is the log likelihood of the training data
automatically by examining the URL and the contents of as a function of the weight vector. The second component
a page. The proposed scheme correctly classifies more than is the regularization which adds a penalty to the objective
90% of phishing pages several weeks after training concludes. function [71].

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2808 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

TABLE VII
To perform the learning process over large-scale datasets in S OURCES OF DATASETS FOR P HISHING AND L EGITIMATE W EBSITES
real time, the data is divided into m shares and then processed
in a distributed manner (i.e., using Hadoop Spark [121]).
Monarch can achieve an overall accuracy of 91% with
0.87% false positives with a throughput of 638,000 URLs per
day. Similar works that use hybrid features include [13], [16],
[69], [79], [80], and [84].
Other methods: Ramesh et al. [90] present a phishing
website detection approach based on the phishing target identi-
fication. After obtaining the target domain name, the proposed
scheme performs third-party DNS look up for comparison to
decide the legitimacy of the suspicious page. The proposed
scheme achieves 0.32% false positive rate and 0.33% false
negative rate.
Similar works based on phishing target identification
include [101] and [107].

IV. P HISHING D ETECTION S CHEMES : A S YSTEMATIC


S TUDY F ROM D IFFERENT P ERSPECTIVES
In this section, we perform a systematic review of the
software based phishing detection schemes from different
perspectives including evaluation datasets, detection features,
detection techniques and evaluation metrics.

A. Evaluation Datasets
The evaluation is tightly coupled with the ground truth
datasets employed by the various approaches. Different
approaches collect ground truth from different cyber intel-
ligence sources. Such sources may employ different testing
methodologies and target different types of phishing activities,
and hence cover different phishing domains. That is, evaluation
based on one dataset may differ from that based on another.
Therefore, we argue that having a publicly available reference
datasets is crucial for systematizing the evaluation of various
approaches. Because it is an important step towards providing
a benchmark to compare and contrast the efficiency of vari-
ous approaches and it can help researchers to further advance
the area in a more systematic way. The absence of reference
sets combined with difficulties in sharing code, make it hard • Dataset timeliness: Phishing websites tend to have very
to repeat experiments for systematic comparison of effective- short life time. Therefore, phishing blacklist providers
ness. In the following, we list the identifying features of the usually update information in hourly, daily or weekly
datasets used in the literature: schedules. Even if two schemes use the same data source
• Dataset source: Table VII lists the commonly used data with the same dataset size, they may contain different
sources of phishing websites and legitimate websites, phishing website information.
together with the approaches that leverage each source. • Ratio of legitimate to phishing websites: the ratio of legit-
There is no common consensus on the quality of the dif- imate to phishing instances shows the extent to which the
ferent sources due to the lack of knowledge about the experiments represent a real world distribution (≈ 100/1).
methodologies used in compiling and maintaining each • Training set to testing set ratio: the ratio of training to
source. testing instances indicates the scalability of the approach.
• Dataset size: the evaluation dataset size varies a lot In Section V, we use these aspects to perform a system-
among different approaches. Generally speaking, the atic and comprehensive evaluation of the various phishing
larger the dataset, the more credible the results. detection approaches.
• Dataset redundancy: Datasets, especially those of phish-
ing websites, usually contain repeated entries due B. Phishing Detection Features
to multiple submissions and overlap among different 1) Most Commonly Used Phishing Detection Features: In
sources. However, little information is provided about this section, we summarize the most commonly used features
datasets redundancy in the literature. by various phishing detection approaches. Even though the

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2809

Fig. 8. Most commonly used phishing detection features.

listing of atomic features presented here is not exhaustive, it – Registered information about domain names: For
includes the popular features used in most of the state-of-the- most of the observed phishing sites, either the regis-
art phishing detection approaches. Figure 8 summarizes the tration record is not available in WHOIS databases
features. or the claimed identity is not accurate in the record.
(i) URL-based lexical features: URLs are rich of lexical fea- – Age of Domain: Many of the observed phishing web-
tures that have been widely used in various phishing detection sites have domains that are registered only a few days
approaches [22], [108], [118], including: before phishing emails are sent out, that is, phishing
• URL replaced with IP address: Some phishing web- domains are likely to be short lived.
sites do not use host-names, but rather use IP address • Geographic information: Geographical location is one of
directly to locate the fake site. Such behavior is nor- the most commonly used indicators in detecting phish-
mally employed either to obfuscate the legitimate URL ing because phishing websites are likely to be hosted
or simply to reduce cost. in locations different from those of legitimate web-
• URL Length: Phishing websites usually have longer URLs sites [3]. For example, Netcraft [3] provides location
compared to legitimate websites. information (i.e., IP-based country information) to help
• Number of dots and sub domains: Phishing URLs often in identifying fraudulent URLs. For example, the real
contain more “dots" and sub-domains compared to legit- bankofamerica.com is unlikely to be hosted in Russia.
imate ones. • Domain name similarity: A measure of the similarity
• Number of re-directions: Malicious URLs often have between a potential phishing domain name and a tar-
multiple URL redirects in order to evade detection by get domain name. The similarity can be measured in
blacklists. many ways. For example, it can be measured based on
• Use of HTTPS protocol: Legitimate websites often use the Edit Distance between the two domains [28]. The
HTTPS protocol, while phishing sites usually do not. Edit Distance (a.k.a., Levenshtein distance) is the num-
(ii) URL-based host features: ber of characters that need to be inserted or deleted in
• WHOIS information: WHOIS is a query and response order to transform one domain into another. The smaller
protocol that is widely used for querying databases that the number of insertions and deletions, the higher the
store registration information about websites [51], [72]. similarity.

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2810 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

(iii) Website page content: The textual content of the phish-


ing website can be used to determine the identity of the target
website.
• Page in top search result: Xiang et al. [118] proposed a
content-based approach to detect phishing websites based
on the TF-IDF information with the help of Google search
engine, as follows:
– TF-IDF of each term on a suspected Web page is
calculated.
– Top 5 terms with highest TF-IDF values are selected.
– The top 5 terms are submitted to a search engine and
the domain names of the n-first returned entries are
stored.
– If the suspected domain name is found within the n
returned results, the site is considered legitimate.
• Page rank: PageRank is a link analysis algorithm first
used by Google, in which each document on the Web Fig. 9. ROC curves of FPR-TPR for different URL based features and URL
sets.
is assigned a numerical weight from 0 to 10, with 0
indicating “least popular" and 10 representing “most
popular”. sites. In the first phase, the targeted victim site is identified
• Page style & contents: Aburrous et al. [13] propose sev- through content analysis of the phishing website, URL char-
eral page style and content heuristics including: spelling acteristics, spam email content analysis, or combination of
errors, copying website, using forms with submit button, them. Victim website identification is well studied in litera-
using Pop-Up windows, and disabling Right-Click. ture (e.g., [70], [75], and [107]), and hence we capitalize on
• Bad forms: Phishing attacks are usually accomplished the state-of-the-art mechanisms to build our victim identifica-
through HTML forms. This feature checks if a page tion component. In the second phase, we compare the average
contains potentially harmful HTML forms. NRTT values of both the phishing website and the targeted
• Non-matching links: Links on phishing sites are usually victim website from a certain vantage point. If the difference
meaningless or contain URLs of the target legitimate between the two NRTT averages is greater than a threshold,
sites. Therefore, this feature examines all the links in the the link is highly likely a phishing link. The intuition here is
HTML, and checks if the most frequent domain coincides that the phishing website is likely to be hosted on a server
with the page domain [118]. different from that of the victim website, and hence, they will
(iv) Website visual similarity: experience different average NRTT.
• Text, image and overall similarity: Medvet et al. [77] To account for transient network instabilities, NRTT is
present a technique to visually compare a suspected measured by sending multiple packets (called profiling sig-
phishing page with the legitimate one via a set of visual nals in [56]) from the vantage point to the target website,
features. These features include (i) each visible text sec- which acknowledges each packet back to the vantage point.
tion with its visual attributes, (ii) each visible image, and The number of the profiling signals is optimized based on
(iii) the overall visual look-and-feel (i.e., the larger com- the trade-off between the accuracy of the measurements on
posed image) of the Web page visible in the viewport (i.e., one hand and the bandwidth and the delay overhead on the
the part of the Web page that is visible in the browser other hand. Based on the observation that NRTT follows a
window). Gaussian distribution [56], [62], Khalil et al. [56] theoreti-
• Dominant color and its centroid coordinate: Fu et al. [38] cally show that 27 profiling signals are sufficient to create
first convert Web pages into low resolution images and reliable NRTT.
then use the dominant color category and its correspond- More importantly, NRTT should be robust enough to resist
ing centroid coordinate to represent the image signatures. manipulation attacks and should cope with different kinds of
Finally, they use Earth Mover’s Distance (EMD) to cal- network instabilities to avoid false positives. These NRTT chal-
culate the signature distances of the images of the Web lenges have been well studied and addressed in [56], where the
pages (i.e., legitimate vs. phishing). authors define and address three types of network instabilities
2) Network Round Trip Time (NRTT): A New Phishing that may affect the authentication accuracy:
Detection Feature: In this section, we propose a new phish- Instantaneous instabilities cause transient changes in
ing detection feature based on the Network Round Trip Time, NRTT and hence, they only affect a few of the profiling sig-
dubbed as NRTT. NRTT has been introduced in [56] as a nals. This type of instability is addressed by outlier filtering
reliable and robust second Web authentication factor. NRTT based on median absolute deviation [67].
simply captures the network round trip time that packets take Long-term instabilities are instabilities that persist long
in its journey from one Internet connected host to another enough to affect all or most of the of profiling signals yet not
and back to the original sender. We propose here a two- permanent. These instabilities are mainly caused by traffic con-
phase approach based on NRTT to detect phishing Web gestion at the local network segment connecting to the network

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2811

Fig. 10. ROC curves of FPR-TPR for different feature sets. Fig. 11. ROC curves of FPR-TPR for selected feature sets, NRTT and the
all feature sets.

backbone. This type of instability is addressed by measuring the proportion of positives that are correctly identified (i.e., the
NRTT from different vantage points as detailed in [56]. percentage of phishing sites which are correctly identified):
Routing instabilities may result in permanent changes in
network communications latency due to, for example, perma- # of correctly detected phishing
TPR =
nent network routing changes. It has been shown in many Total # of phishing
previous works [29], [64], [65], [91], [95], that only a small FPR measures the proportion of positives that are incorrectly
portion of the Internet is responsible for the vast majority of identified (i.e., the percentage of legitimate sites which are
the routing instabilities and these routing changes exhibit a wrongly identified as phishing sites):
strong temporal periodicity, despite the growth of the Internet.
# of wrongly detected legitimate
Leveraging NRTT for our problem, that is, distinguishing FPR =
phishing from legitimate websites is much more practical than Total # of legitimate
the application envisioned by Khalil et al. [56], that is, Web Figure 9 shows the ROC (Receiver operating characteristic)
authentication: (i) Identifying phishing websites does not have curves of FPR vs. TPR for different URL based features
the limitation and the concern of mobile clients, not only including Length of the URL, Number of dots in the URL,
because Web servers are static but also because NRTT is Number of re-directions of the URL, and the URL set (i.e.,
only computed and compared on the fly for the two web- combination of all of the 3 URL features plus the binary fea-
sites (the suspected phishing and the target website). (ii) No tures: usage of HTTPS protocol and IP address in the URL).
reference profiles are maintained and stored at the vantage It shows that the URL set alone could achieve about 90% TPR
point. (iii) The unsolved routing network instabilities men- with 2% FPR.
tioned above do not exist in our case. For Web authentication, Figure 10 shows the ROC curves of FPR vs. TPR for dif-
the reference profile and the real time profiles are measured ferent feature sets, including: WHOIS set, URL set, Web of
at different times. That is, it is possible that permanent route trust score and the combination of all of the three sets “all
changes occur between measuring reference and real-time pro- 3 sets” (i.e., the combination of WHOIS set, URL set and
files, which may harshly affect the efficiency. On the other Web of trust score). The WHOIS set contains two features,
hand, NRTT of both the phishing and the legitimate website namely, the age of the domain and the existence of the regis-
are measured at the same time in real time, and hence, perma- tering information in WHOIS database. The Web of trust score
nent instabilities are not a concern. (iv) Long term instabilities is provided by SEO (search engine optimization) that collects
are also not a concern in our problem. This is because local all website ranking information based on Google, Bing, Yahoo,
network congestion does not apply in the case of Web hosting among others. The results show that the combination of all the
servers compared to Web clients who may have poor network selected feature sets can achieve about 93% TPR with 0.5%
connections. Additionally, NRTT signals are sent at the same FPR.
time for the phishing and legitimate websites, that is, the insta- Figure 11 shows the ROC curves of FPR vs. TPR for “other
bilities affect both and hence the difference between the two 3 sets” (i.e., the combination of WHOIS set, URL set and
remain unchanged. Web of trust score), NRTT and “all sets” (i.e., “other 3 sets”
In order to demonstrate the effectiveness of NRTT as a + NRTT). It clearly shows that, with the combination of all
phishing detection feature, we perform a set of experiments to the features, the proposed scheme can achieve 99% TPR and
evaluate the trade-off between True Positive Rate (TPR) and 0.2% FPR.
False Positive Rate (FPR) among different selected features The evaluation dataset contains 820 verified on-line phish-
and different feature sets (including NRTT). TPR measures ing websites (redundancy reduced) collected from PhishTank

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2812 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

TABLE VIII
S UMMARY OF P HISHING D ETECTION /C LASSIFICATION T ECHNIQUES false alarm rate and Type I error in some parts of the
literature.
• True Negative Rate (TNR): The number of correctly iden-
tified legitimate websites divided by the total number of
legitimate websites.
# of correctly identified legitimate
TNR =
Total # of legitimate
• False Negative Rate (FNR): The number of phishing sites
that incorrectly identified as legitimate sites divided by
the number of phishing sites. It is also known as miss
rate, Type II error in some parts of literature.
# of phishing identified as legitimate
FNR =
Total # of phishing
• Precision (P): The rate of correctly detected phishing
sites in relation to all sites that were detected as phishing.
# of phishing correctly identified
P=
Total # of sites detected as phishing
• F1 score: The harmonic mean between precision P and
recall R.

F1 = 2TPR/(2TPR + FPR + FNR)


and 612 legitimate websites obtained from Alexa in
September 2016. Although two different phishing detection schemes may use
the same evaluation metrics, we cannot simply compare the
C. Phishing Detection Techniques numerical results. We have to take other aspects into consid-
In this section, we research the phishing detec- eration when performing comparisons. For example, the use
tion/classification techniques and their supporting technologies of different datasets may have high impact on the results, as
(e.g., data mining algorithms, detection strategies, etc.). discussed in Section IV-A.
Table VIII summarizes the results. We can conclude that the Based on our investigations, two of the most commonly used
Support Vector Machine (SVM), Logistic Regression (LR) evaluation metrics are TPR and FPR. Due to the relatively high
and Bayesian-based classifiers are the most popular tools used percentage of legitimate websites compared to the phishing
to support phishing detection in the literature we covered. ones, the latter metric is considered to be of high importance
We include feature mining algorithms in this category for practicality and usability reasons. Even a very small FPR
because they are increasingly used in recent detection tech- may result in a large number of wrongly identified legitimate
niques. For example, the TF-IDF algorithm (as introduced in websites as phishing and hence diminish any potential benefit
Section III-B) is widely used in many page content based of the approach.
phishing detection schemes [90], [118], [124].
V. I NSIGHTS AND F UTURE R ESEARCH D IRECTIONS
D. Evaluation Metrics Most of the phishing detection related work focus on the
In this section, we summarize the most commonly used detection quality but often overlook other important angles
evaluation metrics in the literature. such as datasets, features, practicality and performance. In this
• Accuracy (ACC): The number of correctly identified section, we provide a comprehensive view of all the impor-
phishing and legitimate websites divided by the total tant aspects that we have discussed as part of studying and
number of websites. evaluating phishing detection schemes.
# of correctly identified sites More importantly, this section provides detailed takeaway
ACC = lessons for researchers and practitioners in the area of phishing
Total # of sites
detection. This section is structured following the four aspects
• True Positive Rate (TPR): The number of phishing sites of phishing detection that have been discussed in the survey,
correctly identified divided by the total number of phish- namely: datasets, detection features, detection schemes, and
ing sites. It is also known as hit rate, recall (R), and evaluation metrics. According to our study, every technique
sensitivity in different parts of the literature. (e.g., machine learning algorithms) used in the literature has
• False Positive Rate (FPR): The number of legitimate sites its own advantages and disadvantages, which has to be care-
that are wrongly identified as phishing sites divided by fully selected in order to optimize the goals of the detection
the total number of legitimate sites. It is also known as approach.

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2813

A. Dataset Selection the detection accuracy of the underlying phishing detection


Datasets considerably affect the evaluation results. This approaches. Therefore, we recommend to always cross-check
evaluation criteria measures how easy it is to compile the the ground truth in multiple blacklists, or use majority voting
datasets and to extract the necessary features from them. It to decide the inclusion of entries in the ground truth dataset.
also provides an understanding of the deployment environ-
ment. Different environments may call for different detection B. Feature Selection and Engineering
techniques. For example, in healthy environments where the Different phishing detection systems use different combi-
vast majority of the websites are legitimate, schemes with low nations of detection features. In this survey, we have dis-
FPR are preferable over those with high TPR and relatively cussed the most important detection features used by different
high FPR. In Section IV-A, we elaborate on the various impor- schemes. However, the literature lacks systematic evaluation
tant aspects relevant to datasets used in training and evaluating of these features in terms of the availability of the detection
different phishing detection approaches. features, the time it takes to mine the features, the com-
One of the important challenges here is to compile a rep- plexity of extracting the features, and the robustness of the
resentative dataset that covers as much as possible of the features. For example, a feature that can be easily manip-
behaviors of phishing attackers. It is relatively easy to collect ulated by adversaries should not be used irrespective of its
phishing related datasets from a single organization. However, effectiveness in detecting current phishing attempts. This is
such datasets may only offer a limited local view of the threats. intuitive as adversaries can simply manipulate the feature to
On the other hand, compiling datasets from multiple organi- avoid detection.
zations could be challenging due to the potentially leakage Feature extraction (a.k.a., feature engineering) is a chal-
of sensitive information. Therefore, such efforts are usually lenging task that has a significant impact on the quality
hampered by restrictive legal clauses including non-disclosure (accuracy and robustness) of the underlying phishing detec-
agreements and regulations against sensitive information leak- tion approaches. Well-crafted features lead to more successful
age. Even though when such representative datasets may be detection approaches, while poor features may even ruin
successfully compiled by some entities, it is generally diffi- good detection algorithms. Typically, approaches look for fea-
cult to share the datasets with the bigger phishing detection tures that maximize the detection accuracy while ignoring the
community. This not only adds to the difficulty of systemat- robustness of the feature. Some features may have strong pos-
ically comparing different phishing detection approaches, but itive impact on the detection accuracy, however, they may
also considerably limits the potential benefits of the datasets. be controlled by attackers and hence can be easily manipu-
The takeaway lesson here is that it is not always true that the lated without affecting the attacker utility. For example, the
bigger the dataset, the better the detection outcome, but rather, URL-based lexical features (e.g., length of the URL, usage
the more representative the dataset, the more comprehensive of HTTPS protocol, etc.) can be easily manipulated to evade
the features collected and the better the detection performance. detection. More specifically, it is no longer uncommon to
The timeliness of phishing detection datasets is another see some phishing attackers increasingly use HTTPS, which
challenging issue. As mentioned earlier, the cyber space is makes such features less effective in identifying phishing
dynamic by nature, a characteristic that has been extensively attempts.
exploited by attackers to evade detection. Attackers frequently The most important takeaway here is that it is paramount
change behavior and adjust their attack models to evade detec- for phishing detection approaches to carefully select the fea-
tion. The most obvious examples include Dynamic Generation tures that strike the right balance between detection accuracy
Algorithms (DGAs) and Fast Flux Service Networks (FFSNs). and robustness in the face of potential manipulations. Our new
DGAs (e.g., [100]) create phishing websites, among others, NRTT feature is an example of a robust feature that not only
that are only accessible for short time periods. This makes the provides excellent detection accuracy (as clearly shown by the
identification of a phishing email, for example, by a known experimental results), but also continuously adapt to changes
suspicious link that it contains useless, because such link in the underlying requirements. According to the work in [33],
may continuously change, some times on daily bases. FFSNs the design and measurement of NRTT aim at preventing attack-
(e.g., [48]) exploit short TTL values to frequently change the ers from being able to learn and mimic legitimate NRTTs, and
IP addresses assigned to a specific domain, mainly to evade hence the feature is robust. Additionally, the decision threshold
IP-based blocking. Such behavior makes, for example, IP- specific to the feature is not fixed, but rather is adaptive and
based phishing identification inefficient. Dynamic behavior of may change from the current measurement to the next one,
attackers also complicates the process of dataset collection as depending on network conditions. More specifically, both the
it limits the validity of such datasets to only short time peri- baseline NRTT and the NRTT of the claimed URL are mea-
ods. For example, a dataset of phishing URLs that has been sured and compared at the same time, and hence, congestion
compiled 1 year ago, may not be valid now because most of or other network conditions do not affect the outcome of the
the listed URLs may no longer be in service. comparison.
One last issue is related to the ground truth datasets. It has However, it is quite challenging to evaluate the robustness of
been shown [88], [99] that blacklists, which are usually used a feature in a systematic and measurable way. The importance
to compile ground truth datasets, have high false positive and of the problem has been recognized by many researchers in
false negative rate. Such impurities may negatively impact other domains as well (e.g., [82], [118], and [123]). However,

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2814 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

to the best of our knowledge, none of the existing approaches utilizes human expertise to identify phishing attempts, which
provide a framework that can be used to quantitatively eval- has been implemented using various heuristics. However, such
uate the robustness of features. Most of the approaches that approaches require man-in-the-loop, and hence are too slow.
recognize the problem only qualitatively discuss the robustness They fail to handle large scale datasets and cannot cope
of some of the important features used in their approaches. with high data rates, frequent dataset changes, or adaptive
Therefore, we believe that providing a framework that outlines attack behaviors. Therefore, machine learning technologies,
qualitative and quantitative evaluation guidelines of the robust- which utilize data-driven algorithms, were introduced to help
ness of features is an open problem that calls for attention from automate the learning process. Different machine learning
the research community. Such framework has to consider both algorithms are used. Support Vector Machine (SVM), Logistic
complexity of feature forging as well as its impact on attacker Regression (LR), and Bayesian-based classifiers, are among
benefits. Such framework will be an important tool in the face the mostly used algorithms in the literature.
of the ever evolving attack as it helps researchers and prac- Through our extensive investigation of the large body of
titioners to design phishing detection techniques leveraging phishing detection approaches, we learned that one size does
features that are both adaptive and hard to manipulate without not fit all and hence, it is extremely difficult to recommend one
considerably affecting attack utility. machine learning algorithm over another. Each machine learn-
One additional issue to consider while designing the detec- ing algorithm has its own strengths and weaknesses, which
tion features of an approach is the time it takes to mine the has to be carefully considered to optimize the goals of the
feature. Some features could be extremely useful in identify- detection approach. For example, SVM is considered among
ing and detecting phishing attempts, however, they may take the most robust and accurate classification algorithms [117].
a relatively long time to compute, such as page reputation and However, it has the drawback of being computationally ineffi-
virtual appearance similarity. The use of such features may cient, and hence may not be appropriate for large scale datasets
either result in user inconvenience due to service delays until or high data rates. On the other hand, LR is one of the most
computation completes, or may result in security risks in case widely used statistical models for binary data [12]. However,
the service is provided before computing the features. it performs poorly when nonlinear relationships exist between
feature sets. Furthermore, even though Bayesian-based classi-
C. Detection Schemes fiers are easy to construct and can be readily applied to large
As presented in Section IV-C, phishing detection scale datasets [103], they assume independent features, and
systems use various data mining algorithms and detection hence are very restrictive.
approaches, each with its own advantages and disadvantages. Recent research efforts leverage Deep Learning (DL) algo-
Understanding the underlying data mining algorithms is rithms to improve the performance of phishing detection
important in evaluating the performance, the scalability and schemes. DL allows computational models that are composed
the robustness of phishing detection schemes. of multiple processing layers to learn representations of data
When designing a phishing detection scheme, we recom- with multiple levels of abstraction [102]. DL has been suc-
mend to follow the life cycle illustrated in Figure 5 in order to cessfully applied in many research fields, such as speech
help fellow researchers obtain a comprehensive understanding recognition, visual object recognition, drug discovery and
of the proposed approach and to make it easy for future studies genomics. Therefore, we believe that DL could be a viable
to conduct comparative evaluations. Specifically, a detection alternative to traditional machine learning algorithms (e.g.,
approach has to clearly state what it can and what it cannot SVM, LR), especially when handling complex and large scale
do in terms of phishing detection and blocking, to avoid rely- datasets.
ing on the approach in scenarios where it may not be efficient. Another important issue that we have identified through this
Additionally, details of dataset specifications in terms of con- survey is the absence of deep and systematic evaluation of
tent and volume that better support the approach should be the performance of phishing detection approaches. The vast
clearly articulated and documented. majority of the approaches focus on evaluating and analyz-
A very important lesson that we have learned is that, ing the detection accuracy, while they overlook the run-time
due to the dynamic nature of cyber attacks, the most reli- performance of the approach. Some approaches may show
able and efficient phishing detection approaches are those acceptable performance during the design and test phases
that can continuously adapt to cope with such dynamisms. due to the relatively small size datasets used during these
References [108] and [118] are examples of such dynamic phases. However, real world datasets are usually more com-
approaches. Additionally, the robustness of phishing detec- plex and much larger, which may cause such approaches
tion approaches is tightly coupled with the robustness of the to perform poorly in real world applications. Systematic
features used by the approach. Therefore, an approach may performance analysis can provide important guidelines to eval-
result in excellent detection accuracy at the time of design uate the scalability of detection approaches, which in turn can
or in its early deployment, but fails miserably later due to help in improving performance by considering, for example,
either changes in the dataset or deliberate manipulation of the distributed platforms and parallel algorithms.
features utilized by the approach.
Two main categories have been considered in the underlying D. Evaluation Metrics
technologies (e.g., feature mining, classification, etc.) of phish- In addition to evaluating the quality of the detection scheme
ing detection approaches. The first category of approaches in terms of FPR and TPR, it is also imperative to have

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2815

E VALUATION OF ACADEMIC P HISHING D ETECTION S CHEMES


TABLE IX

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2816 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

systematic evaluation of the effectiveness and the scalability and page content based features. Also, studies that are incor-
of the underlying detection algorithms. Effectiveness metrics porated with more features tend to have better performance
have been discussed in Section IV-D, however, we note that results. The recent trend is to leverage the classifier itself
most of the work in the literature lacks evaluation of other to optimize the detection accuracy using a large number of
performance aspects such as speed of detection, usability, and various detection features.
practical deployment, among others.
The majority of phishing detection approaches leverage
VI. C ONCLUSION
machine learning concepts including clustering and classifica-
tion techniques. Therefore, they adopt the evaluation metrics In this paper, we provide a systematic study of exist-
and strategies developed in this domain. However, as men- ing phishing detection works from different perspectives. We
tioned earlier, the cyber security domain is more challenging first describe the background knowledge about the phishing
due to the adaptive nature of attackers. Therefore, the evalua- ecosystem and the state-of-the-art phishing statistics. Then we
tion results during the design phase should be considered with present a systematic review of the automatic phishing detection
caution, as they may not hold later. In other words, the design schemes. Specifically, we provide a taxonomy of the phishing
phase results are limited in time validity and scope, which calls detection schemes, discuss the datasets used in training and
for the phishing detection community to think about adaptive evaluating various detection approaches, discuss the features
evaluation strategies that cope with the unique challenges in used by various detection schemes, discuss the underlying
the cyber security domain. For example, the researchers could detection algorithms and the commonly used evaluation met-
firstly classify the dataset into different categories (e.g., by rics. Finally, we provide recommendations that we believe will
type, time period, country, etc.), then perform the evaluation help guide the development of more effective phishing detec-
over every type of the dataset to obtain a more comprehensive tion schemes and make it easy to compare and contrast various
and convincing results. Another important issue is the dif- schemes.
ficulty in providing comparative evaluations among different
phishing detection techniques. This is mainly due to the lack R EFERENCES
of standard benchmarks, and the lack of reference datasets as [1] (2016). Phishing Trends & Intelligence Report: Hacking the Human.
a consequence of the dynamic nature of the attackers and the [Online]. Available: https://info.phishlabs.com/pti-report-download
potential sensitivity of data, which restricts sharing. [2] The Alexa Top 500 Sites on the Web. Accessed: Nov. 21, 2016. [Online].
Available: http://www.dmoz.org/
Unfortunately, many of the above mentioned challenges
[3] Anti-Phishing Extension: Netcraft. Accessed: Dec. 5, 2016. [Online].
across all the aspects continue to exist, and hence call for Available: http://toolbar.netcraft.com/
a collaborative effort among the research community to alle- [4] Anti-Phishing Working Group. Accessed: Nov. 15, 2016. [Online].
viate their negative impact on the effectiveness and coverage Available: http:/www.antiphishing.org/
[5] Clean MX Malicious URL List. Accessed: Nov. 21, 2016.
of phishing detection approaches. [Online]. Available: http://support.clean-mx.com/clean-mx/
Table IX shows the comparison results across the previous phishing.php?response=alive
four evaluation dimensions. From the performance perspective, [6] DMOZ—The Directory of the Web. Accessed: Nov. 21, 2016. [Online].
Available: http://www.dmoz.org/
we can see that 11 out of 12 schemes focus on the evaluation [7] Malwarepatrol. Accessed: Oct. 31, 2016. [Online]. Available:
of true positive rate, false positive rate or other equivalent https://www.malwarepatrol.net/open-source.shtml
evaluation metrics. This is mainly because of the fact that TPR [8] Millersmiles Spoof Email and Phishing Scams
List. Accessed: Nov. 21, 2016. [Online]. Available:
determines the detection capability of the scheme while FPR http://www.millersmiles.co.uk/scams.php
represents its negative effects. Thus, they together provide the [9] SURBL URL Reputation Data. Accessed: Nov. 21, 2016. [Online].
most valuable performance information about the quality of Available: http://www.surbl.org/lists
[10] G. Aaron and R. Rasmussen, Global Phishing Survey: Trends
different approaches. and Domain Name Use in 2h2009, Anti-Phishing Working Group,
PhishTank is the most dominant source for phishing web- Lexington, MA, USA, 2010.
sites because it provides large quantity, up-to-date and verified [11] S. Abu-Nimeh and S. Nair, “Bypassing security toolbars and phish-
ing filters via DNS poisoning,” in Proc. IEEE Glob. Telecommun.
phishing list for free. Yahoo and DMOZ were competitors Conf. (GLOBECOM), New Orleans, LA, USA, 2008, pp. 1–6.
to each other for providing legitimate websites information. [12] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, “A comparison of
However, Yahoo closed its service since the end of 2014 machine learning techniques for phishing detection,” in Proc. ACM
Anti Phishing Working Groups 2nd Annu. eCrime Researchers Summit,
for some unknown reasons. Other favorite sources for legiti- Pittsburgh, PA, USA, 2007, pp. 60–69.
mate websites include Alexa top sites and Google keywords [13] M. Aburrous, M. A. Hossain, K. Dahal, and F. Thabtah, “Intelligent
searching. Although researchers try to use a larger number of phishing detection system for e-banking using fuzzy data mining,”
Expert Syst. Appl., vol. 37, no. 12, pp. 7913–7921, 2010.
datasets for more convincing evaluation results, few of them
[14] M. Aburrous and A. Khelifi, “Phishing detection plug-in toolbar using
considered some fundamental aspects about the datasets. For intelligent fuzzy-classification mining techniques,” in Proc. Int. Conf.
example, the ratio of the number of legitimate websites to Soft Comput. Softw. Eng., San Francisco, CA, USA, 2013.
the number of phishing websites, which is about 100 to 1 in [15] S. Afroz and R. Greenstadt, “Phishzoo: An automated Web phishing
detection approach based on profiling and fuzzy matching,” in Proc. 5th
reality. IEEE Int. Conf. Semantic Comput. (ICSC), 2009.
Blacklists are commonly used in the public phishing detec- [16] A. Aggarwal, A. Rajadesingan, and P. Kumaraguru, “PhishAri:
tion toolbars because they have the fastest response time. From Automatic realtime phishing detection on Twitter,” in Proc. IEEE
eCrime Researchers Summit (eCrime), 2012, pp. 1–12.
Table IX, we can conclude that the most commonly used fea- [17] F. Aloul, S. Zahidi, and W. El-Hajj, “Two factor authentication using
tures (also with the best performance results) are URL based mobile phones,” in Proc. AICCSA, Rabat, Morocco, 2009, pp. 641–644.

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2817

[18] D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker, [43] A. Gharaibeh et al., “Smart cities: A survey on data management,
“Spamscatter: Characterizing Internet scam hosting infrastructure,” in security and enabling technologies,” IEEE Commun. Surveys Tuts., to
Proc. Usenix Security, Boston, MA, USA, 2007, pp. 1–14. be published.
[19] APWG Phishing Trends Reports, Anti Phishing Working Group, 2016. [44] (2016). Anti-Phishing Working Group. [Online]. Available:
[20] M. Aydin and N. Baykal, “Feature extraction and classification phish- http://www.antiphishing.org
ing websites based on URL,” in Proc. IEEE Conf. Commun. Netw. [45] X. Han, N. Kheir, and D. Balzarotti, “Phisheye: Live monitoring
Security (CNS), Florence, Italy, 2015, pp. 769–770. of sandboxed phishing kits,” in Proc. ACM SIGSAC Conf. Comput.
[21] A. Bergholz et al., “New filtering approaches for phishing email,” Commun. Security, Vienna, Austria, 2016, pp. 1402–1413.
J. Comput. Security, vol. 18. no. 1, pp. 7–35, 2010. [46] M. Hara, A. Yamada, and Y. Miyake, “Visual similarity-based phish-
[22] A. Blum, B. Wardman, T. Solorio, and G. Warner, “Lexical feature ing detection without victim site information,” in Proc. IEEE Symp.
based phishing URL detection using online learning,” in Proc. 3rd ACM Comput. Intell. Cyber Security (CICS), Nashville, TN, USA, 2009,
Workshop Artif. Intell. Security, Chicago, IL, USA, 2010, pp. 54–60. pp. 30–36.
[23] Y. Cao, W. Han, and Y. Le, “Anti-phishing based on automated indi- [47] F. L. Hitchcock, “The distribution of a product from several sources to
vidual white-list,” in Proc. 4th ACM Workshop Digit. Identity Manag., numerous localities,” J. Math. Phys., vol. 20, nos. 1–4, pp. 224–230,
Alexandria, VA, USA, 2008, pp. 51–60. 1941.
[48] T. Holz, C. Gorecki, K. Rieck, and F. C. Freiling, “Measuring and
[24] D. D. Caputo, S. L. Pfleeger, J. D. Freeman, and M. E. Johnson, “Going
detecting fast-flux service networks,” in Proc. 15th Netw. Distrib. Syst.
spear phishing: Exploring embedded training and awareness,” IEEE
Security Symp., 2008.
Security Privacy, vol. 12, no. 1, pp. 28–38, Jan./Feb. 2014.
[49] J. Hong, “The state of phishing attacks,” Commun. ACM, vol. 55, no. 1,
[25] K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting phish- pp. 74–81, 2012.
ing with discriminative keypoint features,” IEEE Internet Comput., [50] Google Inc. Google Safe Browsing. Accessed: Dec. 5, 2016. [Online].
vol. 13, no. 3, pp. 56–63, May/Jun. 2009. Available: https://developers.google.com/safe-browsing/
[26] T.-C. Chen, S. Dick, and J. Miller, “Detecting visually similar [51] T. N. Jagatic, N. A. Johnson, M. Jakobsson, and F. Menczer, “Social
Web pages: Application to phishing detection,” ACM Trans. Internet phishing,” Commun. ACM, vol. 50, no. 10, pp. 94–100, 2007.
Technol., vol. 10, no. 2, p. 5, 2010. [52] C. Karlof, U. Shankar, J. D. Tygar, and D. Wagner, “Dynamic pharm-
[27] T.-C. Chen, T. Stepan, S. Dick, and J. Miller, “An anti-phishing system ing attacks and locked same-origin policies for Web browsers,” in
employing diffused information,” ACM Trans. Inf. Syst. Security, Proc. 14th ACM Conf. Comput. Commun. Security, Alexandria, VA,
vol. 16, no. 4, p. 16, 2014. USA, 2007, pp. 58–71.
[28] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Client-side [53] I. Khalil and S. Bagchi, “Secos: Key management for scalable and
defense against Web-based identity theft,” in Proc. NDSS, San Diego, energy efficient crypto on sensors,” in Proc. IEEE Depend. Syst. Netw.,
CA, USA, 2004. 2003.
[29] G. Comarela, G. Gürsun, and M. Crovella, “Studying interdomain rout- [54] I. Khalil, S. Bagchi, and N. Shroff, “Analysis and evaluation of secos,
ing over long timescales,” in Proc. Conf. Internet Meas., Barcelona, a protocol for energy efficient and secure communication in sensor
Spain, 2013, pp. 227–234. networks,” Ad Hoc Netw., vol. 5, no. 3, pp. 360–391, 2007.
[30] L. F. Cranor, S. Egelman, J. I. Hong, and Y. Zhang, “Phinding phish: [55] I. Khalil, Z. Dou, and A. Khreishah, “TPM-based authentication
An evaluation of anti-phishing toolbars,” in Proc. NDSS, San Diego, mechanism for apache hadoop,” in Proc. Int. Conf. Security Privacy
CA, USA, 2007. Commun. Syst., Beijing, China, 2014, pp. 105–122.
[31] R. Dhamija, J. D. Tygar, and M. Hearst, “Why phishing works,” in [56] I. Khalil, Z. Dou, and A. Khreishah, “Your credentials are compro-
Proc. SIGCHI Conf. Human Factors Comput. Syst., Montreal, QC, mised, do not panic: You can be well protected,” in Proc. 11th ACM
Canada, 2006, pp. 581–590. AsiaCCS, Xi’an, China, 2016, pp. 925–930.
[32] Z. Dou, I. Khalil, and A. Khreishah, “CLAS: A novel communica- [57] I. Khalil, I. Hababeh, and A. Khreishah, “Secure inter cloud data migra-
tions latency based authentication scheme,” Security Commun. Netw., tion,” in Proc. 7th Int. Conf. Inf. Commun. Syst. (ICICS), Irbid, Jordan,
vol. 2017, 2017, Art. no. 4286903. 2016, pp. 62–67.
[33] Z. Dou, I. Khalil, and A. Khreishah, “A novel and robust authentication [58] I. Khalil, T. Yu, and B. Guan, “Discovering malicious domains through
factor based on network communications latency,” IEEE Syst. J., to be passive DNS data graph analysis,” in Proc. 11th ACM Asia Conf.
published. Comput. Commun. Security, Xi’an, China, 2016, pp. 663–674.
[34] Z. Dou, I. Khalil, A. Khreishah, and A. Al-Fuqaha, “Robust insider [59] M. Khonji, Y. Iraqi, and A. Jones, “Phishing detection: A literature
attacks countermeasure for Hadoop: Design and implementation,” IEEE survey,” IEEE Commun. Surveys Tuts., vol. 15, no. 4, pp. 2091–2121,
Syst. J., to be published. 4th Quart., 2013.
[35] J. S. Downs, M. B. Holbrook, and L. F. Cranor, “Decision strategies [60] P. Kumaraguru et al., “Getting users to pay attention to anti-phishing
and susceptibility to phishing,” in Proc. 2nd Symp. Usable Privacy education: Evaluation of retention and transfer,” in Proc. Anti Phishing
Security, Pittsburgh, PA, USA, 2006, pp. 79–90. Working Groups 2nd Annu. eCrime Researchers Summit, Pittsburgh,
PA, USA, 2007, pp. 70–81.
[36] S. Egelman, L. F. Cranor, and J. Hong, “You’ve been warned: An
[61] P. Kumaraguru, S. Sheng, A. Acquisti, L. F. Cranor, and J. Hong,
empirical study of the effectiveness of Web browser phishing warn-
“Teaching Johnny not to fall for phish,” ACM Trans. Internet Technol.,
ings,” in Proc. SIGCHI Conf. Human Factors Comput. Syst., Florence,
vol. 10, no. 2, p. 7, 2010.
Italy, 2008, pp. 1065–1074.
[62] M. Kwon et al., “Use of network latency profiling and redundancy for
[37] M. N. Feroz and S. Mengel, “Examination of data, rule generation and cloud server selection,” in Proc. IEEE 7th Int. Conf. Cloud Comput.,
detection of phishing URLs using online logistic regression,” in Proc. Anchorage, AK, USA, 2014, pp. 826–832.
IEEE Int. Conf. Big Data (Big Data), Washington, DC, USA, 2014, [63] Spoofguard, Stanford Security Lab., Stanford, CA, USA, 2004.
pp. 241–250. [Online]. Available: https://crypto.stanford.edu/SpoofGuard/
[38] A. Y. Fu, L. Wenyin, and X. Deng, “Detecting phishing Web pages with [64] C. Labovitz, G. R. Malan, and F. Jahanian, “Internet routing
visual similarity assessment based on earth mover’s distance (EMD),” instability,” IEEE/ACM Trans. Netw., vol. 6, no. 5, pp. 515–528,
IEEE Trans. Depend. Secure Comput., vol. 3, no. 4, pp. 301–311, Oct. 1998.
Oct./Dec. 2006. [65] M. Lad, J. H. Park, T. Refice, and L. Zhang, “A study of Internet routing
[39] A. Y. Fu, L. Wenyin, and X. Deng, “EMD based visual similarity for stability using link weight,” Dept. Comput. Sci., Univ. California at
detection of phishing webpages,” in Proc. Int. Workshop Web Doc. San Diego, San Diego, CA, USA, Tech. Rep., 2008.
Anal., vol. 2005. 2005. [66] A. Le, A. Markopoulou, and M. Faloutsos, “PhishDef: Url names say
[40] S. Garera, N. Provos, M. Chew, and A. D. Rubin, “A framework it all,” in Proc. IEEE INFOCOM, Shanghai, China, 2011, pp. 191–195.
for detection and measurement of phishing attacks,” in Proc. ACM [67] C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, “Detecting out-
Workshop Recurring Malcode, Alexandria, VA, USA, 2007, pp. 1–8. liers: Do not use standard deviation around the mean, use absolute
[41] S. Gastellier-Prevost, G. G. Granadillo, and M. Laurent, “A dual deviation around the median,” J. Exp. Soc. Psychol., vol. 49, no. 4,
approach to detect pharming attacks at the client-side,” in Proc. 4th pp. 764–766, 2013.
IFIP Int. Conf. New Technol. Mobility Security (NTMS), Paris, France, [68] G. L’Huillier, A. Hevia, R. Weber, and S. Ríos, “Latent semantic anal-
2011, pp. 1–5. ysis and keyword extraction for phishing classification,” in Proc. IEEE
[42] GeoTrust. Geotrust TrustWatch Toolbar. Accessed: Dec. 5, 2016. Int. Conf. Intell. Security Inf. (ISI), Vancouver, BC, Canada, 2010,
[Online]. Available: https://www.geotrust.com/comcasttoolbar/ pp. 129–131.

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
2818 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 19, NO. 4, FOURTH QUARTER 2017

[69] P. Likarish, E. Jung, D. Dunbar, T. E. Hansen, and J. P. Hourcade, [94] N. Sanglerdsinlapachai and A. Rungsawang, “Using domain top-page
“B-APT: Bayesian anti-phishing toolbar,” in Proc. IEEE Int. Conf. similarity feature in machine learning-based Web phishing detection,”
Commun., Beijing, China, 2008, pp. 1745–1749. in Proc. 3rd Int. Conf. Knowl. Disc. Data Min. (WKDD), 2010,
[70] G. Liu, B. Qiu, and L. Wenyin, “Automatic detection of phish- pp. 187–190.
ing target from phishing webpage,” in Proc. 20th Int. Conf. Pattern [95] A. Shaikh, A. Varma, L. Kalampoukas, and R. Dube, “Routing sta-
Recognit. (ICPR), Istanbul, Turkey, 2010, pp. 4153–4156. bility in congested networks: Experimentation and analysis,” ACM
[71] Z. Q. J. Lu, “The elements of statistical learning: Data mining, infer- SIGCOMM Comput. Commun. Rev., vol. 30, no. 4, pp. 163–174, 2000.
ence, and prediction,” J. Roy. Stat. Soc. A, Stat. Soc., vol. 173, no. 3, [96] M. Sharifi and S. H. Siadati, “A phishing sites blacklist generator,” in
pp. 693–694, 2010. Proc. IEEE/ACS Int. Conf. Comput. Syst. Appl., Doha, Qatar, 2008,
[72] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond black- pp. 840–843.
lists: Learning to detect malicious Web sites from suspicious URLs,” [97] S. Sheng, M. Holbrook, P. Kumaraguru, L. F. Cranor, and J. Downs,
in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., Paris, “Who falls for phish?: A demographic analysis of phishing susceptibil-
France, 2009, pp. 1245–1254. ity and effectiveness of interventions,” in Proc. SIGCHI Conf. Human
[73] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying Factors Comput. Syst., Atlanta, GA, USA, 2010, pp. 373–382.
suspicious URLs: An application of large-scale online learning,” in [98] S. Sheng et al., “Anti-phishing phil: The design and evaluation of a
Proc. 26th Annu. Int. Conf. Mach. Learn., Montreal, QC, Canada, 2009, game that teaches people not to fall for phish,” in Proc. 3rd Symp.
pp. 681–688. Usable Privacy Security, Pittsburgh, PA, USA, 2007, pp. 88–99.
[74] S. Marchal, J. François, R. State, and T. Engel, “PhishStorm: Detecting [99] S. Sinha, M. Bailey, and F. Jahanian, “Shades of grey: On the effective-
phishing with streaming analytics,” IEEE Trans. Netw. Service Manag., ness of reputation-based ‘blacklists,”’ in Proc. 3rd Int. Conf. Malicious
vol. 11, no. 4, pp. 458–471, Dec. 2014. Unwanted Softw., Fairfax, VA, USA, Oct. 2008, pp. 57–64.
[75] S. Marchal, K. Saari, N. Singh, and N. Asokan, “Know your phish: [100] A. K. Sood and S. Zeadally, “A taxonomy of domain-generation
Novel techniques for detecting phishing sites and their targets,” in Proc. algorithms,” IEEE Security Privacy, vol. 14, no. 4, pp. 46–53,
IEEE 36th Int. Conf. Distrib. Comput. Syst. (ICDCS), Nara, Japan, Jul./Aug. 2016.
2016, pp. 323–333. [101] C. L. Tan, K. L. Chiew, K. Wong, and S. N. Sze, “PhishWHO: Phishing
[76] M.-E. Maurer and D. Herzner, “Using visual website similarity for webpage detection via identity keywords extraction and target domain
phishing detection and reporting,” in Proc. Extended Abstracts Human name finder,” Decis. Support Syst., vol. 88, pp. 18–27, Aug. 2016.
Factors Comput. Syst. CHI, Austin, TX, USA, 2012, pp. 1625–1630. [102] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, “Design and
[77] E. Medvet, E. Kirda, and C. Kruegel, “Visual-similarity-based phishing evaluation of a real-time URL spam filtering service,” in Proc. IEEE
detection,” in Proc. 4th Int. Conf. Security Privacy Commun. Netw., Symp. Security Privacy, Berkeley, CA, USA, 2011, pp. 447–462.
Istanbul, Turkey, 2008, p. 22. [103] G. Varshney, M. Misra, and P. K. Atrey, “A survey and classification
[78] I.-C. Mihai and L. Giurea, “Management of eLearning platforms secu- of Web phishing detection schemes,” Security Commun. Netw., vol. 9,
rity,” in Proc. Int. Sci. Conf. eLearn. Softw. Educ., vol. 1. 2016, no. 18, pp. 6266–6284, 2016.
pp. 422–427. [104] J. Wang, T. Herath, R. Chen, A. Vishwanath, and H. R. Rao, “Research
article phishing susceptibility: An investigation into the processing of
[79] M. Moghimi and A. Y. Varjani, “New rule-based phishing detection
a targeted spear phishing email,” IEEE Trans. Prof. Commun., vol. 55,
method,” Expert Syst. Appl., vol. 53, pp. 231–242, Jul. 2016.
no. 4, pp. 345–362, Dec. 2012.
[80] R. M. Mohammad, F. Thabtah, and L. McCluskey, “Predicting phishing
[105] W. D. Yu, S. Nargundkar, and N. Tiruthani, “A phishing vulnerabil-
websites based on self-structuring neural network,” Neural Comput.
ity analysis of Web based systems,” in Proc. IEEE Symp. Comput.
Appl., vol. 25, no. 2, pp. 443–458, 2014.
Commun. (ISCC), Marrakech, Morocco, 2008, pp. 326–331.
[81] PhishTank: An Anti-Phishing Site, LLC OpenDNS, San Francisco, CA,
[106] L. Wenyin, G. Huang, L. Xiaoyue, X. Deng, and Z. Min, “Phishing
USA, accessed: Dec. 5, 2016.
Web page detection,” in Proc. 8th Int. Conf. Document Anal.
[82] A. Oprea, Z. Li, T.-F. Yen, S. H. Chin, and S. Alrwais, “Detection Recognit. (ICDAR), Seoul, South Korea, 2005, pp. 560–564.
of early-stage enterprise infection by mining large-scale log data,” [107] L. Wenyin, G. Liu, B. Qiu, and X. Quan, “Antiphishing through phish-
in Proc. 45th Annu. IEEE/IFIP Int. Conf. Depend. Syst. Netw., ing target discovery,” IEEE Internet Comput., vol. 16, no. 2, pp. 52–61,
Rio de Janeiro, Brazil, Jun. 2015, pp. 45–56. Mar./Apr. 2012.
[83] P. Pajares. Phishing Safety: Is HTTPS Enough? [Online]. Available: [108] C. Whittaker, B. Ryner, and M. Nazif, “Large-scale automatic classi-
http://blog.trendmicro.com/trendlabs-security-intelligence/phishing- fication of phishing pages,” in Proc. NDSS, vol. 10. San Diego, CA,
safety-is-https-enough/ USA, 2010.
[84] Y. Pan and X. Ding, “Anomaly based Web phishing page detection,” [109] Avalanche (Phishing Group)—Wikipedia, the Free Encyclopedia,
in Proc. ACSAC, vol. 6, 2006, pp. 381–392. Wikipedia, San Francisco, CA, USA, 2016.
[85] R. K. Panta, S. Bagchi, and I. M. Khalil, “Efficient wireless reprogram- [110] Bag-of-Words Model—Wikipedia, the Free Encyclopedia, Wikipedia,
ming through reduced bandwidth usage and opportunistic sleeping,” Ad San Francisco, CA, USA, 2016.
Hoc Netw., vol. 7, no. 1, pp. 42–62, 2009. [111] Mcafee Siteadvisor—Wikipedia, the Free Encyclopedia, Wikipedia,
[86] B. Parmar, “Protecting against spear-phishing,” Comput. Fraud San Francisco, CA, USA, 2016, accessed: Sep. 6, 2016.
Security, vol. 2012, no. 1, pp. 8–11, 2012. [112] Microsoft Smartscreen—Wikipedia, the Free Encyclopedia, Wikipedia,
[87] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta, “PhishNet: San Francisco, CA, USA, 2016, accessed: Sep. 28, 2016.
Predictive blacklisting to detect phishing attacks,” in Proc. IEEE [113] Netcraft—Wikipedia, the Free Encyclopedia, Wikipedia, San Francisco,
INFOCOM, San Diego, CA, USA, 2010, pp. 1–5. CA, USA, 2016, accessed: Sep. 3, 2016.
[88] A. Ramachandran, D. Dagon, and N. Feamster, “Can DNS-based [114] Yahoo! Directory—Wikipedia, the Free Encyclopedia, Wikipedia,
blacklists keep up with bots,” in Proc. 3rd Conf. Email Anti Spam, San Francisco, CA, USA, 2016, accessed: Jun. 7, 2016.
2006. [115] Zero-Day (Computing)—Wikipedia, the Free Encyclopedia, Wikipedia,
[89] V. Ramanathan and H. Wechsler, “Phishing website detection San Francisco, CA, USA, 2016.
using latent Dirichlet allocation and adaboost,” in Proc. IEEE Int. [116] M. Wu, R. C. Miller, and S. L. Garfinkel, “Do security toolbars actu-
Conf. Intell. Security Informat. (ISI), Arlington, VA, USA, 2012, ally prevent phishing attacks?” in Proc. SIGCHI Conf. Human Factors
pp. 102–107. Comput. Syst., Montreal, QC, Canada, 2006, pp. 601–610.
[90] G. Ramesh, I. Krishnamurthi, and K. S. S. Kumar, “An effica- [117] X. Wu et al., “Top 10 algorithms in data mining,” Knowl. Inf. Syst.,
cious method for detecting phishing webpages through target domain vol. 14, no. 1, pp. 1–37, 2008.
identification,” Decis. Support Syst., vol. 61, pp. 12–22, May 2014. [118] G. Xiang, J. Hong, C. P. Rose, and L. Cranor, “Cantina+: A feature-
[91] J. Rexford, J. Wang, Z. Xiao, and Y. Zhang, “BGP routing stability of rich machine learning framework for detecting phishing Web sites,”
popular destinations,” in Proc. 2nd ACM SIGCOMM Workshop Internet ACM Trans. Inf. Syst. Security, vol. 4, no. 2, 2011, Art. no. 21.
Meas., Marseille, France, 2002, pp. 197–202. [119] G. Xiang and J. I. Hong, “A hybrid phish detection approach by identity
[92] P. Robichaux and D. L. Ganger, “Gone phishing: Evaluating anti- discovery and keywords retrieval,” in Proc. 18th Int. Conf. World Wide
phishing tools for windows,” 3Sharp Project, Redmond, WA, USA, Web, Madrid, Spain, 2009, pp. 571–580.
Tech. Rep., Sep. 2006. [120] J. Yearwood, M. Mammadov, and A. Banerjee, “Profiling phishing
[93] J. C. Russ and R. P. Woods, “The image processing handbook,” emails based on hyperlink information,” in Proc. Int. Conf. Adv. Soc.
J. Comput. Assisted Tomograph., vol. 19, no. 6, pp. 979–981, 1995. Netw. Anal. Min. (ASONAM), Odense, Denmark, 2010, pp. 120–127.

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app
DOU et al.: SOK: SYSTEMATIC REVIEW OF SOFTWARE-BASED WEB PHISHING DETECTION 2819

[121] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Ala Al-Fuqaha (S’00–M’04–SM’09) received the
“Spark: Cluster computing with working sets,” in Proc. HotCloud, M.S. degree in electrical and computer engineer-
Boston, MA, USA, 2010, p. 10. ing from the University of Missouri-Columbia in
[122] H. Zhang, G. Liu, T. W. S. Chow, and W. Liu, “Textual and 1999 and the Ph.D. degree in electrical and computer
visual content-based anti-phishing: A Bayesian approach,” IEEE Trans. engineering from the University of Missouri-Kansas
Neural Netw., vol. 22, no. 10, pp. 1532–1546, Oct. 2011. City in 2004. He is currently a Professor and the
[123] J. Zhang, S. Saha, G. Gu, S.-J. Lee, and M. Mellia, “Systematic min- Director of NEST Research Laboratory, Computer
ing of associated server herds for malware campaign discovery,” in Science Department, Western Michigan University.
Proc. 35th IEEE Int. Conf. Distrib. Comput. Syst., Columbus, OH, His research interests include wireless vehicular
USA, 2015, pp. 630–641. networks, cooperation and spectrum access eti-
[124] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: A content-based quettes in cognitive radio networks, smart services
approach to detecting phishing Web sites,” in Proc. 16th Int. Conf. in support of the Internet of Things, management and planning of software
World Wide Web, Banff, AB, Canada, 2007, pp. 639–648. defined networks, and performance analysis and evaluation of high-speed
computer and telecommunications networks. In 2014, he was a recipient
of the Outstanding Researcher Award with the College of Engineering and
Applied Sciences, Western Michigan University. He is currently serving
on the Editorial Board for Security and Communication Networks (Wiley),
Zuochao Dou received the B.S. degree in elec- Wireless Communications and Mobile Computing (Wiley), EAI Transactions
tronics from the Beijing University of Technology on Industrial Networks and Intelligent Systems, and the International Journal
in 2009, the M.S. degree from the University of of Computing and Digital Systems. He has served as a Technical Program
Southern Denmark, concentrating on embedded con- Committee Member and a Reviewer of many international conferences and
trol systems in 2011, and the M.S. degree from the journals.
University of Rochester majoring in communications
and signal processing in 2013. He is currently pur-
suing the Ph.D. degree in cloud computing security
and network security under the supervision of Dr.
A. Khreishah and Dr. I. Khalil.

Issa Khalil (S’06–M’08) received the Ph.D. degree


in computer engineering from Purdue University,
USA, in 2007. He joined the College of Information
Technology, United Arab Emirates University, where
he served as an Associate Professor and the
Department Head with the Information Security
Department. In 2013, he joined the Cyber Security
Group, Qatar Computing Research Institute, a mem-
ber of Qatar Foundation, as a Senior Scientist, where
he was recently promoted to a Principal Scientist.
His research interests span the areas of wireless and
wireline network security and privacy. He is especially interested in cloud
security, malicious domain detection and takedown, and security data ana-
lytics. His novel technique to discover malicious domains following the
guilt-by-association social principle attracts the attention of local media and
stakeholders. He was a recipient of the CIT Outstanding Professor Award for
outstanding performance in research, teaching, and service in 2011. He served
as an Organizer, a Technical Program Committee Member, and a Reviewer Mohsen Guizani (S’85–M’89–SM’99–F’09)
for many international conferences and journals. He is a member of ACM and received the B.S. (with Distinction) and M.S.
delivers invited talks and keynotes in many local and international forums. degrees in electrical engineering, and the M.S.
and Ph.D. degrees in computer engineering from
Syracuse University, Syracuse, NY, USA, in 1984,
1986, 1987, and 1990, respectively. He is currently
a Professor and the ECE Department Chair with
Abdallah Khreishah received the B.S. (Hons.) the University of Idaho, USA. He served as the
degree from the Jordan University of Science and Associate Vice President of Graduate Studies, Qatar
Technology in 2004, and the M.S. and Ph.D. degrees University, the Chair of the Computer Science
in electrical and computer engineering from Purdue Department, Western Michigan University, and the
University in 2010 and 2006, respectively. He was Chair of the Computer Science Department, University of West Florida. He
with NEESCOM. In 2012, he joined the Electrical also served in academic positions with the University of Missouri-Kansas
and Computer Engineering Department, New Jersey City, University of Colorado-Boulder, Syracuse University, and Kuwait
Institute of Technology as an Assistant Professor and University. His research interests include wireless communications and
was promoted to an Associate Professor in 2017. mobile computing, computer networks, mobile cloud computing, security,
His research spans the areas of wireless networks, and smart grid. He currently serves on the Editorial Boards of several
visible-light communication, vehicular networks, international technical journals and is the Founder and the Editor-in-Chief of
congestion control, cloud and edge computing, and network security. His Wireless Communications and Mobile Computing (Wiley). He has authored
research projects are funded by the National Science Foundation, New Jersey 9 books and over 450 publications in refereed journals and conferences. He
Department of Transportation, and the UAE Research Foundation. He is cur- guest edited a number of special issues in IEEE journals and magazines. He
rently serving as an Associate Editor for the International Journal of Wireless also served as a member, the Chair, and the General Chair of a number of
Information Networks. He served as the TPC Chair for WASA 2017, IEEE international conferences. He was a recipient of teaching awards multiple
SNAMS 2014, IEEE SDS-2014, BDSN-2015, BSDN 2015, and IOTSMS- times from different institutions as well as best research awards from three
2105. He has also served on the TPC committee of IEEE Infocom 2017, institutions. He was the Chair of the IEEE Communications Society Wireless
IEEE Infocom 2016, IEEE PIMRC 2016, IEEE WCNC 2016, IEEE CCH Technical Committee and the Chair of the TAOS Technical Committee. He
2016, IEEE PIMRC 2015, and ICCVE 2015. He is the Chair of the IEEE served as the IEEE Computer Society Distinguished Speaker from 2003 to
EMBS North Jersey Chapter. 2005. He is a Senior Member of ACM.

horized licensed use limited to: Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology. Downloaded on January 29,2024 at 09:43:48 UTC from IEEE Xplore. Restrictions app

You might also like