Project Shokan
Project Shokan
List of Figures.....................................................................................................................iv
List of Tables.......................................................................................................................v
Abstract.............................................................................................................................vi
Acknowledgements..........................................................................................................vii
Introduction.......................................................................................................................1
1.1. Background....................................................................................................................1
1.2. Research Aims and Objectives........................................................................................2
1.3. Contribution...................................................................................................................3
1.4. Risks...............................................................................................................................3
1.4.1. Failure to get data..........................................................................................................................4
1.4.2. Hardware defects or failure............................................................................................................4
1.4.3. Low quality of prediction................................................................................................................4
1.5. Dissertation structure.....................................................................................................5
Literature Review...............................................................................................................6
2.1. Malicious URL categorisation..........................................................................................6
2.1.1. Phishing URLs.................................................................................................................................7
2.1.2. Malware URLs.................................................................................................................................8
2.1.3. Spamming URLs..............................................................................................................................8
2.2. Attacks...........................................................................................................................9
2.2.1. Phishing attacks and URL obfuscation techniques..........................................................................9
2.2.2. Injection attacks............................................................................................................................11
2.2.3. Drive-by download attack.............................................................................................................13
2.2.4. Spamming attacks........................................................................................................................13
2.3. Valuable lexical-based features.....................................................................................14
2.4. Detecting malicious URLs..............................................................................................17
2.4.1. Machine Learning approach.........................................................................................................17
2.4.2. Alternative approaches.................................................................................................................26
2.5. Summary of the literature review.................................................................................27
Methodology....................................................................................................................29
3.1. Theoretical approach....................................................................................................29
3.1.1. Research Philosophy.....................................................................................................................30
3.1.2. Theory development....................................................................................................................30
3.1.3. Methodological development......................................................................................................30
3.1.4. Strategy........................................................................................................................................30
3.1.5. Time horizon.................................................................................................................................31
3.1.6. Techniques and procedures..........................................................................................................31
3.2. Practical approach........................................................................................................31
3.2.1. Experimental environment.......................................................................................31
3.2.2. Data..........................................................................................................................33
3.2.3. Model validation and optimisation...........................................................................36
3.2.4. Evaluation metrics........................................................................................................................37
3.3. Summary of the methodology......................................................................................................39
Results..............................................................................................................................41
4.1. Describing data.............................................................................................................41
4.2. Model description.........................................................................................................44
4.3. Results..........................................................................................................................45
4.4. Comparison..................................................................................................................................46
ii
List of Figures
Figure 1. The paper’s structure.................................................................................................10
Figure 2. Example of obfuscation with JavaScript (Chiew, Yong and Tan, 2018)..................16
Figure 3. The generic URL syntax (Berners-Lee, 2005)..........................................................20
Figure 4. Sources for collecting raw data.................................................................................23
Figure 5. SVM classification....................................................................................................27
Figure 6. KNN classification....................................................................................................28
Figure 7. Layers of Onion Framework.....................................................................................33
Figure 8. Technical characteristics of the experimental host environment..............................36
Figure 9. Number and percentage of benign and malicious URLs...........................................38
Figure 10. Visual representation of data splitting and validating processes (Nelson, 2018)....40
Figure 11. Visual representation of K-Folds cross validation method (Nelson, 2018)............41
Figure 12. Structure of confusion matrix for binary classifier.................................................41
Figure 13. Distribution of classes in the testing and training subsets.......................................44
Figure 14. Spearman's correlation matrix.................................................................................45
Figure 15. Misclassification error rate vs number of neighbours.............................................47
Figure 16. Confusion matrix.....................................................................................................48
iii
List of Tables
Table 1. Timetable of the project objectives...............................................................................9
Table 2. URL obfuscation examples.........................................................................................17
Table 3. Injection URL examples.............................................................................................19
Table 4. Spamming URL examples..........................................................................................21
Table 5. References of lexical-based features used by researchers in related studies..............23
Table 6. References of different types of machine learning algorithms used for malicious URL
detection in last decade.............................................................................................................28
Table 7. Decision making table according to MCDM method.................................................33
Table 8. Experimental tools......................................................................................................40
Table 9. List of primary sources for data collection.................................................................41
Table 10. Advanced evaluation metrics....................................................................................46
Table 11. Proportion of training and testing samples...............................................................49
Table 12. Settings of the KNN in Scikit-Learn v0.19.2 library................................................52
Table 13. Values of advanced metrics......................................................................................53
Table 14. Result comparison....................................................................................................54
iv
Abstract
The detection of malicious URLs is one of the highest priority issues for cyber security
practitioners. Despite the large number of studies that have examined different machine
learning techniques to address the issue, the most used approach remains blacklisting. The
main obstacle of using machine learning is the difficulties in data collection.
This paper examines the possibility of identifying malicious URLs with the help of
analysis only of lexical-based futures. For the analysis, an experiment was designed. But
before that, the known lexical characteristics of the malicious URLs were examined based on
previous studies.
The classifier showed a fairly good average accuracy rate of 94%. But it was also
noticed that the classifier showed a poor FP rate, which increases the risks of encountering
malicious URLs. Additionally, correlation analysis using Spearman’s coefficient showed that
the URL length and number of special characters are the most determining signs of malicious
URLs.
Key words: malicious URL, machine learning, k-nearest neighbours, lexical-based features
v
Chapter 1
Introduction
1.1. Background
The internet remains the main vector of attack, where an accidental visit to a malicious
website can trigger a pre-designed criminal activity. Google Inc. (2018a) reported that daily it
detects thousands of new unsafe web pages, many of which are compromised legitimate
websites. This growing threat has increased the demand for security on the internet.
Currently, there are different approaches to the detection of dangerous web pages on the
internet. The blacklisting approach is commonly used by popular online services and antivirus
software (Chen, Huang and Ou, 2015). But, in addition to other shortcomings, the blacklisting
approach is not able to detect targeted attacks and new phishing pages which are not yet
blacklisted.
Recent developments in the fields of machine learning and artificial intelligence have
led to a renewed interest in its application to address wide range of cybersecurity issues.
Particularly, machine learning has been using to identify malicious web pages. For example, a
Google online service called Safe Browsing, in its current version, applies a machine learning
approach to identify suspicious web pages (Wen, 2017). Additionally, to this day, this
approach is also being improved by academics due to the large number of studies being
conducted in this area.
The available studies in this domain have shown that there are several research vectors
that aim at providing users a safe surfing experience on the internet. Due to practical
constraints, this paper cannot provide a comprehensive review of all of them. Hence, the
scope of this research is limited to only machine learning approach. By contrast, the approach
is used in malicious Uniform Resource Locators (URL) detection works by analysing only
lexical-based features.
By lexical-based features, this paper refers to predictors that were extracted from
statistical properties of the URL string. For example, the features can consist a length of the
URL, length of hostname or top-level domain name. Moreover, the existence of some key
words or special characters in a URL also can be a lexical-based feature, hence in some
literature these features are called Bag-of-Words features.
vi
Despite the large amount of studies in the field of malicious URL detection, a number of
problems and practical issues remain open to this day. The main concern is the massiveness of
the data. There are more than 30 trillion unique URLs on the internet (Sullivan, 2012; Lin et
al., 2013). Processing such huge amount of data to this day is problematic.
The second main concern in the is difficulties in future collection. The choice of the
appropriate set of features is very important for the quality of the classifier’s performance.
But it was found that the previous studies mainly applied features such as the host-based and
page content-based futures. Collecting these features is time consuming. For example, it is
required to wait few seconds to get a value for some host-based feature. Given the above-
mentioned massiveness of data, the collection of these features is an infeasible task.
Moreover, as was noted by McGrath and Gupta (2008), the majority of malicious URLs have
the property of being available for only a very short period of time. Hence, it is necessary to
find an easy and efficient collection method.
Sahoo, Liu and Hoi (2017) underlined in his survey that the most accessible features are
the lexical-based features of URLs. Additionally, very few studies (Le, Markopoulou and
Faloutsos, 2011; Sorio, Bartoli and Medvet, 2013) have investigated the impact of lexical
features on maliciousness web pages without mixing them with the other hard-to-collect
features. Consequently, the main research question of this paper is: can a machine learning
approach focusing on only lexical-based features of URLs improve on the current state-of-
the-art?
This study examines the effectiveness of a machine learning algorithm that uses only lexical-
based features in detecting malicious web pages. In order to find the answer to the main
research question within the framework of this project, it is necessary to achieve seven
objectives that are given in the Table 1.
vii
3 Identify valuable 01 Mar 2018 / Valuable lexical-based
lexical-based features 17 Jun 2018 features according to on
URL types
Valuable lexical-based
features according
recommendations of
previous studies
4 Identify state-of-the-art 01 Mar 2018 / Top related studies in last 10
machine learning 17 Jun 2018 years and their results
approach
5 Choose the most 01 Mar 2018 / Machine learning algorithm
appropriate machine 17 Jun 2018
learning algorithm
6 Collect a dataset of 01 Jun 2018 / Primary data or appropriate
malicious and benign 27 Jun 2018 secondary data
URLs
Methodology
1.3. Contribution
The study makes a major contribution to the malicious URL detection domain by
demonstrating the effectiveness of lexical-based features in malicious URL detection. For
example, taking into account the ease of obtaining these features, it will be possible to create a
system that detects malicious URLs in real-time without using a blacklist.
But in the case of obtaining poor results from the learning algorithm classifier, hopefully
it also will be considered as a contribution. Because, further studies will receive an additional
confirmation about the ineffectiveness of lexical-based features for malicious URL detection.
1.4. Risks
It is important to identify the possible risks associated with the research project and mitigate
them to ensure a successful completion. Therefore, four risks were considered in this paper
due to their high and medium likelihood.
viii
1.4.1. Failure to get data
The first risk is related to the existence of an appropriate dataset for conducting the
experiment, where the dataset can be reliably labelled into two classes, malicious and benign
URLs. At the same time, the number of these classes should be balanced.
Mitigation of the risk related to the dataset is the most difficult and requires significant
technical and administrative efforts. Because of this, the search for the dataset started before
the project began.
The first preference was given to publicly available datasets that have the most positive
reputation among other scholars. At the same time, a dataset was requested from few authors
of large-scale studies such as Ma et al. (2010) and Vanhoenshoven et al. (2016). Also, in
order to have a backup plan, it was decided to collect malicious and benign URLs with help of
a custom Python code that parses particular web sites. More information about actually the
obtained dataset can be found in the methodology part of this paper.
1.4.2. Hardware defects or failure
The second risk refers to unexpected failure of the computer on which the experiment will be
conducted. The risk associated with the fault tolerance of hardware also can cause a
significant shift of the project’s timetable.
To mitigate this risk, it was decided to store developed code on the online version control
system GitHub. Additionally, the datasets and valuable configuration files should be stored on
online file hosting services. Lastly, the computer was periodically backed up (Apple Inc.,
2017) to allow the researcher to restore the computer from a snapshot if the experiment has
some defects experiment environment.
1.4.3. Low quality of prediction
Last but not least, risk associated with the project is the quality of the classification. Despite
the appropriateness of the selected machine learning algorithm and presented futures, there is
always a risk associated with the accuracy of the classification known as overfitting.
Overfitting is the result ‘of an analysis which corresponds too closely or exactly to a
particular set of data, and may therefore fail to fit additional data or predict future
observations reliably’ (Oxford Dictionary, 1930). The consequences of overfitting are the
poor quality of the classifier when it will work with a new dataset.
Mainly, the basic rule of this study is to perform the experiment a few times in order to
ensure the consistency of the results. Secondly, the cross-validation technique is used to
ix
evaluate a classifier against overfitting. This technique is explained in the methodology part
of the paper.
The overall structure of the study takes the form of five chapters, including this introductory
chapter: 1) the Introduction mainly gives a broad view of the general research area and the
underlines research question; (2) Chapter Two, the literature review, begins by laying out the
theoretical dimensions of the research and looks at how effectively the experiment should be
conducted; 3) the third chapter is concerned with the methodology used for this research
project; 4) the fourth chapter presents the results and main findings of the experiment, tying
up the various theoretical and empirical strands in order to answer to the main question; and
(5) Chapter Five discuss the result, critically evaluating the findings. Also, this chapter
examines the limitations of the study. Finally, (6) the Conclusion gives a brief summary and
critique of the findings. Recommendations for future research can be found in the conclusion
as well. A graphical overview of more detailed structure of the project is given in the Figure
1.
x
CHAPTER 2
Literature Review
The detecting malicious URLs is an emerging issue in academia. It was found that more
recent attention has focused on the provision of machine learning algorithms to tackle this
problem. Hence, the large and growing body of literature in both the fields of machine
learning and cybersecurity has been investigated in during this project.
What it is known about the application of machine learning is largely based upon
empirical studies that investigate performances of different classifier algorithms in the
experimental environment. However, there have been no controlled studies which make
attempt to find a practical approach which is able to cope with the significant amount of data
in the real-world environment. Moreover, much uncertainty still exists about possibility of
machine learning algorithms to detect malicious URLs by analysing only the static properties
of URL strings.
But in order to get an answer to the main question of the research, it is first necessary
to answer a chain of sub-questions. With this background in mind, the literature review
attempts to answer the following research sub-questions: (1) How can malicious URLs be
categorised? (2) What kind of attacks exist that are conducted by URLs? (3) What valuable
lexical-based features can be extracted from a URL? (4) What is the state-of-the-art machine
learning approach for combating malicious URLs? (5) Which machine learning algorithm can
be effective for building a classifier? These five questions allow us to achieve to the first five
research objectives by providing a conceptual theoretical framework based on the literature.
The purpose of this chapter is to review the literature to find an answer to the above-
mentioned questions by exploring primary and secondary sources. For achieving this, the
chapter is divided into four sections: (1) Malicious URL categorisation; (2) Attacks; (3)
Valuable lexical-based features; and (4) Detecting malicious URLs.
As far as the term of ‘malicious URL’ is concerned, an arguable weakness of the majority
studies is the arbitrariness in definition of this term. The term ‘malicious’ is vague, therefore,
xi
it is often necessary to clarify the level of maliciousness for a close understanding of the
threat. Therefore, categorising URLs would provide a better understanding of the
characteristics of the existing types of malicious URLs. This would help in the experimental
part to act as a stepping stone to the development of a holistic machine learning classifier.
A number of studies investigating malicious URL detection have been carried out by
scholars in last decade. However, the majority of them did not define the term ‘malicious
URL’. During the experiments, they collected phishing and spamming URLs and marked
them under a single label, ‘malicious’. Conversely, Dua and Du (2015) reported that
malicious activities (spamming or phishing) have different properties and their identification
should be different.
Categorisation and the separate detection of malicious URL first was demonstrated
experimentally by Choi, Zhu and Lee, (2011). In the systematic study, malicious URLs were
detected in two stages: (1) a machine learning binary classifier divided samples into benign
and malicious; (2) the malicious URLs were assigned three types of labels which are
phishing, malware and spamming. The scholars noted that a URL can be related to different
categories at the same time (e.g. a URL can be both spamming and malware).
A similar perspective has been adopted by Ma et al. (2010), Sahoo, Liu and Hoi
(2017) who argue that malicious URLs should be categorised according to the content of the
web page to which they refer to. These authors applied the same three types of malicious
URLs: (1) phishing, (2) malware and (3) spamming. Therefor it was decided to accept these
three types of malicious URL that have been commonly mentioned by different scholars.
Below, a closer look at each of these types is presented.
2.1.1. Phishing URLs
Phishing attacks are a social engineering technique that aims to lure users into providing
confidential information by clicking on a link that looks like legitimate. The term ‘phishing’
started to actively operate from the mid-1990s in the telecommunication sector, when the
acquisition of internet service provider account information was a common cybercrime
(Zulfikar, 2010). Since then, the term has had wide range of applications and different types
of phishing attack were invented.
The experimental studies on the effectiveness of visually identifying a phishing web
pages are rather controversial. For instance, Kumaraguru et al. (2008) and Sheng et al. (2010)
examined ability of people to visually identify phishing web pages after training courses.
They came to conclusion that user training courses are highly effective. Conversely,
Alsharnouby, Alaca and Chiasson (2015), who also examined behavioural strategies of users,
xii
reported that majority of internet users failed the test on detecting phishing web resources
even once have being taught to identify them.
Another study by Aleroud and Zhou (2017) examined the trend in different phishing
attacks. The scholars researched phishing attacks in four research dimensions: (1)
communication media (e.g. social networks), where the attacks are conducted (2) target
devices, (3) attack techniques and (4) countermeasures. The conclusion of the study was the
idea that the identification of a phishing attacks is not a trivial task. Accordingly, perhaps
there is no single right approach for identifying phishing attacks.
2.1.2. Malware URLs
By ‘malware’, this paper refers to the URLs that trigger a downloading of hostile or intrusive
software. Cybercriminals design malware to compromise the integrity, confidentiality and
availability of a user device. The most common techniques of these are cross-site scripting
(XSS) (Chiew, Yong and Tan, 2018) and drive-by-download attack attacks (Choi, Zhu and
Lee, 2011). If phishing attacks rely on users’ unconsciousness, then malware URLs are
designed specifically for the vulnerability of web browsers or web applications that are
developed on different platforms.
A number of studies, such as Alcaide et al. (2011) and Curtsinger et al. (2011), have
examined different approaches to effectively detect malware activities, but to date none has
achieved sufficient results. There are several reasons for the difficulties of detecting malware
URLs. The main reason is that malware attacks conducted with help of URLs are usually
developed in different programming languages such as PHP, ASP or JSP. Hence, it is
necessary to individually develop safety requirements for web applications on the internet
(ibid.).
2.1.3. Spamming URLs
Spamming is the sending of unsolicited content for advertising and it occurs in significant
numbers (Choi, Zhu and Lee, 2011). In other words, spamming URLs intend to promote
commercial or non-commercial content. Obviously, spamming web pages themselves are
detrimental to the quality of online content and user experience.
However spamming URLs usually are not physically harmful for a user device.
However, this paper defines them as malicious, since they are usually used for the distribution
of fake news, the fight against which has become one of the priority tasks for the states.
Intentionally misleading readers currently has dangerous outcomes for society. Mainly social
networks are used for spreading fake news (Krombholz, Merkl and Weippl, 2012). In
xiii
addition, spamming URLs are often used to distribute obscene content (Gao et al., 2010) that
can damage the vulnerable minds of children. In the following chapter, different techniques
for distributing spamming URLs are revealed.
Garcia-Molina and Gyongyi, (2005) and Jelodar et al. (2017) reviewed current
spamming techniques and applied a machine learning approach to detect spamming URLs.
The scholars made an attempt to detect spamming emails by analysing their lexical-based and
host-based features. Study shows that spamming servers (or spamming farms) usually have a
short lifespan. Other features, such as the structure, content and geography of spamming
URLs do not give enough clues to distinguish spamming from non-spamming content
automatically.
2.2. Attacks
xiv
In contrast, typo-squatting attacks use visually similar domain names (Milletary, 2005;
Jelodar et al., 2017; Chiew, Yong and Tan, 2018). Actually, this technique is the most
commonly conducted phishing attack. Phishing domains are also selected accordingly
common typo errors. The attacker relies on both grammatical mistakes and the user mistypes
the website address by accidentally pressing the adjacent key or missing a character. The
mistyped website address can lead a user to a phishing website that may look like the
legitimate website (Jelodar et al., 2017). Examples of these and other phishing URLs are
presented in Table 2.
# Example Comment
1 https://heir-fresh.com Sound air-fresh.com
2 https://high5.com Changed text to number for mimicking the
URL www.highfive.com
4 https://wwwmybank2us.com Missing dot typos
5 https://mybankus.com Character omission typos
6 https://mybank2su.com Character permutation typos
7 https://mybanl2us.com Character replacement typos
8 https://mybank2uss.com Character insertion typos
9 https:// Obfuscation option of a legitimate website
legitimatesite.legit.com (legit.legitimatesite.com) by interchanging
the domain and subdomain name
10 https:// Obfuscation using country code top-level
legit.legitimatesite.com.my domain
11 http:// Obfuscation using character substitution
legit.Iegitimatesite.com.my
12 https:// Obfuscation by using part of the legitimate
legit.legitimatesite.anothersi URL as subdomain
te.com
13 https://%68%74%74%70%3a%2f%2f Hexadecimal encoding of ASCII Text
%77%77%77%2e%65%78%61%6d%70%6c domain name. e.g.
%65%2e %63%6f%6d http://www.example.com
14 https://192.168.1.1 Dotted Quad Notation
15 https://0xc0a80101 Hexadecimal Format
16 https://bit.ly/2KnuCGI Shortening URL of malware web site
xv
B. Clickable image
However, not all phishing attacks rely on the similarity to the URLs of legitimate web
resources. Another URL obfuscation technique is using a clickable image instead of text.
Usually, this is used in emails that contain a single image in JPEG format. The image appears
to be a legitimate email from an online bank or shop, which usually includes official logos. As
a result users are directed to a phishing web page when they click on this image (Milletary,
2005). This is a most commonly used technique that is technically simple and highly effective
(ibid).
Moreover, it is possible to change phishing URL string with an image of a legitimate
URL by using JavaScript. It is also common to add the security icon, which gives the user a
false sense of security (Anthony, 2007). Script controls the chrome part of the browser, where
the address bar and the status line enter (Milletary, 2005). An example of this type of
obfuscation is shown in Figure 2.
Figure 2. Example of obfuscation with JavaScript (Chiew, Yong and Tan, 2018)
C. Alternative encoding
Chiew, Yong and Tan, (2018) revealed in detail the more advanced URL obfuscation methods
that are also actively used by phishers. For example, using an alternative encoding is another
obfuscation technique that makes a URL unrecognisable. Also, IP addresses can be specified
as hexadecimal numbers. Additionally, alphanumeric characters can be changed to their
hexadecimal representations. Regardless of which encoding is presented a URL, web
xvi
browsers usually correctly interpret most of these representations. These and other methods
are presented in Table 2.
2.2.2. Injection attacks
There are a number of attacks performed by compiling software code, an illegal way to
perform malicious actions. The possibility of this attacks arises if the web resources have a
weak input validation policy. Acunetix (2017) categorised these attacks into nine types: 1)
code injection; 2) CRLF injection; 3) cross-site scripting; 4) email injection; 5) host header
injection; 6) LDAP injection; 7) OS command injection; 8) SQL injection; 9) XPath injection.
There are several ways to reproduce these attacks. Below, the paper gives an overview of
different type of injection attacks, paying attention to the lexical properties of URL. Examples
of such custom URLs are shown in Table 3.
A. Cross-site scripting
Firstly, cross-site scripting (XSS) is perhaps the most common attack, where the attack vector
uses compromised web resources (Ollmann, 2004). This type of attack is produced by
sending a victim a link containing JavaScript or Flash code, which the browser executes
without the need for expansion. However, in some cases, attacks are also generated based on
certain browser vulnerabilities.
xvii
Study by Vogt, Nentwich and Jovanovic, (2006) explored XSS attacks from different
prospective such as attack vector, solutions and appearance of attacking script. The study
noted that usually the URL contains particular JavaScript commands such as
escape(document.cookie), alert(‘error’), and GetParameter(‘eid’).
These and other key commands are useful to build a word dictionary that will be used for the
machine learning classifier.
B. SQL injection
Secondly, SQL injection (SQLIA) is also a dangerous and common attack that can be
conducted with help of URLs. It occurs when an attacker attempts to change the logic,
semantics or syntax of a legitimate SQL statement by inserting new SQL keywords or
operators into the statement (Halfond and Orso, 2005, p3). They usually are designed
manually for particular web resources.
A survey conducted by Halfond, Viegas and Orso (2006) made a review of commonly
known SQLIA techniques. According to the study, URLs should have key commands tags
words used in SQL requests, such as GROUP BY, DROP, or UPDATE. Also, characters such
as ‘;’, ‘‘‘, or ‘‘‘ are common in SQL injection techniques (Tsai and Yu, 2009). A semicolon
allows the system to execute several consecutive SQL codes.
2.2.3. Drive-by download attack
A drive-by download attack involves the unintended download of malware software without
the knowledge of users. This attack can be triggered not only with the help of malware URLs
but also the process can start after viewing e-mail messages or clicking on a pop-up window
(Le et al., 2013). The attack is usually conducted by developing malware software that is
designed by using the vulnerabilities of a web browser.
Detecting the attack by analysis lexical-based features can be an infeasible task for a
few reasons. Firstly, mostly the attack is conducted from a compromised legitimate website,
or a legitimate website unknowingly distributing the attacker’s content through a third-party
service such as online advertisement (Provos et al., 2007). Therefore, these websites have an
ordinary URL that can be distinguished from malicious. Secondary, malicious code can be
inserted into webpage content, and to launch the attack, it is necessary to download the
compromised web page.
2.2.4. Spamming attacks
The amount of web spam has been increasing significant and has led to a degradation of
search results and users’ experience on the internet. There are a number of techniques for
xviii
spreading spamming URLs on the internet. The most attacking vector still remains search
engines, emails and social networks (Garcia-Molina and Gyongyi, 2005; Jelodar et al., 2017).
# Technique Example
1 anchor text <a href=“page.html”>sales, black friday, 90%
spam discount, London</a>
2 url spam london-90%-discount-black-fraday.camerasx.com
xix
To sum up, this section explored different attacks that are conducted with help of URLs which
are (1) phishing, (2) injection, (3) cross-site-scripting, (4) drive-by download and (4)
spamming attacks. While considering each of these attacks, URL string lexical properties
were taken into account. This knowledge is then used in the next stage of the study, during
feature representation.
It was found that in a few cases, detecting malicious URLs by looking at their lexical
properties is impossible. First, when phishers use images to obfuscate malicious URL;
second, when malicious URLs are inserted into HTML tags of compromised websites.
As was pointed out previously, feature representation is an important part in the workflow of
a data scientist and correctly selected features gives a classifier a high accuracy rate. This
section is dedicated to establishing the most valuable features, based on knowledge from
previous studies.
Before starting, it is necessary to define some terms. Firstly, to have a common
understanding of information about different part of URLs, Figure 3 (adopted from RFC 3986
Section 3) presents the names of these parts. Secondly, by ‘feature extracting’, the paper
refers to a process of variable denoting from lexical statically properties of the URL. These
properties include bag-of-words (BoW), character counts, and n-gram.
http://domain.com:8042/over/there?name=ferret#nose
| | | | |
Analysis of the related studies showed an application of several groups of features that can be
conditionally divided into three main categories, which are BoW, n-gram and special
characters. These categories can also be broken down into subcategories; this will have
presented below in this paper.
As far as n-grams features are concerned, the features can have low effectiveness in
the URL detection task. The method is usually applied to denote similarity between words in
the presence of multi-lingual data (Damashek, 1995). It is done by converting text into tokens
xx
of character size ‘n’ using a window of adjacent characters (Kolari, Finin and Joshi, 2006).
However, the majority of URLs are written in the English language, although the number of
domain names in other languages is rising. To avoid unnecessary noise caused by features
with low effectiveness, these features were excluded from the consideration.
On the other hand, particular feature, such as (1) length of the path, (2) length of the
query part, and (3) length of the fragment part, have appeared to be effective in previous
experiments. For example, after an analysis of the F-score measure of URL features, it was
found that the above-mentioned three features have high weights, which indicates a higher
potential of splitting benign and malicious web pages (Eshete, Villafiorita and Weldemariam,
2013).
Overall, 16 previous studies were explored, where lexical-based features have been
applied. All these features were categorised and grouped by authors in Table 5. The
recommended features were also applied in the current experiment. Additionally, few features
were added to this list as the result of an analysis of the attacks discussed in the previous
section. Although these additional features are shown in the table as New, it is not argued that
they were not applied previously. Basically, they were not found in the available literature
that was explored within this study.
xxi
‘banking’, and ‘confirm’, ‘secure’
‘images’, ‘com’, ‘www’, exe,
account, swfNode.php, pdfNode.php
4 Existence of particular words, New *
used in an authentication page.
These words include: username,
password, urs, user, pass, pwd
5 Special character and Existence of particular Special (Kolari, Finin and
numbers character and numbers “/”, “.”, “?” Joshi, 2006; Eshete,
and “=” Villafiorita and
Weldemariam, 2013)
6 Static properties URL length (Choi, Zhu and Lee,
(Integer) Path length 2011; Thomas et al.,
A number of authors, such as Sabhnani, Serpen and More (2003); Tsai and Yu (2009);
Vanhoenshoven et al. ( 2016); Dong, Shang and Yu (2017) have reported the application of
machine learning approach in malicious URL detection issue that demonstrated promising
results. Additionally, this paper reviews related stat-of the-art approaches that applied not
only machine learning but also other different alternative techniques. After analysing the
xxii
published literature, approaches for detecting malicious URLs can be divided into three
following categories: (1) machine learning approaches (2) blacklisting approaches and (3)
heuristic approaches. Below, these approaches are described.
2.4.1. Machine Learning approach
Based on the number of published papers in the last ten years, it seems that the academy is
increasingly seeing the solution to this problem in the machine learning approach. But it is
necessary to emphasise that despite the huge number of proposed solutions, almost none of
them currently have practical application in the industry. Actually, there are a number of
trade-offs between computational price and performance, accuracy and speed.
As Sahoo, Liu and Hoi (2017) emphasised that the issue of data collection is the big
obstacle for the machine learning approach, since it does not allow its application on a global
scale. This is because not all features, such as content-based and host-based features, can be
easily collected due to their heaviness and the significant number of unique URLs on the
internet.
Actually, as was previously reported, these heavily weighted features provide more
chances of detecting malicious URLs. In this regard, to get a full image of the following,
state-of-art machine learning approaches are considered in three dimensions: (1) data
collection sources, (2) applied features and (3) applied machine learning algorithms.
xxiii
Figure 4. Sources for collecting raw data
These features can be collected on point (1) called content-based features. They require the
full downloading of a web page to be collected. Canali et al., (2011), Eshete, Villafiorita and
Weldemariam (2013) conducted an experiment analysing the HTML and JavaScript content
of web pages. These features were created based on the structure of HTML tags, the existence
of particular JavaScript commands or specific ActiveX elements. Additionally, a recent study
by PATIL and PATIL (2016) extracted lexical features from content of HTML pages.
Despite the high accuracy that content-based features can ensure, there are two main
disadvantages that should be considered as well. The first concern is security, because to
extract these features, a web page must be fully downloaded. Hence, there is a high
probability that malicious code can executed before the classifier will label it as malicious. In
the second, resource consumption is an issue, because all mentioned features require high
computational power and processing time. Hence, it is doubtful that the building of a
classifier with content-based features would be effective on a large scale.
Point (2) in Figure 4 gives more chances to intercept malicious web pages by using
machine learning classifiers. Features that can be extracted from this point are called lexical-
based features and host-based features of URLs. Obviously, the content-based features are
not available from this point.
Regarding host-based features, this information usually is requested from DNS
servers. WHOIS requests can obtain from DNS information about the domain owner’s name,
xxiv
location, IP address, live-time, registered and updated dates. It was mentioned in the previous
chapter that malicious URLs tend to frequently change location and live only for a short
period of time. Therefore, host-based features have high value to making an accurate
classification.
There are also other valuable host-based features that can be explicitly obtained from
a web host, for example, connection speed and IP addresses (if the URL contains only an IP
address). Sahoo, Liu and Hoi, (2017) pointed out that it is difficult to change IP addresses for
each new attack. Hence, information about IP addresses can be helpful for accuracy of
classifiers.
However, it should be mentioned that host-based features have obvious disadvantages,
such as availability and speed. According to McGrath and Gupta (2008) DNS servers cannot
be available during data collection and prediction. Therefore, the training data would contain
missed values that can affect the quality of the classifier. Additionally, the connection speed
with both the DNS server and the web server can periodically decrease, which also effects the
prediction speed. Anyway, under the assumption that there is a sufficient connection speed,
some host-based information can be requested in a few seconds that is too much for real-
world situations. These factors make host-based features impractical in real-world
environment.
Lexical-based features of URLs are obtained from URL names (strings). In other
words, a classifier learns to distinguish malicious URLs from benign ones according their
appearance and text structure. Usually, different measured features such as URL length,
domain name length, and count of special characters are extracted as features. Additionally,
binary features such as the existence of particular characters or words in the given URL are
also extracted as features. This features are also known as bag-of-words (Vanhoenshoven et
al. 2016).
These features have considerable drawbacks as well. For instance, a classifier that is
built on the basis of only the lexical-based features of URLs can be considered an extension
of the blacklisting approach. One of the drawbacks of URL-based features is that new URL
names which are able avoid classifiers can be generated algorithmically. However, the
number of studies (Yadav et al., 2010) (Schulz et al., 2012) claim the it is possible to
recognise algorithmically generated patterns by analysing their alpha-numeric distribution.
Returning back to Figure 4, points (3) and (4) refer to the IDS and HTTP-PROXY
servers. These servers usually can be installed in a corporate network. The HTTP-PROXY
additionally allows the system to extract features based on the HTTP request and replies. The
xxv
IDS of the server give a lot of information on the network level. These directions are also
promising. The company Cisco has been developing a product called Umbrella that identifies
different kind intrusion scenarios by analysing IDS logs (Dua and Du, 2015).
Lastly, DNS servers, point (5), appear to be the most suitable places from where data
should be collected (Holz et al., 2008). As was described in Section 1, to hide malware IP
addresses, attackers change domain names every five minutes by registering domain names
with the help of botnets. The only place where such behaviour can be identified is the DNS.
Also, this approach eliminates the problem of losing a connection during data collection,
which was discussed earlier in this chapter.
Table 6. References of different types of machine learning algorithms used for malicious URL
detection in last decade
xxvi
Batch: SVM (Nepali, Wang and Alshboul, 2015), (Ma, L. K. Saul, et al., 2009),
(Kolari, Finin and Joshi, 2006), (Pao, Chou and Lee, 2012), (Marchal,
Francois, et al., 2015), (Marchal, State, et al., 2015), (Chu et al., 2013),
(Sorio, Bartoli and Medvet, 2013), (Xu et al., 2013), (Hou et al., 2010),
(Wang et al., 2013), (Bannur, Saul and Savage, 2011), (Huang, Qian and
Wang, 2012), (Ying and Xuhua, 2006), (He et al., 2011)
Batch: Naive Bayes (Canali et al., 2011), (Xu et al., 2013), (Hou et al., 2010), (Cao et al.,
2016), (Aggarwal, Rajadesingan and Kumaraguru, 2012)
Batch: Logistic (Garera et al., 2007), (Ma, L. K. Saul, et al., 2009), (Canali et al., 2011),
regression (Xu et al., 2013), (Wang et al., 2013)
Batch: K-nearest (Choi, Zhu and Lee, 2011; Vanhoenshoven, Napoles, et al., 2016)
neighbours
Online mode (Ma et al., 2010), (Ma, L. Saul, et al., 2009), (Blum, Wardman and
algorithms Warner, 2010)
C. Description of algorithms
1) Supportive Vector Machine is the most commonly used algorithms that have shown their
strength in this case. The Supportive Vector Machine (SVM) is the most applied learning
algorithm for classification and regression problems. The SVM model is a representation of
examples as points in a multidimensional space, displayed in such a way that the examples of
individual categories are divided into clear boundaries which are as wide as possible (Cortes
and Vapnik, 1995). In other words, the algorithm reveals the maximal margin that separates
two (or more) classes.
Supposing that the extracted features of training dataset vectors from real numbers
, malicious and benign URLs are classified into two classes, labelled as vector
. SVM solves the following primal problem:
subject to (1)
Its dual is
xxvii
subject to (2)
Where is the vector of all ones, is the upper bound, is an -by- positive
semidefinite matrix (Yadav, 2010)
To simplify, pseudo-values in two-dimensional Cartesian coordinates are presented in
Figure 5. In this example, the number of positive (1 and 2) and negative (3 and 4) points,
which are closer to the opposing class, form supporting vectors. The centre line of this margin
is called a hyperplane; on this basis a model makes a classification.
The distance between these vectors is a margin that always strives to attain the
maximum value, which is a constrained optimisation problem that can be formulated as
follows (Wenyu and Ya-Xiang, 2006):
subject to (3)
where the equality and inequality constraints required to be satisfied, and is objective
function that needs to be optimised subject to the constraints (Yurkiewicz 1985). With the
help of the constraints, the ratio of the algorithm with respect to outliers can be regulated.
This algorithm has a number of advantages that make it attractive for the tasks of
binary and multi-class classification. One of the advantages is that the number of data
xxviii
dimensions does not affect the accuracy of the classification. In other words, a classification
can still be effective even if the number of features is predetermined by the amount of
observations.
2) K-Nearest Neighbours (KNN) is a classifier which measures the distance to the its k-
number of neighbours and refers assign to a class that its nearest neighbours have. As is
shown in Figure 6, the X-query point has four negative (-) samples and one positive (+)
sample. As the result, this point has assigned to the negative class, because it has more
negative neighbours than positive.
The distances between neighbouring points are measured by the Euclidean system. Given two
(4)
The most important parameter of the algorithm is the number of neighbours. Dua and Du
(2015) argues that the more neighbours a query point has, the more noise the algorithm
receives. Therefore, the accuracy of classification is reduced. According to his
recommendations, the value of k should be less than the square root of the total number of
xxix
training samples. Also, in binary classification problems, the number of neighbours should be
chosen among odd numbers to avoid tied votes.
The algorithm has its advantages as well as disadvantages. More data scientists have
chosen this method mainly because it is easy to implement and interpret. However, KNN
classification appears to be time and memory consuming (ibid.).
3) Online learning algorithms have become increasingly popular, which proves their
practicality in the industry. By online learning machines, the paper refers to the machine
learning approach in which data is made available in a sequential order and update weight of
the predictors at each iteration.
Formally saying, the online learning algorithm address a classification problem over a
value. The algorithm memorise time ( ) when the model makes a mistake ( ) for
creating the hypothesis for the next time sequence ( ). This approach to learning is also
called the incremental approach (Ross, Ruei-Sung and LinMing-Hsuan, 2008).
One of the largest studies in the field of identifying malicious URLs was conducted by
Ma et al. (2010) using an online machine learning algorithm. Within about 100 days, as part
of the experiment, malicious URLs were collected from the online services of Cisco, Google,
Microsoft, and Yahoo. Scholars compared three algorithms such as (1) Perceptron, (2)
Passive-aggressive Algorithm and Confidence-weighted (CW) Algorithm. They justified the
choice of the latter by referring to the advantages of CW that make it well suited for models
with a large number of features. In fact, the experiment had significant results; the classifier
showed an accuracy rate of 99%, which made the study the most citable in the field and the
dataset was used by other scores for alternatively testing other machine learning algorithms.
Although performances of the presented approach showed positive results in malicious
URL detection, these findings should be interpreted with caution due to two reasons. Firstly,
the paper did not provide a definition for malicious URLs. As it was stated earlier, the
malicious intents of URLs are different and have different feature properties. Accordingly, the
feature vector of one type of malicious URL can be absolutely ineffective for another and
make noise. Secondly, the experiment uses host-based futures, which have obvious drawbacks
xxx
that make them impractical. For example, as it was mentioned earlier, in real-world industrial
environments it is impractical to collect host-based features.
D. Algorithm selection
It was decided to select the algorithm by matching its properties with the actual requirements
of the experiment. The approach known as multiple-criteria decision analysis (MCDA)
described by Antunes, Carlos Henggeler, Henriques (2016) helped to make this balanced
decision. The MCDA is an integrated method that explicitly evaluates multiple conflicting
criteria in a decision-making process. To select the most appropriate algorithm according to
the MCDA approach, it is necessary to complete the following three steps: 1) define the
criteria; 2) prioritise them by assigning weight to each criterion; 3) present the list of available
options. These steps are explained below in this section.
As far as criteria for the algorithm is concerned, three main criteria were defined. The
first criterion is (1) the average accuracy rate of the algorithms, which also was taken from
previous experiments. A list of experiments where the values were obtained can be found in
Appendix A. The second criterion, (2) ubiquity, is measured by the number of experiments
where the certain algorithm was applied in last 10 years. The last criterion, flexibility, is
related to how the algorithm can be tuned in the experimental tool, which is the Python’s
library called Scikit-Learn. Basically, it shows the number of parameters and attributes that
that the particular algorithm has in Scikit-Learn v0.19.2.
According to the MCDA decision making approach, it also was necessary to establish
available options for the selection. As was mentioned in the literature review, there was set of
particular algorithms that was applied most in almost all experiments during the past 10 years.
But only the top three of them were considered in this research due to practical limitations.
These three algorithms consist of Logistic Regression (LG), Naïve Bayes (NB), and
Supportive Vector Machine (SVM). Table 7 presents the values of these algorithms in the
context of the five criteria mentioned above.
Then, these three criteria were prioritised by assigning them weights from 0 to 1. This
is necessary for determining how important the criteria are to the objective. The result of these
operations is presented in Table 7.
xxxi
Accuracy rate (%) 0.7 93.92 87.95 95
average accuracy rate obtained from
previous experiments
Ubiquitousness (count) 0.1 2 15 2
the number of experiments where an
algorithm was applied
Flexibility (count) 0.2 8 20 0
the number of parameters and
attributes for tuning a model
Scores 67.544 67.065 66.700
:
Finally, to calculate the final score, it was necessary to multiply values by their weights. As a
result, the analysis shows that the most appropriate algorithm for this experiment was the
KNN, by a score of 67.544. Accordingly, the experiment will be conducted with help of KNN
algorithm.
xxxii
approach monitor incoming data, such as URL and cookie data. The distribution component
focus on objects and operators associated with corrupted values, such as escape() and
encodeURIComponent(). The work describes six such conditions. In order for a page to
be considered vulnerable, all six components must satisfy the condition.
Such a flexible approach makes the system capable of detecting malicious URLs that
previously were not on the blacklist. The main advantage of this system is its flexibility and
expandability. But this approach requires persistent human involvement. Additionally, the
system can be developed only for a limited number of common threats and cannot generalise
all types of (new) attacks. Moreover, using methods of obfuscation, it is easy to bypass them.
A more specific version of heuristic approaches consists of analysing the dynamics of the
execution of a web page, for example, as proposed by Kolbitsch, C; Livshits, B; Seifert
(2012); Eshete, Villafiorita and Weldemariam (2013).
Overall, this chapter helped to achieve five out of seven research objectives, which are (1) to
identify types of malicious URLs; (2) to explore different types of attacks conducted by
URLs; (3) to identify valuable lexical-based features; (4) to identify the state-of-the-art
machine learning approach; and (5) to choose the most appropriate machine learning
algorithm.
The first finding was that all malicious URLs in the explored literature can be
categorised into phishing, spamming and malware. The second finding related to the attacks
conducted with the help of URLs. Together, these two findings gave an understanding of the
basic lexical parameters of malicious URLs. However, it was also found that in a few cases,
lexical-based features are helpless for detecting malicious URLs.
Next, to identify the most valuable lexical-based features, related literature was
reviewed in Section 2.3. It was found that all these features can be grouped into eight
categories. However, after critical evaluation some features were excluded from further
consideration due to their low effectiveness. Also, based in the knowledge obtained in the
previous sections, few new features were identified as valuable.
Lastly, to choose the state-of-art machine learning approach above, 30 related studies
in the last ten years were explored. The results of these experiments also were presented
systematically in this section. It was found that mainly three batch mode machine learning
algorithms (SVM, NB, LR) and one online algorithm (CW) have demonstrated the best
xxxiii
performance. Then, backed with this information, an MCDA analysis was made to choose the
most appropriate algorithm, which appeared to be KNN.
xxxiv
Chapter 3. Methodology
Chapter 3
Methodology
This chapter is dedicated to describing methodology that is used to conduct this experiment.
The aim of the chapter is to establish experimental configuration baseline and subsequent
derivation. The chapter is divided into two main sections, which are on the theoretical and
practical approaches.
To explain of how this study is conducted to achieve the research aim and objectives the
Onion framework was applied. As it shown in the Figure 7, the methods of each stages are
shown in each layer which are described in this section.
3.1.4. Strategy
Denzin and Lincoln (2011) defined the research strategy as the link between research
philosophy and method. Strategy is needed to plan how to collect and process data for
achieving the research objectives (Saunders, Lewis and Thornhill, 2009). There are several
36
Chapter 3. Methodology
options of main strategies such as experiment, surveys, case studies. This study is conducted
with help of experiment.
3.1.5. Time horizon
Time horizon establishes period of a study. There are two main time horizons which are
longitudinal and cross-sectional. Due to the limitation of the dissertation at the university
course timeline, a cross-sectional study is selected as an approach for data collection. This
type of research needs to be conducted at a specific point in time (Gould et al., 2015).
3.1.6. Techniques and procedures
The main technique applied in the research is conducting an experiment with help of machine
learning classifier. Procedures of the experiment are described in the next section (Section
3.2).
This section describes how to practically conduct the experiment. It starts by describing the
three experimental stages. Then the experimental environment and tools used are described to
give other researchers the opportunity to repeat the experiment if necessary; next, the data is
described; this is followed by giving information about how the classification model is
optimised for the experiment; finally, it gives information about the metrics that are used for
evaluating the results.
The experiment was hosed on a MacBook Pro device with macOS High Sierra operating
system. According to the risk mitigation plan, in case of software defects, the operating
system periodically was backed-up using the standard macOS functionality. Technical
characteristics of the experimental device are presented in more detail in Figure 8.
37
Chapter 3. Methodology
The processing of the collected data and all stages of the data mining were made in the
programming language Python v3.6.5. This choice of programming language is explained by
the experimenter’s personal preference and also by the presence of all necessary libraries (e.g.
Pandas library) for data preprocessing. The full list of the used Python libraries with
descriptions can be found in Appendix B (lines from 2 to 24).
The process of selecting the toolkit for building the classifier was the subject of
careful analysis. Taking into account the review of the tools for educational data mining by
Slater et al., (2016), in the short list of considered tools were products such as Python's
SciKitLearn, RapidMiner, Matlab, Weko, and R. After the critical analysis, it was decided to
use SciKitLearn implemented in the Python environment. This toolkit was chosen based on
the following two main reasons: first, integrity. As mentioned above, Python is also used for
different tasks of the experiment. Therefore, it is convenient to have a single environment for
the whole experiment. The second reason is the processing speed. Pedregosa et al., (2012)
examined the processing speed of a few toolkits by running a learning and cross-validation
process. As a result, he found that SciKitLearn was the fastest.
To manage the release of the experimental configuration and source code, the online
service was used called GitHub. The latest version of the source code also can be found in the
Appendix B. The use of this service was also part of the risk mitigation plan mentioned in the
introduction. In the case of hardware failure, it would always be possible to restore the
program code from the GitHub.
To recompile the experiment’s Python code, it is not necessary to choose a particular
code editor. However, all code where written in the Jupiter Notebook v4.4.0, hence source
code file is stored in the *.ipynb extension.
38
Chapter 3. Methodology
Additionally, the application of this set of toolkits, libraries and online services
appears to be a common practice among the data scientist community (Stackoverflow, 2018).
The only limiting aspect of the experimental environment was the lack of computational
power for training the classifier. The process of training the dataset took several hours, which
of course had a negative impact on the research experience. Full list of used tools can be
found in Table 8.
3.2.2. Data
A. Collection
At beginning of the research, finding an appropriate dataset was challenging task. This was
mostly because the experimental data from previous studies does not fit for the research. For
example, dataset of the large-scale study by Ma et al., (2009); Vanhoenshoven et al., (2016);
Dong et al. (2017) was in the SVMLight format. But the application of this data was
impossible since features were converted into digital massive. Despite the general description
of the features, in practice, it was not possible to exclude host-based features from this
dataset.
In this reason, data was collected from publicly available data sources. In particular,
benign URLs were obtained from the Open Directory Project (or DMOZ). The directory
39
Chapter 3. Methodology
consists of the biggest set of URLs that are manually checked by editors. These editors have a
certain level of trust because they also pass through a preliminary check.
For collecting malicious URLs, a custom code (Appendix B, lines 26–78) was
developed that parses URLs from online services such as Vxvault, Malware Domain List and
Cybercrime-Tracker. A list of these sources can be found in Table 9.
The bar chart in Figure 9 shows the number of samples collected from each source, while the
ratio of benign and malicious classes is presented in the pie chart.
40
Chapter 3. Methodology
B. Preprocessing
In this section, the preprocessing stage, which comes after data collection, is described. By
preprocessing, this paper refers to the stage where data are converted into a form that is more
appropriate for the selected machine learning algorithm. Preprocessing includes stages such as
feature extraction, dealing with missing data, inappropriate values, and feature engineering.
These prepossessing stages are discussed below in this section.
Feature extracting is important stage, where raw data are transferred into different
features. A list of valuable features is presented in the methodology chapter (Section 3.4). The
full list of extracted features is presented in Appendix C. Additionally, in this stage, all textual
features are converted into digital. This is because, usually, most machine learning
algorithms, and particularly KNN, work only on numerical (integer or real) data. The сlasses,
malicious and benign, are also represented digital format as 1 and 0, respectively.
The next stage deals with missing and inappropriate data that usually appear during
the data collection process. There are two options for dealing with missing data, and the
appropriate choice depends on the reason of their absence. First, an entire row should be
deleted from a table if its cells were missed because of difficulties during data collection. This
should be done because, otherwise, these data might be misinterpreted by the learning
algorithm. Second, in some cases, missing data have some meaning. Such values should be
grouped into a new category if so (Brink, Richards and Fetherolf, 2016).
41
Chapter 3. Methodology
Regarding inappropriate data, it was found that some URLs were created with help of
URL shortening services such as bit.ly and goo.gl. These URLs do not carry a value
for the classifier and create unnecessary noise (Shekokar et al., 2015). For identifying these
URLs, a custom code was developed that uses an application programming interface (API)
provided by the online service longurl.org. This service allows one to automatically
detect and expand the shortened URLs addresses. The service currently supports about 300
popular URL shortening services.
As far as feature engineering is concerned, the term refers to the process of
mathematical operation on extracted features for creating other independent variables.
Mathematical operations such as finding the mean, normalising, or calculating ratio are
usually applied by machine learning practitioners for boosting the accuracy and computational
efficiency of classifier models (Brink et al, 2017).
Lastly, the code in the Python programming language that was for data preprocessing
can be found in Appendix B (lines 80–155). Fragments of the data and references to original
file can be found in Appendix D. This should provide the ability for other scientists to
smoothly repeat this experiment if necessary.
The validation of the performance of a classifier is an important part of the data analytics,
since the maximum accuracy of prediction from the primary setting is impossible. The
validation was performed using the Python’s library is called model_selection.
The goal is to be sure that the model will show a stable result when new data is
received. This is because it does not give an indication of how well the learner will generalise
to a previously unseen dataset. In other words, a classifier should be low in bias and variance
without overfitting or underfitting for a particular dataset (Freitas, 2000).
To tackle this issue, arbitrary classification is performed, and then the obtained results
are compared with the actual values. As result of this operation, the model denoted the
numerical estimate of the difference in classified and actual, which is called the training
error. This process is called also validation.
The first validation is performed during the model training process, when the entire
dataset is divided into two parts, trained and tested as shown in Figure 10 (a). But by dividing
the data into only two parts, the model gets more chances to obtain non-randomly distributed
data. Therefore, it is necessary to cross-check classification accuracy by additionally dividing
42
Chapter 3. Methodology
the dataset into more parts (Platt, 2013). This stage is also called cross-validation and is
shown in Figure 10 (b).
Figure 10. Visual representation of data splitting and validating processes (Nelson, 2018)
Figure 11. Visual representation of K-Folds cross validation method (Nelson, 2018)
43
Chapter 3. Methodology
True-Positive False-Positive
Test
Negative
False-Negative True-Negative
But there are also advanced metrics that give a broader understanding of the accuracy of a
model. These metrics include the True Positive Rate (TPR), False Positive Rate (FPR),
Precision and Recall, which are calculated from the above-mentioned basic metrics. Overall,
these advanced metrics with the calculation formulas are presented in Table 10.
44
Chapter 3. Methodology
# Metric Formula
1 TPR (True Positive Rate or Sensitivity) TP / (TP+FN)
2 FPR (False Positive Rate or Specificity) FP / (FP+TN)
3 Precision TP/(TP+FP)
4 Recall TP / (TP+FN)
From the table, it can be seen that TPR and Recall are identical. Consequently, the following
question arises: if they are the same why are they named differently? Actually, this is partly
because TPR and FPR are usually used for building a receiver operating characteristic
(ROC) curve, so in the literature, usually they are used together. Later data science
community came up with Recall and Precision, which are also used in pairs. That is, it is a
common practice to measure a classifier accuracy using a ROC curve (that is built from TPR
and FPR) or Recall with Precision.
Drawing on an extensive range of sources (Davis and Goadrich, 2006; Martin, 2011),
the authors set out the different recommendations about when these metrics should be used.
From that recommendations, several important and simple ideas can be obtained if the
scholars’ argument backed by math is demystified.
Mainly, Precision is recommended (Davis and Goadrich, 2006) where the dataset has
predominately negative (benign URLs) samples rather than positive class (malicious URLs).
This is because Precision is more focused on the positive class, hence there are more chances
to more correctly detect a malicious URL. These approaches are applied, for example, where
the misdetection of a malicious URL (FP) has lamentable consequences.
On the other hand, FPR and TPR (ROC metrics) measure the ability to distinguish
between two classes (ibid.). Hence, ROC curve metrics should be used when the detection of
both classes is equally important. This is because these metrics give equal weight to both
classes’ prediction ability. Usually, ROC curve metrics are used when two classes are
balanced, or when the positive class is larger.
Overall, taking into account the above-mentioned recommendations, it was decided to
measure the classifier accuracy with help of the Precision and Recall metrics. This decision
was made according to the following factors: (1) as was revealed in the methodology chapter,
the collected dataset is unbalanced – it has more negative samples than positive; (2) the study
is conducted under the assumption that a collision with a malware URL will have serious
45
Chapter 3. Methodology
negative consequences. Hence, security is a priority. Therefore, the end result of the classifier
will be evaluated with help of the Precision and Recall metrics. Particularly, the results of the
metrics will be compared with similar metrics of previous works. After the comparison, the
relevant conclusions will be made, which can be found in the Results and Analysis chapter of
this paper.
3.3. Summary of the methodology
To sum up, the chapter explored all stages of the experiment step-by-step. The chapter
considered the methodology from two perspectives: theoretical and practical.
The theoretical part was made with help of the Onion framework. According to this
framework, there were the following establishments: as a paradigm, positivism was chosen;
theory development will be made by the deductive method; a quantitative methodology was
developed; the strategy is to conduct an experiment according to a time horizon; the study is
cross-sectional; lastly, as techniques and procedures were established, an experiment was
conducted on the machine learning classifier.
In the practical part, the experimental environment was systematically explained. This
section has provided a list of tools that were used during the experiment.
Next, detailed information about experimental dataset was given. This section describes
the data and examined some obstacles to obtaining the data. Also, information about data pre-
processing is given in this section.
Then the chapter moved to the model optimisation approach that was applied. This stage
was a part of risk mitigation process related to the poor quality of the classifier. As the result,
a K-Fold cross-validation method was chosen for evaluating the classifier.
Lastly, the model evaluation metrics were chosen. After an analysis of the dataset and
algorithm, it was found that Precision and Recall would be appropriate metrics for measuring
the accuracy rate of the classifier. This gave an understanding of how to compare the obtained
classifier with previously conducted stat-of-art experiments.
46
Chapter 4. Results
Chapter 4
Results
This chapter describes of the experiment’s results. Detailed information about the
methodology of the experiment can be found in the previous chapter. The chapter is divided
into four parts: 1) Description of the data; 2) Description of the classifier; 3) Performance of
the classifier; 4) Comparison with the other studies.
This section gives more detailed information about the process of manipulation with the
dataset before the experiment. The number of samples in these subsets is presented in Table
11.
47
Chapter 4. Results
The bar chart above shows that two classes were distributed almost equally to training and
testing subsets. As was mentioned previously, the given dataset is imbalanced, which means
that total number of benign URLs is much larger than the number of malicious URLs. For this
reason, there are only almost 11% of malicious URLs.
Dataset was divided into training and testing subsets with help of Python’s library
called train_test_split. The main reason for this splitting is the primary validation which is
described in the methodological chapter of this paper (Section 3.5). During the experiment,
the classifier was trained on these 291 201 URLs, then prediction and validation were made
based on the remaining 291 201 URLs.
The next step was to analyse the mutual dependence between all features of the
dataset. This analysis was made with help of Python’s library called Pandas. This library
builds a correlation matrix based on Spearman's rank coefficient or Spearman's rho (Gautheir,
2001), which is shown in Figure 14.
48
Chapter 4. Results
The matrix above presents the intercorrelations among all features of the given dataset.
According to the matrix, feature is_equal has strong correlation with is_querry_part, also
another feature url_length has strong correlation with url_content_length, special_caracters
and slashes. According to Brink, Richards and Fetherolf (2016), the strong correlation
between the independent features is undesirable. Therefore, it is recommended to leave only
one of these features to reduce the size of the dataset. However, as was mentioned in the
methodology, each of the extracted features previously demonstrated high performance in the
malicious URL detection task. In this regard, it was decided to leave all features for further
application.
Additionally, it is noticeable that the target feature (type) has a moderate correlation
with the features is_query_part, is_equal, is_ip_based, slashes, special_caracters, and
url_length. According to Brink, Richards and Fetherholf (2016), this means that the
mentioned features have more weight and gain classification accuracy. In other words,
is_query_part, is_equal, is_ip_based, slashes, special_caracters, and url_length are valuable
features of the given dataset.
49
Chapter 4. Results
The section describes the chosen algorithm, KNN, as implemented in Scikit-Learn v0.19.2
library. To obtain the maximum accuracy rate, it was necessary to tune the three parameters
(Table 12) of the algorithm with help of empirical testing and based on the recommendation
of Scikit-Learn (2018a).
The table above shows established values of three parameters, which are weight, algorithm
and n_neighbours. The values of the first and second parameters were established based on
the recommendations given for binary classification tasks with a small-size-imbalanced
dataset. The optimal number of nearest neighbours was chosen by providing empirical testing
for all options between 1 and 20. The result of this test is shown in Figure 15.
The line graph above shows the misclassification error rate for each option of k-neighbour
number. According to the plot, the algorithm shows less misclassification error rate when the
50
Chapter 4. Results
number of neighbours were equal to 3, 5 and 9. However, the most appropriate number of
neighbours appeared to be 5. After tuning the algorithm’s parameters, the next step was to
build the classifier and obtain values of evaluation metrics.
4.3. Results
The section provides information about the classifier’s performance. As was established
previously in the methodology chapter, Precision and Recall metrics were chosen for
evaluating the performance of the classifier. The values of these metrics are shown in Table
13.
Table 13 shows that the total values of the Precision and Recall metrics are equal to 0.93 and
0.94 respectively, while the general accuracy rate is 0.93. But the capability of classifying
malicious URLs is poor, where Precision is 0.76 and Recall is 0.59. This also can be seen in
the Confusion matrix, which is given in Figure 16.
51
Chapter 4. Results
The matrix above shows that the model misclassified (FN) malicious URLs 6899 times.
Although, the classifier copes with classification of benign URLs very well, the
misclassification rate of malicious URL is considered to be very high.
4.4. Comparison
The values obtained from this experiment were compared with results of experiments that
have shown the best performance. The comparison is presented in the Table 14. Full list of the
experiments with their results is available in Appendix A.
52
Chapter 4. Results
53
Chapter 5. Discussion and Analysis
Chapter 5
5.1.1. Objective 1
In this work, the methodology part began by giving a discussion of the what is the state-of-
the-art machine learning approach for detecting malicious URLs. To understand this, almost
previous studies, conducted in last 10 years, were reviewed in the literature review. Whereas
the analysis was focused basically on the machine learning approach. However, other
alternative approaches such as heuristic and blacklisting were also considered. By giving a
critical appraisal of the advantages and disadvantages of the proposed approaches, few papers
were chosen for further examinations.
5.1.2. Objective 2
The next objective dedicated to identifying types of malicious URLs. It was also explored in
the literature review. For achieving this also recent surveys also were reviewed. According to
these papers, malicious URLs can be categorised into following three groups: (1) spamming;
(2) phishing; and (3) malware. After such categorisation, it became clear on which features of
the URL will need to be paid attention while extracting futures. Generally, this knowledge
54
Chapter 5. Discussion and Analysis
gave close view to our next objective which attempts to explore attacks that are conducted
with help of URLs.
5.1.3. Objective 3
The objective of exploring attack types was also explored in the literature review. The
analysis of different attacks undertaken here, has extended our knowledge in this domain.
According to It helps to pay attention to particular features of a URL string which allows to
achieve the next objective that attempt to find the most valuable lexical-based features.
5.1.4. Objective 4
Next objective was to identify valuable lexical-based features. Using all gathered information
about the lexical properties of malicious URLs the most valuable features were listed in the
literature review. Additionally, the features that used in the previous studies are also were
taken into account. Overall, there were 25 lexical-based features for feeding machine learning
algorithm.
5.1.5. Objective 5
The following objective attempted to choose the most appropriate machine learning
algorithm. The Multiple-criteria decision analysis (MCDA) approach was used to make a
final decision regarding algorithm. As result, it was found that SVM is the most suitable
algorithm for this experiment.
5.1.6. Objective 6
Data collection process was described in the methodology part of the paper. As was the case
for the research, is was wished to collect a wide variety of malicious and benign URLs in
order to produce rich and interesting results for the experiment. After tremendous work of
parsing different web resources an abundant amount of data was found. But according to a
limitation of computational power the number of benign URLs was reduced from 4 million to
400 thousand.
5.1.7. Objective 7
Lastly, it was necessary to ensure the robustness of a machine learning classifier. In other
words, the classifier needed to be resilient to the overfitting. For reaching that, the trained
model was cross-validated by using kFold techniques. The principle of how the kFold
techniques worked was explained in the methodology part.
55
Chapter 5. Discussion and Analysis
5.2. Findings
Overall, the classifier showed good performance. The average results of this experiment were
less than the result obtained by Ma et al., (2010) and Dong, Shang and Yu (2017). However,
the average result of 93–94 is still considered high.
The first finding was about the value of selected features. Data analysis with help of the
correlation matrix based on Spearman's rank coefficient showed that the presence of the query
part, IP address and special characters in the URL string may indicate that the URL is
malicious.
Second, the chosen KNN binary classifier does not appear to be as fast as it was reported
to be in previous studies. Actually, it is the common practice to meet unexpected resource
overconsumption in data analysis workflow. Also, it is still difficult to precisely calculate how
model will behave when dataset become larger for several thousand times. Hence,
additionally a calculation of computational resources is needed before design large scale
solution.
Finally, the main finding was that average performance of the classifier was sufficiently
good. However, the FP rate of the model was equal to 59%, which means that the classifier is
able to correctly detect quite more than half of malicious URLs. This poor FP rate can be
explained by two limitations of this experiment and the general limitations of the machine
learning approach, which are discussed below.
5.3. Limitations
Despite the systematic approach of study design, the results should be interpreted with caution
due to some limitations of the experiment and the limitations of the general machine learning
approach.
The main limitation of this experiment was size of the dataset. Particularly, the
relatively small number of malicious URLs can be caused by the poor FP rate of the classifier.
The initial data could not be sufficient for the qualitative learning process of the model.
The second limitation is the labelling of a training dataset. As was previously stated,
the collected data labelling of malicious URLs has been manually done by various volunteers.
In this regard, the quality of labelling is not fully reliable.
Another limitation is compromised websites. As was mentioned in the literature
review, almost 1/3 of web sites can be compromised. This cause the obstacles for machine
56
Chapter 5. Discussion and Analysis
learning approach because compromised web resources actually benign with inherent for this
URL property. Hence, the classifier incorrectly detected them as benign.
Moreover, other general limitations of the machine learning approach should be
addressed as well. For example, different obfuscation techniques are also cause solid
limitation to the approach. For example, as it was reported, using of obfuscation technics with
help of URL shortening services and QR code generators is becoming a new trend in phishing
attack. Detection of these kind of URL is infeasible task for machine learning for the moment
(Sahoo, Liu and Hoi, 2017).
Finally, poisoning attacks is the new challenge for machine learning practitioners
working at cybersecurity industry. The attack is provided by supplying carefully designed
samples to eventually compromise the learning process of a classifier. Thus, be regarded as an
adversarial contamination of the training data (Jagielski et al., 2018).
In the introduction, three risks were identified. During the project it was needed to mitigate
two of them, which are (1) failure to get data and (2) low quality of the classifier.
Regarding the dataset, it was impossible to find raw dataset from the previously
published studies. Therefore, the raw URLs was requested from the other large scale studies
such as Ma et al. (2009); Vanhoenshoven et al. (2016). However, it was found that the
original URLs are not available anymore. In this regard, it was decided to collect data with
help of custom software. Finally, the data was collected on time.
The second risk related to the quality of the classifier is also appeared during the
experiment. The several compilations of the predictor showed different results. However, as it
was discussed in the methodology, kFold Cross-Validation method was obtained to avoid
overfitting and underfitting of the model. As the result, the performance of the algorithm was
stable at the end of the experiment.
Overall, the experiment was conducted smoothly. It can be result of mitigation actions
that were discussed at the beginning of the project.
57
Chapter 5. Discussion and Analysis
Chapter 6
58
Chapter 5. Discussion and Analysis
References
Acunetix (2017) Acunetix, [online]. Available at:
<https://www.acunetix.com/blog/articles/injection-attacks> [Accessed: 1 May 2018].
Aggarwal, A., Rajadesingan, A. and Kumaraguru, P. (2012) ‘PhishAri: Automatic realtime
phishing detection on twitter’, eCrime Researchers Summit, eCrime, pp. 1–12. doi:
10.1109/eCrime.2012.6489521.
Alcaide, A., Blasco, J., Galan, E. and Orfila, A. (2011) ‘Cross-Site Scripting: An Overview’,
Innovations in SMEs and Conducting EBusiness Technologies Trends and Solutions, pp. 61–
75. doi: 10.4018/978-1-60960-765-4.ch004.
Aleroud, A. and Zhou, L. (2017) ‘Phishing environments, techniques, and countermeasures:
A survey’, Computers and Security, 68(May), pp. 160–196. doi: 10.1016/j.cose.2017.04.006.
Alsharnouby, M., Alaca, F. and Chiasson, S. (2015) ‘Why phishing still works: User
strategies for combating phishing attacks’, International Journal of Human Computer Studies.
doi: 10.1016/j.ijhcs.2015.05.005.
Anthony, E. (2007) ‘Phishing : An Analysis of a Growing Problem Phishing ’. Avaliable at:
<https://www.sans.org/reading-room/whitepapers/threats/phishing-analysis-growing-problem-
1417> [Accessed: 1 May 2018].
Antunes, Carlos Henggeler;Henriques, C. O. (2016) Multiple Criteria Decision Analysis,
Multiple Criteria Decision Analysis. doi: 10.1007/978-1-4939-3094-4.
Apple Inc. (2017) macOS Sierra: Back up with Time Machine. Available at:
<https://support.apple.com/kb/PH25710?locale=ru_RU&viewlocale=en_US> [Accessed: 2
June 2017].
Banday, M. T. and Qadri, J. a. (2007) ‘Phishing – A Growing Threat to E-Commerce’, The
Business Review, 12(2), pp. 76–83.
Banday, M. T. and Qadri, J. A. (2007) ‘Phishing – A Growing Threat to E-Commerce’, 12(2),
pp. 76–83.
Bannur, S. N., Saul, L. K. and Savage, S. (2011) ‘Judging a site by its content: learning the
textual, structural, and visual features of malicious Web pages’, Proceedings of the 4th ACM
workshop on Security and artificial intelligence - AISec ’11, (Vm), p. 1. doi:
10.1145/2046684.2046686.
Basnet, R. B. and Sung, A. H. (2012) ‘Mining web to detect phishing URLs’, Proceedings -
2012 11th International Conference on Machine Learning and Applications, ICMLA 2012,
1(July 2015), pp. 568–573. doi: 10.1109/ICMLA.2012.104.
Berners-Lee, T. (2005) Uniform Resource Identifier (URI): Generic Syntax. Available at:
<https://tools.ietf.org/html/rfc3986#section-3> [Accessed: 1 May 2018].
59
Chapter 5. Discussion and Analysis
Blum, A., Wardman, B. and Warner, G. (2010) ‘Lexical Feature Based Phishing URL
Detection Using Online Learning’, pp. 54–60.
Brink, H., Richards, J. and Fetherolf, M. (2016) Real-World Machine Learning. 1st edn.
Manning Publications.
Canali, D., Cova, M., Vigna, G. and Kruegel, C. (2011) ‘Prophiler : A Fast Filter for the
Large-Scale Detection of Malicious Web Pages Categories and Subject Descriptors’, Proc. of
the International World Wide Web Conference (WWW), pp. 197–206. doi:
10.1145/1963405.1963436.
Cao, J., Li, Q., Ji, Y., He, Y. and Guo, D. (2016) ‘Detection of Forwarding-Based Malicious
URLs in Online Social Networks’, International Journal of Parallel Programming. Springer
US, 44(1), pp. 163–180. doi: 10.1007/s10766-014-0330-9.
Chaudhry, J. A., Chaudhry, S. A. and Rittenhouse, R. G. (2016) ‘Phishing attacks and
defenses’, International Journal of Security and its Applications, 10(1), pp. 247–256. doi:
10.14257/ijsia.2016.10.1.23.
Chen, C. M., Huang, J. J. and Ou, Y. H. (2015) ‘Efficient suspicious URL filtering based on
reputation’, Journal of Information Security and Applications. Elsevier Ltd, 20, pp. 26–36.
doi: 10.1016/j.jisa.2014.10.005.
Chiew, K. L., Yong, K. S. C. and Tan, C. L. (2018) ‘A survey of phishing attacks: Their
types, vectors and technical approaches’, Expert Systems with Applications. doi:
10.1016/j.eswa.2018.03.050.
Chilisa, Bagele; Kawulich, B. (2012) ‘Selecting a Research Approach:Paradigm,
Methodology, and Methods’, Doing Social Research A Global Context, (October), pp. 51–61.
Available at: <https://www.researchgate.net/profile/Barbara_Kawulich/publication/
257944787_Selecting_a_research_approach_Paradigm_methodology_and_methods/links/
56166fc308ae37cfe40910fc/Selecting-a-research-approach-Paradigm-methodology-and-
methods.pdf> [Accessed: 1 May 2018]..
Choi, H., Zhu, B. B. and Lee, H. (2011) ‘Detecting malicious web links and identifying their
attack types’, WebApps, p. 11. doi: 10.1109/IUCS.2010.5666254.
Chu, W., Zhu, B. B., Xue, F., Guan, X. and Cai, Z. (2013) ‘Protect sensitive sites from
phishing attacks using features extractable from inaccessible phishing URLs’, IEEE
International Conference on Communications, (July), pp. 1990–1994. doi:
10.1109/ICC.2013.6654816.
Clough, P. (2012) A Student’s Guide to Methodology. 3rd edn. Sage Publications Ltd. doi:
1446208621.
Cortes, C. and Vapnik, V. (1995) ‘Support-Vector Networks’, Machine Learning, 20(3), pp.
273–297. doi: 10.1023/A:1022627411411.
Curtsinger, C., Livshits, B., Zorn, B. and Seifert, C. (2011) ‘ZOZZLE: fast and precise in-
browser JavaScript malware detection’, SEC’11 Proceedings of the 20th USENIX conference
on Security, p. 3. Available at: <http://dl.acm.org.oca.korea.ac.kr/citation.cfm?
id=2028067.2028070> [Accessed: 1 May 2018].
Damashek, M. (1995) ‘Gauging Similarity with Language-Independent Categorization of
Text’, 267(5199), pp. 843–848.
Davis, J. and Goadrich, M. (2006) ‘The relationship between Precision-Recall and ROC
curves’, University of Wisconsin-Madison, Madison, WI, pp. 233–240. doi:
60
Chapter 5. Discussion and Analysis
10.1145/1143844.1143874.
Dong, H., Shang, J. and Yu, D. (2017) ‘Beyond the blacklists : Detecting malicious URL
through machine learning’.
Dua, S. and Du, X. (2015) Data Mining and Machine Learning in Cybersecurity, Impressoras
3D: O novo meio Produtivo. doi: 10.1017/CBO9781107415324.004.
Enbody, R; Sood, A. (2011) ‘Fraud & security’, (April).
Eshete, B., Villafiorita, A. and Weldemariam, K. (2013) ‘BINSPECT: Holistic analysis and
detection of malicious web pages’, Lecture Notes of the Institute for Computer Sciences,
Social-Informatics and Telecommunications Engineering, 106 LNICS, pp. 149–166. doi:
10.1007/978-3-642-36883-7_10.
Freitas, A. (2000) Understanding the crucial differences between classification and discovery
of association rules: a position paper. doi: 10.1145/360402.360423.
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y. and Zhao, B. Y. (2010) ‘Detecting and
characterizing social spam campaigns’, Proceedings of the 10th annual conference on
Internet measurement - IMC ’10, p. 35. doi: 10.1145/1879141.1879147.
Garcia-Molina, H. and Gyongyi, Z. (2005) ‘Web Spam Taxonomy’, First international
workshop on adversarial information retrieval on the web (AIRWeb 2005), pp. 1–9. Available
at: <http://ilpubs.stanford.edu:8090/771/1/2005-9.pdf> [Accessed: 1 May 2018]..
Garera, S., Provos, N., Chew, M. and Rubin, A. D. (2007) ‘A framework for detection and
measurement of phishing attacks’, Proceedings of the 2007 ACM workshop on Recurring
malcode - WORM ’07, p. 1. doi: 10.1145/1314389.1314391.
Gautheir, T. (2001) ‘Detecting Trends Using Spearman’s Rank Correlation Coefficient’,
Environmental Forensics, 2(4), pp. 359–362.
Google Inc. b (2018) Safe Browsing site status [online]. Available at:
<https://transparencyreport.google.com/safe-browsing/search?hl=en_GB> [Accessed: 1 May
2018].
Gould, S. J. J., Cox, A. L., Brumby, D. P. and Wiseman, S. (2015) ‘Home is Where the Lab
is: A Comparison of Online and Lab Data From a Time-sensitive Study of Interruption’,
Human Computation, 2(1), pp. 45–67. doi: 10.15346/hc.v2i1.4.
Halevi, T. and Lewis, J. (2013) ‘Phishing , Personality Traits and Facebook’, (January).
Avaliable at: <
https://www.researchgate.net/publication/235357780_Phishing_Personality_Traits_and_Face
book > [Accessed: 1 May 2018].
Halfond, W. G. J. and Orso, A. (2005) ‘AMNESIA : Analysis and Monitoring for
NEutralizing SQL-Injection Attacks’, p. 3.
Halfond, W. G. J., Viegas, J. and Orso, A. (2006) ‘A Classification of SQL Injection Attacks
and Countermeasures’.
He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., Chen, R. J. and Sutanto, A.
(2011) ‘An efficient phishing webpage detector’, Expert Systems with Applications. Elsevier
Ltd, 38(10), pp. 12018–12027. doi: 10.1016/j.eswa.2011.01.046.
Holz, T., Gorecki, C., Rieck, K. and Freiling, F. C. (2008) ‘Measuring and Detecting Fast-
Flux Service Networks’, Ndss, pp. 24–31. doi: 10.1.1.140.188.
61
Chapter 5. Discussion and Analysis
Hou, Y. T., Chang, Y., Chen, T., Laih, C. S. and Chen, C. M. (2010) ‘Malicious web content
detection by machine learning’, Expert Systems with Applications. Elsevier Ltd, 37(1), pp.
55–60. doi: 10.1016/j.eswa.2009.05.023.
Huang, H., Qian, L. and Wang, Y. (2012) ‘A SVM-based technique to detect phishing URLs’,
Information Technology Journal, 11(7), pp. 921–925. doi: 10.3923/itj.2012.921.925.
Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C. and Li, B. (2018) ‘Manipulating
Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning’, (1).
doi: 10.1109/SP.2018.00057.
Jelodar, H., Wang, Y., Yuan, C. and Jiang, X. (2017) ‘A systematic framework to discover
pattern for web spam classification’, 2017 8th IEEE Annual Information Technology,
Electronics and Mobile Communication Conference, IEMCON 2017, pp. 32–39. doi:
10.1109/IEMCON.2017.8117135.
Kohavi, R. (2016) ‘A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection’, Learning, (March 2001), pp. 1137–1143.
Kolari, P., Finin, T. and Joshi, A. (2006) ‘SVMs for the blogosphere: Blog identification and
splog detection’, AAAI Spring Symposium on Computational Approaches to Analyzing
Weblogs, 4, p. 1.
Kolbitsch, C; Livshits, B; Seifert, C. (2012) ‘Rozzle: De-cloaking internet malware’.
Krombholz, K., Merkl, D. and Weippl, E. (2012) ‘Fake identities in social media: A case
study on the sustainability of the Facebook business model’, Journal of Service Science
Research, 4(2), pp. 175–212. doi: 10.1007/s12927-012-0008-z.
Kuhn, T. S. (1962) The Structure of Scientific Revolutions, Philosophical Review. Chicago
Uni. Chicago Press. doi: 10.1119/1.1969660.
Kumaraguru, P., Sheng, S., Acquisti, A., Cranor, L. F. and Hong, J. (2008) ‘Lessons from a
real world evaluation of anti-phishing training’, eCrime Researchers Summit, eCrime 2008.
doi: 10.1109/ECRIME.2008.4696970.
Lawrence, N. (2013) Research Methods: Qualitative and Quantitative Approaches. Available
at: <http://lib.hpu.edu.vn/handle/123456789/28691> [Accessed: 1 May 2018].
Le, A., Markopoulou, A. and Faloutsos, M. (2011) ‘PhishDef: URL names say it all’,
Proceedings - IEEE INFOCOM, pp. 191–195. doi: 10.1109/INFCOM.2011.5934995.
Le, V. L., Welch, I., Gao, X. and Komisarczuk, P. (2013) ‘Anatomy of drive-by download
attack’, Conferences in Research and Practice in Information Technology Series, 138(Aisc).
Lin, M. S., Chiu, C. Y., Lee, Y. J. and Pao, H. K. (2013) ‘Malicious URL filtering - A big
data application’, Proceedings - 2013 IEEE International Conference on Big Data, Big Data
2013, pp. 589–596. doi: 10.1109/BigData.2013.6691627.
Ma, J., Kulesza, A., Dredze, M., Saul, L. K. and Pereira, F. (2010) ‘Exploiting Feature
Covariance in High-Dimensional Online Learning’, Proceedings of the Artificial Intelligence
and Statistics, 9, pp. 493–500. Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.169.3701&rep=rep1&type=pdf> [Accessed: 1 May 2018].
Ma, J., Saul, L. K., Savage, S. and Voelker, G. M. (2009) ‘Beyond Blacklists : Learning to
Detect Malicious Web Sites from Suspicious URLs’, World Wide Web Internet And Web
Information Systems, pp. 1245–1253. doi: 10.1145/1557019.1557153.
62
Chapter 5. Discussion and Analysis
Ma, J., Saul, L., Savage, S. and Voelker, G. (2009) ‘Identifying suspicious URLs: an
application of large-scale online learning’, … on Machine Learning, pp. 681–688. Available
at: <http://dl.acm.org/citation.cfm?id=1553462> [Accessed: 1 May 2018].
Marchal, S., Francois, J., State, R. and Engel, T. (2015) ‘PhishScore: Hacking phishers’
minds’, Proceedings of the 10th International Conference on Network and Service
Management, CNSM 2014, pp. 46–54. doi: 10.1109/CNSM.2014.7014140.
Marchal, S., State, R., Engel, T., Marchal, S., State, R., Engel, T., Detecting, P. and Marchal,
S. (2015) ‘PhishStorm : Detecting Phishing with Streaming Analytics To cite this version :
PhishStorm : Detecting Phishing with Streaming Analysis’.
Martin, D. (2011) ‘Evaluation: from Precision, Recall and F-measure to ROC, Informedness,
Markedness and Correlation’. Available at: <http://hdl.handle.net/2328/27165> [Accessed: 1
May 2018].
McGrath, D. K. and Gupta, M. (2008) ‘Behind phishing: an examination of phisher modi
operandi’, Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET), p. 4.
Available at: http://portal.acm.org/citation.cfm?id=1387713.
Milletary, J. (2005) ‘Technical Trends in Phishing Attacks’, Technical Trends in Phishing,
pp. 1–17. Available at:
<https://resources.sei.cmu.edu/asset_files/WhitePaper/2005_019_001_50315.pdf> [Accessed:
1 May 2018].
Nepali, R. K., Wang, Y. and Alshboul, Y. (2015) ‘Detecting malicious short URLs on
Twitter’, Americas Conference on Information Systems, pp. 1–7.
Ollmann, G. (2004) ‘Second Order Code Injection Attacks’, pp. 1–11.
Oxford Dictionary (1930) Definition of overfitting in English. Available at:
<https://en.oxforddictionaries.com/definition/overfitting> [Accessed: 1 May 2018].
Pan, J. and Mao, X. (2016) ‘DomXssMicro: A micro Benchmark for evaluating DOM-based
cross-site scripting detection’, Proceedings - 15th IEEE International Conference on Trust,
Security and Privacy in Computing and Communications, 10th IEEE International
Conference on Big Data Science and Engineering and 14th IEEE International Symposium
on Parallel and Distributed Proce, pp. 208–215. doi: 10.1109/TrustCom.2016.0065.
Pao, H. K., Chou, Y. L. and Lee, Y. J. (2012) ‘Malicious URL detection based on
Kolmogorov complexity estimation’, Proceedings - 2012 IEEE/WIC/ACM International
Conference on Web Intelligence, WI 2012, pp. 380–387. doi: 10.1109/WI-IAT.2012.258.
PATIL, D. R. and PATIL, J. B. (2016) ‘Malicious Web Pages Detection Using Static
Analysis of URLs’, International Journal of Information Security and Cybercrime, 5(2), pp.
57–70. doi: 10.19107/IJISC.2016.02.06.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M. and Duchesnay, É. (2012) ‘Scikit-learn: Machine Learning in Python’, Journal
of Machine Learning Research, 12, pp. 2825–2830. doi: 10.1007/s13398-014-0173-7.2.
PhishTank (2017) Join the fight against phishing. Available at: <https://www.phishtank.com>
[Accessed: 1 May 2018].
Platt, J. (2013) ‘Probabilistic Outputs for Support Vector Machines and Comparisons to
Regularized Likelihood Methods’, (June 2000).
63
Chapter 5. Discussion and Analysis
Provos, N., Mcnamee, D., Mavrommatis, P., Wang, K. and Modadugu, N. (2007) ‘The Ghost
In The Browser Analysis of Web-based Malware’, Proceedings of the first conference on
First Workshop on Hot Topics in Understanding Botnets, 462, p. 4. doi: 10.1038/nature08624.
Ross, D., Ruei-Sung, L. and LinMing-Hsuan, Y. (2008) ‘Incremental Learning for Robust
Visual Tracking’, International Journal of Computer Vision, 77(1–3), pp. 125–141. doi:
https://doi.org/10.1007/s11263-007-0075-7.
Sabhnani, M., Serpen, G. and More, K. K. (2003) ‘Application of Machine Learning
Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context’,
Proceedings of International Conference on Machine Learning: Models, Technologies, and
Applications (MLMTA), pp. 209–215. doi: citeulike-article-id:9827151.
Sahoo, D., Liu, C. and Hoi, S. C. H. (2017) ‘Malicious URL Detection using Machine
Learning: A Survey’, pp. 1–21. Available at: <http://arxiv.org/abs/1701.07179> [Accessed: 1
May 2018].
Saunders, M., Lewis, P. and Thornhill, A. (2009) Understanding research philosophies and
approaches.
Schulz, M.-A., Schmalbach, B., Brugger, P. and Witt, K. (2012) ‘Analysing Humanly
Generated Random Number Sequences: A Pattern-Based Approach’. doi:
https://doi.org/10.1371/journal.pone.0041531.
Scikit-learn (2018) 1.6. Nearest Neighbors. Available at:
<http://scikit-learn.org/stable/modules/neighbors.html> [Accessed: 18 July 2018].
Seifert, C., Welch, I. and Komisarczuk, P. (2008) ‘Identification of malicious web pages with
static heuristics’, Proceedings of the 2008 Australasian Telecommunication Networks and
Applications Conference, ATNAC 2008, pp. 91–96. doi: 10.1109/ATNAC.2008.4783302.
Shekokar, N. M., Shah, C., Mahajan, M. and Rachh, S. (2015) ‘An ideal approach for
detection and prevention of phishing attacks’, Procedia Computer Science. Elsevier Masson
SAS, 49(1), pp. 82–91. doi: 10.1016/j.procs.2015.04.230.
Sheng, S., Holbrook, M., Kumaraguru, P., Cranor, L. F. and Downs, J. (2010) ‘Who falls for
phish? A Demographic Analysis of Phishing Susceptibility and Effectiveness of
Interventions’, Proceedings of the 28th international conference on Human factors in
computing systems - CHI ’10, pp. 373–382. doi: 10.1145/1753326.1753383.
Slater, S., Joksimovic, S., Kovanovic, V., Baker, R. S. and Gasevic, D. (2016) ‘EDM: Tools
for educational data mining’, USA Journal of Educational and Behavioral Statistics, 42(1), p.
1076998616666808-. doi: 10.3102/1076998616666808.
Sorio, E., Bartoli, A. and Medvet, E. (2013) ‘Detection of hidden fraudulent URLs within
trusted sites using lexical features’, Proceedings - 2013 International Conference on
Availability, Reliability and Security, ARES 2013, pp. 242–247. doi: 10.1109/ARES.2013.31.
Spirin, N. and Han, J. (2011) ‘Survey onWeb Spam Detection: Principles and Algorithms’,
SIGKDD Explorations Newsletter, 13(2), pp. 50–64. doi: 10.1145/2207243.2207252.
Stackoverflow (2018) Developer Survey Results 2018. Available at:
<https://insights.stackoverflow.com/survey/2018> [Accessed: 1 August 2018].
Stehman, S. (1997) ‘Selecting and interpreting measures of thematic classification accuracy’,
62 (1), pp. 77–89. doi: 10.1016/S0034-4257(97)00083-7.
Stol, K.-J., Ralph, P. and Fitzgerald, B. (2015) ‘Grounded Theory in Software Engineering
Research : A Critical Review and Guidelines’, Proceedings of the 37th International
64
Chapter 5. Discussion and Analysis
65
Chapter 5. Discussion and Analysis
Table 2. NB
# Alg/Paper Acc. Precisio Recall FPR FNR TPR
Rate n
1 (Canali et al., 2011) 85 44.1 16,4
Table 3. KNN
66
Chapter 5. Discussion and Analysis
Table 4. LR
# Alg/Paper Acc. Precisio Recall FPR FNR TPR
Rate n
1 (Garera et al., 2007) 97.3 0.7 12 88
2 (Ma, L. K. Saul, et al., 99 0.1 7.6
2009)
3 85 17.1 25.6
(Canali et al., 2011)
4 (Xu et al., 2013) 90.55 5.69 22.99
5 (Wang et al., 2013) 56.43 52.8 65.7
Average accuracy: 85.66
67
Chapter 5. Discussion and Analysis
68
Chapter 5. Discussion and Analysis
69
Chapter 5. Discussion and Analysis
70
Chapter 5. Discussion and Analysis
71
Chapter 5. Discussion and Analysis
72
Chapter 5. Discussion and Analysis
174 print("Train set {} samples, where [0]:{} and [1]:{}".format(len(y_train), tr[0], tr[1]))
175 print("Test set {} samples, where [0]:{} and [1]:{}".format(len(y_test), ts[0], ts[1]))
176
177 start_time = time.time()
178 knn = KNeighborsClassifier(n_neighbors=5)
179 scores_lg = cross_val_score(estimator=knn, X=X_train, y=y_train, scoring="accuracy", cv=10)
180 elapsed_time = time.time() - start_time
print("\nAverage accuracy score: {}, Validating time: {} sec".format((scores_lg.mean()),
181 round(elapsed_time, 2)))
182
183 myList = list(range(1,20))
184 # subsetting just the odd ones
185 neighbors = filter(lambda x: x % 2 != 0, myList)
186
187 # empty list that will hold cv scores
188 cv_scores = []
189
190 # perform 10-fold cross validation
191 for k in neighbors:
192 knn = KNeighborsClassifier(n_neighbors=k)
193 scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
194 cv_scores.append(scores.mean())
195
196 # changing to misclassification error
197 MSE = [1 - x for x in cv_scores]
198 neighbors = filter(lambda x: x % 2 != 0, myList)
199 neighbors_list = []
200 for nl in neighbors:
201 neighbors_list.append(nl)
202
203 # determining best k neighbors' number
204 optimal_k = neighbors_list[MSE.index(min(MSE))]
205 print("The optimal number of neighbors is %d" % optimal_k)
206
207 # plot misclassification error vs k
208 plt.plot(neighbors_list, MSE)
209 plt.xlabel('Number of Neighbors K')
210 plt.ylabel('Misclassification Error')
73
Chapter 5. Discussion and Analysis
211 plt.grid()
212 plt.show()
213
214 start_time = time.time()
215 knn.fit(X_train, y_train) # trainning
216 training_time = time.time() - start_time
217
218 start_time = time.time()
219 y_pred_dt = knn.predict(X_test) # predicting
220 predicting_time = time.time() - start_time
221
222 train_score_dt = knn.score(X_train, y_train)
223 model_rf_acc = accuracy_score(y_test, y_pred_dt)
224
225 print(classification_report(y_test, y_pred_dt))
226 print("-"*30)
print("Predicting time: {} sec; Training time: {}; Accuracy score:
227 {:.4f}".format(round(predicting_time), round(training_time), knn.score(X_test, y_test)))
228
229 labels = ['Benign','Malicious']
230 conf_matrix = confusion_matrix(y_test, y_pred_dt)
231 plt.figure(figsize=(7, 6))
232 sb.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, fmt="d")
233 plt.title("Confusion matrix")
234 plt.ylabel('True class')
235 plt.xlabel('Predicted class')
236 plt.show()
237 # cf_matrix_dt
Notice: the source code is also can be downloaded from the following URL
<https://github.com/shokan/MaliciousURL>
74
Appendix C. Extracted Lexical-based Features
Notice: the raw data and prepossessed dataset can be downloaded from the following URL
https://www.dropbox.com/s/f7c8ijhhp4joaig/dataset.zip?dl=0
76
Appendix B. Python Source Code
77