0% found this document useful (0 votes)

29 views

Project Shokan

The document discusses detecting malicious URLs through machine learning approaches. It provides background on URL attacks and features for detection. The methodology section outlines the experimental environment, data collection, model validation and evaluation metrics. Results and comparisons are presented along with objectives, findings, limitations and future work.

Uploaded by

vss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Project Shokan

Uploaded by

vss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 77

List of Figures.....................................................................................................................iv
List of Tables.......................................................................................................................v
Abstract.............................................................................................................................vi
Acknowledgements..........................................................................................................vii
Introduction.......................................................................................................................1
1.1. Background....................................................................................................................1
1.2. Research Aims and Objectives........................................................................................2
1.3. Contribution...................................................................................................................3
1.4. Risks...............................................................................................................................3
1.4.1. Failure to get data..........................................................................................................................4
1.4.2. Hardware defects or failure............................................................................................................4
1.4.3. Low quality of prediction................................................................................................................4
1.5. Dissertation structure.....................................................................................................5
Literature Review...............................................................................................................6
2.1. Malicious URL categorisation..........................................................................................6
2.1.1. Phishing URLs.................................................................................................................................7
2.1.2. Malware URLs.................................................................................................................................8
2.1.3. Spamming URLs..............................................................................................................................8
2.2. Attacks...........................................................................................................................9
2.2.1. Phishing attacks and URL obfuscation techniques..........................................................................9
2.2.2. Injection attacks............................................................................................................................11
2.2.3. Drive-by download attack.............................................................................................................13
2.2.4. Spamming attacks........................................................................................................................13
2.3. Valuable lexical-based features.....................................................................................14
2.4. Detecting malicious URLs..............................................................................................17
2.4.1. Machine Learning approach.........................................................................................................17
2.4.2. Alternative approaches.................................................................................................................26
2.5. Summary of the literature review.................................................................................27
Methodology....................................................................................................................29
3.1. Theoretical approach....................................................................................................29
3.1.1. Research Philosophy.....................................................................................................................30
3.1.2. Theory development....................................................................................................................30
3.1.3. Methodological development......................................................................................................30
3.1.4. Strategy........................................................................................................................................30
3.1.5. Time horizon.................................................................................................................................31
3.1.6. Techniques and procedures..........................................................................................................31
3.2. Practical approach........................................................................................................31
3.2.1. Experimental environment.......................................................................................31
3.2.2. Data..........................................................................................................................33
3.2.3. Model validation and optimisation...........................................................................36
3.2.4. Evaluation metrics........................................................................................................................37
3.3. Summary of the methodology......................................................................................................39

Results..............................................................................................................................41
4.1. Describing data.............................................................................................................41
4.2. Model description.........................................................................................................44
4.3. Results..........................................................................................................................45
4.4. Comparison..................................................................................................................................46

Discussion and Analysis....................................................................................................47

5.1. Summary of objectives..................................................................................................47
5.1.1. Objective 1....................................................................................................................................47
5.1.2. Objective 2....................................................................................................................................47
5.1.3. Objective 3....................................................................................................................................48
5.1.4. Objective 4....................................................................................................................................48
5.1.5. Objective 5....................................................................................................................................48
5.1.6. Objective 6....................................................................................................................................48
5.1.7. Objective 7....................................................................................................................................48
5.2. Findings........................................................................................................................49
5.3. Limitations....................................................................................................................49
5.4. Risk Mitigation..............................................................................................................50
Conclusion and Future Work.............................................................................................51
References........................................................................................................................52
Appendix A. Related Studies.............................................................................................59
Appendix B. Python Source Code......................................................................................61
Appendix C. Extracted Lexical-based Features...................................................................68
Appendix D. Fragment of Data.........................................................................................69
Appendix E. Ethical Approval............................................................................................70

ii
List of Figures
Figure 1. The paper’s structure.................................................................................................10
Figure 2. Example of obfuscation with JavaScript (Chiew, Yong and Tan, 2018)..................16
Figure 3. The generic URL syntax (Berners-Lee, 2005)..........................................................20
Figure 4. Sources for collecting raw data.................................................................................23
Figure 5. SVM classification....................................................................................................27
Figure 6. KNN classification....................................................................................................28
Figure 7. Layers of Onion Framework.....................................................................................33
Figure 8. Technical characteristics of the experimental host environment..............................36
Figure 9. Number and percentage of benign and malicious URLs...........................................38
Figure 10. Visual representation of data splitting and validating processes (Nelson, 2018)....40
Figure 11. Visual representation of K-Folds cross validation method (Nelson, 2018)............41
Figure 12. Structure of confusion matrix for binary classifier.................................................41
Figure 13. Distribution of classes in the testing and training subsets.......................................44
Figure 14. Spearman's correlation matrix.................................................................................45
Figure 15. Misclassification error rate vs number of neighbours.............................................47
Figure 16. Confusion matrix.....................................................................................................48

iii
List of Tables
Table 1. Timetable of the project objectives...............................................................................9
Table 2. URL obfuscation examples.........................................................................................17
Table 3. Injection URL examples.............................................................................................19
Table 4. Spamming URL examples..........................................................................................21
Table 5. References of lexical-based features used by researchers in related studies..............23
Table 6. References of different types of machine learning algorithms used for malicious URL
detection in last decade.............................................................................................................28
Table 7. Decision making table according to MCDM method.................................................33
Table 8. Experimental tools......................................................................................................40
Table 9. List of primary sources for data collection.................................................................41
Table 10. Advanced evaluation metrics....................................................................................46
Table 11. Proportion of training and testing samples...............................................................49
Table 12. Settings of the KNN in Scikit-Learn v0.19.2 library................................................52
Table 13. Values of advanced metrics......................................................................................53
Table 14. Result comparison....................................................................................................54

iv
Abstract
The detection of malicious URLs is one of the highest priority issues for cyber security
practitioners. Despite the large number of studies that have examined different machine
learning techniques to address the issue, the most used approach remains blacklisting. The
main obstacle of using machine learning is the difficulties in data collection.
This paper examines the possibility of identifying malicious URLs with the help of
analysis only of lexical-based futures. For the analysis, an experiment was designed. But
before that, the known lexical characteristics of the malicious URLs were examined based on
previous studies.
The classifier showed a fairly good average accuracy rate of 94%. But it was also
noticed that the classifier showed a poor FP rate, which increases the risks of encountering
malicious URLs. Additionally, correlation analysis using Spearman’s coefficient showed that
the URL length and number of special characters are the most determining signs of malicious
URLs.

Key words: malicious URL, machine learning, k-nearest neighbours, lexical-based features

v
Chapter 1

Introduction
1.1. Background

The internet remains the main vector of attack, where an accidental visit to a malicious
website can trigger a pre-designed criminal activity. Google Inc. (2018a) reported that daily it
detects thousands of new unsafe web pages, many of which are compromised legitimate
websites. This growing threat has increased the demand for security on the internet.
Currently, there are different approaches to the detection of dangerous web pages on the
internet. The blacklisting approach is commonly used by popular online services and antivirus
software (Chen, Huang and Ou, 2015). But, in addition to other shortcomings, the blacklisting
approach is not able to detect targeted attacks and new phishing pages which are not yet
blacklisted.
Recent developments in the fields of machine learning and artificial intelligence have
led to a renewed interest in its application to address wide range of cybersecurity issues.
Particularly, machine learning has been using to identify malicious web pages. For example, a
Google online service called Safe Browsing, in its current version, applies a machine learning
approach to identify suspicious web pages (Wen, 2017). Additionally, to this day, this
approach is also being improved by academics due to the large number of studies being
conducted in this area.
The available studies in this domain have shown that there are several research vectors
that aim at providing users a safe surfing experience on the internet. Due to practical
constraints, this paper cannot provide a comprehensive review of all of them. Hence, the
scope of this research is limited to only machine learning approach. By contrast, the approach
is used in malicious Uniform Resource Locators (URL) detection works by analysing only
lexical-based features.
By lexical-based features, this paper refers to predictors that were extracted from
statistical properties of the URL string. For example, the features can consist a length of the
URL, length of hostname or top-level domain name. Moreover, the existence of some key
words or special characters in a URL also can be a lexical-based feature, hence in some
literature these features are called Bag-of-Words features.

vi
Despite the large amount of studies in the field of malicious URL detection, a number of
problems and practical issues remain open to this day. The main concern is the massiveness of
the data. There are more than 30 trillion unique URLs on the internet (Sullivan, 2012; Lin et
al., 2013). Processing such huge amount of data to this day is problematic.
The second main concern in the is difficulties in future collection. The choice of the
appropriate set of features is very important for the quality of the classifier’s performance.
But it was found that the previous studies mainly applied features such as the host-based and
page content-based futures. Collecting these features is time consuming. For example, it is
required to wait few seconds to get a value for some host-based feature. Given the above-
mentioned massiveness of data, the collection of these features is an infeasible task.
Moreover, as was noted by McGrath and Gupta (2008), the majority of malicious URLs have
the property of being available for only a very short period of time. Hence, it is necessary to
find an easy and efficient collection method.
Sahoo, Liu and Hoi (2017) underlined in his survey that the most accessible features are
the lexical-based features of URLs. Additionally, very few studies (Le, Markopoulou and
Faloutsos, 2011; Sorio, Bartoli and Medvet, 2013) have investigated the impact of lexical
features on maliciousness web pages without mixing them with the other hard-to-collect
features. Consequently, the main research question of this paper is: can a machine learning
approach focusing on only lexical-based features of URLs improve on the current state-of-
the-art?

1.2. Research Aims and Objectives

This study examines the effectiveness of a machine learning algorithm that uses only lexical-
based features in detecting malicious web pages. In order to find the answer to the main
research question within the framework of this project, it is necessary to achieve seven
objectives that are given in the Table 1.

Table 1. Timetable of the project objectives

Dedicated # Objectives Start date / Output

chapter Start date
1 Identify types of 15 Jan 2018 /  Type of malicious URLs and
Literature

malicious URLs 20 Mar 2018 their lexical based features

review

2 Explore different type 15 Jan 2018 /  Type of attacks

of attacks conducted by 30 Mar 2018
URLs

vii
3 Identify valuable 01 Mar 2018 /  Valuable lexical-based
lexical-based features 17 Jun 2018 features according to on
URL types
 Valuable lexical-based
features according
recommendations of
previous studies
4 Identify state-of-the-art 01 Mar 2018 /  Top related studies in last 10
machine learning 17 Jun 2018 years and their results
approach
5 Choose the most 01 Mar 2018 /  Machine learning algorithm
appropriate machine 17 Jun 2018
learning algorithm
6 Collect a dataset of 01 Jun 2018 /  Primary data or appropriate
malicious and benign 27 Jun 2018 secondary data
URLs
Methodology

7 Build a robust machine 01 Jul 2018 /  Tuned, trained and

learning classifier and 30 Jul 2018 optimised classifier
obtain results  Results according to selected
metrics

1.3. Contribution

The study makes a major contribution to the malicious URL detection domain by
demonstrating the effectiveness of lexical-based features in malicious URL detection. For
example, taking into account the ease of obtaining these features, it will be possible to create a
system that detects malicious URLs in real-time without using a blacklist.
But in the case of obtaining poor results from the learning algorithm classifier, hopefully
it also will be considered as a contribution. Because, further studies will receive an additional
confirmation about the ineffectiveness of lexical-based features for malicious URL detection.

1.4. Risks

It is important to identify the possible risks associated with the research project and mitigate
them to ensure a successful completion. Therefore, four risks were considered in this paper
due to their high and medium likelihood.

viii
1.4.1. Failure to get data
The first risk is related to the existence of an appropriate dataset for conducting the
experiment, where the dataset can be reliably labelled into two classes, malicious and benign
URLs. At the same time, the number of these classes should be balanced.
Mitigation of the risk related to the dataset is the most difficult and requires significant
technical and administrative efforts. Because of this, the search for the dataset started before
the project began.
The first preference was given to publicly available datasets that have the most positive
reputation among other scholars. At the same time, a dataset was requested from few authors
of large-scale studies such as Ma et al. (2010) and Vanhoenshoven et al. (2016). Also, in
order to have a backup plan, it was decided to collect malicious and benign URLs with help of
a custom Python code that parses particular web sites. More information about actually the
obtained dataset can be found in the methodology part of this paper.
1.4.2. Hardware defects or failure
The second risk refers to unexpected failure of the computer on which the experiment will be
conducted. The risk associated with the fault tolerance of hardware also can cause a
significant shift of the project’s timetable.
To mitigate this risk, it was decided to store developed code on the online version control
system GitHub. Additionally, the datasets and valuable configuration files should be stored on
online file hosting services. Lastly, the computer was periodically backed up (Apple Inc.,
2017) to allow the researcher to restore the computer from a snapshot if the experiment has
some defects experiment environment.
1.4.3. Low quality of prediction
Last but not least, risk associated with the project is the quality of the classification. Despite
the appropriateness of the selected machine learning algorithm and presented futures, there is
always a risk associated with the accuracy of the classification known as overfitting.
Overfitting is the result ‘of an analysis which corresponds too closely or exactly to a
particular set of data, and may therefore fail to fit additional data or predict future
observations reliably’ (Oxford Dictionary, 1930). The consequences of overfitting are the
poor quality of the classifier when it will work with a new dataset.
Mainly, the basic rule of this study is to perform the experiment a few times in order to
ensure the consistency of the results. Secondly, the cross-validation technique is used to

ix
evaluate a classifier against overfitting. This technique is explained in the methodology part
of the paper.

1.5. Dissertation structure

The overall structure of the study takes the form of five chapters, including this introductory
chapter: 1) the Introduction mainly gives a broad view of the general research area and the
underlines research question; (2) Chapter Two, the literature review, begins by laying out the
theoretical dimensions of the research and looks at how effectively the experiment should be
conducted; 3) the third chapter is concerned with the methodology used for this research
project; 4) the fourth chapter presents the results and main findings of the experiment, tying
up the various theoretical and empirical strands in order to answer to the main question; and
(5) Chapter Five discuss the result, critically evaluating the findings. Also, this chapter
examines the limitations of the study. Finally, (6) the Conclusion gives a brief summary and
critique of the findings. Recommendations for future research can be found in the conclusion
as well. A graphical overview of more detailed structure of the project is given in the Figure
1.

Figure 1. The paper’s structure

x
CHAPTER 2

Literature Review
The detecting malicious URLs is an emerging issue in academia. It was found that more
recent attention has focused on the provision of machine learning algorithms to tackle this
problem. Hence, the large and growing body of literature in both the fields of machine
learning and cybersecurity has been investigated in during this project.
What it is known about the application of machine learning is largely based upon
empirical studies that investigate performances of different classifier algorithms in the
experimental environment. However, there have been no controlled studies which make
attempt to find a practical approach which is able to cope with the significant amount of data
in the real-world environment. Moreover, much uncertainty still exists about possibility of
machine learning algorithms to detect malicious URLs by analysing only the static properties
of URL strings.
But in order to get an answer to the main question of the research, it is first necessary
to answer a chain of sub-questions. With this background in mind, the literature review
attempts to answer the following research sub-questions: (1) How can malicious URLs be
categorised? (2) What kind of attacks exist that are conducted by URLs? (3) What valuable
lexical-based features can be extracted from a URL? (4) What is the state-of-the-art machine
learning approach for combating malicious URLs? (5) Which machine learning algorithm can
be effective for building a classifier? These five questions allow us to achieve to the first five
research objectives by providing a conceptual theoretical framework based on the literature.
The purpose of this chapter is to review the literature to find an answer to the above-
mentioned questions by exploring primary and secondary sources. For achieving this, the
chapter is divided into four sections: (1) Malicious URL categorisation; (2) Attacks; (3)
Valuable lexical-based features; and (4) Detecting malicious URLs.

2.1. Malicious URL categorisation

As far as the term of ‘malicious URL’ is concerned, an arguable weakness of the majority
studies is the arbitrariness in definition of this term. The term ‘malicious’ is vague, therefore,

xi
it is often necessary to clarify the level of maliciousness for a close understanding of the
threat. Therefore, categorising URLs would provide a better understanding of the
characteristics of the existing types of malicious URLs. This would help in the experimental
part to act as a stepping stone to the development of a holistic machine learning classifier.
A number of studies investigating malicious URL detection have been carried out by
scholars in last decade. However, the majority of them did not define the term ‘malicious
URL’. During the experiments, they collected phishing and spamming URLs and marked
them under a single label, ‘malicious’. Conversely, Dua and Du (2015) reported that
malicious activities (spamming or phishing) have different properties and their identification
should be different.
Categorisation and the separate detection of malicious URL first was demonstrated
experimentally by Choi, Zhu and Lee, (2011). In the systematic study, malicious URLs were
detected in two stages: (1) a machine learning binary classifier divided samples into benign
and malicious; (2) the malicious URLs were assigned three types of labels which are
phishing, malware and spamming. The scholars noted that a URL can be related to different
categories at the same time (e.g. a URL can be both spamming and malware).
A similar perspective has been adopted by Ma et al. (2010), Sahoo, Liu and Hoi
(2017) who argue that malicious URLs should be categorised according to the content of the
web page to which they refer to. These authors applied the same three types of malicious
URLs: (1) phishing, (2) malware and (3) spamming. Therefor it was decided to accept these
three types of malicious URL that have been commonly mentioned by different scholars.
Below, a closer look at each of these types is presented.
2.1.1. Phishing URLs
Phishing attacks are a social engineering technique that aims to lure users into providing
confidential information by clicking on a link that looks like legitimate. The term ‘phishing’
started to actively operate from the mid-1990s in the telecommunication sector, when the
acquisition of internet service provider account information was a common cybercrime
(Zulfikar, 2010). Since then, the term has had wide range of applications and different types
of phishing attack were invented.
The experimental studies on the effectiveness of visually identifying a phishing web
pages are rather controversial. For instance, Kumaraguru et al. (2008) and Sheng et al. (2010)
examined ability of people to visually identify phishing web pages after training courses.
They came to conclusion that user training courses are highly effective. Conversely,
Alsharnouby, Alaca and Chiasson (2015), who also examined behavioural strategies of users,
xii
reported that majority of internet users failed the test on detecting phishing web resources
even once have being taught to identify them.
Another study by Aleroud and Zhou (2017) examined the trend in different phishing
attacks. The scholars researched phishing attacks in four research dimensions: (1)
communication media (e.g. social networks), where the attacks are conducted (2) target
devices, (3) attack techniques and (4) countermeasures. The conclusion of the study was the
idea that the identification of a phishing attacks is not a trivial task. Accordingly, perhaps
there is no single right approach for identifying phishing attacks.
2.1.2. Malware URLs
By ‘malware’, this paper refers to the URLs that trigger a downloading of hostile or intrusive
software. Cybercriminals design malware to compromise the integrity, confidentiality and
availability of a user device. The most common techniques of these are cross-site scripting
(XSS) (Chiew, Yong and Tan, 2018) and drive-by-download attack attacks (Choi, Zhu and
Lee, 2011). If phishing attacks rely on users’ unconsciousness, then malware URLs are
designed specifically for the vulnerability of web browsers or web applications that are
developed on different platforms.
A number of studies, such as Alcaide et al. (2011) and Curtsinger et al. (2011), have
examined different approaches to effectively detect malware activities, but to date none has
achieved sufficient results. There are several reasons for the difficulties of detecting malware
URLs. The main reason is that malware attacks conducted with help of URLs are usually
developed in different programming languages such as PHP, ASP or JSP. Hence, it is
necessary to individually develop safety requirements for web applications on the internet
(ibid.).
2.1.3. Spamming URLs
Spamming is the sending of unsolicited content for advertising and it occurs in significant
numbers (Choi, Zhu and Lee, 2011). In other words, spamming URLs intend to promote
commercial or non-commercial content. Obviously, spamming web pages themselves are
detrimental to the quality of online content and user experience.
However spamming URLs usually are not physically harmful for a user device.
However, this paper defines them as malicious, since they are usually used for the distribution
of fake news, the fight against which has become one of the priority tasks for the states.
Intentionally misleading readers currently has dangerous outcomes for society. Mainly social
networks are used for spreading fake news (Krombholz, Merkl and Weippl, 2012). In

xiii
addition, spamming URLs are often used to distribute obscene content (Gao et al., 2010) that
can damage the vulnerable minds of children. In the following chapter, different techniques
for distributing spamming URLs are revealed.
Garcia-Molina and Gyongyi, (2005) and Jelodar et al. (2017) reviewed current
spamming techniques and applied a machine learning approach to detect spamming URLs.
The scholars made an attempt to detect spamming emails by analysing their lexical-based and
host-based features. Study shows that spamming servers (or spamming farms) usually have a
short lifespan. Other features, such as the structure, content and geography of spamming
URLs do not give enough clues to distinguish spamming from non-spamming content
automatically.

2.2. Attacks

2.2.1. Phishing attacks and URL obfuscation techniques

A number of authors have reported trends in phishing attacks that demonstrate a steady
increase in the popularity of this attack (M Tariq Banday and Qadri, 2007; McGrath and
Gupta, 2008; Halevi and Lewis, 2013; Sorio, Bartoli and Medvet, 2013; Chaudhry, Chaudhry
and Rittenhouse, 2016; Symantec, 2017). However, few authors have been able to draw on
any systematic research into phishing techniques.
Recent surveys, such as that conducted by Chiew, Yong and Tan (2018), have shown
that the success of a phishing attack depends on several factors. These factors include a
location, technique of obfuscation, time and users’ devices.
But according to this paper’s goal, it is necessary to explore lexical properties of
phishing URLs. Therefore, this subsection focuses on different techniques of phishing URL
obfuscation. This systematic view of the methods would lead to the development of a more
effective and holistic manner of anti-phishing technique to solve the phishing problem.

A. Sound-squatting and typo-squatting

Regarding to URL obfuscation techniques, Milletary (2005) and Holz et al. (2008)
pointed out two common the ways to contract phishing URLs. These types of phishing are the
most straightforward: sound-squatting and typo-squatting. In sound-squatting, an attacker
registers a domain name that sounds like a legitimate domain name. In other words, attackers
select similar sounding words are called homonyms (e.g. ascent and assent). Another way of
sound-squatting technique is when numbers in domain names are changed for digital
representation (e.g. www.five-star.org to www.5star.org).

xiv
In contrast, typo-squatting attacks use visually similar domain names (Milletary, 2005;
Jelodar et al., 2017; Chiew, Yong and Tan, 2018). Actually, this technique is the most
commonly conducted phishing attack. Phishing domains are also selected accordingly
common typo errors. The attacker relies on both grammatical mistakes and the user mistypes
the website address by accidentally pressing the adjacent key or missing a character. The
mistyped website address can lead a user to a phishing website that may look like the
legitimate website (Jelodar et al., 2017). Examples of these and other phishing URLs are
presented in Table 2.

Table 2. URL obfuscation examples

# Example Comment
1 https://heir-fresh.com Sound air-fresh.com
2 https://high5.com Changed text to number for mimicking the
URL www.highfive.com
4 https://wwwmybank2us.com Missing dot typos
5 https://mybankus.com Character omission typos
6 https://mybank2su.com Character permutation typos
7 https://mybanl2us.com Character replacement typos
8 https://mybank2uss.com Character insertion typos
9 https:// Obfuscation option of a legitimate website
legitimatesite.legit.com (legit.legitimatesite.com) by interchanging
the domain and subdomain name
10 https:// Obfuscation using country code top-level
legit.legitimatesite.com.my domain
11 http:// Obfuscation using character substitution
legit.Iegitimatesite.com.my
12 https:// Obfuscation by using part of the legitimate
legit.legitimatesite.anothersi URL as subdomain
te.com
13 https://%68%74%74%70%3a%2f%2f Hexadecimal encoding of ASCII Text
%77%77%77%2e%65%78%61%6d%70%6c domain name. e.g.
%65%2e %63%6f%6d http://www.example.com
14 https://192.168.1.1 Dotted Quad Notation
15 https://0xc0a80101 Hexadecimal Format
16 https://bit.ly/2KnuCGI Shortening URL of malware web site

xv
B. Clickable image
However, not all phishing attacks rely on the similarity to the URLs of legitimate web
resources. Another URL obfuscation technique is using a clickable image instead of text.
Usually, this is used in emails that contain a single image in JPEG format. The image appears
to be a legitimate email from an online bank or shop, which usually includes official logos. As
a result users are directed to a phishing web page when they click on this image (Milletary,
2005). This is a most commonly used technique that is technically simple and highly effective
(ibid).
Moreover, it is possible to change phishing URL string with an image of a legitimate
URL by using JavaScript. It is also common to add the security icon, which gives the user a
false sense of security (Anthony, 2007). Script controls the chrome part of the browser, where
the address bar and the status line enter (Milletary, 2005). An example of this type of
obfuscation is shown in Figure 2.

Figure 2. Example of obfuscation with JavaScript (Chiew, Yong and Tan, 2018)

C. Alternative encoding
Chiew, Yong and Tan, (2018) revealed in detail the more advanced URL obfuscation methods
that are also actively used by phishers. For example, using an alternative encoding is another
obfuscation technique that makes a URL unrecognisable. Also, IP addresses can be specified
as hexadecimal numbers. Additionally, alphanumeric characters can be changed to their
hexadecimal representations. Regardless of which encoding is presented a URL, web

xvi
browsers usually correctly interpret most of these representations. These and other methods
are presented in Table 2.
2.2.2. Injection attacks
There are a number of attacks performed by compiling software code, an illegal way to
perform malicious actions. The possibility of this attacks arises if the web resources have a
weak input validation policy. Acunetix (2017) categorised these attacks into nine types: 1)
code injection; 2) CRLF injection; 3) cross-site scripting; 4) email injection; 5) host header
injection; 6) LDAP injection; 7) OS command injection; 8) SQL injection; 9) XPath injection.
There are several ways to reproduce these attacks. Below, the paper gives an overview of
different type of injection attacks, paying attention to the lexical properties of URL. Examples
of such custom URLs are shown in Table 3.

Table 3. Injection URL examples

# Custom URLs Example

1 http://legit.com/login? Full HTML substitution
URL=http://phish.com/login/
fake.htm
2 http://legit.com/login? Inline embedded scripting content
page=1&client=<SCRIPT>phishcode..
.
3 http://legit.com/login? Load external scripting code
page=1&response=phish.com
%21phishcode.js...
4 http://www.site.com/ Microsoft SQL Server
vulnerable.php?id=1' waitfor
delay '00:00:10'--
5 http://newspaper.com/items.php? SELECT title, description,
id=2 and 1=1 body FROM items WHERE ID = 2
and 1=2
6 http:// The basic reflected XSS attack
vulnerableWebApplication.com/
page.php?
parameters=<script>alert('xss
payload');</script>

A. Cross-site scripting
Firstly, cross-site scripting (XSS) is perhaps the most common attack, where the attack vector
uses compromised web resources (Ollmann, 2004). This type of attack is produced by
sending a victim a link containing JavaScript or Flash code, which the browser executes
without the need for expansion. However, in some cases, attacks are also generated based on
certain browser vulnerabilities.

xvii
Study by Vogt, Nentwich and Jovanovic, (2006) explored XSS attacks from different
prospective such as attack vector, solutions and appearance of attacking script. The study
noted that usually the URL contains particular JavaScript commands such as
escape(document.cookie), alert(‘error’), and GetParameter(‘eid’).
These and other key commands are useful to build a word dictionary that will be used for the
machine learning classifier.

B. SQL injection
Secondly, SQL injection (SQLIA) is also a dangerous and common attack that can be
conducted with help of URLs. It occurs when an attacker attempts to change the logic,
semantics or syntax of a legitimate SQL statement by inserting new SQL keywords or
operators into the statement (Halfond and Orso, 2005, p3). They usually are designed
manually for particular web resources.
A survey conducted by Halfond, Viegas and Orso (2006) made a review of commonly
known SQLIA techniques. According to the study, URLs should have key commands tags
words used in SQL requests, such as GROUP BY, DROP, or UPDATE. Also, characters such
as ‘;’, ‘‘‘, or ‘‘‘ are common in SQL injection techniques (Tsai and Yu, 2009). A semicolon
allows the system to execute several consecutive SQL codes.
2.2.3. Drive-by download attack
A drive-by download attack involves the unintended download of malware software without
the knowledge of users. This attack can be triggered not only with the help of malware URLs
but also the process can start after viewing e-mail messages or clicking on a pop-up window
(Le et al., 2013). The attack is usually conducted by developing malware software that is
designed by using the vulnerabilities of a web browser.
Detecting the attack by analysis lexical-based features can be an infeasible task for a
few reasons. Firstly, mostly the attack is conducted from a compromised legitimate website,
or a legitimate website unknowingly distributing the attacker’s content through a third-party
service such as online advertisement (Provos et al., 2007). Therefore, these websites have an
ordinary URL that can be distinguished from malicious. Secondary, malicious code can be
inserted into webpage content, and to launch the attack, it is necessary to download the
compromised web page.
2.2.4. Spamming attacks
The amount of web spam has been increasing significant and has led to a degradation of
search results and users’ experience on the internet. There are a number of techniques for

xviii
spreading spamming URLs on the internet. The most attacking vector still remains search
engines, emails and social networks (Garcia-Molina and Gyongyi, 2005; Jelodar et al., 2017).

A. Malvertising and search engine optimisation

The technique search engine or online advertisement tools for spread malicious content
(Spirin and Han, 2011). The attackers develop a web page and optimise it for the search
engine indexing algorithm. Basnet and Sung, (2012) define the term Black SEO for this
technique of attack using a search engine. This uses a particular keyword of popular trends or
events in their website content or on a URL, which ensures that their website is ranked as one
of the top search results
Enbody and Sood (2011) argue that online advertisement distributors do not enough to
verify the advertising content, which makes this kind of attack possible. This makes possible a
spamming type called malvertising which is conducted via paid advertising (M. Tariq Banday
and Qadri, 2007). A malvertising URL can redirect the user to both to a malicious webpage.
According to Symantec (2017), the relatively easy process of ordering online advertising
makes this type of capture attractive to intruders.
Moreover, attackers can launch advertisement campaigns without compromising third-
party websites; they use their own resources to spread the attack in the form of advertising.
This case makes detection task infeasible, because the malicious source has ‘ordinary URL’
properties.

B. Anchor text spam

Variously, there is another spamming technique that deals with lexical properties of URLs.
Garcia-Molina and Gyongyi (2005) defined another search engine manipulation technique
that uses anchor text spam and url spam. The idea is that the search engines also pay attention
to the context and the meaning of the words of the URL itself. For example, in the case of
anchor spam, the keywords fit directly between the HTML tags <a href>. Examples of
anchor and spamming URLs are provided in Table 4.

Table 4. Spamming URL examples

# Technique Example
1 anchor text <a href=“page.html”>sales, black friday, 90%
spam discount, London</a>
2 url spam london-90%-discount-black-fraday.camerasx.com

xix
To sum up, this section explored different attacks that are conducted with help of URLs which
are (1) phishing, (2) injection, (3) cross-site-scripting, (4) drive-by download and (4)
spamming attacks. While considering each of these attacks, URL string lexical properties
were taken into account. This knowledge is then used in the next stage of the study, during
feature representation.
It was found that in a few cases, detecting malicious URLs by looking at their lexical
properties is impossible. First, when phishers use images to obfuscate malicious URL;
second, when malicious URLs are inserted into HTML tags of compromised websites.

2.3. Valuable lexical-based features

As was pointed out previously, feature representation is an important part in the workflow of
a data scientist and correctly selected features gives a classifier a high accuracy rate. This
section is dedicated to establishing the most valuable features, based on knowledge from
previous studies.
Before starting, it is necessary to define some terms. Firstly, to have a common
understanding of information about different part of URLs, Figure 3 (adopted from RFC 3986
Section 3) presents the names of these parts. Secondly, by ‘feature extracting’, the paper
refers to a process of variable denoting from lexical statically properties of the URL. These
properties include bag-of-words (BoW), character counts, and n-gram.

http://domain.com:8042/over/there?name=ferret#nose

\_/ \___________/\_/\____/ \/

| | | | |

scheme authority path query fragment

Figure 3. The generic URL syntax (Berners-Lee, 2005)

Analysis of the related studies showed an application of several groups of features that can be
conditionally divided into three main categories, which are BoW, n-gram and special
characters. These categories can also be broken down into subcategories; this will have
presented below in this paper.
As far as n-grams features are concerned, the features can have low effectiveness in
the URL detection task. The method is usually applied to denote similarity between words in
the presence of multi-lingual data (Damashek, 1995). It is done by converting text into tokens
xx
of character size ‘n’ using a window of adjacent characters (Kolari, Finin and Joshi, 2006).
However, the majority of URLs are written in the English language, although the number of
domain names in other languages is rising. To avoid unnecessary noise caused by features
with low effectiveness, these features were excluded from the consideration.
On the other hand, particular feature, such as (1) length of the path, (2) length of the
query part, and (3) length of the fragment part, have appeared to be effective in previous
experiments. For example, after an analysis of the F-score measure of URL features, it was
found that the above-mentioned three features have high weights, which indicates a higher
potential of splitting benign and malicious web pages (Eshete, Villafiorita and Weldemariam,
2013).
Overall, 16 previous studies were explored, where lexical-based features have been
applied. All these features were categorised and grouped by authors in Table 5. The
recommended features were also applied in the current experiment. Additionally, few features
were added to this list as the result of an analysis of the attacks discussed in the previous
section. Although these additional features are shown in the table as New, it is not argued that
they were not applied previously. Basically, they were not found in the available literature
that was explored within this study.

Table 5. References of lexical-based features used by researchers in related studies

# Category Description Applied by

1 BoW: brand names Existence of brand names in TDL, (Ma et al., 2010; Choi,
SLD, path, fragment parts. Brand Zhu and Lee, 2011;
names inclus: ‘paypal’, ebayisapi, Zhang et al., 2011;
jpmorganchase, tam, visa, live, Huang, Qian and Wang,
poste, wellsfargo, blizzard, eBay, 2012; Xu et al., 2013;
PayPal, Rapidshare, HSBC, Yahoo, Marchal, State, et al.,
Alliance-Leicester, Optus, Steam
2015)
2 BoW: code Existence of certain JavaScript or (Eshete, Villafiorita and
SQL code, inherent in the Weldemariam, 2013)
injection attacks
3 BoW: key words Existence of particular words (Garera et al., 2007; Ma
inherent in the malware URL, et al., 2010; Canali et
which are: ‘account’, ‘webscr’, al., 2011; Pao, Chou
‘login’, ‘ebayisapi’, ‘signin’, and Lee, 2012)

xxi
‘banking’, and ‘confirm’, ‘secure’
‘images’, ‘com’, ‘www’, exe,
account, swfNode.php, pdfNode.php
4 Existence of particular words, New *
used in an authentication page.
These words include: username,
password, urs, user, pass, pwd
5 Special character and Existence of particular Special (Kolari, Finin and
numbers character and numbers “/”, “.”, “?” Joshi, 2006; Eshete,
and “=” Villafiorita and
Weldemariam, 2013)
6 Static properties  URL length (Choi, Zhu and Lee,
(Integer)  Path length 2011; Thomas et al.,

 Authority lenth 2011; Chu et al., 2013;

 Number of special character Eshete, Villafiorita and

Weldemariam, 2013;
 Number of BoW
Xu et al., 2013)

7 Static properties  Presence of IP address (Bannur, Saul and

(Binary) Savage, 2011; Canali et
al., 2011)
8 Other  Presence of "parameter part" New *
 Presence of "query part"
 Presence of "fragment part
 Presence of non-standard port
(all ports except 8080 or 443)

2.4. Detecting malicious URLs

A number of authors, such as Sabhnani, Serpen and More (2003); Tsai and Yu (2009);
Vanhoenshoven et al. ( 2016); Dong, Shang and Yu (2017) have reported the application of
machine learning approach in malicious URL detection issue that demonstrated promising
results. Additionally, this paper reviews related stat-of the-art approaches that applied not
only machine learning but also other different alternative techniques. After analysing the

xxii
published literature, approaches for detecting malicious URLs can be divided into three
following categories: (1) machine learning approaches (2) blacklisting approaches and (3)
heuristic approaches. Below, these approaches are described.
2.4.1. Machine Learning approach
Based on the number of published papers in the last ten years, it seems that the academy is
increasingly seeing the solution to this problem in the machine learning approach. But it is
necessary to emphasise that despite the huge number of proposed solutions, almost none of
them currently have practical application in the industry. Actually, there are a number of
trade-offs between computational price and performance, accuracy and speed.
As Sahoo, Liu and Hoi (2017) emphasised that the issue of data collection is the big
obstacle for the machine learning approach, since it does not allow its application on a global
scale. This is because not all features, such as content-based and host-based features, can be
easily collected due to their heaviness and the significant number of unique URLs on the
internet.
Actually, as was previously reported, these heavily weighted features provide more
chances of detecting malicious URLs. In this regard, to get a full image of the following,
state-of-art machine learning approaches are considered in three dimensions: (1) data
collection sources, (2) applied features and (3) applied machine learning algorithms.

A. Evaluation with respect to data collection sources and applied features

Regarding data collection, previous studies suggest a number of options for collecting data
that further can be transformed into features. This work combined all these proposals, as can
be seen in Figure 4. The diagram shows the path of the HTTP request/response from the
client’s device to web hosts.

xxiii
Figure 4. Sources for collecting raw data

These features can be collected on point (1) called content-based features. They require the
full downloading of a web page to be collected. Canali et al., (2011), Eshete, Villafiorita and
Weldemariam (2013) conducted an experiment analysing the HTML and JavaScript content
of web pages. These features were created based on the structure of HTML tags, the existence
of particular JavaScript commands or specific ActiveX elements. Additionally, a recent study
by PATIL and PATIL (2016) extracted lexical features from content of HTML pages.
Despite the high accuracy that content-based features can ensure, there are two main
disadvantages that should be considered as well. The first concern is security, because to
extract these features, a web page must be fully downloaded. Hence, there is a high
probability that malicious code can executed before the classifier will label it as malicious. In
the second, resource consumption is an issue, because all mentioned features require high
computational power and processing time. Hence, it is doubtful that the building of a
classifier with content-based features would be effective on a large scale.
Point (2) in Figure 4 gives more chances to intercept malicious web pages by using
machine learning classifiers. Features that can be extracted from this point are called lexical-
based features and host-based features of URLs. Obviously, the content-based features are
not available from this point.
Regarding host-based features, this information usually is requested from DNS
servers. WHOIS requests can obtain from DNS information about the domain owner’s name,

xxiv
location, IP address, live-time, registered and updated dates. It was mentioned in the previous
chapter that malicious URLs tend to frequently change location and live only for a short
period of time. Therefore, host-based features have high value to making an accurate
classification.
There are also other valuable host-based features that can be explicitly obtained from
a web host, for example, connection speed and IP addresses (if the URL contains only an IP
address). Sahoo, Liu and Hoi, (2017) pointed out that it is difficult to change IP addresses for
each new attack. Hence, information about IP addresses can be helpful for accuracy of
classifiers.
However, it should be mentioned that host-based features have obvious disadvantages,
such as availability and speed. According to McGrath and Gupta (2008) DNS servers cannot
be available during data collection and prediction. Therefore, the training data would contain
missed values that can affect the quality of the classifier. Additionally, the connection speed
with both the DNS server and the web server can periodically decrease, which also effects the
prediction speed. Anyway, under the assumption that there is a sufficient connection speed,
some host-based information can be requested in a few seconds that is too much for real-
world situations. These factors make host-based features impractical in real-world
environment.
Lexical-based features of URLs are obtained from URL names (strings). In other
words, a classifier learns to distinguish malicious URLs from benign ones according their
appearance and text structure. Usually, different measured features such as URL length,
domain name length, and count of special characters are extracted as features. Additionally,
binary features such as the existence of particular characters or words in the given URL are
also extracted as features. This features are also known as bag-of-words (Vanhoenshoven et
al. 2016).
These features have considerable drawbacks as well. For instance, a classifier that is
built on the basis of only the lexical-based features of URLs can be considered an extension
of the blacklisting approach. One of the drawbacks of URL-based features is that new URL
names which are able avoid classifiers can be generated algorithmically. However, the
number of studies (Yadav et al., 2010) (Schulz et al., 2012) claim the it is possible to
recognise algorithmically generated patterns by analysing their alpha-numeric distribution.
Returning back to Figure 4, points (3) and (4) refer to the IDS and HTTP-PROXY
servers. These servers usually can be installed in a corporate network. The HTTP-PROXY
additionally allows the system to extract features based on the HTTP request and replies. The

xxv
IDS of the server give a lot of information on the network level. These directions are also
promising. The company Cisco has been developing a product called Umbrella that identifies
different kind intrusion scenarios by analysing IDS logs (Dua and Du, 2015).
Lastly, DNS servers, point (5), appear to be the most suitable places from where data
should be collected (Holz et al., 2008). As was described in Section 1, to hide malware IP
addresses, attackers change domain names every five minutes by registering domain names
with the help of botnets. The only place where such behaviour can be identified is the DNS.
Also, this approach eliminates the problem of losing a connection during data collection,
which was discussed earlier in this chapter.

B. Review of related studies

Algorithm selection is also important stage in the machine learning approach. However, this
section does not deeply evaluate the mathematical foundation by relying on expert opinions of
previous scholars in the machine learning field. Also, algorithms can be classified according
to the training process mode as a batch mode and an online mode.
The process does not appear to be a trivial task because no algorithm can be
universally applicable to all tasks. Therefore, it is needed to find balance between several
trade-offs.
An analysis of previously conducted experiments shows that more scholars more
prefer particular supervised algorithms, which are Naive Bayes (NB), Supported Vector
Machine (SVM) and Logistic Regression (LG). About 30 studies that applied these three
commonly used algorithms were explored. Also, a number of studies, pioneered by (Ma, L. K.
Saul, et al., 2009) examined online learning algorithms as well. A list of these studies is
provided in Table 6.

Table 6. References of different types of machine learning algorithms used for malicious URL
detection in last decade

Learning mode: Studies

algorithm

xxvi
Batch: SVM (Nepali, Wang and Alshboul, 2015), (Ma, L. K. Saul, et al., 2009),
(Kolari, Finin and Joshi, 2006), (Pao, Chou and Lee, 2012), (Marchal,
Francois, et al., 2015), (Marchal, State, et al., 2015), (Chu et al., 2013),
(Sorio, Bartoli and Medvet, 2013), (Xu et al., 2013), (Hou et al., 2010),
(Wang et al., 2013), (Bannur, Saul and Savage, 2011), (Huang, Qian and
Wang, 2012), (Ying and Xuhua, 2006), (He et al., 2011)

Batch: Naive Bayes (Canali et al., 2011), (Xu et al., 2013), (Hou et al., 2010), (Cao et al.,
2016), (Aggarwal, Rajadesingan and Kumaraguru, 2012)
Batch: Logistic (Garera et al., 2007), (Ma, L. K. Saul, et al., 2009), (Canali et al., 2011),
regression (Xu et al., 2013), (Wang et al., 2013)

Batch: K-nearest (Choi, Zhu and Lee, 2011; Vanhoenshoven, Napoles, et al., 2016)
neighbours
Online mode (Ma et al., 2010), (Ma, L. Saul, et al., 2009), (Blum, Wardman and
algorithms Warner, 2010)

C. Description of algorithms
1) Supportive Vector Machine is the most commonly used algorithms that have shown their
strength in this case. The Supportive Vector Machine (SVM) is the most applied learning
algorithm for classification and regression problems. The SVM model is a representation of
examples as points in a multidimensional space, displayed in such a way that the examples of
individual categories are divided into clear boundaries which are as wide as possible (Cortes
and Vapnik, 1995). In other words, the algorithm reveals the maximal margin that separates
two (or more) classes.
Supposing that the extracted features of training dataset vectors from real numbers
, malicious and benign URLs are classified into two classes, labelled as vector
. SVM solves the following primal problem:

subject to (1)

Its dual is

xxvii
subject to (2)

Where is the vector of all ones, is the upper bound, is an -by- positive
semidefinite matrix (Yadav, 2010)
To simplify, pseudo-values in two-dimensional Cartesian coordinates are presented in
Figure 5. In this example, the number of positive (1 and 2) and negative (3 and 4) points,
which are closer to the opposing class, form supporting vectors. The centre line of this margin
is called a hyperplane; on this basis a model makes a classification.

Figure 5. SVM classification

The distance between these vectors is a margin that always strives to attain the
maximum value, which is a constrained optimisation problem that can be formulated as
follows (Wenyu and Ya-Xiang, 2006):

subject to (3)

where the equality and inequality constraints required to be satisfied, and is objective
function that needs to be optimised subject to the constraints (Yurkiewicz 1985). With the
help of the constraints, the ratio of the algorithm with respect to outliers can be regulated.
This algorithm has a number of advantages that make it attractive for the tasks of
binary and multi-class classification. One of the advantages is that the number of data

xxviii
dimensions does not affect the accuracy of the classification. In other words, a classification
can still be effective even if the number of features is predetermined by the amount of
observations.

2) K-Nearest Neighbours (KNN) is a classifier which measures the distance to the its k-
number of neighbours and refers assign to a class that its nearest neighbours have. As is
shown in Figure 6, the X-query point has four negative (-) samples and one positive (+)
sample. As the result, this point has assigned to the negative class, because it has more
negative neighbours than positive.

Figure 6. KNN classification

The distances between neighbouring points are measured by the Euclidean system. Given two

data points in n dimensional feature space: and , the Euclidean

distance between these points is given by

(4)

The most important parameter of the algorithm is the number of neighbours. Dua and Du
(2015) argues that the more neighbours a query point has, the more noise the algorithm
receives. Therefore, the accuracy of classification is reduced. According to his
recommendations, the value of k should be less than the square root of the total number of

xxix
training samples. Also, in binary classification problems, the number of neighbours should be
chosen among odd numbers to avoid tied votes.
The algorithm has its advantages as well as disadvantages. More data scientists have
chosen this method mainly because it is easy to implement and interpret. However, KNN
classification appears to be time and memory consuming (ibid.).

3) Online learning algorithms have become increasingly popular, which proves their
practicality in the industry. By online learning machines, the paper refers to the machine
learning approach in which data is made available in a sequential order and update weight of
the predictors at each iteration.
Formally saying, the online learning algorithm address a classification problem over a

sequence of pairs , where are feature vectors in

particular time segments, while are their labels in that have be a positive (+1) or
negative (-1) value. During the training process, a predicted value is compared with an actual

value. The algorithm memorise time ( ) when the model makes a mistake ( ) for
creating the hypothesis for the next time sequence ( ). This approach to learning is also
called the incremental approach (Ross, Ruei-Sung and LinMing-Hsuan, 2008).
One of the largest studies in the field of identifying malicious URLs was conducted by
Ma et al. (2010) using an online machine learning algorithm. Within about 100 days, as part
of the experiment, malicious URLs were collected from the online services of Cisco, Google,
Microsoft, and Yahoo. Scholars compared three algorithms such as (1) Perceptron, (2)
Passive-aggressive Algorithm and Confidence-weighted (CW) Algorithm. They justified the
choice of the latter by referring to the advantages of CW that make it well suited for models
with a large number of features. In fact, the experiment had significant results; the classifier
showed an accuracy rate of 99%, which made the study the most citable in the field and the
dataset was used by other scores for alternatively testing other machine learning algorithms.
Although performances of the presented approach showed positive results in malicious
URL detection, these findings should be interpreted with caution due to two reasons. Firstly,
the paper did not provide a definition for malicious URLs. As it was stated earlier, the
malicious intents of URLs are different and have different feature properties. Accordingly, the
feature vector of one type of malicious URL can be absolutely ineffective for another and
make noise. Secondly, the experiment uses host-based futures, which have obvious drawbacks

xxx
that make them impractical. For example, as it was mentioned earlier, in real-world industrial
environments it is impractical to collect host-based features.

D. Algorithm selection
It was decided to select the algorithm by matching its properties with the actual requirements
of the experiment. The approach known as multiple-criteria decision analysis (MCDA)
described by Antunes, Carlos Henggeler, Henriques (2016) helped to make this balanced
decision. The MCDA is an integrated method that explicitly evaluates multiple conflicting
criteria in a decision-making process. To select the most appropriate algorithm according to
the MCDA approach, it is necessary to complete the following three steps: 1) define the
criteria; 2) prioritise them by assigning weight to each criterion; 3) present the list of available
options. These steps are explained below in this section.
As far as criteria for the algorithm is concerned, three main criteria were defined. The
first criterion is (1) the average accuracy rate of the algorithms, which also was taken from
previous experiments. A list of experiments where the values were obtained can be found in
Appendix A. The second criterion, (2) ubiquity, is measured by the number of experiments
where the certain algorithm was applied in last 10 years. The last criterion, flexibility, is
related to how the algorithm can be tuned in the experimental tool, which is the Python’s
library called Scikit-Learn. Basically, it shows the number of parameters and attributes that
that the particular algorithm has in Scikit-Learn v0.19.2.
According to the MCDA decision making approach, it also was necessary to establish
available options for the selection. As was mentioned in the literature review, there was set of
particular algorithms that was applied most in almost all experiments during the past 10 years.
But only the top three of them were considered in this research due to practical limitations.
These three algorithms consist of Logistic Regression (LG), Naïve Bayes (NB), and
Supportive Vector Machine (SVM). Table 7 presents the values of these algorithms in the
context of the five criteria mentioned above.
Then, these three criteria were prioritised by assigning them weights from 0 to 1. This
is necessary for determining how important the criteria are to the objective. The result of these
operations is presented in Table 7.

Table 7. Decision making table according to MCDM method

Criteria (unit of measure) Weigh KNN SVM Online learning

t (CW)

xxxi
Accuracy rate (%) 0.7 93.92 87.95 95
average accuracy rate obtained from
previous experiments
Ubiquitousness (count) 0.1 2 15 2
the number of experiments where an
algorithm was applied
Flexibility (count) 0.2 8 20 0
the number of parameters and
attributes for tuning a model
Scores 67.544 67.065 66.700
:

Finally, to calculate the final score, it was necessary to multiply values by their weights. As a
result, the analysis shows that the most appropriate algorithm for this experiment was the
KNN, by a score of 67.544. Accordingly, the experiment will be conducted with help of KNN
algorithm.

2.4.2. Alternative approaches

First, an alternative approach is blacklisting, which is straightforward and has a high True-
Positive rate detection. Even this approach can be treated as the main one as it is the most
commonly used method to defend against malicious URLs. There are publicly open and
commercial databases of malicious URLs. One of the biggest publicly open databases is
PhishingTank.com. This platform, which collects malicious URLs, was funded in 2016 by
OpenDNS. PhishingTank is supported by volunteered users who submit suspected phishers
and other users ‘vote’ on whether it is a phish or not. The database is used by many projects
software such as Kaspersky, Opera, and Yahoo! Mail (PhishTank, 2017).
Despite these arguments against the blacklisting approach, it still remains the most
effective. However, the list should be continuously updated by new URLs and the old ones
should be periodically checked.
The second alternative approach is heuristic. The method has a slightly useful
mechanism compared to blacklisting. Seifert, Welch and Komisarczuk (2008) considered this
method an extension of the blacklisting approach. The essence of the idea is that the system
will search for some signatures which are assigned to malicious behaviour attack types, where
the input data for the system can came from a web browser’s address bar or from network
packets.
Pan and Mao (2016) offer a method for detecting a DomXSSMicro attack using a
search to check specific Decision Support System (DSS) characteristics based on XSS attacks.
The approach is one is obvious example of a heuristic approach. Systems based on this

xxxii
approach monitor incoming data, such as URL and cookie data. The distribution component
focus on objects and operators associated with corrupted values, such as escape() and
encodeURIComponent(). The work describes six such conditions. In order for a page to
be considered vulnerable, all six components must satisfy the condition.
Such a flexible approach makes the system capable of detecting malicious URLs that
previously were not on the blacklist. The main advantage of this system is its flexibility and
expandability. But this approach requires persistent human involvement. Additionally, the
system can be developed only for a limited number of common threats and cannot generalise
all types of (new) attacks. Moreover, using methods of obfuscation, it is easy to bypass them.
A more specific version of heuristic approaches consists of analysing the dynamics of the
execution of a web page, for example, as proposed by Kolbitsch, C; Livshits, B; Seifert
(2012); Eshete, Villafiorita and Weldemariam (2013).

2.5. Summary of the literature review

Overall, this chapter helped to achieve five out of seven research objectives, which are (1) to
identify types of malicious URLs; (2) to explore different types of attacks conducted by
URLs; (3) to identify valuable lexical-based features; (4) to identify the state-of-the-art
machine learning approach; and (5) to choose the most appropriate machine learning
algorithm.
The first finding was that all malicious URLs in the explored literature can be
categorised into phishing, spamming and malware. The second finding related to the attacks
conducted with the help of URLs. Together, these two findings gave an understanding of the
basic lexical parameters of malicious URLs. However, it was also found that in a few cases,
lexical-based features are helpless for detecting malicious URLs.
Next, to identify the most valuable lexical-based features, related literature was
reviewed in Section 2.3. It was found that all these features can be grouped into eight
categories. However, after critical evaluation some features were excluded from further
consideration due to their low effectiveness. Also, based in the knowledge obtained in the
previous sections, few new features were identified as valuable.
Lastly, to choose the state-of-art machine learning approach above, 30 related studies
in the last ten years were explored. The results of these experiments also were presented
systematically in this section. It was found that mainly three batch mode machine learning
algorithms (SVM, NB, LR) and one online algorithm (CW) have demonstrated the best

xxxiii
performance. Then, backed with this information, an MCDA analysis was made to choose the
most appropriate algorithm, which appeared to be KNN.

xxxiv
Chapter 3. Methodology

Chapter 3

Methodology
This chapter is dedicated to describing methodology that is used to conduct this experiment.
The aim of the chapter is to establish experimental configuration baseline and subsequent
derivation. The chapter is divided into two main sections, which are on the theoretical and
practical approaches.

3.1. Theoretical approach

To explain of how this study is conducted to achieve the research aim and objectives the
Onion framework was applied. As it shown in the Figure 7, the methods of each stages are
shown in each layer which are described in this section.

Figure 7. Layers of Onion Framework (Saunders, Lewis and Thornhill, 2009)

Chapter 3. Methodology

3.1.1. Research Philosophy

According to Kuhn (1962), a philosophical paradigm is the common beliefs and agreements
negotiated between scholars about how problems should be understood and tackled. Hence,
before beginning a study, the first thing that is needed is to establish its philosophical
paradigm.
Chilisa, Bagele, and Kawulich (2012) defined main paradigms such as positivism, critical
realism, post-modernism and pragmatism. Among them, the paradigm which is the most
appropriate for this study is pragmatism. Mainly, this was chosen because positivism focuses
on the facts and ignores the influence of the human factor. Also, this research philosophy
normally uses a deductive approach for collecting data for quantitative analysis. Moreover,
the positivism is historically applied in Science, Technology, Engineering and Mathematics
(STEM) research (Stol, Ralph and Fitzgerald, 2015).

3.1.2. Theory development

To develop theory the main two methods are usually used which are deductive and inductive.
(Clough, 2012) explained differences between these methods. According to him, deductive
approach aims to test the existing theory, while inductive approach starts with an asking
question in order to scope down a research area and focuses on develop a new concept. This
experiment, is more fit to the deductive approach, since it aims to conduct an experiment to
test the existing theory with different way.

3.1.3. Methodological development

Methodological development is needed to deal with data. There are three main research
methods: quantitative methods, qualitative methods and mixed methods, which can be applied
to collect the data (Lawrence, 2013). However, as it was mention in the previous section, the
research was conducted by a deductive approach which mainly depended on the qualitative
method to collect the primary data. The qualitative method usually uses data received from
interview that is not supposed to done within this study. Therefore, the paper’s methodology
refers to only quantitative method.

3.1.4. Strategy
Denzin and Lincoln (2011) defined the research strategy as the link between research
philosophy and method. Strategy is needed to plan how to collect and process data for
achieving the research objectives (Saunders, Lewis and Thornhill, 2009). There are several

36
Chapter 3. Methodology

options of main strategies such as experiment, surveys, case studies. This study is conducted
with help of experiment.
3.1.5. Time horizon
Time horizon establishes period of a study. There are two main time horizons which are
longitudinal and cross-sectional. Due to the limitation of the dissertation at the university
course timeline, a cross-sectional study is selected as an approach for data collection. This
type of research needs to be conducted at a specific point in time (Gould et al., 2015).
3.1.6. Techniques and procedures
The main technique applied in the research is conducting an experiment with help of machine
learning classifier. Procedures of the experiment are described in the next section (Section
3.2).

3.2. Practical approach

This section describes how to practically conduct the experiment. It starts by describing the
three experimental stages. Then the experimental environment and tools used are described to
give other researchers the opportunity to repeat the experiment if necessary; next, the data is
described; this is followed by giving information about how the classification model is
optimised for the experiment; finally, it gives information about the metrics that are used for
evaluating the results.

3.2.1. Experimental environment

The experiment was hosed on a MacBook Pro device with macOS High Sierra operating
system. According to the risk mitigation plan, in case of software defects, the operating
system periodically was backed-up using the standard macOS functionality. Technical
characteristics of the experimental device are presented in more detail in Figure 8.

37
Chapter 3. Methodology

Figure 8. Technical characteristics of the experimental host environment

The processing of the collected data and all stages of the data mining were made in the
programming language Python v3.6.5. This choice of programming language is explained by
the experimenter’s personal preference and also by the presence of all necessary libraries (e.g.
Pandas library) for data preprocessing. The full list of the used Python libraries with
descriptions can be found in Appendix B (lines from 2 to 24).
The process of selecting the toolkit for building the classifier was the subject of
careful analysis. Taking into account the review of the tools for educational data mining by
Slater et al., (2016), in the short list of considered tools were products such as Python's
SciKitLearn, RapidMiner, Matlab, Weko, and R. After the critical analysis, it was decided to
use SciKitLearn implemented in the Python environment. This toolkit was chosen based on
the following two main reasons: first, integrity. As mentioned above, Python is also used for
different tasks of the experiment. Therefore, it is convenient to have a single environment for
the whole experiment. The second reason is the processing speed. Pedregosa et al., (2012)
examined the processing speed of a few toolkits by running a learning and cross-validation
process. As a result, he found that SciKitLearn was the fastest.
To manage the release of the experimental configuration and source code, the online
service was used called GitHub. The latest version of the source code also can be found in the
Appendix B. The use of this service was also part of the risk mitigation plan mentioned in the
introduction. In the case of hardware failure, it would always be possible to restore the
program code from the GitHub.
To recompile the experiment’s Python code, it is not necessary to choose a particular
code editor. However, all code where written in the Jupiter Notebook v4.4.0, hence source
code file is stored in the *.ipynb extension.

38
Chapter 3. Methodology

Additionally, the application of this set of toolkits, libraries and online services
appears to be a common practice among the data scientist community (Stackoverflow, 2018).
The only limiting aspect of the experimental environment was the lack of computational
power for training the classifier. The process of training the dataset took several hours, which
of course had a negative impact on the research experience. Full list of used tools can be
found in Table 8.

Table 8. Experimental tools

# Tool Application Name Version

1 Programming language Data collecting Python 3.6.5
Data pre-processing
Data visualisation
2 Machine learning library Model building SciKitLear 0.3.4
for the Python n
3 Code editor Programming Jupyter 4.4.0
4 Code version control Storing code Github 2.15.0 (Desktop)
system
5 File hosting service Storing the dataset Dropbox Online

3.2.2. Data

A. Collection
At beginning of the research, finding an appropriate dataset was challenging task. This was
mostly because the experimental data from previous studies does not fit for the research. For
example, dataset of the large-scale study by Ma et al., (2009); Vanhoenshoven et al., (2016);
Dong et al. (2017) was in the SVMLight format. But the application of this data was
impossible since features were converted into digital massive. Despite the general description
of the features, in practice, it was not possible to exclude host-based features from this
dataset.
In this reason, data was collected from publicly available data sources. In particular,
benign URLs were obtained from the Open Directory Project (or DMOZ). The directory

39
Chapter 3. Methodology

consists of the biggest set of URLs that are manually checked by editors. These editors have a
certain level of trust because they also pass through a preliminary check.
For collecting malicious URLs, a custom code (Appendix B, lines 26–78) was
developed that parses URLs from online services such as Vxvault, Malware Domain List and
Cybercrime-Tracker. A list of these sources can be found in Table 9.

Table 9. List of primary sources for data collection

# Source Data Description

accessed
1 DMOZ 12 June DMOZ was a multilingual open-content directory of the
2018 Internet links. The community who maintained it were
also known as the Open Directory Project (ODP). It was
owned by AOL. However, it was constructed and
maintained by a community of volunteer editors.
2 Cybercrime-Tracker 20 May Officially active from 2012, Cyber Crime Tracker
2018 monitors and tracks various malware families that are
used to perpetrate cybercrimes, such as banking Trojans
and ransomware. <http://cybercrime-
tracker.net/unsorted.php>
3 Malware Domain List 11 June Malware Domain List.
2018 <https://www.malwaredomainlist.com/mdl
.php>
4 VxVault 13 May List of URLs and MD5s that are malicious.
2018 <http://vxvault.net/ViriList.php>

The bar chart in Figure 9 shows the number of samples collected from each source, while the
ratio of benign and malicious classes is presented in the pie chart.

40
Chapter 3. Methodology

Figure 9. Number and percentage of benign and malicious URLs

B. Preprocessing
In this section, the preprocessing stage, which comes after data collection, is described. By
preprocessing, this paper refers to the stage where data are converted into a form that is more
appropriate for the selected machine learning algorithm. Preprocessing includes stages such as
feature extraction, dealing with missing data, inappropriate values, and feature engineering.
These prepossessing stages are discussed below in this section.
Feature extracting is important stage, where raw data are transferred into different
features. A list of valuable features is presented in the methodology chapter (Section 3.4). The
full list of extracted features is presented in Appendix C. Additionally, in this stage, all textual
features are converted into digital. This is because, usually, most machine learning
algorithms, and particularly KNN, work only on numerical (integer or real) data. The сlasses,
malicious and benign, are also represented digital format as 1 and 0, respectively.
The next stage deals with missing and inappropriate data that usually appear during
the data collection process. There are two options for dealing with missing data, and the
appropriate choice depends on the reason of their absence. First, an entire row should be
deleted from a table if its cells were missed because of difficulties during data collection. This
should be done because, otherwise, these data might be misinterpreted by the learning
algorithm. Second, in some cases, missing data have some meaning. Such values should be
grouped into a new category if so (Brink, Richards and Fetherolf, 2016).

41
Chapter 3. Methodology

Regarding inappropriate data, it was found that some URLs were created with help of
URL shortening services such as bit.ly and goo.gl. These URLs do not carry a value
for the classifier and create unnecessary noise (Shekokar et al., 2015). For identifying these
URLs, a custom code was developed that uses an application programming interface (API)
provided by the online service longurl.org. This service allows one to automatically
detect and expand the shortened URLs addresses. The service currently supports about 300
popular URL shortening services.
As far as feature engineering is concerned, the term refers to the process of
mathematical operation on extracted features for creating other independent variables.
Mathematical operations such as finding the mean, normalising, or calculating ratio are
usually applied by machine learning practitioners for boosting the accuracy and computational
efficiency of classifier models (Brink et al, 2017).
Lastly, the code in the Python programming language that was for data preprocessing
can be found in Appendix B (lines 80–155). Fragments of the data and references to original
file can be found in Appendix D. This should provide the ability for other scientists to
smoothly repeat this experiment if necessary.

3.2.3. Model validation and optimisation

The validation of the performance of a classifier is an important part of the data analytics,
since the maximum accuracy of prediction from the primary setting is impossible. The
validation was performed using the Python’s library is called model_selection.
The goal is to be sure that the model will show a stable result when new data is
received. This is because it does not give an indication of how well the learner will generalise
to a previously unseen dataset. In other words, a classifier should be low in bias and variance
without overfitting or underfitting for a particular dataset (Freitas, 2000).
To tackle this issue, arbitrary classification is performed, and then the obtained results
are compared with the actual values. As result of this operation, the model denoted the
numerical estimate of the difference in classified and actual, which is called the training
error. This process is called also validation.
The first validation is performed during the model training process, when the entire
dataset is divided into two parts, trained and tested as shown in Figure 10 (a). But by dividing
the data into only two parts, the model gets more chances to obtain non-randomly distributed
data. Therefore, it is necessary to cross-check classification accuracy by additionally dividing

42
Chapter 3. Methodology

the dataset into more parts (Platt, 2013). This stage is also called cross-validation and is
shown in Figure 10 (b).

Figure 10. Visual representation of data splitting and validating processes (Nelson, 2018)

Different methods of cross-validation techniques such as K-Fold, Stratified K-Fold, and

Leave-P-Out exist so far. However, K-Fold cross-validation is the most preferred method
among data scientists. Perhaps, this is because this method allows making a validation with a
comparatively small dataset (Kohavi, 2016).
The principle of this method is visually presented in Figure 11. K-Fold cross-
validation splits a dataset into k different parts (or folds). Then it uses k-1 folds to train data
and leave the last fold for testing. Next, it averages the model against each of the folds and
then finalises the model. As a general rule, data science practitioners divide datasets into five
or ten folds. However, this rule is based on only empirical evidence as a preference.

Figure 11. Visual representation of K-Folds cross validation method (Nelson, 2018)

43
Chapter 3. Methodology

3.2.4. Evaluation metrics

This stage is the fundamental part of the whole experiment and answers the question: in
which case will this experiment be considered successful? In order to find the answer to this
question, the following tasks must be done: (1) to choose appropriate metrics for measuring
the accuracy of the classifier; (2) to compare the selected metrics with the analogic values of a
similar experiment which had a successful result.
Looking ahead, these two scenarios are expected from the results: (1) the failure of the
hypothesis – the selected metrics will greatly degrade the analogous values of other
experiments; (2) the hypothesis is justified – the value of the metrics is close to the values of
other studies or even shows an improvement in performance. Independent of obtaining a
particular outcome, the relevant conclusions will be made, which will be found in the next
chapter.
Traditionally, the performance of predictive models is measured by a basic four metrics,
which are True-Positive (TP), False-Positive (FP), True-Negative (TN) and False-Negative
(FN). Definitions of these metrics are given in the following:
1) TP – classifier correctly detects a malicious URL
2) FP – classifier incorrectly detects a malicious URL
3) TN – classifier correctly detects a benign URL
4) FN – classifier incorrectly detects a benign URL
Graphical representation of these metrics for binary classifier is shown in the Figure 12.
This figure also is called the Confusion or Error Matrix (Stehman, 1997).
Condition
Present Absent
Positive

True-Positive False-Positive
Test

Negative

False-Negative True-Negative

Figure 12. Structure of confusion matrix for binary classifier

But there are also advanced metrics that give a broader understanding of the accuracy of a
model. These metrics include the True Positive Rate (TPR), False Positive Rate (FPR),
Precision and Recall, which are calculated from the above-mentioned basic metrics. Overall,
these advanced metrics with the calculation formulas are presented in Table 10.

44
Chapter 3. Methodology

Table 10. Advanced evaluation metrics

# Metric Formula
1 TPR (True Positive Rate or Sensitivity) TP / (TP+FN)
2 FPR (False Positive Rate or Specificity) FP / (FP+TN)
3 Precision TP/(TP+FP)
4 Recall TP / (TP+FN)

From the table, it can be seen that TPR and Recall are identical. Consequently, the following
question arises: if they are the same why are they named differently? Actually, this is partly
because TPR and FPR are usually used for building a receiver operating characteristic
(ROC) curve, so in the literature, usually they are used together. Later data science
community came up with Recall and Precision, which are also used in pairs. That is, it is a
common practice to measure a classifier accuracy using a ROC curve (that is built from TPR
and FPR) or Recall with Precision.
Drawing on an extensive range of sources (Davis and Goadrich, 2006; Martin, 2011),
the authors set out the different recommendations about when these metrics should be used.
From that recommendations, several important and simple ideas can be obtained if the
scholars’ argument backed by math is demystified.
Mainly, Precision is recommended (Davis and Goadrich, 2006) where the dataset has
predominately negative (benign URLs) samples rather than positive class (malicious URLs).
This is because Precision is more focused on the positive class, hence there are more chances
to more correctly detect a malicious URL. These approaches are applied, for example, where
the misdetection of a malicious URL (FP) has lamentable consequences.
On the other hand, FPR and TPR (ROC metrics) measure the ability to distinguish
between two classes (ibid.). Hence, ROC curve metrics should be used when the detection of
both classes is equally important. This is because these metrics give equal weight to both
classes’ prediction ability. Usually, ROC curve metrics are used when two classes are
balanced, or when the positive class is larger.
Overall, taking into account the above-mentioned recommendations, it was decided to
measure the classifier accuracy with help of the Precision and Recall metrics. This decision
was made according to the following factors: (1) as was revealed in the methodology chapter,
the collected dataset is unbalanced – it has more negative samples than positive; (2) the study
is conducted under the assumption that a collision with a malware URL will have serious

45
Chapter 3. Methodology

negative consequences. Hence, security is a priority. Therefore, the end result of the classifier
will be evaluated with help of the Precision and Recall metrics. Particularly, the results of the
metrics will be compared with similar metrics of previous works. After the comparison, the
relevant conclusions will be made, which can be found in the Results and Analysis chapter of
this paper.
3.3. Summary of the methodology
To sum up, the chapter explored all stages of the experiment step-by-step. The chapter
considered the methodology from two perspectives: theoretical and practical.
The theoretical part was made with help of the Onion framework. According to this
framework, there were the following establishments: as a paradigm, positivism was chosen;
theory development will be made by the deductive method; a quantitative methodology was
developed; the strategy is to conduct an experiment according to a time horizon; the study is
cross-sectional; lastly, as techniques and procedures were established, an experiment was
conducted on the machine learning classifier.
In the practical part, the experimental environment was systematically explained. This
section has provided a list of tools that were used during the experiment.
Next, detailed information about experimental dataset was given. This section describes
the data and examined some obstacles to obtaining the data. Also, information about data pre-
processing is given in this section.
Then the chapter moved to the model optimisation approach that was applied. This stage
was a part of risk mitigation process related to the poor quality of the classifier. As the result,
a K-Fold cross-validation method was chosen for evaluating the classifier.
Lastly, the model evaluation metrics were chosen. After an analysis of the dataset and
algorithm, it was found that Precision and Recall would be appropriate metrics for measuring
the accuracy rate of the classifier. This gave an understanding of how to compare the obtained
classifier with previously conducted stat-of-art experiments.

46
Chapter 4. Results

Chapter 4

Results
This chapter describes of the experiment’s results. Detailed information about the
methodology of the experiment can be found in the previous chapter. The chapter is divided
into four parts: 1) Description of the data; 2) Description of the classifier; 3) Performance of
the classifier; 4) Comparison with the other studies.

4.1. Describing data

This section gives more detailed information about the process of manipulation with the
dataset before the experiment. The number of samples in these subsets is presented in Table
11.

Table 11. Proportion of training and testing samples

Class 1: Class 0: benign Total Total %

malicious
Train set 31 089 260 112 291 201 65%
Test set 16 914 139 888 156 802 35%
448 003
The table above shows the breakdown of classes in the training and testing dataset. As can be
seen in the table, the total number of samples is 448 003, where 291 201 URLs are in the
training sub-dataset and 156 802 URLs are in testing subset. This proportion is equal to 65%
and 35% for the training and testing subsets, respectively. The percentage of these classes in
each sub-dataset are presented in Figure 13.

47
Chapter 4. Results

Figure 13. Distribution of classes in the testing and training subsets

The bar chart above shows that two classes were distributed almost equally to training and
testing subsets. As was mentioned previously, the given dataset is imbalanced, which means
that total number of benign URLs is much larger than the number of malicious URLs. For this
reason, there are only almost 11% of malicious URLs.
Dataset was divided into training and testing subsets with help of Python’s library
called train_test_split. The main reason for this splitting is the primary validation which is
described in the methodological chapter of this paper (Section 3.5). During the experiment,
the classifier was trained on these 291 201 URLs, then prediction and validation were made
based on the remaining 291 201 URLs.
The next step was to analyse the mutual dependence between all features of the
dataset. This analysis was made with help of Python’s library called Pandas. This library
builds a correlation matrix based on Spearman's rank coefficient or Spearman's rho (Gautheir,
2001), which is shown in Figure 14.

48
Chapter 4. Results

Figure 14. Spearman's correlation matrix

The matrix above presents the intercorrelations among all features of the given dataset.
According to the matrix, feature is_equal has strong correlation with is_querry_part, also
another feature url_length has strong correlation with url_content_length, special_caracters
and slashes. According to Brink, Richards and Fetherolf (2016), the strong correlation
between the independent features is undesirable. Therefore, it is recommended to leave only
one of these features to reduce the size of the dataset. However, as was mentioned in the
methodology, each of the extracted features previously demonstrated high performance in the
malicious URL detection task. In this regard, it was decided to leave all features for further
application.
Additionally, it is noticeable that the target feature (type) has a moderate correlation
with the features is_query_part, is_equal, is_ip_based, slashes, special_caracters, and
url_length. According to Brink, Richards and Fetherholf (2016), this means that the
mentioned features have more weight and gain classification accuracy. In other words,
is_query_part, is_equal, is_ip_based, slashes, special_caracters, and url_length are valuable
features of the given dataset.

49
Chapter 4. Results

4.2. Model description

The section describes the chosen algorithm, KNN, as implemented in Scikit-Learn v0.19.2
library. To obtain the maximum accuracy rate, it was necessary to tune the three parameters
(Table 12) of the algorithm with help of empirical testing and based on the recommendation
of Scikit-Learn (2018a).

Table 12. Settings of the KNN in Scikit-Learn v0.19.2 library

# Parameter Value Comment

1 Weights Uniform Weights of neighbours. The selected value
means that existent points in each
neighbourhood have equal weights.
2 Algorithm Auto The function automatically choses the
most appropriate algorithm based on
performance of the classifier
3 n_neighbors 5 Number of neighbours.

The table above shows established values of three parameters, which are weight, algorithm
and n_neighbours. The values of the first and second parameters were established based on
the recommendations given for binary classification tasks with a small-size-imbalanced
dataset. The optimal number of nearest neighbours was chosen by providing empirical testing
for all options between 1 and 20. The result of this test is shown in Figure 15.

Figure 15. Misclassification error rate vs number of neighbours

The line graph above shows the misclassification error rate for each option of k-neighbour
number. According to the plot, the algorithm shows less misclassification error rate when the
50
Chapter 4. Results

number of neighbours were equal to 3, 5 and 9. However, the most appropriate number of
neighbours appeared to be 5. After tuning the algorithm’s parameters, the next step was to
build the classifier and obtain values of evaluation metrics.

4.3. Results

The section provides information about the classifier’s performance. As was established
previously in the methodology chapter, Precision and Recall metrics were chosen for
evaluating the performance of the classifier. The values of these metrics are shown in Table
13.

Table 13. Values of advanced metrics

Class Precision Recall Accuracy

0: benign 0.95 0.98

1: malicious 0.76 0.59
avg / total 0.93 0.94 0.9360

Table 13 shows that the total values of the Precision and Recall metrics are equal to 0.93 and
0.94 respectively, while the general accuracy rate is 0.93. But the capability of classifying
malicious URLs is poor, where Precision is 0.76 and Recall is 0.59. This also can be seen in
the Confusion matrix, which is given in Figure 16.

51
Chapter 4. Results

Figure 16. Confusion matrix

The matrix above shows that the model misclassified (FN) malicious URLs 6899 times.
Although, the classifier copes with classification of benign URLs very well, the
misclassification rate of malicious URL is considered to be very high.

4.4. Comparison
The values obtained from this experiment were compared with results of experiments that
have shown the best performance. The comparison is presented in the Table 14. Full list of the
experiments with their results is available in Appendix A.

Table 14. Result comparison

Author(s) Algorithm Category of the applied Accuracy Precision Recall

features (%) (%) (%)

Current study KNN Lexical-based 93.6 93 94

52
Chapter 4. Results

(Vanhoenshoven, KNN Lexical-based, 96.25

Gonzalo, et al., Host-based
2016)
(Ma et al., 2010) Online Lexical-based, 99
learning host-based features

(Bannur, Saul SVM Content-based, 98 97.6 96.6

and Savage, Lexical-based,
2011) Non-text (image-based)
(Garera et al., LR Host-based, 97.3
2007) Page-Ranking-based,
Lexical-based
(Aggarwal, NB Lexical-based, 87.02 89.21 68.32
Rajadesingan Host-based,
and Kumaraguru, Content-based
2012)

53
Chapter 5. Discussion and Analysis

Chapter 5

Discussion and Analysis

The introduction to this work set out to question whether it is effective to detect malicious
URL analysing only it’s lexical based features. Thus, seven objectives were set for answering
this question.
The first was (1) to identify what types of malicious URLs are known to date, and the
second was (2) to explorer information about attacks that are conducted by using malicious
URLs. These two objectives gave solid expertise in the domain of malicious URLs, which
helped to achieve the third objective which is (3) to extract the most valuable lexical-based
features. Next objective was (4) to choose the most appropriate machine learning algorithm
that also was achieved in the literature review. This followed by certain data scientist’s
actions which are (5) to collect dataset and (6) to build machine learning classifier. The final
objective was dedicated (7) to make an analysis by comparing the classifier’s performance
with results of previous studies. The achievement of each objective are concluded below.

5.1. Summary of objectives

5.1.1. Objective 1
In this work, the methodology part began by giving a discussion of the what is the state-of-
the-art machine learning approach for detecting malicious URLs. To understand this, almost
previous studies, conducted in last 10 years, were reviewed in the literature review. Whereas
the analysis was focused basically on the machine learning approach. However, other
alternative approaches such as heuristic and blacklisting were also considered. By giving a
critical appraisal of the advantages and disadvantages of the proposed approaches, few papers
were chosen for further examinations.
5.1.2. Objective 2
The next objective dedicated to identifying types of malicious URLs. It was also explored in
the literature review. For achieving this also recent surveys also were reviewed. According to
these papers, malicious URLs can be categorised into following three groups: (1) spamming;
(2) phishing; and (3) malware. After such categorisation, it became clear on which features of
the URL will need to be paid attention while extracting futures. Generally, this knowledge

54
Chapter 5. Discussion and Analysis

gave close view to our next objective which attempts to explore attacks that are conducted
with help of URLs.
5.1.3. Objective 3
The objective of exploring attack types was also explored in the literature review. The
analysis of different attacks undertaken here, has extended our knowledge in this domain.
According to It helps to pay attention to particular features of a URL string which allows to
achieve the next objective that attempt to find the most valuable lexical-based features.
5.1.4. Objective 4
Next objective was to identify valuable lexical-based features. Using all gathered information
about the lexical properties of malicious URLs the most valuable features were listed in the
literature review. Additionally, the features that used in the previous studies are also were
taken into account. Overall, there were 25 lexical-based features for feeding machine learning
algorithm.
5.1.5. Objective 5
The following objective attempted to choose the most appropriate machine learning
algorithm. The Multiple-criteria decision analysis (MCDA) approach was used to make a
final decision regarding algorithm. As result, it was found that SVM is the most suitable
algorithm for this experiment.
5.1.6. Objective 6
Data collection process was described in the methodology part of the paper. As was the case
for the research, is was wished to collect a wide variety of malicious and benign URLs in
order to produce rich and interesting results for the experiment. After tremendous work of
parsing different web resources an abundant amount of data was found. But according to a
limitation of computational power the number of benign URLs was reduced from 4 million to
400 thousand.
5.1.7. Objective 7
Lastly, it was necessary to ensure the robustness of a machine learning classifier. In other
words, the classifier needed to be resilient to the overfitting. For reaching that, the trained
model was cross-validated by using kFold techniques. The principle of how the kFold
techniques worked was explained in the methodology part.

55
Chapter 5. Discussion and Analysis

5.2. Findings

Overall, the classifier showed good performance. The average results of this experiment were
less than the result obtained by Ma et al., (2010) and Dong, Shang and Yu (2017). However,
the average result of 93–94 is still considered high.
The first finding was about the value of selected features. Data analysis with help of the
correlation matrix based on Spearman's rank coefficient showed that the presence of the query
part, IP address and special characters in the URL string may indicate that the URL is
malicious.
Second, the chosen KNN binary classifier does not appear to be as fast as it was reported
to be in previous studies. Actually, it is the common practice to meet unexpected resource
overconsumption in data analysis workflow. Also, it is still difficult to precisely calculate how
model will behave when dataset become larger for several thousand times. Hence,
additionally a calculation of computational resources is needed before design large scale
solution.
Finally, the main finding was that average performance of the classifier was sufficiently
good. However, the FP rate of the model was equal to 59%, which means that the classifier is
able to correctly detect quite more than half of malicious URLs. This poor FP rate can be
explained by two limitations of this experiment and the general limitations of the machine
learning approach, which are discussed below.

5.3. Limitations

Despite the systematic approach of study design, the results should be interpreted with caution
due to some limitations of the experiment and the limitations of the general machine learning
approach.
The main limitation of this experiment was size of the dataset. Particularly, the
relatively small number of malicious URLs can be caused by the poor FP rate of the classifier.
The initial data could not be sufficient for the qualitative learning process of the model.
The second limitation is the labelling of a training dataset. As was previously stated,
the collected data labelling of malicious URLs has been manually done by various volunteers.
In this regard, the quality of labelling is not fully reliable.
Another limitation is compromised websites. As was mentioned in the literature
review, almost 1/3 of web sites can be compromised. This cause the obstacles for machine

56
Chapter 5. Discussion and Analysis

learning approach because compromised web resources actually benign with inherent for this
URL property. Hence, the classifier incorrectly detected them as benign.
Moreover, other general limitations of the machine learning approach should be
addressed as well. For example, different obfuscation techniques are also cause solid
limitation to the approach. For example, as it was reported, using of obfuscation technics with
help of URL shortening services and QR code generators is becoming a new trend in phishing
attack. Detection of these kind of URL is infeasible task for machine learning for the moment
(Sahoo, Liu and Hoi, 2017).
Finally, poisoning attacks is the new challenge for machine learning practitioners
working at cybersecurity industry. The attack is provided by supplying carefully designed
samples to eventually compromise the learning process of a classifier. Thus, be regarded as an
adversarial contamination of the training data (Jagielski et al., 2018).

5.4. Risk Mitigation

In the introduction, three risks were identified. During the project it was needed to mitigate
two of them, which are (1) failure to get data and (2) low quality of the classifier.
Regarding the dataset, it was impossible to find raw dataset from the previously
published studies. Therefore, the raw URLs was requested from the other large scale studies
such as Ma et al. (2009); Vanhoenshoven et al. (2016). However, it was found that the
original URLs are not available anymore. In this regard, it was decided to collect data with
help of custom software. Finally, the data was collected on time.
The second risk related to the quality of the classifier is also appeared during the
experiment. The several compilations of the predictor showed different results. However, as it
was discussed in the methodology, kFold Cross-Validation method was obtained to avoid
overfitting and underfitting of the model. As the result, the performance of the algorithm was
stable at the end of the experiment.
Overall, the experiment was conducted smoothly. It can be result of mitigation actions
that were discussed at the beginning of the project.

57
Chapter 5. Discussion and Analysis

Chapter 6

Conclusion and Future Work

In the conclusion, the main goal of the current study was to determine: can a machine learning
approach focusing on only lexical-based features of URLs improve on current state-of-the-
art? To find the answer to this question at the beginning of the project, seven objectives were
established.
The key strengths of this study are its technical expertise in the field of attacks conducted
by malicious URLs. The literature review of the study explored malicious URL types and
possible attacks by paying attention to their lexical-based features. This knowledge was used
to extract the most appropriate features for the experiment.
Despite these strengths, the classifier showed a moderate-to-low rate for detecting
malicious URLs (FP). Hence, returning to the question posed at the beginning of this study, it
is now possible to state that a machine learning algorithm based only on lexical-based features
showed limited results. Hence, further investigation is needed.
Given the chance to conduct the research again, the experiment should be repeated
again with a much greater number of malicious samples. Also, the detection capabilities for
each category (phishing, spamming and malware) of malicious URLs should be tested as
well.
Moreover, further investigation and experimentation into the performance of online
algorithms in malicious URL detection is strongly recommended. As was mentioned
throughout the paper several times, a study by Ma et al. (2009) showed promising results by
using online learning algorithms. This algorithms are often much more scalable than
traditional batch learning algorithms and consume fewer resources (Sahoo, Liu and Hoi,
2017). However, this research heavily applied collected host-based features. Hence, it would
be interesting to explore the results of an experiment with the same methodology but using
only lexical-based features.

58
Chapter 5. Discussion and Analysis

References
Acunetix (2017) Acunetix, [online]. Available at:
<https://www.acunetix.com/blog/articles/injection-attacks> [Accessed: 1 May 2018].
Aggarwal, A., Rajadesingan, A. and Kumaraguru, P. (2012) ‘PhishAri: Automatic realtime
phishing detection on twitter’, eCrime Researchers Summit, eCrime, pp. 1–12. doi:
10.1109/eCrime.2012.6489521.
Alcaide, A., Blasco, J., Galan, E. and Orfila, A. (2011) ‘Cross-Site Scripting: An Overview’,
Innovations in SMEs and Conducting EBusiness Technologies Trends and Solutions, pp. 61–
75. doi: 10.4018/978-1-60960-765-4.ch004.
Aleroud, A. and Zhou, L. (2017) ‘Phishing environments, techniques, and countermeasures:
A survey’, Computers and Security, 68(May), pp. 160–196. doi: 10.1016/j.cose.2017.04.006.
Alsharnouby, M., Alaca, F. and Chiasson, S. (2015) ‘Why phishing still works: User
strategies for combating phishing attacks’, International Journal of Human Computer Studies.
doi: 10.1016/j.ijhcs.2015.05.005.
Anthony, E. (2007) ‘Phishing : An Analysis of a Growing Problem Phishing ’. Avaliable at:
<https://www.sans.org/reading-room/whitepapers/threats/phishing-analysis-growing-problem-
1417> [Accessed: 1 May 2018].
Antunes, Carlos Henggeler;Henriques, C. O. (2016) Multiple Criteria Decision Analysis,
Multiple Criteria Decision Analysis. doi: 10.1007/978-1-4939-3094-4.
Apple Inc. (2017) macOS Sierra: Back up with Time Machine. Available at:
<https://support.apple.com/kb/PH25710?locale=ru_RU&viewlocale=en_US> [Accessed: 2
June 2017].
Banday, M. T. and Qadri, J. a. (2007) ‘Phishing – A Growing Threat to E-Commerce’, The
Business Review, 12(2), pp. 76–83.
Banday, M. T. and Qadri, J. A. (2007) ‘Phishing – A Growing Threat to E-Commerce’, 12(2),
pp. 76–83.
Bannur, S. N., Saul, L. K. and Savage, S. (2011) ‘Judging a site by its content: learning the
textual, structural, and visual features of malicious Web pages’, Proceedings of the 4th ACM
workshop on Security and artificial intelligence - AISec ’11, (Vm), p. 1. doi:
10.1145/2046684.2046686.
Basnet, R. B. and Sung, A. H. (2012) ‘Mining web to detect phishing URLs’, Proceedings -
2012 11th International Conference on Machine Learning and Applications, ICMLA 2012,
1(July 2015), pp. 568–573. doi: 10.1109/ICMLA.2012.104.
Berners-Lee, T. (2005) Uniform Resource Identifier (URI): Generic Syntax. Available at:
<https://tools.ietf.org/html/rfc3986#section-3> [Accessed: 1 May 2018].
59
Chapter 5. Discussion and Analysis

Blum, A., Wardman, B. and Warner, G. (2010) ‘Lexical Feature Based Phishing URL
Detection Using Online Learning’, pp. 54–60.
Brink, H., Richards, J. and Fetherolf, M. (2016) Real-World Machine Learning. 1st edn.
Manning Publications.
Canali, D., Cova, M., Vigna, G. and Kruegel, C. (2011) ‘Prophiler : A Fast Filter for the
Large-Scale Detection of Malicious Web Pages Categories and Subject Descriptors’, Proc. of
the International World Wide Web Conference (WWW), pp. 197–206. doi:
10.1145/1963405.1963436.
Cao, J., Li, Q., Ji, Y., He, Y. and Guo, D. (2016) ‘Detection of Forwarding-Based Malicious
URLs in Online Social Networks’, International Journal of Parallel Programming. Springer
US, 44(1), pp. 163–180. doi: 10.1007/s10766-014-0330-9.
Chaudhry, J. A., Chaudhry, S. A. and Rittenhouse, R. G. (2016) ‘Phishing attacks and
defenses’, International Journal of Security and its Applications, 10(1), pp. 247–256. doi:
10.14257/ijsia.2016.10.1.23.
Chen, C. M., Huang, J. J. and Ou, Y. H. (2015) ‘Efficient suspicious URL filtering based on
reputation’, Journal of Information Security and Applications. Elsevier Ltd, 20, pp. 26–36.
doi: 10.1016/j.jisa.2014.10.005.
Chiew, K. L., Yong, K. S. C. and Tan, C. L. (2018) ‘A survey of phishing attacks: Their
types, vectors and technical approaches’, Expert Systems with Applications. doi:
10.1016/j.eswa.2018.03.050.
Chilisa, Bagele; Kawulich, B. (2012) ‘Selecting a Research Approach:Paradigm,
Methodology, and Methods’, Doing Social Research A Global Context, (October), pp. 51–61.
Available at: <https://www.researchgate.net/profile/Barbara_Kawulich/publication/
257944787_Selecting_a_research_approach_Paradigm_methodology_and_methods/links/
56166fc308ae37cfe40910fc/Selecting-a-research-approach-Paradigm-methodology-and-
methods.pdf> [Accessed: 1 May 2018]..
Choi, H., Zhu, B. B. and Lee, H. (2011) ‘Detecting malicious web links and identifying their
attack types’, WebApps, p. 11. doi: 10.1109/IUCS.2010.5666254.
Chu, W., Zhu, B. B., Xue, F., Guan, X. and Cai, Z. (2013) ‘Protect sensitive sites from
phishing attacks using features extractable from inaccessible phishing URLs’, IEEE
International Conference on Communications, (July), pp. 1990–1994. doi:
10.1109/ICC.2013.6654816.
Clough, P. (2012) A Student’s Guide to Methodology. 3rd edn. Sage Publications Ltd. doi:
1446208621.
Cortes, C. and Vapnik, V. (1995) ‘Support-Vector Networks’, Machine Learning, 20(3), pp.
273–297. doi: 10.1023/A:1022627411411.
Curtsinger, C., Livshits, B., Zorn, B. and Seifert, C. (2011) ‘ZOZZLE: fast and precise in-
browser JavaScript malware detection’, SEC’11 Proceedings of the 20th USENIX conference
on Security, p. 3. Available at: <http://dl.acm.org.oca.korea.ac.kr/citation.cfm?
id=2028067.2028070> [Accessed: 1 May 2018].
Damashek, M. (1995) ‘Gauging Similarity with Language-Independent Categorization of
Text’, 267(5199), pp. 843–848.
Davis, J. and Goadrich, M. (2006) ‘The relationship between Precision-Recall and ROC
curves’, University of Wisconsin-Madison, Madison, WI, pp. 233–240. doi:

60
Chapter 5. Discussion and Analysis

10.1145/1143844.1143874.
Dong, H., Shang, J. and Yu, D. (2017) ‘Beyond the blacklists : Detecting malicious URL
through machine learning’.
Dua, S. and Du, X. (2015) Data Mining and Machine Learning in Cybersecurity, Impressoras
3D: O novo meio Produtivo. doi: 10.1017/CBO9781107415324.004.
Enbody, R; Sood, A. (2011) ‘Fraud & security’, (April).
Eshete, B., Villafiorita, A. and Weldemariam, K. (2013) ‘BINSPECT: Holistic analysis and
detection of malicious web pages’, Lecture Notes of the Institute for Computer Sciences,
Social-Informatics and Telecommunications Engineering, 106 LNICS, pp. 149–166. doi:
10.1007/978-3-642-36883-7_10.
Freitas, A. (2000) Understanding the crucial differences between classification and discovery
of association rules: a position paper. doi: 10.1145/360402.360423.
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y. and Zhao, B. Y. (2010) ‘Detecting and
characterizing social spam campaigns’, Proceedings of the 10th annual conference on
Internet measurement - IMC ’10, p. 35. doi: 10.1145/1879141.1879147.
Garcia-Molina, H. and Gyongyi, Z. (2005) ‘Web Spam Taxonomy’, First international
workshop on adversarial information retrieval on the web (AIRWeb 2005), pp. 1–9. Available
at: <http://ilpubs.stanford.edu:8090/771/1/2005-9.pdf> [Accessed: 1 May 2018]..
Garera, S., Provos, N., Chew, M. and Rubin, A. D. (2007) ‘A framework for detection and
measurement of phishing attacks’, Proceedings of the 2007 ACM workshop on Recurring
malcode - WORM ’07, p. 1. doi: 10.1145/1314389.1314391.
Gautheir, T. (2001) ‘Detecting Trends Using Spearman’s Rank Correlation Coefficient’,
Environmental Forensics, 2(4), pp. 359–362.
Google Inc. b (2018) Safe Browsing site status [online]. Available at:
<https://transparencyreport.google.com/safe-browsing/search?hl=en_GB> [Accessed: 1 May
2018].
Gould, S. J. J., Cox, A. L., Brumby, D. P. and Wiseman, S. (2015) ‘Home is Where the Lab
is: A Comparison of Online and Lab Data From a Time-sensitive Study of Interruption’,
Human Computation, 2(1), pp. 45–67. doi: 10.15346/hc.v2i1.4.
Halevi, T. and Lewis, J. (2013) ‘Phishing , Personality Traits and Facebook’, (January).
Avaliable at: <
https://www.researchgate.net/publication/235357780_Phishing_Personality_Traits_and_Face
book > [Accessed: 1 May 2018].
Halfond, W. G. J. and Orso, A. (2005) ‘AMNESIA : Analysis and Monitoring for
NEutralizing SQL-Injection Attacks’, p. 3.
Halfond, W. G. J., Viegas, J. and Orso, A. (2006) ‘A Classification of SQL Injection Attacks
and Countermeasures’.
He, M., Horng, S. J., Fan, P., Khan, M. K., Run, R. S., Lai, J. L., Chen, R. J. and Sutanto, A.
(2011) ‘An efficient phishing webpage detector’, Expert Systems with Applications. Elsevier
Ltd, 38(10), pp. 12018–12027. doi: 10.1016/j.eswa.2011.01.046.
Holz, T., Gorecki, C., Rieck, K. and Freiling, F. C. (2008) ‘Measuring and Detecting Fast-
Flux Service Networks’, Ndss, pp. 24–31. doi: 10.1.1.140.188.

61
Chapter 5. Discussion and Analysis

Hou, Y. T., Chang, Y., Chen, T., Laih, C. S. and Chen, C. M. (2010) ‘Malicious web content
detection by machine learning’, Expert Systems with Applications. Elsevier Ltd, 37(1), pp.
55–60. doi: 10.1016/j.eswa.2009.05.023.
Huang, H., Qian, L. and Wang, Y. (2012) ‘A SVM-based technique to detect phishing URLs’,
Information Technology Journal, 11(7), pp. 921–925. doi: 10.3923/itj.2012.921.925.
Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C. and Li, B. (2018) ‘Manipulating
Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning’, (1).
doi: 10.1109/SP.2018.00057.
Jelodar, H., Wang, Y., Yuan, C. and Jiang, X. (2017) ‘A systematic framework to discover
pattern for web spam classification’, 2017 8th IEEE Annual Information Technology,
Electronics and Mobile Communication Conference, IEMCON 2017, pp. 32–39. doi:
10.1109/IEMCON.2017.8117135.
Kohavi, R. (2016) ‘A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection A Study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection’, Learning, (March 2001), pp. 1137–1143.
Kolari, P., Finin, T. and Joshi, A. (2006) ‘SVMs for the blogosphere: Blog identification and
splog detection’, AAAI Spring Symposium on Computational Approaches to Analyzing
Weblogs, 4, p. 1.
Kolbitsch, C; Livshits, B; Seifert, C. (2012) ‘Rozzle: De-cloaking internet malware’.
Krombholz, K., Merkl, D. and Weippl, E. (2012) ‘Fake identities in social media: A case
study on the sustainability of the Facebook business model’, Journal of Service Science
Research, 4(2), pp. 175–212. doi: 10.1007/s12927-012-0008-z.
Kuhn, T. S. (1962) The Structure of Scientific Revolutions, Philosophical Review. Chicago
Uni. Chicago Press. doi: 10.1119/1.1969660.
Kumaraguru, P., Sheng, S., Acquisti, A., Cranor, L. F. and Hong, J. (2008) ‘Lessons from a
real world evaluation of anti-phishing training’, eCrime Researchers Summit, eCrime 2008.
doi: 10.1109/ECRIME.2008.4696970.
Lawrence, N. (2013) Research Methods: Qualitative and Quantitative Approaches. Available
at: <http://lib.hpu.edu.vn/handle/123456789/28691> [Accessed: 1 May 2018].
Le, A., Markopoulou, A. and Faloutsos, M. (2011) ‘PhishDef: URL names say it all’,
Proceedings - IEEE INFOCOM, pp. 191–195. doi: 10.1109/INFCOM.2011.5934995.
Le, V. L., Welch, I., Gao, X. and Komisarczuk, P. (2013) ‘Anatomy of drive-by download
attack’, Conferences in Research and Practice in Information Technology Series, 138(Aisc).
Lin, M. S., Chiu, C. Y., Lee, Y. J. and Pao, H. K. (2013) ‘Malicious URL filtering - A big
data application’, Proceedings - 2013 IEEE International Conference on Big Data, Big Data
2013, pp. 589–596. doi: 10.1109/BigData.2013.6691627.
Ma, J., Kulesza, A., Dredze, M., Saul, L. K. and Pereira, F. (2010) ‘Exploiting Feature
Covariance in High-Dimensional Online Learning’, Proceedings of the Artificial Intelligence
and Statistics, 9, pp. 493–500. Available at: <http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.169.3701&rep=rep1&type=pdf> [Accessed: 1 May 2018].
Ma, J., Saul, L. K., Savage, S. and Voelker, G. M. (2009) ‘Beyond Blacklists : Learning to
Detect Malicious Web Sites from Suspicious URLs’, World Wide Web Internet And Web
Information Systems, pp. 1245–1253. doi: 10.1145/1557019.1557153.

62
Chapter 5. Discussion and Analysis

Ma, J., Saul, L., Savage, S. and Voelker, G. (2009) ‘Identifying suspicious URLs: an
application of large-scale online learning’, … on Machine Learning, pp. 681–688. Available
at: <http://dl.acm.org/citation.cfm?id=1553462> [Accessed: 1 May 2018].
Marchal, S., Francois, J., State, R. and Engel, T. (2015) ‘PhishScore: Hacking phishers’
minds’, Proceedings of the 10th International Conference on Network and Service
Management, CNSM 2014, pp. 46–54. doi: 10.1109/CNSM.2014.7014140.
Marchal, S., State, R., Engel, T., Marchal, S., State, R., Engel, T., Detecting, P. and Marchal,
S. (2015) ‘PhishStorm : Detecting Phishing with Streaming Analytics To cite this version :
PhishStorm : Detecting Phishing with Streaming Analysis’.
Martin, D. (2011) ‘Evaluation: from Precision, Recall and F-measure to ROC, Informedness,
Markedness and Correlation’. Available at: <http://hdl.handle.net/2328/27165> [Accessed: 1
May 2018].
McGrath, D. K. and Gupta, M. (2008) ‘Behind phishing: an examination of phisher modi
operandi’, Usenix Workshop on Large-Scale Exploits and Emergent Threats (LEET), p. 4.
Available at: http://portal.acm.org/citation.cfm?id=1387713.
Milletary, J. (2005) ‘Technical Trends in Phishing Attacks’, Technical Trends in Phishing,
pp. 1–17. Available at:
<https://resources.sei.cmu.edu/asset_files/WhitePaper/2005_019_001_50315.pdf> [Accessed:
1 May 2018].
Nepali, R. K., Wang, Y. and Alshboul, Y. (2015) ‘Detecting malicious short URLs on
Twitter’, Americas Conference on Information Systems, pp. 1–7.
Ollmann, G. (2004) ‘Second Order Code Injection Attacks’, pp. 1–11.
Oxford Dictionary (1930) Definition of overfitting in English. Available at:
<https://en.oxforddictionaries.com/definition/overfitting> [Accessed: 1 May 2018].
Pan, J. and Mao, X. (2016) ‘DomXssMicro: A micro Benchmark for evaluating DOM-based
cross-site scripting detection’, Proceedings - 15th IEEE International Conference on Trust,
Security and Privacy in Computing and Communications, 10th IEEE International
Conference on Big Data Science and Engineering and 14th IEEE International Symposium
on Parallel and Distributed Proce, pp. 208–215. doi: 10.1109/TrustCom.2016.0065.
Pao, H. K., Chou, Y. L. and Lee, Y. J. (2012) ‘Malicious URL detection based on
Kolmogorov complexity estimation’, Proceedings - 2012 IEEE/WIC/ACM International
Conference on Web Intelligence, WI 2012, pp. 380–387. doi: 10.1109/WI-IAT.2012.258.
PATIL, D. R. and PATIL, J. B. (2016) ‘Malicious Web Pages Detection Using Static
Analysis of URLs’, International Journal of Information Security and Cybercrime, 5(2), pp.
57–70. doi: 10.19107/IJISC.2016.02.06.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
M., Perrot, M. and Duchesnay, É. (2012) ‘Scikit-learn: Machine Learning in Python’, Journal
of Machine Learning Research, 12, pp. 2825–2830. doi: 10.1007/s13398-014-0173-7.2.
PhishTank (2017) Join the fight against phishing. Available at: <https://www.phishtank.com>
[Accessed: 1 May 2018].
Platt, J. (2013) ‘Probabilistic Outputs for Support Vector Machines and Comparisons to
Regularized Likelihood Methods’, (June 2000).

63
Chapter 5. Discussion and Analysis

Provos, N., Mcnamee, D., Mavrommatis, P., Wang, K. and Modadugu, N. (2007) ‘The Ghost
In The Browser Analysis of Web-based Malware’, Proceedings of the first conference on
First Workshop on Hot Topics in Understanding Botnets, 462, p. 4. doi: 10.1038/nature08624.
Ross, D., Ruei-Sung, L. and LinMing-Hsuan, Y. (2008) ‘Incremental Learning for Robust
Visual Tracking’, International Journal of Computer Vision, 77(1–3), pp. 125–141. doi:
https://doi.org/10.1007/s11263-007-0075-7.
Sabhnani, M., Serpen, G. and More, K. K. (2003) ‘Application of Machine Learning
Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context’,
Proceedings of International Conference on Machine Learning: Models, Technologies, and
Applications (MLMTA), pp. 209–215. doi: citeulike-article-id:9827151.
Sahoo, D., Liu, C. and Hoi, S. C. H. (2017) ‘Malicious URL Detection using Machine
Learning: A Survey’, pp. 1–21. Available at: <http://arxiv.org/abs/1701.07179> [Accessed: 1
May 2018].
Saunders, M., Lewis, P. and Thornhill, A. (2009) Understanding research philosophies and
approaches.
Schulz, M.-A., Schmalbach, B., Brugger, P. and Witt, K. (2012) ‘Analysing Humanly
Generated Random Number Sequences: A Pattern-Based Approach’. doi:
https://doi.org/10.1371/journal.pone.0041531.
Scikit-learn (2018) 1.6. Nearest Neighbors. Available at:
<http://scikit-learn.org/stable/modules/neighbors.html> [Accessed: 18 July 2018].
Seifert, C., Welch, I. and Komisarczuk, P. (2008) ‘Identification of malicious web pages with
static heuristics’, Proceedings of the 2008 Australasian Telecommunication Networks and
Applications Conference, ATNAC 2008, pp. 91–96. doi: 10.1109/ATNAC.2008.4783302.
Shekokar, N. M., Shah, C., Mahajan, M. and Rachh, S. (2015) ‘An ideal approach for
detection and prevention of phishing attacks’, Procedia Computer Science. Elsevier Masson
SAS, 49(1), pp. 82–91. doi: 10.1016/j.procs.2015.04.230.
Sheng, S., Holbrook, M., Kumaraguru, P., Cranor, L. F. and Downs, J. (2010) ‘Who falls for
phish? A Demographic Analysis of Phishing Susceptibility and Effectiveness of
Interventions’, Proceedings of the 28th international conference on Human factors in
computing systems - CHI ’10, pp. 373–382. doi: 10.1145/1753326.1753383.
Slater, S., Joksimovic, S., Kovanovic, V., Baker, R. S. and Gasevic, D. (2016) ‘EDM: Tools
for educational data mining’, USA Journal of Educational and Behavioral Statistics, 42(1), p.
1076998616666808-. doi: 10.3102/1076998616666808.
Sorio, E., Bartoli, A. and Medvet, E. (2013) ‘Detection of hidden fraudulent URLs within
trusted sites using lexical features’, Proceedings - 2013 International Conference on
Availability, Reliability and Security, ARES 2013, pp. 242–247. doi: 10.1109/ARES.2013.31.
Spirin, N. and Han, J. (2011) ‘Survey onWeb Spam Detection: Principles and Algorithms’,
SIGKDD Explorations Newsletter, 13(2), pp. 50–64. doi: 10.1145/2207243.2207252.
Stackoverflow (2018) Developer Survey Results 2018. Available at:
<https://insights.stackoverflow.com/survey/2018> [Accessed: 1 August 2018].
Stehman, S. (1997) ‘Selecting and interpreting measures of thematic classification accuracy’,
62 (1), pp. 77–89. doi: 10.1016/S0034-4257(97)00083-7.
Stol, K.-J., Ralph, P. and Fitzgerald, B. (2015) ‘Grounded Theory in Software Engineering
Research : A Critical Review and Guidelines’, Proceedings of the 37th International

64
Chapter 5. Discussion and Analysis

Conference on Software Engineering (ICSE 2015), (Aug), pp. 120–131. doi:

http://dx.doi.org/10.1145/2884781.2884833.
Sullivan, D. (2012) Google: 100 billion searches per month, search to integrate gmail,
launching enhanced search app for ios, Googgle Search Press, [online]. Avaliable at:
<https://www.blog.google/press/> [Accessed: 1 August 2018].
Symantec (2017) Symantec, Symantec. Available at:
<https://www.symantec.com/blogs/threat-intelligence/dragonfly-energy-sector-cyber-attacks>
[Accessed: 31 May 2018].
Thomas, K., Grier, C., Ma, J., Paxson, V. and Song, D. (2011) ‘Design and evaluation of a
real-time URL spam filtering service’, Proceedings - IEEE Symposium on Security and
Privacy, pp. 447–462. doi: 10.1109/SP.2011.25.
Tsai, J. J. P. and Yu, P. S. (2009) ‘Machine learning in cyber trust: Security, privacy, and
reliability’, Machine Learning in Cyber Trust: Security, Privacy, and Reliability, pp. 1–362.
doi: 10.1007/978-0-387-88735-7.
Vanhoenshoven, F., Gonzalo, N., Falcon, R., Vanhoof, K. and Mario, K. (2016) ‘Detecting
Malicious URLs using Machine Learning Techniques’.
Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K. and Koppen, M. (2016) ‘Detecting
malicious URLs using machine learning techniques’, 2016 IEEE Symposium Series on
Computational Intelligence (SSCI), (December), pp. 1–8. doi: 10.1109/SSCI.2016.7850079.
Vogt, P., Nentwich, F. and Jovanovic, N. (2006) ‘Cross-Site Scripting Prevention with
Dynamic Data Tainting and Static Analysis’.
Wang, D., Navathe, S., Liu, L., Irani, D., Tamersoy, A. and Pu, C. (2013) ‘Click Traffic
Analysis of Short URL Spam on Twitter’, Proceedings of the 9th IEEE International
Conference on Collaborative Computing: Networking, Applications and Worksharing. doi:
10.4108/icst.collaboratecom.2013.254084.
Wen, A. (2017) Keeping your company data safe with new security updates to Gmail, Google
Blog, [online]. Available at: <http://www.googblogs.com/author/andy-wen [Accessed: 1
August 2018].
Wenyu, S. and Ya-Xiang, Y. (2006) ‘Optimization Theory and Methods’.
Xu, L., Zhan, Z., Xu, S. and Ye, K. (2013) ‘Cross-layer detection of malicious websites’,
Proceedings of the ACM Conference on Data and Application Security and Privacy
(CODASPY), pp. 141–152. doi: 10.1145/2435349.2435366.
Yadav, S., Member, S., Kumar, A., Reddy, K. and Reddy, A. L. N. (2010) ‘Detecting
Algorithmically Generated Domain-Flux Attacks with DNS Traffic Analysis’, pp. 1–15.
Yadav, V. (2010) ‘A phosphate transporter from the root endophytic fungus Piriformospora
indica plays a role in phosphate transport to the host plant’, Biol Chem.
Ying, P. and Xuhua, D. (2006) ‘Anomaly based web phishing page detection’, Proceedings -
Annual Computer Security Applications Conference, ACSAC, pp. 381–390. doi:
10.1109/ACSAC.2006.13.
Zhang, H., Liu, G., Chow, T. W. S. and Liu, W. (2011) ‘Textual and visual content-based
anti-phishing: A Bayesian approach’, IEEE Transactions on Neural Networks, 22(10), pp.
1532–1546. doi: 10.1109/TNN.2011.2161999.

65
Chapter 5. Discussion and Analysis

Zulfikar, R. (2010) Phishing attacks and countermeasures, Handbook of Information and

Communication Security. In Stamp, Mark & Stavroulakis, Peter.

Appendix A. Related Studies

Table 1. SVM
# Alg/Paper Acc. Precisi Recall FPR FNR TPR
Rate on
1 (Nepali, Wang and Alshboul, 2015) 91.87
2 (Ma, L. K. Saul, et al., 2009) 99
3 (Kolari, Finin and Joshi, 2006) 88
4 (Pao, Chou and Lee, 2012) 99
5 (Marchal, Francois, et al., 2015) 94.91 98.44
6 (Marchal, State, et al., 2015) 86.3
7 (Chu et al., 2013) 98 64
8 (Sorio, Bartoli and Medvet, 2013) 92.5 4.9 9.8
9 (Xu et al., 2013) 89.4 5.2 30
1 (Hou et al., 2010) 93.5 9.9 86.36
0
1 (Wang et al., 2013) 53.9 80.7 88.6
1
1 (Bannur, Saul and Savage, 2011) 98 97.6 96.6
2
1 (Huang, Qian and Wang, 2012) 53.9 80.7 88.6
3
1 (Ying and Xuhua, 2006) 84
4
1 (He et al., 2011) 97 1.49 97.3
5
Average accuracy: 87.95

Table 2. NB
# Alg/Paper Acc. Precisio Recall FPR FNR TPR
Rate n
1 (Canali et al., 2011) 85 44.1 16,4

2 (Xu et al., 2013) 51.26 59 11

3 (Hou et al., 2010) 58.28 9.6 84.6

4 (Cao et al., 2016) 84.74 9.09

5 (Aggarwal, Rajadesingan and 87.02 89.21 68.32

Kumaraguru, 2012)
Average accuracy: 73.26

Table 3. KNN

66
Chapter 5. Discussion and Analysis

# Alg/Paper Acc. Precisio Recall FPR FNR TPR

Rate n
1 (Choi, Zhu and Lee, 91.6 86.1
2011)
2 (Vanhoenshoven, 96.25
Napoles, et al., 2016)
Average accuracy: 93.92

Table 4. LR
# Alg/Paper Acc. Precisio Recall FPR FNR TPR
Rate n
1 (Garera et al., 2007) 97.3 0.7 12 88
2 (Ma, L. K. Saul, et al., 99 0.1 7.6
2009)
3 85 17.1 25.6
(Canali et al., 2011)
4 (Xu et al., 2013) 90.55 5.69 22.99
5 (Wang et al., 2013) 56.43 52.8 65.7
Average accuracy: 85.66

Table 5. Online learning algorithm: Confidence-Weighted

# Alg/Paper Acc. Precisio Recall FPR FNR TPR
Rate n
1 (Ma et al., 2010) 18.1
(Ma et al., 2009) 99
(Ma, L. Saul, et al., 2009) 95
Average accuracy: 97

67
Chapter 5. Discussion and Analysis

Appendix B. Python Source Code

# Code
1 # importing necessary libraries
2 import pandas as pd
3 import pandas_profiling
4 # Scikit-learn libraries
5 from sklearn import preprocessing
6 from sklearn.preprocessing import Normalizer
7 from sklearn.preprocessing import StandardScaler
8 from sklearn.model_selection import train_test_split
9 from sklearn.neighbors import KNeighborsClassifier
10 from sklearn.model_selection import cross_val_score
11 from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
12 # Libraries to data visualisation
13 import matplotlib.pyplot as plt
14 import matplotlib.mlab as mlab
15 import seaborn as sb
16 # Other libraries
17 import re
18 from urllib.parse import urlparse
19 from tld import get_tld
20 import time
21 from collections import Counter
22 import warnings
23 warnings.filterwarnings("ignore", 'This pattern has match groups')
24 # %matplotlib inline
25
26 # Declaring variables

68
Chapter 5. Discussion and Analysis

27 n = 400000 # number of necessary observations

28 # Dataset files
file0 = "0. good_dmoz-{}K.csv.txt".format(round(n/1000)) # <>, accessed on 12.06.2018, 3 749
29 905 rows
file1 = "1. bad_verified_online-32K.csv.txt" # <http://vxvault.net/ViriList.php>, accessed on
30 13.05.2018, 32 436 rows
31 file2 = "2. bad_cybercrime-tracker.net.txt" # accessed on 20.05.2018, 13 818 rows
file3 = "3. bad_malware-urls.csv.txt" # <https://www.malwaredomainlist.com/mdl.php>,
32 accessed on 11.06.2018, 2 285 rows
33
# Reducing the number of samples in the original dataset. It was needed for primary model
34 building
35 df0 = pd.read_csv("0. good_dmoz-1M.csv.txt", sep="^", header=0, names=["id", "url"])
36 df4 = df0.head(n)
37 file0_name = "0. good_dmoz-{}K.csv.txt".format(round(n/1000))
38 df4.to_csv(file0_name, sep='^')
39
40 # importing dataset 0 with 'benign urls' from dmoz
41 df0 = pd.read_csv(file0, sep="^", header=0, names=["id", "url"])
42 #deleting nulls
43 df0.drop(["id"], axis=1, inplace=True)
44 #insert column with the static value=0 which is for 'good urls'
45 df0['type']=0
46 df0.shape
47
48 # importing dataset #1
49 df1 = pd.read_csv(file1, sep=",", usecols=["url"])
50 # url_type=1 that is for phishing
51 df1['type']=1
52 df1.shape
53
54 # importing dataset #2
55 df2 = pd.read_csv(file2, header=None, names=["url"])
56 # url_type=1 that is for exploits
57 df2['type']=1
58 df2.shape
59
60 # importing dataset #3
df3 = pd.read_csv(file3, header=None, names=["date", "url", "ip", "domain", "url_type_txt",
61 "email", "svalue", "uknown"])
62 # delete unnecessary columns

69
Chapter 5. Discussion and Analysis

63 df3.drop(["date", "domain", "email", "svalue", "uknown"], axis=1, inplace=True)

64 # delete nulls
65 df3.dropna(inplace=True)
66 # url_type=1 that is for exploits
67 df3['type']=1
68 # change url_type to 1 (that is phishing) where it is necessary
69 # df3.loc[df3.url_type_txt.str.lower().str.contains('phish', 'fake'), 'url_type'] = 1
70 df3.drop(["url_type_txt", "ip"], axis=1, inplace=True)
71 df3=df3.loc[(df3['url'] != "-")]
72 df3.shape
73
74 # combining all dataframes into big one and exporting to csv file
75 df = pd.concat([df0, df1, df2, df3])
76 urls_all_name = "urls_all_{}K.csv".format(round(n/1000))
77 df.to_csv(urls_all_name, sep='\t')
78 df.shape
79
80 file = "urls_all_{}K.csv".format(round(n/1000))
81 df = pd.read_csv(file, sep="\t", low_memory=False)
82 df.shape
83
84 # for getting the same urls' format, we delete http:// and https://
85 df["url"] = df["url"].str.replace("http://", '')
86 df["url"] = df["url"].str.replace("https://", '')
87 df["url"] = [x.rstrip('/') for x in df["url"]] # cleaning last dash ("/") if there are some
88 df["url"] = [x.lstrip('www.') for x in df["url"]]
89 df_tmp=df[["url", "type"]] # keeping only two columns url and type
90
91 # function counts % of vowels
92 def vowels_rate(ip_str):
93 v_dic = 'AEIOUYaeiouy'
94 c_dic = 'BCDFGHJKLMNPQRSTVWXZbcdfghjklmnpqrstvwxz'
95 v_count = {}.fromkeys(v_dic,0)
96 for char in ip_str:
97 if char in v_count:
98 v_count[char] += 1
99 c_count = {}.fromkeys(c_dic,0)

70
Chapter 5. Discussion and Analysis

100 for char in ip_str:

101 if char in c_count:
102 c_count[char] += 1
103 v = sum(v_count.values()) # vowels
104 c = sum(c_count.values()) # consonants
105 s = v+c # sum of vowels and consonants
106 if s==0:
107 rate=0
108 else:
109 rate = v/s
110 return round(rate, 4)
111
112 # function that extracts authority part of a URL
113 def get_tdl(raw_url):
114 urlnetloc=[]
115 for i in raw_url:
116 url = "//"+i
117 o=urlparse(url)
118 urlnetloc.append(o.netloc)
119 return urlnetloc
120
121 # Extracting lexical-based features
122 df_tmp["url_length"]=df_tmp["url"].str.len()
123 df_tmp["tdl"] = get_tdl(df_tmp["url"])
124 df_tmp["tdl_length"] = df_tmp["tdl"].str.len()
125 df_tmp["content_length"]=df_tmp["url_length"] - df_tmp["tdl_length"]
126 df_tmp["special_characters"] = df_tmp['url'].str.count(r'[^A-Za-z0-9]+')
127 df_tmp["slashs"] = df_tmp["url"].str.count("/")
128 df_tmp["vowels_rate"] = [vowels_rate(i) for i in df_tmp["url"]]
129
130 # existance of special characters
131 df_tmp["is_hyphen_tdl"] = [int(elem) for elem in df_tmp["tdl"].str.contains("-")]
132 df_tmp["is_digit_tdl"] = [int(elem) for elem in df_tmp["tdl"].str.contains(r"[0-9]")]
df_tmp["is_ip_based"] = [int(elem) for elem in df_tmp["tdl"].str.contains(r"\b((?:25[0-5]|2[0-4][0-
133 9]|[01]?[0-9][0-9]?)(?:(?<!\.)\b|\.)){4}")]
134 df_tmp["is_hex_based"] = [int(elem) for elem in df_tmp["tdl"].str.contains(r"^[0-9A-Fa-f]+$")]
135 df_tmp["is_underscore"] = [int(elem) for elem in df_tmp["url"].str.contains("_")]
136 df_tmp["is_equal"] = [int(elem) for elem in df_tmp["url"].str.contains("=")]

71
Chapter 5. Discussion and Analysis

137 df_tmp["is_comma"] = [int(elem) for elem in df_tmp["url"].str.contains(",")]

138 df_tmp["is_semicolon"] = [int(elem) for elem in df_tmp["url"].str.contains(";")]
df_tmp["is_query_part"] = [int(elem) for elem in df_tmp["url"].str.contains(r"[\\?&]([^&=]+)=([^&=]
139 +)")]
140 df_tmp["is_aspersed"] = [int(elem) for elem in df_tmp["url"].str.contains("@")]
141
142 # bag-of-words
143 df_tmp["is_username"] = [int(elem) for elem in df_tmp["url"].str.contains("username")]
144 df_tmp["is_password"] = [int(elem) for elem in df_tmp["url"].str.contains("password")]
145 # df_tmp["is_brand"] = [int(elem) for elem in df_tmp["url"].str.contains(brand)]
146
147 # other features
148 df_tmp["is_nonstandard_port"] = [int(elem) for elem in df_tmp["url"].str.contains(r"\:\d{4}")]
149
150 del df_tmp["tdl"], df_tmp["url"] # Delete text features
151
152 # combining all dataframes into big one and exporting to csv file
153 urls_final_name = "urls_final_{}.csv".format(round(n/1000))
154 df_tmp.to_csv(urls_final_name, sep=';')
155 df_tmp.shape
156
157 urls_final_name = "urls_final_{}.csv".format(round(n/1000))
158 data = pd.read_csv(urls_final_name, sep=';') # Reading the file .csv
159 df = pd.DataFrame(data) # Converting data to Panda DataFrame
160 df.drop(df.columns[[0]], axis=1, inplace=True)
161 df.shape
162
163 # pandas_profiling.ProfileReport(df) # amazing report is made by pandas_profiling
164
165 X = df.iloc[:, 1:20].values # dependent features
166 y = df.iloc[:, 0].values # target feature
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.35) #splitting dataset into
167 tain and test subsets
168
169 nrml = StandardScaler()
170 X_trn_sc = nrml.fit_transform(X_train) # scaling feature values in training subset
171 X_tst_sc = nrml.transform(X_test) # scaling feature values in testing subset
172
173 tr, ts = Counter(y_train), Counter(y_test)

72
Chapter 5. Discussion and Analysis

174 print("Train set {} samples, where [0]:{} and [1]:{}".format(len(y_train), tr[0], tr[1]))
175 print("Test set {} samples, where [0]:{} and [1]:{}".format(len(y_test), ts[0], ts[1]))
176
177 start_time = time.time()
178 knn = KNeighborsClassifier(n_neighbors=5)
179 scores_lg = cross_val_score(estimator=knn, X=X_train, y=y_train, scoring="accuracy", cv=10)
180 elapsed_time = time.time() - start_time
print("\nAverage accuracy score: {}, Validating time: {} sec".format((scores_lg.mean()),
181 round(elapsed_time, 2)))
182
183 myList = list(range(1,20))
184 # subsetting just the odd ones
185 neighbors = filter(lambda x: x % 2 != 0, myList)
186
187 # empty list that will hold cv scores
188 cv_scores = []
189
190 # perform 10-fold cross validation
191 for k in neighbors:
192 knn = KNeighborsClassifier(n_neighbors=k)
193 scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
194 cv_scores.append(scores.mean())
195
196 # changing to misclassification error
197 MSE = [1 - x for x in cv_scores]
198 neighbors = filter(lambda x: x % 2 != 0, myList)
199 neighbors_list = []
200 for nl in neighbors:
201 neighbors_list.append(nl)
202
203 # determining best k neighbors' number
204 optimal_k = neighbors_list[MSE.index(min(MSE))]
205 print("The optimal number of neighbors is %d" % optimal_k)
206
207 # plot misclassification error vs k
208 plt.plot(neighbors_list, MSE)
209 plt.xlabel('Number of Neighbors K')
210 plt.ylabel('Misclassification Error')

73
Chapter 5. Discussion and Analysis

211 plt.grid()
212 plt.show()
213
214 start_time = time.time()
215 knn.fit(X_train, y_train) # trainning
216 training_time = time.time() - start_time
217
218 start_time = time.time()
219 y_pred_dt = knn.predict(X_test) # predicting
220 predicting_time = time.time() - start_time
221
222 train_score_dt = knn.score(X_train, y_train)
223 model_rf_acc = accuracy_score(y_test, y_pred_dt)
224
225 print(classification_report(y_test, y_pred_dt))
226 print("-"*30)
print("Predicting time: {} sec; Training time: {}; Accuracy score:
227 {:.4f}".format(round(predicting_time), round(training_time), knn.score(X_test, y_test)))
228
229 labels = ['Benign','Malicious']
230 conf_matrix = confusion_matrix(y_test, y_pred_dt)
231 plt.figure(figsize=(7, 6))
232 sb.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, fmt="d")
233 plt.title("Confusion matrix")
234 plt.ylabel('True class')
235 plt.xlabel('Predicted class')
236 plt.show()
237 # cf_matrix_dt

Notice: the source code is also can be downloaded from the following URL
<https://github.com/shokan/MaliciousURL>

74
Appendix C. Extracted Lexical-based Features

Appendix C. Extracted Lexical-based

Features
# Name Description
1 url_length The total number of characters of a URL
2 special_characters Number of special characters in a URL. All symbols that are not
text and digit characters
3 content_length The number of characters in a path, query and fragment part
4 is_ip_based Is a URL in a IP based host
5 is_hex_based Is a URL in a hex-based host
6 is_hyphen Is there „-„ in an authority part
7 is_digit Digit [0-9] in an authority part
8 is_underscore Is there „-„ in an authority part
9 is_slash „/‟ in a URL
10 is_equal „=‟ in a URL
11 is_comma „;‟ in a URL
12 is_semicolon „,‟ in a URL
13 is_aspersed Has „@‟ in URL
14 is_parameter_part Has Parameter Part
15 is_query_part Has Query Part
16 is_fragment_part Has Fragment Part
17 is_username Has „Username‟ in URL
18 is_password Has „Password‟ in URL
19 is_nonstandard_port Has Non-Standard Port (all ports except 8080)
20 type Class of URL, binary: benign of malicious
Appendix B. Python Source Code

Appendix D. Fragment of Data

Figure 1. Raw data

Figure 2. Preprocessed dataset

Notice: the raw data and prepossessed dataset can be downloaded from the following URL
https://www.dropbox.com/s/f7c8ijhhp4joaig/dataset.zip?dl=0

76
Appendix B. Python Source Code

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Aquaponic Design Plans Everything You Needs to Know: Everything You Need to Know from Backyard to Profitable Business
From Everand
Aquaponic Design Plans Everything You Needs to Know: Everything You Need to Know from Backyard to Profitable Business
David H Dudley
No ratings yet
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Paper 7AdvancesinEngineeringSoftware
No ratings yet
Paper 7AdvancesinEngineeringSoftware
6 pages
Phishing Website Detection
No ratings yet
Phishing Website Detection
19 pages
Midterm Project Report
No ratings yet
Midterm Project Report
21 pages
CyberSec Review3 Team10
No ratings yet
CyberSec Review3 Team10
28 pages
ChatGPT CheatSheet: 400 Powerful Examples That Turn You Into a ChatGPT Expert
From Everand
ChatGPT CheatSheet: 400 Powerful Examples That Turn You Into a ChatGPT Expert
Igor Pogany
No ratings yet
Business English Writing: Effective Business Writing Tips and Will Help You Write Better and More Effectively at Work
From Everand
Business English Writing: Effective Business Writing Tips and Will Help You Write Better and More Effectively at Work
Mary G. Lewis
No ratings yet
Web Video Business
From Everand
Web Video Business
MUHAMMAD NUR WAHID ANUAR
No ratings yet
Sniffing Dtetction IEEE Paper
No ratings yet
Sniffing Dtetction IEEE Paper
3 pages
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
From Everand
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
Jay Nans
No ratings yet
A To Z of Internet: Everything You Wanted to Know
From Everand
A To Z of Internet: Everything You Wanted to Know
Bittu Kumar
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Phishing-Detection Using Ml[1]
No ratings yet
Phishing-Detection Using Ml[1]
14 pages
Mini Project Report Sample Format 2024 - Final
No ratings yet
Mini Project Report Sample Format 2024 - Final
80 pages
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
PHISHING WEBSITE DETECTION USING MACHINE LEARNING - COMPLETED (1) Full
No ratings yet
PHISHING WEBSITE DETECTION USING MACHINE LEARNING - COMPLETED (1) Full
73 pages
Cui Qian 2019 Thesis
No ratings yet
Cui Qian 2019 Thesis
136 pages
Ieee Paper
No ratings yet
Ieee Paper
3 pages
Malicious URL
No ratings yet
Malicious URL
11 pages
Batch-22
No ratings yet
Batch-22
14 pages
Phishing URL Detection Research Paper
No ratings yet
Phishing URL Detection Research Paper
12 pages
Fin Irjmets1682919970
No ratings yet
Fin Irjmets1682919970
5 pages
Proposal for Research
No ratings yet
Proposal for Research
23 pages
URL Phishing
No ratings yet
URL Phishing
36 pages
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
No ratings yet
Sat - 26.Pdf - Phishing Website Detection Using Novel Machine Learning Fusion Approach
11 pages
Malicious URL Detection Using Random Forest
No ratings yet
Malicious URL Detection Using Random Forest
36 pages
SafeLink AI_ URL Threat Detection
No ratings yet
SafeLink AI_ URL Threat Detection
17 pages
MaliciousURLDetection_Acomparativestudy (1)
No ratings yet
MaliciousURLDetection_Acomparativestudy (1)
6 pages
Presentation Slides
No ratings yet
Presentation Slides
42 pages
1822 B.E Cse Batchno 287
No ratings yet
1822 B.E Cse Batchno 287
65 pages
The Amazing Biological Revolution and The Amazing New Health Care
From Everand
The Amazing Biological Revolution and The Amazing New Health Care
Bertil Lindmark
No ratings yet
128 Submission
No ratings yet
128 Submission
7 pages
Detecting Phishing Websites Using Machine Learning
No ratings yet
Detecting Phishing Websites Using Machine Learning
16 pages
Malicious Site Detection (MSD)
No ratings yet
Malicious Site Detection (MSD)
58 pages
Project Report1
No ratings yet
Project Report1
83 pages
The Stock Market from A to See - 2nd Edition
From Everand
The Stock Market from A to See - 2nd Edition
John Nunez
No ratings yet
Phishing URL Detection Using ML: Project Report
No ratings yet
Phishing URL Detection Using ML: Project Report
25 pages
Breaking Barriers: S.T.E.M Mentorship in Business
From Everand
Breaking Barriers: S.T.E.M Mentorship in Business
Matthew C. Smith
No ratings yet
Report PUD
No ratings yet
Report PUD
20 pages
Towards Safeguarding Users Against Phishing and Ransomware Attack
No ratings yet
Towards Safeguarding Users Against Phishing and Ransomware Attack
229 pages
Logistic Regression Based Machine Learning Technique For Phishing Website Detection
No ratings yet
Logistic Regression Based Machine Learning Technique For Phishing Website Detection
4 pages
Phishingurl Report23
No ratings yet
Phishingurl Report23
52 pages
Full Text 02
No ratings yet
Full Text 02
40 pages
V6I602
No ratings yet
V6I602
8 pages
Phishing Detection Using Machine Learnin
No ratings yet
Phishing Detection Using Machine Learnin
5 pages
1ds19scn09 - Mtech Project Phase-3
No ratings yet
1ds19scn09 - Mtech Project Phase-3
27 pages
Detection of Email Phishing Fraud Attacks Using Machine Learning
No ratings yet
Detection of Email Phishing Fraud Attacks Using Machine Learning
30 pages
updated_phishing_url_detection
No ratings yet
updated_phishing_url_detection
13 pages
Detect Phishing Website by Using Machine Learning
No ratings yet
Detect Phishing Website by Using Machine Learning
4 pages
WT 2
No ratings yet
WT 2
47 pages
) of The Tatal Water Available On The Earth, Fresh Water Forms Per Cent
No ratings yet
) of The Tatal Water Available On The Earth, Fresh Water Forms Per Cent
2 pages
XML Extensible Markup Language
No ratings yet
XML Extensible Markup Language
59 pages
What Is A SWOT Analysis
No ratings yet
What Is A SWOT Analysis
23 pages
Royal Blackcat Ransomware Tlpclear
No ratings yet
Royal Blackcat Ransomware Tlpclear
67 pages
Iot Fundamentals
No ratings yet
Iot Fundamentals
37 pages
Unit III Operating System Vulnerabilities
No ratings yet
Unit III Operating System Vulnerabilities
6 pages
Chapter 5 - Big Data Implementation Part 3 (Security)
No ratings yet
Chapter 5 - Big Data Implementation Part 3 (Security)
28 pages
Information and Communications Technology Today
100% (1)
Information and Communications Technology Today
48 pages
Kumpul PDF - The Most Wanted Girl by Quinwriter
No ratings yet
Kumpul PDF - The Most Wanted Girl by Quinwriter
423 pages
Term Paper On Ethical Hacking
100% (1)
Term Paper On Ethical Hacking
5 pages
Crime in Cyberspace Offenders and The Ro
No ratings yet
Crime in Cyberspace Offenders and The Ro
44 pages
Itu-T: Technical Security Measures For Implementation of ITU-T X.805 Security Dimensions
No ratings yet
Itu-T: Technical Security Measures For Implementation of ITU-T X.805 Security Dimensions
32 pages
Viruses and "Malicious Programs": Data and Network Security 1
No ratings yet
Viruses and "Malicious Programs": Data and Network Security 1
11 pages
Wire Shark Analysis
No ratings yet
Wire Shark Analysis
10 pages
End of Mid-Term Two 2023
No ratings yet
End of Mid-Term Two 2023
9 pages
Kisi-Kisi CSCU v3
No ratings yet
Kisi-Kisi CSCU v3
3 pages
E Tech Student Module
No ratings yet
E Tech Student Module
56 pages
FoIT Syllabus
No ratings yet
FoIT Syllabus
6 pages
Ethical Hacking Training Course Content
No ratings yet
Ethical Hacking Training Course Content
8 pages
Home
No ratings yet
Home
16 pages
Payments Threats and Fraud Trends Report - 2022 PDF
No ratings yet
Payments Threats and Fraud Trends Report - 2022 PDF
69 pages
Cyber Security Introduction
100% (1)
Cyber Security Introduction
44 pages
Chapter 3. Basic Dynamic Analysis
No ratings yet
Chapter 3. Basic Dynamic Analysis
10 pages
Panorama PDF
No ratings yet
Panorama PDF
6 pages
Computer Security Assignment 1
No ratings yet
Computer Security Assignment 1
7 pages
LS 05
No ratings yet
LS 05
17 pages
Classification of Computer Viruses Using The Theory of Affordances
No ratings yet
Classification of Computer Viruses Using The Theory of Affordances
28 pages
CYBER Security Essential
No ratings yet
CYBER Security Essential
40 pages
Security Awareness Applying Practical Security in Your World 3rd Edition Mark Ciampa download
No ratings yet
Security Awareness Applying Practical Security in Your World 3rd Edition Mark Ciampa download
54 pages
Gatewatcher Brochure
No ratings yet
Gatewatcher Brochure
3 pages
Ethical Hacking and Cyber Security
No ratings yet
Ethical Hacking and Cyber Security
38 pages
Information Security DA
No ratings yet
Information Security DA
2 pages
Security 1 1 - Part 1 - Viruses and Worms
No ratings yet
Security 1 1 - Part 1 - Viruses and Worms
7 pages