0% found this document useful (0 votes)

654 views

Automated Collection of Open Source Intelligence

This thesis presents a framework called Pantomath for automated collection of open source intelligence (OSINT) from public online sources. Pantomath is highly modular and offers three modes of operation for different anonymity needs. It utilizes existing OSINT tools and services, and estimates reliability for some collected data like cyber threat intelligence, geolocation, and port discovery information. The framework addresses challenges of OSINT like legal/ethical issues and processing vast amounts of unstructured online data. It is compared to other OSINT automation tools and areas for future work are discussed.

Uploaded by

Brigada Osint

We take content rights seriously. If you suspect this is your content, claim it here.

0% found this document useful (0 votes)

654 views

Automated Collection of Open Source Intelligence

Uploaded by

Brigada Osint

We take content rights seriously. If you suspect this is your content, claim it here.

You are on page 1/ 82

Masaryk University

Faculty of Informatics

Automated Collection of
Open Source Intelligence

Master’s Thesis

Bc. Ondřej Zoder

Brno, Fall 2020

Declaration

Hereby I declare that this paper is my original authorial work, which I have
worked out on my own. All sources, references, and literature used or ex-
cerpted during elaboration of this work are properly cited and listed in com-
plete reference to the due source.

Bc. Ondřej Zoder

Advisor: RNDr. Lukáš Němec

i
Acknowledgements

I would like to thank my advisor RNDr. Lukáš Němec for his guidance
throughout the entirety of this thesis. My thanks also go to RNDr. Martin
Stehlík, Ph.D. who provided many valuable suggestions and helped shape
the tool that is the outcome of this thesis.
Huge appreciation goes to my family for the support they have given me
during all the years of my studies.
I also want to thank Cedric from CIRCL.LU for providing me free access
to their Passive DNS and Passive SSL databases, and Gregory from Spyse
for giving me a free trial for their port discovery service, which allowed me
to extend Pantomath and further test the reliability estimation model.

ii
Abstract

With the ever-growing amount of data available on the Internet and the wide-
spread adoption of social media networks, publicly accessible websites have
grown into a goldmine of valuable information about individuals and com-
panies. Open Source Intelligence, shortly OSINT, is any information obtain-
able legally and ethically from publicly available sources addressing specific
intelligence requirements. The relatively easy and cheap integration makes
OSINT a practical solution for national security, cyber threat intelligence,
and many other fields. This thesis presents a framework called Pantomath
for an automated collection of OSINT that utilizes many existing tools and
services. The framework is highly modular, provides all the functionality
needed throughout the whole process of OSINT, offers three modes of oper-
ation for different anonymity requirements, and presents the data in a struc-
tured output. The reliability of some of the collected data is estimated to
allow the user to analyze the data more efficiently and precisely. The frame-
work is compared to existing OSINT automation tools, and the most notable
advantages and disadvantages are discussed.

iii
Keywords

OSINT, open-source intelligence, OSINT automation, military intelligence,

social media intelligence, threat intelligence, Pantomath

iv
Contents

1 Introduction 1

2 Open Source Intelligence 3

2.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Legal and Ethical Aspects . . . . . . . . . . . . . . . . 6
2.2 Value and Use Cases . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Military Intelligence . . . . . . . . . . . . . . . . . . . 9
2.2.2 Cybersecurity . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Social and Business Intelligence . . . . . . . . . . . . . 11
2.3 State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Natural Language Processing . . . . . . . . . . . . . . 13
2.3.2 Machine Learning . . . . . . . . . . . . . . . . . . . . 14

3 OSINT Sources and Tools 17

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 OSINT Automation . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Recon-ng . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Maltego . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 SpiderFoot . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Pantomath: Tool for Automated OSINT Collection 28

4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Architecture and Functionality . . . . . . . . . . . . . . . . . 30
4.2.1 Base Framework . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Modes of Operation . . . . . . . . . . . . . . . . . . . 33
4.2.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Reliability Estimation . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Cyber Threat Intelligence . . . . . . . . . . . . . . . . 44
4.3.2 Geolocation . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Port Discovery . . . . . . . . . . . . . . . . . . . . . . 47

5 Evaluation and Discussion 49

5.1 Evaluation of Reliability Estimation . . . . . . . . . . . . . . 49
5.1.1 Cyber Threat Intelligence . . . . . . . . . . . . . . . . 49
5.1.2 Geolocation . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.3 Port Discovery . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Comparison with Existing Tools . . . . . . . . . . . . . . . . 54
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

v
6 Conclusions 59

Bibliography 60

A Appendices 73

vi
1 Introduction

With the exponential growth of the Internet in the last few decades, the
amount of data stored around the world has become immeasurable. It is
estimated that four of the biggest online companies, Amazon, Microsoft,
Google, and Facebook, store at least 1.2 million terabytes of data. At first,
data was thought of as a mere by-product of computing, but it has even-
tually grown into a product itself [1]. Companies sell their users’ data to
others that benefit from it, so collecting data of any value is essential for
many. A large portion of Internet data is accessible to anyone with an In-
ternet connection and often contains a lot of knowledge about individuals,
companies, or governments. All this data is commonly called Open Source
Intelligence, or shortly OSINT.
The value of OSINT is increasingly getting recognized in many different
fields. According to [2], over 80% of the knowledge used for policymaking
on a national level is derived from OSINT. Cyber threat intelligence heavily
utilizes OSINT and combines it with data collected by security devices to
evaluate possible threats to companies’ infrastructures. All in all, publicly
available sources constitute an irreplaceable source of knowledge. However,
due to the immense amount of data on the Internet and its unstructured and
heterogeneous nature, the collection and processing of OSINT is a challeng-
ing task requiring non-trivial methods. Arguably one of the biggest draw-
backs of OSINT is the lack of mechanisms for verification of the collected
information [3].
To make the whole process of OSINT easier and more accessible, various
tools and services that provide useful information exist. These range from
simple websites that provide basic information about IP addresses to more
complex tools implementing state-of-the-art algorithms, such as Shodan [4].
A framework called Pantomath for an automated collection of OSINT is pre-
sented in this thesis. The framework utilizes existing tools and services that
provide valuable information about Internet identifiers, such as IP addresses
or domain names. As the number of these services is enormous, Pantomath
was designed to make the integration of new sources more straightforward
by moving the data collection to separate modules, which can be added by
merely implementing a well-defined interface.
To address the user’s anonymity requirements, Pantomath offers three
modes of operation with varying guarantees and drawbacks. The overt mode
represents a regular operation where all sources are used, and an Internet
connection is required. In the stealth mode, all requests sent to the Internet

1
1. Introduction

are proxied through the Tor network. The offline mode provides the highest
guarantees for the user’s anonymity, as only a database of preprocessed data
is queried, and no Internet connection is needed. Pantomath also attempts to
tackle possibly the biggest challenge of OSINT – the validation of the gath-
ered data. A mathematical model for reliability estimation of the results is
defined and used in several modules.
The thesis is organized as follows. Chapter 2 introduces OSINT, discusses
some of the challenges, the value it provides, the fields where OSINT is often
utilized, and a few state-of-the-art techniques that improve the efficiency of
OSINT collection. Chapter 3 outlines the sources that can be used to gather
the data and some tools that aim to automatize this process. Pantomath,
a tool for an automated collection of OSINT, is presented in Chapter 4.
Chapter 5 evaluates Pantomath, compares it to tools with similar goals, and
drafts the possible extensions and improvements.

2
2 Open Source Intelligence

Intelligence is a process of information gathering for the purpose of providing

a clear understanding of issues, allowing responsible people to make indepen-
dent and impartial decisions [3]. Thomas Fingar [5] states that the primary
purpose of intelligence is to reduce uncertainty about intentions, capabilities,
and actions of adversaries and allies. To be of any value, intelligence must be
up-to-date, accurate, relevant, and verifiable. The goal of intelligence is not
only to collect data but also to identify parts of the data that are valuable
for the issue at hand, link them together, and evaluate them.
Open Source Intelligence (OSINT) is an intelligence based on information
that can be obtained legally and ethically from publicly available sources [6].
OSINT is considered to be the oldest form of intelligence gathering, with
its earliest usage going as far as the Second World War, where radio and
print sources were used [7]. However, its utility increased significantly with
the emergence of information technologies and the Internet in particular [8].
It is estimated that over 80% of knowledge used for policymaking on a na-
tional level is derived from OSINT [2, 9].
OSINT is a broad term, and the exact definitions can vary depending
on the field of study. The Office of the Director of National Intelligence of
the U.S. [10] defines it as intelligence produced from publicly available infor-
mation that is collected, exploited, and disseminated in a timely manner to
an appropriate audience for the purpose of addressing a specific intelligence
requirement. They state that the sources of OSINT include mass media,
public data, gray literature, and observation and reporting. For the purpose
of this thesis, the definition will be narrowed down to intelligence based
on information openly accessible over the Internet. The Internet itself is
not considered as a source of OSINT, but rather a platform through which
the sources are accessed.
There are borderline cases of sources that might not be regarded as part
of OSINT by some definitions, e.g., any private information that was made
public even though it was not the intention of the owner of that information.
That might occur due to some error, e.g., a misconfiguration of the system
containing this information or because a third party published it. Examples
of such sources are WikiLeaks [11] or any data leaks that are available on
the Internet. This thesis considers this type of information as OSINT. For
example, discovering a vulnerability in a system is recognized as OSINT,
exploiting this vulnerability to bypass the security of the system and gain
some information from the inside is not.

3
2. Open Source Intelligence

Publicly available data can be divided into four different categories [12],
as illustrated in Figure 2.1. Open Source Data (OSD) are any publicly avail-
able data that are not refined in any way, e.g., an image or raw social media
data. Open Source Information (OSINF) are OSD that have undergone fil-
tering, extraction of valuable information, and editing, for example, articles
and results from search engine queries. Open Source Intelligence (OSINT)
is a collection of OSINF that addresses a specific intelligence requirement,
with Validated Open Source Intelligence (OSINT-V) going a step further by
validating the OSINT using supporting information. Data from all these four
categories can be used for OSINT. However, Open Source Data and Open
Source Information require further assessment. Specific sources that can be
used for OSINT are discussed in Chapter 3.

Figure 2.1: Description of Open Source Data, Open Source Information, and
(Validated) Open Source Intelligence and how these categories differ in terms
of the level of data transformation [12].

4
2. Open Source Intelligence

The cycle of intelligence starts with the identification of the intelligence

requirements, the transformation of these requirements into specific queries,
and the selection of sources and tools to be used for the data aggregation [12].
Once the data are collected, they are processed and converted into intelli-
gence addressing the requirements. Any relevant information discovered dur-
ing the previous steps can be used for further investigation by formulating
new requirements and repeating the cycle.
Correct visualization of the collected data is just as crucial as the collec-
tion itself because it can help with the evaluation of the data. For example,
when the goal of the investigation is to find as much information about an IP
address as possible, the OSINT data may contain many other IP addresses
that are related to the target, domain names acquired by a reverse DNS
resolution, or e-mail addresses that sent unsolicited messages from the tar-
get IP address. Visualization of these results can provide a good overview
of the structure and improve the understanding of the relationships between
different elements, e.g., by discovering related social media accounts that are
not directly connected [13].

2.1 Challenges

Due to the immense amount of data publicly available on the Internet,

OSINT collection and analysis is a complex procedure. The data are also
predominantly unstructured and very heterogeneous, and distinguishing be-
tween something of value and unrelated information requires a thorough eval-
uation. Various complex algorithms and methods are utilized to deal with
the diversity and volume of OSINT data during all phases of the intelligence
cycle.
Machine learning can help with the classification and clustering of the re-
sults and extraction of additional information with natural language pro-
cessing. Using de-anonymization, one can connect different identifiers, thus
broadening the amount of information about a particular target. An algo-
rithm called dimension reduction can decrease the amount of data that needs
to be processed by extracting features from the data. Even though a large
part of the Internet data is in English, valuable information can be found
in other languages as well. Machine translation that can convert the mean-
ing of more complex language structures, such as abbreviations or phrases,
can increase the amount of information found during the investigation [14].
Some of the more innovative approaches to OSINT using these algorithms
are described in Section 2.3.

5
2. Open Source Intelligence

Information retrieval is often compared to finding answers to questions [15],

which means the formulation of the OSINT search queries in a simple and
unambiguous way is of high importance. However, this is not always entirely
achievable, and a particular query might result in different, often contradict-
ing results. That is amplified by the fact that the amount of data available
on the Internet is enormous, the information is mostly unstructured and
often false or misleading [16]. Indeed, one of the frequently cited disadvan-
tages of OSINT is the lack of mechanisms for verification and evaluation
of the collected information, which is especially true for information gath-
ered via the Internet [3]. This proves to be an issue when information about
possible cyber threats is collected from publicly available sources [17].
The quality of the results can be evaluated by establishing the credibil-
ity and independence of the sources of the information and using multiple
sources and comparing the results [18]. However, as stated before, having
the correct answers to the question does not automatically render the infor-
mation useful, as the question might not have been defined precisely. To find
information that is relevant to the user’s inquiry, context and query-specific
knowledge are essential [19].

2.1.1 Legal and Ethical Aspects

Besides some of the technical challenges, OSINT also brings additional ethi-
cal and legal issues. Although the collection of OSINT should be by definition
legal since only publicly accessible data are considered, the line between eth-
ical and ill-willed usage of the data is not completely clear [20]. Extremely
sensitive personal information such as sexual orientation, religion, or political
beliefs can be inferred even when these are not explicitly stated [21]. The com-
bination of multiple sources and the use of state-of-the-art techniques that
derive valuable information is what elevates mere data into a powerful tool.
According to some, the collection of OSINT is not much different from
someone reading a newspaper since the information is public in both cases.
However, it is the institutionalization of OSINT that raises concerns even
within the intelligence community [22]. As is apparent from leaks of classified
documents such as those released by Edward Snowden, the mass surveillance
performed by governments is not only focused on certain suspicious individ-
uals but rather omnipresent. The intelligence agencies of the world’s biggest
economies, such as the US, have enough resources to find virtually anything
about individual people the Internet has to offer [1].

6
2. Open Source Intelligence

The biggest issue that arises with such powerful knowledge is the po-
tential harm to the targeted individuals [23]. The EU’s General Protection
Regulation (GDPR) and other emerging privacy regulations are an incen-
tive for companies and individuals to carefully handle data to prevent any
direct or indirect leakage of personal information [24]. According to GDPR,
different pieces of information that together identify a specific person also
constitute personal data. This creates a non-trivial task to handle for com-
panies that sell pseudonymized data, as irreversible anonymization remains
an unresolved issue. Having regulations such as GDPR only solves a part of
the problem since users can voluntarily publish their personal information.
Nonetheless, the push for better privacy of Internet users could potentially
decrease the value of OSINT.
Both GDPR and Privacy by Design [25], an essential concept from GDPR
addressing users’ privacy, are also partially applicable to OSINT (or any data
collection process). By adhering to these principles, the OSINT investigators
can perform the tasks at hand as ethically and safely for the targeted indi-
viduals as possible in the given scenario. The principles can be summarized
as follows [12]:

• minimize the amount of collected personal data

• use the data only for the specified purpose

• only a pre-defined set of people should access the data

• delete the data once it serves the purpose

These principles cannot be fully embedded in OSINT platforms in a legally

binding manner [26]. However, legal and ethical safeguards can be builtin to
allow the users to determine to what extent they comply with GDPR and
Privacy by Design approach. These safeguards could be strengthened by us-
ing a markup language that would allow the users to specify access control
policies, data removal enforcement, and other privacy requirements in an au-
tomated manner. Casanovas [27] defines a regulatory model that applies
Privacy by Design principles to OSINT investigations. This model mainly
focuses on the analyses performed by governmental agencies and aims to set
appropriate legal boundaries. Rajamäki and Simola [28] explore the neces-
sary extensions and changes that would need to be carried out in an existing
maritime surveillance project to implement Privacy by Design architecture.

7
2. Open Source Intelligence

2.2 Value and Use Cases

Although there are some challenges when utilizing OSINT, the sheer amount
of information available on the Internet makes OSINT a viable solution for
national security, cyber threat intelligence, and many other fields. The incre-
mental improvements in the performance of computers throughout the past
have made OSINT more and more relevant. However, it was the emergence of
big data [29] and machine learning that made huge amounts of data available
at a much more rapid pace and with more valuable information extracted
from it. These are described later in Section 2.3.
With the widespread adoption of social media networks, the availability
of private personal information has increased considerably. Combined with
the fact that many users are unaware that large portions of the informa-
tion they share could be accessible to anyone, social media in particular are
a goldmine of OSINT [30]. Data from these websites can be used for market
research and other business-related use cases, but also for malicious activities,
such as phishing [31]. Companies also need to be aware of the information
that can be found about them publicly, as possession of a complete collec-
tion of such information could give a potential attacker a lot of valuable
knowledge as to how the company could be exploited.
The dark web offers strong anonymity and privacy guarantees for users
that wish to participate in illegal activities [32]. This fact alone makes
the dark web a very fruitful source of information, especially for govern-
ment agencies fighting against crime. The strong anonymity comes hand
in hand with the necessity to use specialized software to collect data from
the dark web [33], such as search engines and web crawlers that are able to
index hidden services. A framework called BlackWidow proposed by Schafer
et al. [34] brings together various tools for collection and analysis of the con-
tent to gather information related to cybersecurity and fraud monitoring.
There is a body of research focusing on the detection of activities of interna-
tional terrorist groups on the dark web and the education of agencies fighting
against these groups [35].
As shown in Figure 2.2, most of the use cases of OSINT can be divided
into three main categories – detection of organized crime, cybersecurity, and
social media and sentiment analysis. Many niche use cases do not necessar-
ily fall within any of these categories, such as competitive intelligence used
by companies to research potential markets for their products or cybersecu-
rity from the attacker’s perspective. Therefore, this thesis generalizes these
categories to military intelligence, cybersecurity from the viewpoint of both
attackers and defenders, and social and business intelligence.

8
2. Open Source Intelligence

Figure 2.2: The three main types of use cases for OSINT [36].

2.2.1 Military Intelligence

With the widespread adoption of social media networks, discussion forums,

and other forms of communication over the Internet by the general popula-
tion, criminal and terrorist organizations migrated a significant part of their
communication there as well. Even though many instant messaging applica-
tions provide end-to-end encryption, open forms of communication remain
an attractive medium. Additionally, the dark web is often used as a market-
place for illegal goods such as drugs, guns, or even hitman services. Besides
the monitoring of criminal groups’ activities, governments can utilize OS-
INT to detect a growing discontent of the population, the emergence of new
political movements, and other indicators of possible threats.
For military purposes, the collection of OSINT is generally more straight-
forward, less expensive, and safer than the gathering of intelligence from
covert sources [37]. Governments increasingly recognize these advantages,

9
2. Open Source Intelligence

and OSINT is used in combination with established forms of intelligence.

The benefits of the fusion of OSINT into existing sources include the initial
establishment of an intelligence objective, validation of information obtained
from covert sources, expansion of the existing knowledge, and replacement
of the same knowledge found in covert sources to protect these sources when
presenting evidence [38].
The particular uses of OSINT for military intelligence and fighting against
organized crime are varied. Scrivens et al. [39] combine data gathered by
a web crawler specializing in extremist content and a novel sentiment analy-
sis tool. They conclude that sentiment analysis might significantly improve
the detection of extremism on public websites. Susnea [40] suggests that
unexpected events such as natural disasters are usually followed by many
social media posts, including images, video recordings, and detailed infor-
mation, providing governments with better insights into what is happening.
Ball [41] proposes the utilization of automatic analysis of social networks for
the prevention of potential terrorist attacks. Dawson et al. [42] analyze how
OSINT tools can link various Twitter posts to the African terrorist group
Boko Haram.

2.2.2 Cybersecurity

The world of OSINT provides a lot of valuable knowledge for companies con-
cerned about the security of their infrastructures, such as information about
the ever-evolving landscape of cybersecurity threats, details about new vul-
nerabilities that are constantly getting discovered, or reports about recent
security incidents. All these pieces of information constitute a goldmine of
intelligence that anyone can easily and freely utilize. Just as companies can
employ OSINT to get a better understanding of potential threats, the at-
tackers might use the publicly available information to explore the Internet
presence of the companies, find any possible loopholes for exploitation, or
shape a strategy for a phishing campaign [43].
Chapter 3 discusses some existing tools suitable for these tasks, for exam-
ple, tools that determine what software a website is using, which ports are
open on a particular IP address including the services running on there, or
people and e-mail addresses associated with a company. Edwards et al. [44]
demonstrated that a large-scale collection of information required for a social
engineering campaign could be carried out completely automatically with no
active communication with the targets. They did so by gathering contact in-
formation of all employees publicly affiliated with a company, tracking down

10
2. Open Source Intelligence

other employees through social media networks, obtaining their personal in-
formation, and so forth.
Hayes and Cappa [45] performed a thorough evaluation of the critical
infrastructure of a company operating in the U.S. electrical grid using only
publicly available data. They created a complete overview of the infrastruc-
ture, including specifics about the hardware and software used on various ma-
chines, outlined potential vulnerabilities within the infrastructure, and dis-
covered the company’s employees, including their e-mail addresses. The au-
thors conclude that a continuous collection of OSINT targeted at a particular
company could provide attackers with powerful knowledge, and companies
should pay more attention to what information about them is publicly ac-
cessible. Cartagena et al. [46] performed a similar analysis and were able to
achieve comparable results.
Tanaka and Kashima [47] propose a URL blacklist based solely on OS-
INT, and they show that 75% of the blacklist’s values are unknown to Google
Safe Browsing. Additionally, 23% of the malware used in these URLs is also
unknown. Quick and Choo [48] incorporate OSINT in digital forensic analysis
to add value to the data and aid with timely extraction of required evidence.
Vacas et al. [49] propose an automated approach for the collection of new
knowledge from OSINT for detection rules in intrusion detection systems.
The method was tested on real-world network traffic, proving it can detect
malicious activities within a network. Lee et al. [50] combine OSINT with
events detected by security devices to improve the knowledge of potential
threats.

2.2.3 Social and Business Intelligence

How the general population feels about specific topics has always been an im-
portant aspect when designing marketing campaigns, deciding how a product
should be built to suit its users’ needs, or even formulating a manifesto be-
fore an election. With the recent advent of sentiment analysis algorithms
that can evaluate users’ opinions just from their posts, OSINT has become
an attractive source of information for social and business intelligence. Data
from social media networks, discussion forums, and other websites where
users often share opinions on different topics can be collected and evaluated
to get a grasp of the overall public opinion [51].
Neri et al. [52] performed a sentiment analysis of 1000 new articles related
to a public scandal of the former Italian Prime Minister Silvio Berlusconi.
The primary goal was to detect whether there was a coordinated press cam-
paign by evaluating the time and geographical distribution of the articles and

11
2. Open Source Intelligence

the proportion of positive and negative opinions. Fleisher [53] lays down how
OSINT affects competitive and marketing intelligence, details the biggest
challenges when utilizing it, and outlines the best practices for successful
utilization of public sources.

2.3 State-of-the-Art

Performing OSINT is a challenging task and requires a systematic approach

and advanced technology. Just like in other fields that need to manage large
volumes of data with no pre-defined structure, various techniques can be
utilized to make the process more efficient. This section describes some of
the innovative approaches utilizing sophisticated algorithms for data aggre-
gation, extraction of valuable information, classification of the results, and
other aspects of OSINT.
Magalhães and Magalhães [54] present an OSINT tool named TExtractor
that extracts text from audio and video and searches for valuable informa-
tion. Specifically, the tool tries to detect keywords associated with cyber
attacks. The text is transcribed from the source by using speech recognition
tools, translated into English, and inspected to detect the specified keywords.
The accuracy of TExtractor when detecting audio and video referencing cy-
ber attacks is between 60% and 70%.
Maciolek and Dobrowolski [55] present a system for aggregation and anal-
ysis of OSINT based on modern data processing approaches. The system
enables ad hoc queries for the collection of OSINT for a particular target
by utilizing Big Data techniques, specifically a MapReduce model. Multiple
data collectors, such as a web crawler and relational databases collector, are
implemented, and new extensions can be included using the universal REST
interface. Once the data are gathered, the content is extracted, tagged, clas-
sified, and possibly translated to English from other languages.
Gong et al. [56] propose a model for reliability estimation of CTI feeds.
The data from each feed are normalized to have the same format. The main
features, such as the risk level of a network resource or the number of IP ad-
dresses associated with a specific attack are extracted, and categorical values
are transformed into numerical values. While using these numerical features,
all feeds are compared to each other, and the average of the differences to
other feeds is used as the independence factor of each feed. The average of
all data collected from the feeds for various network resources (IP addresses,
domains, and file hashes) is used as the expected value, and the error of each
feed is computed as the deviation from the expected value. Finally, the reli-

12
2. Open Source Intelligence

ability of the CTI feed is determined by its independence and error and is
continuously updated with new values to reflect the current situation.

2.3.1 Natural Language Processing

Natural Language Processing (NLP) is an interdisciplinary research area in-

cluding computer science, linguistics, machine learning, and statistics. The pri-
mary goal of NLP is to process natural language data in order to analyze and
understand the characteristics and meaning of both text and speech [57]. Al-
gorithms using NLP have been adopted by a range of software dealing with
large amounts of natural language data, including search engines such as
Google [58] or voice assistants such as Apple Siri [59], and it is the basic
building block of many advanced solutions for OSINT.
Noubours et al. [60] argue that NLP is a necessity for effective OSINT
data aggregation and analysis. Their framework aims to help the human
analyst with data acquisition, management, and analysis by using various
NLP techniques. The data are collected by both manual web search and
NLP-aided web crawling starting from pre-defined set of URLs. All the data
are then processed into a structured output for more effective data manipu-
lation and search queries. The analysis itself is done manually by an OSINT
investigator that can utilize advanced search queries, content-based informa-
tion filtering, text classification, machine translation, data visualization, and
generation of reports.
Social media account identifiers, such as usernames and e-mail addresses,
can be easily manipulated so that two different accounts cannot be linked
together. Different supplementary mechanisms can be used together with di-
rect linkage using identifiers to increase the likelihood of successful account
correlation. Layton et al. [61] examine how an authorship analysis can help
to link social media accounts without direct evidence. They design a method
that takes pieces of text created by accounts with different identifiers, pro-
files them, and determines whether they are related based on an empirical
threshold. The authorship analysis has an accuracy of 84% and goes up to
90% when using the computed threshold.
Li et al. [62] propose a framework for the analysis of unstructured text
to gather cyber-threat-related information. The framework utilizes NLP to
extract cybersecurity event data from the infrastructure of a company, arti-
cles about cybersecurity, and Common Vulnerabilities and Exposures (CVE)
entries. These are combined to create profiles of potential threat actors,
the methods they might use, and their possible targets. The collected data

13
2. Open Source Intelligence

are used to train a machine learning model, which achieved a precision of

85%.
Ganino et al. [63] explore the role ontologies might play in the interpre-
tation of data collected from unstructured sources on the Internet. Ontolo-
gies [64] describe an application domain by defining its terms, identifying
possible relationships, and establishing constraints. Design of ontologies is
a complex procedure and cannot be completely automated. Ganino et al.
assume the availability of ontologies that are already constructed. They fo-
cus on the population of ontologies with data collected from the Internet
and utilization of this technology for more efficient analysis of unstructured
OSINT data.
Serrano et al. [65] propose a knowledge discovery system that combines
techniques for information extraction and aggregation. The system merges
data retrieved by pattern mining and ontologies and connects different pieces
of information to create more comprehensive knowledge. They test their
system on Global Terrorism Database [66] and show that systems using
both pattern mining and ontologies can achieve better results than systems
using only one of these techniques.

2.3.2 Machine Learning

Machine learning (ML) is a subfield of artificial intelligence studying algo-

rithms that build a model based on a so-called training data set and identify
patterns or make predictions with little to no human intervention [67]. There
are three paradigms in ML that differ in the type of information they pro-
vide and the way the model is trained. Supervised learning uses data set with
known attributes to train the model for the classification of data, unsuper-
vised learning can find patterns or relationships in the data, and reinforce-
ment learning decides on future actions of software agents in an environment
based on the state of the environment. ML is utilized across various different
applications, such as recommendation systems, autonomous vehicle control,
speech recognition, or computer vision [68].
Pellet et al. [69] propose a method for the localization of social network
users by combining OSINT and ML. Firstly, data from Facebook, Twitter,
and Instagram are collected using APIs of these websites. Certain features
are extracted from the data, namely geolocations (if available), IP addresses,
usernames, and relationships between different users, which are then trans-
formed into social graphs. The relationships are determined based on users
tagging each other in posts and other direct public interactions. The geolo-
cation resolution starts with seeds found in the data. These include geoloca-

14
2. Open Source Intelligence

tions pinned directly in the posts, places mentioned in the posts themselves,
and geolocations found by CLAVIN (Cartographic Location And Vicinity
INdexer) [70], a tool that uses NLP and ML. The combination of user re-
lationships and geolocation seeds is used to determine past, present, and
possibly future physical locations of the users. The achieved accuracy is over
77%.
Ranade et al. [71] utilize deep learning for the translation of CTI data
from different languages to English. The translation is optimized specifically
for cybersecurity-related data by creating translation mappings for all key-
words that are associated with cybersecurity. Data collected from public
sources such as Twitter that are in other languages are preprocessed using
NLP and a translation framework that adopts deep learning, and the trans-
lation mappings translate all relevant parts of the data. Once the data are
translated to English, they can be fed to CTI systems, thus broadening
the amount of gained CTI. The system focuses on Russian but can be ex-
tended for other languages as well by simply creating mappings for the re-
spective language.
Alves et al. [72] present a framework for the collection and classifica-
tion of Twitter posts. The main goal is to gather security-related informa-
tion from these posts and provide them to Security Information and Event
Management (SIEM) systems, which employ event data to handle the se-
curity management of an organization. The premise is that many security
experts use Twitter to post short messages about security-related news in
near real-time. These messages are normalized, the features are extracted,
and the messages are classified by a supervised ML model and clustered.
Each cluster is analyzed using named entity recognizers [73], and the crucial
components (attack, vector, target) are retrieved.
Deliu et al. [74] investigate how ML and Neural Networks may help
with the collection of CTI from hacker forums and other social platforms.
Specifically, they employ supervised ML and Convolutional Neural Networks
(CNN) [75] to classify the posts from these platforms. The results show that
the ML classifier performs at least as well as the CNN one, with the accuracy
of both being approximately 98%. As CNN classifiers tend to be rather com-
plex and expensive when used in practical scenarios, having ML classifiers
with the same accuracy might increase the chances of companies utilizing
classifiers for CTI.
Mittal et al. [76] present a system for extraction and analysis of cyber-
security information collected from multiple sources, including national vul-
nerability databases, social networks, and dark web vulnerability markets.
The system uses NLP to remove unnecessary parts of the textual data and

15
2. Open Source Intelligence

extract parts of the data related to cybersecurity. The preprocessed data

are stored in complex structures representing the information about differ-
ent entities and their relationships. The knowledge is pro-actively improved
by utilizing supervised ML and deep learning. The system is able to alert
the user based on some pre-defined rules and enables complex search queries.

16
3 OSINT Sources and Tools

This chapter introduces various different sources that can be used to obtain
data for OSINT. These are described in Section 3.1, ranging from simple
tools giving some specific information about the queried keyword, such as
geolocation of an IP address or a list of accounts associated with a username,
to more sophisticated software that employs non-trivial algorithms, such
as Shodan [4] or Darksearch [77]. Tools for OSINT automation also exist,
utilizing many available sources and using the results to search for additional
information. A few of these are described later in Section 3.2.
There are books and articles discussing various OSINT tools and prac-
tices in more detail. Chauhan and Panda [78] explain all the theory behind
OSINT and describe some of the tools from this chapter in more detail, in-
cluding instructions on how to use them. Revell et al. [79] establish a frame-
work for the assessment of OSINT tools and best practices for their usage.
The OSINT Handbook [80] provides an exhaustive list of all available OSINT
tools and resources.
Many websites aim to make the process of OSINT more organized and
methodical. The OSINT Framework [81] (illustrated in Figure 3.1) provides
a broad overview of OSINT-related tools that are either completely free or
offer a limited usage for free. Your OSINT Graphical Analyzer (YOGA) [82]
is a simple flowchart showing what a piece of information can be trans-
formed into or used for, for example, how an IP address can be used to find
other relevant data, such as domain name. OSINT Open Source Intelligence
Framework [83] is similar to the OSINT Framework but goes a bit further
by adding educational resources, listing notable companies and people who
contribute to the OSINT realm, and much more.

3.1 Overview

Real Names Gathering information about people using their real names
depends on their country of origin and the uniqueness of their name and
knowledge of other related information, such as address or date of birth.
Many countries keep records of various public information, including prop-
erty ownership, criminal activity, weddings, births, and deaths. Each country
has different rules and laws for what is considered public information and
what is kept secret. There are tools aiming to automatically collect data
from many public sources to find as much information about an individual
as possible. Pipl [84] is an identity resolution engine that tracks online iden-

17
3. OSINT Sources and Tools

Figure 3.1: The OSINT Framework website [81] showing a structured view of
free OSINT-related tools. These are divided into categories based on the type
of information they provide.

tity information, uncovers associations between different people, and more.

Ancestry [85] is a popular service that allows users to find information about
their ancestors by specifying names, addresses, and locations of themselves
and their relatives. As there are no mechanisms to check whether the person
looking for the information is related to the people of interest, it can be
misused for intelligence purposes.

E-mail Addresses and Usernames E-mail addresses can be verified us-

ing MailTester [86] that performs series of checks. HaveIBeenPwned [87]
collects data breaches and allows users to check whether an e-mail address
and the password used with this address was part of a leak. CheckUser-
Names [88] is a service that checks over 170 different social networks to

18
3. OSINT Sources and Tools

find out whether an account with a specified username exists. WhatsMy-

Name [89] provides a list of all the data required to perform enumeration
on these social networks directly. However, search results about usernames
highly depend on the uniqueness of the username as different people often
use the same username. This means that accounts with the same username
across different social media services might not be used by the same person
and therefore are not related. PhoneInfoga [90] collects international phone
number information, such as country, area, carrier and line type.

Social Media People often share a lot of personal information on social

media, so utilizing these services for OSINT data collection can provide a lot
of valuable information. Facebook and Instagram enable users to make their
information and posts accessible only to a selected group of people, but many
users choose to make them public. Twitter is often used by professionals, po-
litical activists, and other thought leaders, and analysis of Twitter posts can
reveal the public opinion about various issues. LinkedIn is a social network
for business relationships and contains information about the education, em-
ployment history, and skills of its users. Additionally, users often share their
phone numbers, e-mail addresses, and other contact information. All of these
services provide an API for access to the content [91, 92, 93].
There are also third-party services that can help with social media in-
formation collection and analysis. Tinfoleak [94] analyzes Twitter accounts
and provides information about the users such as devices and operating sys-
tems used by the user, geolocation of the posts, and other users mentioned
in the posts. Social Searcher [95] monitors all public social media posts for
mentions of a specific keyword, such as a name of a person, company, or
product, and finds social media accounts of the specified person.

IP Addresses and Domains When looking for information about a spe-

cific IP address or domain, various services provide plenty of useful data.
Websites such as DSNlytics [96] and IPinfo [97] give a comprehensive list of in-
formation about the queried IP address, including geolocation, reverse DNS
data, ASN info, related domains, and much more. IKnowWhatYouDown-
load [98] discloses all torrent files associated with an IP address. Domain
names generally yield a lot of information, such as associated IP addresses,
subdomains, other related domains, details of the registrar, contact person,
and so forth. Crt.sh [99] is a database of all publicly issued certificates. It can
find all past certificates of the domain. Urlscan.io [100] browses a submitted
URL, records the activities happening during this process (for instance, vis-

19
3. OSINT Sources and Tools

ited domains and IP addresses), saves the resources of these domains, takes
a screenshot, and so forth.

Blacklists A reputation of an IP address, domain, or e-mail address is

often valuable during intelligence collection. Blacklist is a widespread mech-
anism that keeps track of elements known to perform malicious activities,
such as spamming e-mail addresses, URL addresses containing spam or set
up for phishing, and IP addresses associated with botnet activities. Many dif-
ferent companies operate their blacklist with different criteria for adding and
removing new elements. SpamCop [101], SURBL [102], and SORBS [103] are
established DNS-based blacklists of IP addresses and websites transmitting
unsolicited messages. The primary purpose of this type of blacklists is to be
used by mail-servers and accessed through DNS. There are also many black-
lists accessible in a simple plain-text format. Few notable examples include
Abuse.ch SSL Blacklist [104] containing SSL certificates used by botnet C&C
servers, Malware Domain Blocklist [105] listing domains that are known to
be used to propagate malware and spyware, and FireHOL IP Lists [106],
which analyzes many different IP blacklists to provide a compound blacklist
for various systems.

Cyber Threat Intelligence Section 2.2.2 outlines how OSINT can be

used to aid the collection of data about potential cybersecurity threats
and threat actors. Various companies make this data publicly accessible in
the form of cyber threat intelligence (CTI) feeds. CTI feeds can be used by
companies to build defense mechanisms for their infrastructures and mitigate
the risks. CTI feeds are essentially an advanced type of blacklists, therefore,
they can be used as a source of intelligence as well. Open Threat Exchange
(OTX) [107] by AlienVault is a platform for the security community to share,
validate and research latest threat data. OTX can be integrated into many
existing tools via an API or queried directly through the ThreatCrowd [108]
website. MetaDefender [109] and FortiGuard [110] are some of the other
threat intelligence services.

Reconnaissance A significant category of software that can be used for

OSINT is reconnaissance tools. BuiltWith [111] analyzes the technology
stack of a website. Spyse [112] offers a pack of tools for network scanning,
DNS query validation, IP whois lookup, information about a domain, and
many more. Censys [113] aims to provide a complete network monitoring
solution for companies that want to prevent exposure to their assets. Their

20
3. OSINT Sources and Tools

database is available to the public through a full-text search engine with

a limited number of queries.
Shodan [4] is a popular full-text search engine that discovers all devices
directly accessible through the Internet, checks which ports are open, and
what kind of service runs there. All the data are kept in a database and con-
tinuously updated. Knowing a name and a version of the software running
on some device connected to the Internet might be valuable for both a po-
tential attacker and a user with honest intentions. Nmap [114] is a tool used
to discover hosts and services of a network by analyzing responses. MASS-
CAN [115] is similar to Nmap but scans the whole Internet. Unlike Shodan,
Nmap and MASSCAN are not services providing a database of different In-
ternet devices that were already analyzed but perform the investigation in
real-time.

Search Engines Although created for a different purpose, search engines

like Google [116], Bing [117], and DuckDuckGo [118] have an API for custom
search queries that can be used for OSINT. Google Dorking [119] is a tech-
nique using advanced features of the Google search engine to find information
that is otherwise not presented by the website and does not appear in a basic
search query. For example, by using the operator intitle, it is possible to find
all websites using a specific version of the software that is known to have
a vulnerability.
There are also many specialized search engines. Ahmia [120], Torch [121],
and Onion Search Engine [122] are popular services for indexing of the Deep
Web (.onion sites). Darksearch [77] aims to go even further by searching
through the dark web and accessing black markets, restricted sites, and ille-
gal content in general. Some search engines, for instance Kilos [123], are de-
signed to be used for illegal activities in the first place. The primary purpose
of Kilos is to search the black markets to buy drugs, guns, counterfeit docu-
ments, etc. Another noteworthy category is source code search engines such
as PublicWWW [124] that explores the source code of websites, or Search-
Code [125] that searches public source code in general. Fagan Finder [126]
provides an interface allowing users to query one of many different search
engines with redirection to the results.

Metadata Metadata are very helpful for the management of files on a com-
puter. They carry a considerable amount of information about the file and
can often reveal a lot. Images can yield information about the camera that
took the picture, the date it was taken, and sometimes even the location. Dif-

21
3. OSINT Sources and Tools

ferent documents, such as PDF files, can contain information about the au-
thor or the system used to create it. FOCA [127] and Metagoofil [128] can
find all files of a specific type on a given domain, obtain metadata from a doc-
ument or a website, and find any similarities in metadata of multiple files.
ExifTool [129] is an offline tool for image metadata extraction, while Google
Images [130] can perform a reverse image search to find related images and
websites containing this image.

Geolocation One of the useful pieces of information about a particular

target, or at least of something that is known to be related to the target
(e.g., a piece of text or an image), is its geolocation. Twitter, for exam-
ple, allows to search posts by their GPS coordinates using the geocode key-
word (e.g., geocode:<latitude>,<longitude>,0.1km). As already discussed in
the previous paragraph, files often contain information about geolocation.
Flickr, an image and video sharing service, provides a map [131] showing ge-
olocation of all images containing a geotag. CLAVIN (Cartographic Location
And Vicinity INdexer) [70] enables the extraction of geolocation information
from textual data, such as social media posts, by using natural language
processing and machine learning. The geolocation is not only extracted from
the text (e.g., by finding names of cities or countries), but also compared
with the rest of the document to make it more precise. CLAVIN is currently
not under active development.

Data Archives In general, OSINT is usually performed using tools de-

scribed in Section 3.1, i.e., a target keyword, an IP address for example,
is queried in multiple tools, and the results are manually or automatically
collected and evaluated. However, sometimes this approach does not pro-
vide all the answers, and gathering data in bulk and the subsequent analysis
might reveal more information. Finding an IP address in a block of data
with a known origin can provide context or some other identifying data,
such as an e-mail address. Due to the sheer amount of data on the Internet,
collecting all the data would require a tremendous amount of storage. For
this reason, specific parts of the Internet content need to be targeted, e.g.,
services providing data with a certain origin and purpose. Global Terrorism
Database [66] collects information about international terrorist attacks that
occurred since 1970. An interesting source of data is also a grey-area content,
for instance, Wikileaks [11] or data leaks in general. However, this type of
data might not fit some definitions of OSINT, as the content is not originally
meant to be public.

22
3. OSINT Sources and Tools

Pastebin is a popular type of service used to store and share plain text,
for example, source code or any other text that is formatted or is too long to
share directly through a messaging application. Pastebins are usually public,
and anyone can access the text shared by other users. Since some users are
not aware of this or do not consider it as a problem, they might use this
service to share private information. Additionally, pastebins are often used
to share private information on purpose, such as database leaks, meaning
that they can be a good source of information for an OSINT investigation.
There are multiple ways data from pastebins can be gathered. PasteLert [132]
is a service that sends an e-mail whenever a search term appears in a new
paste. Sniff-Paste [133] scrapes pastebins, stores them in a database, and
searches for noteworthy information. There are also the so-called pastebin
dumps that provide all pastes in one place.

Web Scraping Parts of the Internet with potentially valuable information

can also be collected locally. Periodical aggregation of some websites might be
beneficial since the content of the Internet changes constantly, and valuable
information could disappear before the time of the investigation. To achieve
this, one can use web scraping, a technique used for extraction of data from
websites, or digital libraries maintaining an archive of Internet content, such
as the Wayback Machine [134].
There are many tools and libraries for web scraping. Web Scraper [135]
provides an interface for interactive manual web scraping. Scraper API [136]
handles all small details required for scraping, such as proxies. ScrapeSim-
ple [137] is a service that builds custom scrapers based on the requirements
of the customers. Scrapy [138] is an open-source python framework for build-
ing and running scrapers. A special case for web scraping is the deep web,
i.e., the part of the Internet that is not indexed. Just like special search
engines are needed to explore the deep web, web scrapers with different al-
gorithms have to be used. Ahmia browser provides the code for scraping of
the deep web as open-source [139]. It is based on the Scrapy library, and it
saves the results into an Elasticsearch database. To use it, both their code
for indexing [140] and the Tor browser need to be installed and running.

3.2 OSINT Automation

Even with the broad selection of OSINT tools, searching for valuable infor-
mation about the target can be overwhelming. To overcome this obstacle
and make things easier for the OSINT investigator, proprietary products
such as Intelligence X [141] and ShadowDragon [142] exist. Intelligence X

23
3. OSINT Sources and Tools

is an OSINT software designed to allow the user to perform any kind of

open source intelligence. The search engine uses selectors such as IP address,
URL, or Bitcoin address and explores different sources of OSINT, including
dark web, public data leaks, and document sharing platforms. Additionally,
it keeps a data archive of historical results.
ShadowDragon is a framework for OSINT collection, monitoring, and
analytical investigation with five tools dealing with different aspects of OS-
INT. MalNet collects threat information about domains, IP addresses, and
malware samples, and searches for any correlations. SocialNet tracks individ-
uals across different social media networks and tries to uncover networks of
accounts. OIMonitor brings together intelligence from different sources and
provides a graphic interface for real-time monitoring of any new information.
AliasDB keeps a database of data aggregated in the past to preserve informa-
tion that might not be accessible anymore. Spotter creates an environment
for active interaction with the target by redirecting to a website that keeps
track of the whole communication.
TheHarvester [143] is a simple open-source tool for automated aggre-
gation of OSINT about e-mail addresses, domains and subdomains, URLs,
and IP addresses, designed to be used in the early stages of penetration
testing. It is included in the default installation of Kali Linux. Most of
the data sources used by theHarvester are various search engines, starting
from general-purpose services like Google, Bing, and DuckDuckGo, to spe-
cialized engines, including Shodan and ThreatCrowd. All the data found by
the search engines are parsed and examined to find any valuable informa-
tion. For example, it uses Google to search Trello boards, Twitter accounts
related to a specific domain, and LinkedIn users. TheHarvester also utilizes
Intelligence X, another OSINT automation tool.
The goal of this thesis was to create a tool for OSINT automation. Some
of the noteworthy existing tools with similar objectives are described in
the remainder of this chapter and compared to the implementation in Sec-
tion 5.2.

3.2.1 Recon-ng

Recon-ng [144] is an open-source framework for OSINT automation. It is

highly modular and customizable, with the base framework providing all
the essential functionality and accessory functions needed to perform the in-
vestigation, while the data are collected using separate modules. The ad-
vanced command-line interface has a command completion and interactive
help with extensive documentation for all the commands and subcommands.

24
3. OSINT Sources and Tools

Additionally, Recon-ng also has a simple web interface. All the collected
data are stored in an SQL database and managed through the db command.
The database looks at all the information stored there as a potential new in-
put for subsequent data gathering. Snapshots of the database can be created
for simple data recovery in case of a failure. To present the results in a human-
readable format, CSV and HTML files can be generated. Workspaces create
the possibility to have multiple environments with independent configura-
tions and database instances, allowing the user to switch between them as
needed. Recon-ng can be run automatically through a so-called resources file
containing all the commands for the framework to run.
The modules are defined by an abstract class with a well-defined inter-
face and some accessory functions from which new modules need to inherit.
All the modules are available in a place called Recon-Ng Marketplace [145],
which is an independent GitHub repository. As of now, the Marketplace con-
tains around 100 different modules. Third-party modules that are not part
of the Marketplace can also be used. These are loaded directly from a local
Recon-ng directory. The abstraction allows the users to utilize any infor-
mation source by simply creating a wrapper class around that source since
the framework only requires the class to implement the interface. The mar-
ketplace and modules commands are the entry point for the administration
of the modules. Each module can define the type of input it takes, and
the database is scanned to search for any data of this type that could be
used as a new input for the module.

3.2.2 Maltego

Maltego [146] is a very well-known software platform for OSINT collection

and analysis. It is a commercial product available in multiple versions with
different features and limitations. The approach for the management of these
sources is similar to the one used by Recon-ng. The modules, called trans-
forms, are small pieces of code that transform data from any tool or product
to a well-defined format. All transforms are available through the Trans-
form Hub [147]. Just like Recon-ng’s Marketplace, there is a large number
of different sources for information gathering, starting from tools discussed
in Section 3.1 to other OSINT automation tools, e.g., ShadowDragon.
A substantial advantage of Maltego over Recon-ng and other open-source
tools is the visualization and analysis capabilities that allow the investigator
to study connections between different e-mail addresses, domains, and other
elements. The aggregated information can be represented by a custom entity,
e.g., simple keywords such as IP address and company, or more advanced

25
3. OSINT Sources and Tools

data types, like documents and social networks. Similar to Recon-ng, entities
can be used as an input for further data collection. Entities are visualized as
nodes, and in the case a connection was found, they are clustered to networks
of nodes. Figure 3.2 shows an example of a graph generated by Maltego.

Figure 3.2: Illustration of Maltego’s graphical interface showing a search

graph when nist.gov website is queried [148].

The whole process of OSINT investigation using Maltego starts with

the selection of transform to use. This depends on the type of information
the user uses as the initial entity and what they want to find. Another
consideration is that many transforms require an API key that is often paid.
Once the transforms are loaded, the entity is queried, and all new entities
found by the initial data collection are displayed in the graph. This can be
repeated for all newly discovered entities, and eventually, after all desired
data discovery is finished, a directed graph shows all entities found during
the whole investigation. Node connections show the relationship between
different entities as well as the progression of how the entities were found.
Graph nodes can also be marked with a score representing the confidence
the user has for the result. For example, if multiple DNS names resolve to
a single IP address, this address might have a high confidence score. To make
graphs with many nodes and connections easier to navigate, Maltego also
enables users to add notes, attachments, and bookmarks.

3.2.3 SpiderFoot
SpiderFoot [149] is another open-source framework for OSINT automation.
Like Recon-ng and Maltego, it is designed to be highly modular and provide
all the necessary functions for data manipulation and storage. The significant
advantage of SpiderFoot is the number of modules it implements, which is

26
3. OSINT Sources and Tools

more than 170. It has both a command-line interface and a web interface.
The starting points SpiderFoot can scan are domain names, IP addresses,
hostnames/subdomains, subnets, ASNs, e-mail addresses, phone numbers,
and human names. The features of the paid version SpiderFoot HX that are
not available in the open-source version include Tor browser integration for
deep web scanning, multi-target scanning, continuous monitoring with alerts
and e-mail notifications, and correlation engine, which looks for anomalies
and other notable results.
SpiderFoot’s web interface provides an easy way to configure the app and
the modules, add API keys, choose what modules to use for a scan, debug,
and visualize the results in the form of a table and a graph. The graph
representation is similar to the one in Maltego since it shows results as
nodes and displays relationships by clustering them. Selected results can be
marked as false positives, which also marks child elements and deletes them
from the graph. SpiderFoot HX can run the data collection step-by-step to
inspect how each result is discovered. Figure 3.3 shows the web interface of
SpiderFoot.

Figure 3.3: Illustration of SpideFoot’s web interface showing a list of results

for a particular scan [150].

27
4 Pantomath: Tool for Automated OSINT Collec-
tion

The main goal of this thesis was to implement a tool for an automated collec-
tion of open-source intelligence. Pantomath1 is a highly modular framework
that provides a complete environment for collecting and evaluating OSINT
about IP addresses, e-mail addresses, and domain names. The framework
implements all the functionality required throughout the process of OSINT,
but separate modules perform the collection itself. New modules can be inte-
grated by merely implementing a well-defined interface. Some of the gathered
data are evaluated in terms of their reliability. And finally, all the results are
presented in a structured output.
The rest of the thesis is organized as follows. Section 4.1 defines the prob-
lem at hand, describes some of the main challenges, and how Pantomath
strives to solve them. Section 4.2 outlines the high-level architecture of
the tool and the functionality it provides. Section 4.2.3 goes into more detail
about the implemented modules, and finally, Section 4.3 establishes a model
for reliability estimation of some of the modules. Section 3.2 from the previ-
ous chapter describes existing tools with similar objectives, and Section 5.2
from the following chapter compares them to Pantomath, states the major
differences, and discusses the advantages and disadvantages of each.

4.1 Problem Statement

There are many ways a tool that automatically collects OSINT could be
implemented since OSINT is an extensive topic, and the amount of avail-
able data is enormous. The collection of OSINT also poses many challenges,
many of which were discussed in Section 2.1. If the user aims to find specific
information, such as social media network contacts and posts of a particular
person or the footprint a company has on the Internet, various advanced
methods specializing in these tasks can be utilized. Section 2.3 describes
some of the more innovative approaches aiming to tackle specific OSINT
challenges and tasks. Using social media networks as an example, one could
take advantage of web scraping, sentiment analysis, machine translation, and
deanonymization to create a complete profile of various users of these social
media networks.
However, each problem the user might need to solve requires a differ-
ent approach, and building a state-of-the-art tool for each of these would

1. Pantomath is an English word for a person that wants to know and knows everything.

28
4. Pantomath: Tool for Automated OSINT Collection

be a laborious task. Instead, existing tools and services that already imple-
ment non-trivial gathering of OSINT can be utilized. Chapter 3 provides
an overview of such sources. Therefore, the objective of Pantomath is not
to actively collect data and transform them into valuable information, but
rather to automate the collection of OSINT by employing the existing tools.
Since the number of possible sources is immense and new ones can appear
in the future, one very desirable feature for the tool is an easy integration of
additional sources.
This approach brings new disadvantages that have to be taken into con-
sideration. By using existing services, Pantomath relies on their correct and
continuous operation and needs to address any changes that break the cur-
rent implementation. These services are also often built into commercial
products and, as such, provide only a limited usage for free or no free ver-
sion at all. For some types of information, there are no services that are not
paid, meaning that an API key needs to be bought or the information is
not available. One example of such a tool is BuiltWith [111], which analyzes
the technology stack of a website, thus providing very valuable information
about a domain. However, the price for access to BuiltWith’s API starts
at $295 per month. Anyhow, the selection of which paid services to choose
heavily depends on the use case.
One of the most significant challenges of OSINT that has not been ad-
dressed by any of existing OSINT automation tools2 is evaluation of the re-
liability of the collected information, or in other words, getting as close to
Validated Open Source Intelligence described in Chapter 2 as possible. Val-
idated OSINT is defined as OSINT with a high degree of certainty, which
is very hard to achieve without a certain level of human involvement. Any
OSINT investigation will eventually require a person with some knowledge
of the context of the investigation and OSINT itself to evaluate the gathered
data. The goal of Pantomath is not to provide any guarantees of the informa-
tion reliability but rather to allow the investigator to make more informed
decisions by producing a reliability estimate.
Collecting OSINT is usually not just about finding information about
one particular target but rather a repetitive cycle where follow-up searches
are refined based on information discovered in the previous iterations. By
performing these sequential searches, relationships between different targets
can be revealed. Similarly to the information itself, the reliability of the dis-
covered targets and their relationships with other targets can be estimated.

2. To the best of our knowledge, no existing tools provide an automatic reliability estima-
tion.

29
4. Pantomath: Tool for Automated OSINT Collection

Pantomath aims to automatically perform follow-up searches to reveal re-

lated targets and their relationships and provide a reliability estimate of
these relationships.
Another factor that has not been considered by existing OSINT automa-
tion tools is the footprint the OSINT investigation leaves behind when per-
forming the data collection. Generally, this is not a concern if the search
queries are not confidential. However, if the target itself is private informa-
tion or the investigator wishes to be disconnected from the query, additional
techniques have to be used to address these requirements. The interaction
with various OSINT services instead of direct communication with the tar-
gets provides an intermediary that prevents the targeted or third parties
to link the queries to the investigator. Nevertheless, the service providers
still know the user’s specific queries, which is amplified if the same provider
manages multiple of the queried services. Therefore, this level of indirection
is not sufficient for highly confidential inquiries. Pantomath can be used in
three modes of operation to accommodate varying anonymity requirements,
and these are presented in Section 4.2.2.

4.2 Architecture and Functionality

The architecture of Pantomath is separated into two main parts. The base
framework provides a complete environment for automated collection of
OSINT, with independent modules performing the data gathering itself.
The framework defines an interface that each module needs to implement,
and the interface is described in more detail in Section 4.2.3. Search queries
are narrowed down to simple keywords representing entities on the Internet.
Currently, the target can be an IP address, a domain name, or an e-mail
address, but the selection can be easily extended with user names, phone
numbers, Bitcoin addresses, and other identifiers.
All results that might be used as new targets, i.e., any of the IP addresses,
domains, and e-mail addresses related to the target, are added to a pool and
used for follow-up searches. The decision of what should be considered as
a new target is left for each module. Each new target keeps track of how it was
discovered, including the target used in the previous query, the relationship
between these targets, and the module that disclosed it. Therefore, a kind
of a search tree forms, where the initial target serves as the root of this tree
and has a depth of 0. Figure 4.1 shows an example of a search tree with
the domain fi.muni.cz used as the seed. The nodes represent newly found
targets, and the edges represent the discovery by various modules.

30
4. Pantomath: Tool for Automated OSINT Collection

Figure 4.1: An illustration of the search tree when querying fi.muni.cz up to

a depth of 2.

Each target discovered when querying the initial value is a child node
of the root, has a depth of 1, and is queried the same way as the initial
value. Again, new targets are extracted from the results. The whole process
is repeated until new nodes with a specific depth are found. The depth can
be specified before the query is started and increased if necessary. Generally,
the deeper the target is, the weaker the relationship is with the initial target,
and for some better-known targets, there might be hundreds of new targets
even at the first level.

4.2.1 Base Framework

The Pantomath framework contains various functions for smoother opera-
tion of the tool and more convenient data processing. All events happening
throughout the operation of the tool are logged to a file for easier debugging.
The fetch_url function is the sole point of communication with the outside
world. This design enables the use of the stealth mode described later in
Section 4.2.2, which utilizes the Tor network to interact with the Internet.
Besides the possibility to use the stealth mode, integration of Tor also allows
for potential future implementation of modules that require it to communi-
cate with onion services3 .
The fetch_url function also keeps track of timestamps of the last time
each service was queried and waits for a few seconds before sending another
request. This prevents potential situations where the tool sends too many

3. Onion (or hidden) services are anonymous services only reachable through the Tor
network.

31
4. Pantomath: Tool for Automated OSINT Collection

requests in a short period of time to the same service and gets banned.
The delay is specified in the configuration file, and it can be different for
each service. Moreover, the framework contains functions for extraction of
IP, e-mail, and Bitcoin addresses from blocks of data and validation of these
values, including a function that attempts to fix invalid e-mail addresses (e.g.,
by removing trailing characters that are added to disguise the addresses and
make it harder to extract them).
The database software used in the offline mode described later in Sec-
tion 4.2.2 is PostgreSQL [151], an object-relational database system allow-
ing a great deal of flexibility for the schema by having a wide variety of
objects it can store. The framework provides all the necessary functions for
the database management, such as for table creation and data storage and
retrieval, and leaves the management itself to each module. This gives each
module enough flexibility in how it stores its data and how the data are re-
trieved. Specifically, each table within the database stores three attributes:
• the lookup key,
• the timestamp of when the item was added,
• a JSON object storing all the additional information.
In most cases, the lookup key is just a string representing the retrieved
target (e.g., a specific IP or e-mail address), but it is also possible to use
a CIDR block as the key. In that case, the target IP address can be searched
by checking whether it belongs to some of the CIDR blocks. Since the mod-
ules manage the data storage and retrieval, they can use the lookup key
as a regular unique ID and implement some more complex data indexing.
The timestamp is used to discard items that are expired, and it is managed
by the framework. This feature can be disabled, and the expiry time can be
specified in the configuration file.
Pantomath can be used either through an API or a command-line inter-
face. The framework is also built to be easily extendable with other inter-
faces, such as a web interface, a custom API, and so forth. The command-
line interface provides the commands listed in Table 4.1. The update com-
mand updates the offline data for all modules. The query command queries
the specified target in all modules, stores the results, and adds any new tar-
gets found by the modules to a pool used for follow-up searches. The overt,
stealth, and offline commands indicate which mode of operation should be
used. The query results can be exported to a JSON file or printed out to
the standard output in a structured form using the export and print com-
mands.

32
4. Pantomath: Tool for Automated OSINT Collection

Command Description
exit Exit the CLI
export Export the results into a JSON file
help Display a menu with descriptions of all available commands
modules Display a list of loaded modules with a description
offline Switch to offline mode: only offline sources are queried
overt Switch to overt mode: all sources are queried
print Print the results in a structured form
query Query the target in all available modules
stealth Switch to stealth mode: Tor is used to fetch the data
update Update the offline data in all available modules

Table 4.1: Commands provided by the command-line interface

Figure 4.2 provides a high-level overview of the architecture and what

happens when a specific target is queried. The representation is simplified
only to show important aspects. Before running the fiquery, the database
needs to be updated to make all offline data available, which means the black-
lists module shown in the schema retrieves data that was downloaded and
parsed in advance. Each module fetches and parses data from the service it
uses and returns the results to the framework. The framework is also notified
if a new target was found and either queries this target or asks the user if
it should continue. Once the targets are queried, all results are returned as
a JSON object.

4.2.2 Modes of Operation

To address the concerns formulated in Section 4.1 regarding traces left behind
when performing the collection of OSINT, Pantomath offers three modes of
operation for different anonymity requirements. An overt mode, a stealth
mode, and an offline mode. The overt mode represents a regular operation
of the tool, i.e., all modules are queried, and an Internet connection is re-
quired. This mode can be used whenever the user is not concerned about
the anonymity of the queries, e.g., when a company or privacy-conscious
people investigate their presence on the Internet.
As already discussed in Section 4.2.1, all the communication between
Pantomath and the Internet goes through the fetch_url function. This design

33
4. Pantomath: Tool for Automated OSINT Collection

Figure 4.2: A high-level overview of Pantomath’s architecture.

enables the stealth mode by integrating the Tor network. In this mode, all
requests sent to any of the used services go through the Tor network, making
it much more difficult to trace the request to the machine where Pantomath
is running. The use of Tor is just an illustration of how the stealth mode can
be implemented. Besides Tor, the fetch_url function could be extended to
utilize custom proxies or other forms of anonymization, or the tool could be
deployed on a cloud infrastructure.
Although the stealth mode supports the use of all modules, one important
consideration is the use of API keys. For the services that require an API
key, all requests can be easily correlated to the user even though the requests
are sent through intermediary nodes since API keys are generally tied to
a registered account. This obstacle could be bypassed by registering accounts
using only anonymous e-mail accounts and fake information. However, many
services employ non-trivial protection against this technique.
The offline mode takes the anonymity a step further by performing
queries with no access to the Internet. Instead, all data available as a whole
are downloaded, parsed, and stored in the database in advance using either
overt or stealth mode. Once the database contains fresh values, connection
to the Internet is no longer necessary, and the offline mode can be acti-
vated. Each query checks whether anything related to the target is stored
in the database. By downloading data as a whole and not requesting in-
formation for various targets separately, the data provider or anybody who
observes what was downloaded only knows about the possession of this data
and not about what exactly the data are used for. Preprocessing the data in
advance and storing them in the database also brings additional performance

34
4. Pantomath: Tool for Automated OSINT Collection

advantages whenever multiple queries are executed. The data are parsed only
once, and all subsequent queries just check the database, which is a much
faster operation.
From the list of modules providing some offline data in Section 4.2.3,
it is apparent the selection is relatively small. In general, services rarely
offer complete access to their data for free, and usually not even as a paid
service. To have access to more data in the offline mode, the functionality
provided by the existing services queried by Pantomath would need to be
implemented within the tool. As discussed in Section 4.1, this would be
a very laborious task entirely out of the scope of this thesis. However, for
some of the modules, open-source tools providing similar functionality exist,
meaning that Pantomath would only need to implement the continuous data
collection. One example of such a tool is MASSCAN [115], which provides
information about open ports of IP addresses similarly to Shodan [4].

4.2.3 Modules

Pantomath is designed to be highly modular, where new modules can be

added by implementing a well-defined interface. The interface specifies three
attributes – the type of targets the module takes as an input (IP address,
domain name, e-mail address, or multiple of these), whether it provides data
for the offline mode, and a description of the module. The only function
that is mandatory for all modules is query, which takes a target, looks for all
the information it can find, and returns the results in the form of a dictionary.
There are no constraints on the structure of the results; they only need to
be JSON-serializable.
If the module provides offline data, i.e., data that can be downloaded
completely, it needs to create a table in the database using the init_database
function. After the tables are initialized, the module has to download, parse,
and store the data in the database using the update_database function. As
already mentioned, the table’s schema is flexible, and each module can imple-
ment non-trivial indexing of its data. And finally, the offline_query function
is an offline equivalent of the query function, which means it only retrieves
data stored in the database.
As discussed in Chapter 3, the list of all OSINT sources is enormous, and
implementing all of them would be redundant and entirely out of the scope of
this thesis. Instead, the focus regarding the modules was twofold – to create
a framework that would allow for straightforward integration of new modules
and to implement some of the more interesting modules that provide diverse
types of information about all the possible targets, i.e., IP addresses, domain

35
4. Pantomath: Tool for Automated OSINT Collection

names, and e-mail addresses. Table 4.2 shows all the implemented modules,
which target types they take as an input, and whether they require an API
key. The remainder of this Section describes the modules in more detail.

Module Target Types Offline API key Follow-up

blacklists IPv4, Domain X
crtsh Domain X
darkweb IPv4, Domain X X
datacenter IPv4 X
dns_servers IPv4 X
dns IPv4, Domain X
geolocation IPv4 X multiple
haveibeenpwned Domain, Email X
ip2asn IPv4 X
openproxy IPv4 X
passive_dns IPv4, Domain X
passive_ssl IPv4 X
pgp Domain, Email X
port_discovery IPv4 multiple
psbdmp IPv4, Domain, Email X
spyonweb IPv4, Domain X X
threat_intel IPv4, Domain multiple
torexits IPv4 X
urlscan Domain X
whois IPv4, Domain X
whois_reverse Email X X

Table 4.2: List of implemented modules with information about which types
of targets they take as an input, whether they provide offline data, if an API
key is required, and if the module can find new targets for follow-up searches.

36
4. Pantomath: Tool for Automated OSINT Collection

geolocation This module queries various websites providing geolocation

information about an IP address. The GPS coordinates are extracted from
each result, or, if the website returns only an address without the coordi-
nates, they are resolved from the address using the OpenStreetMap API [152].
The coordinates are then clustered using a threshold specified in the config-
uration file, and a single pair of coordinates is computed for each cluster.
The reliability of these cluster coordinates is estimated based on the reli-
abilities of the websites that returned the coordinates within the cluster.
These are specified in the configuration file and are continuously updated
with values from new queries. The calculations use the model described in
Section 4.3.2, and the initial values are based on measurements conducted
in Section 5.1.2.
Since most of the websites return the results as a JSON response, the pars-
ing is performed using a configuration file, meaning that new sources of geolo-
cation that yield JSON responses can be added by merely specifying the URL
and the format of the response in the configuration file. Maxmind.com pro-
vides a free version of their geolocation database, which is downloaded dur-
ing the database update and can be queried in the offline mode. The module
could be extended to use a paid version of their database and provide more re-
cent and accurate results. Seven of the implemented services require an API
key with varying limitations of the free version. Currently implemented web-
sites are the following:
• IPinfo [97]
• FreeGeoIP [153]
• IPify [154]
• IP-API [155]
• IPgeolocation [156]
• IPdata [157]
• Extreme-IP-Lookup [158]
• Geoplugin [159]
• IPwhois [160]
• IPregistry [161]
• WhoisXMLAPI [162]
• IPlocate [163]
• Utrace [164]
• Maxmind [165]

37
4. Pantomath: Tool for Automated OSINT Collection

threat_intel This module queries multiple cyber threat intelligence feeds

to check whether the IP address or the domain name is considered malicious.
Each CTI feed provides an evaluation of the risk associated with the tar-
get. These are used to estimate the reliability of each feed using the model
described in Section 4.3.1 and the total risk value for a particular target.
The values are set in the configuration file and continuously updated, with
the measurements performed in Section 5.1.1 used as the base. Apart from
ThreatCrowd, all of the implemented services require API keys. However,
all of them provide a limited number of requests for free. The following is
the list of currently implemented feeds:

• ThreatCrowd [108]
• VirusTotal [166]
• AlienVault OTX [107]
• MetaDefender [109]

port_discovery This module queries multiple port discovery services to

see which ports are open for the given IP address and what services are
running there. The ports returned by different services are compared to esti-
mate the reliability using the model defined in Section 4.3.3. Each estimate
is based on the measurements performed in Section 5.1.3 and is continuously
updated with values collected in new queries. The following are the currently
implemented services (all require an API key):

• Shodan [4]
• Spyse [112]
• Censys [113]

blacklists This module downloads blacklists of IP addresses and domain

names with various criteria for addition of new entries. The blacklists are
parsed, and all the entries are stored in the database. The blacklists con-
tain addresses associated with botnets, spamming and phishing activities,
and so forth. New blacklists available in a structured file can be added to
the configuration file and parsed automatically. For completely unstructured
blacklists (e.g., available as an HTML file), a separate parsing function can
be implemented. The data only needs to be stored in the database and re-
trieved accordingly in the query function. The following is the list of currently
implemented blacklists:

38
4. Pantomath: Tool for Automated OSINT Collection

• SSLBL Abuse.ch [104]

• Feodotracker Abuse.ch [167]
• Myip.ms [167]
• AlienVault [168]
• Cinsscore [169]
• Blocklist.de [170]
• Spamhaus [171]
• Openphish [172]
• Zerodot1 [173]
• Malwaredomains [105]

crtsh This module searches historical certificates of the specified domain

at crt.sh [99]. If the user enables it in the configuration file, each certificate
is fetched from the website to find additional information. All the e-mail
addresses associated with these certificates are added to the pool of newly
found targets.

darkweb This module parses a CSV file containing data scraped from
the dark web and used for categorization of the websites in [174]. Each
entry in the file contains a link to the website, its content, possible locations
resolved from the content using CLAVIN [70], and the website’s category.
The module looks for any IP addresses and domain names in the website’s
content, and for each discovered IP or domain, the whole entry is saved into
the database (i.e., if no IP or domain is found in the content, the entry is
skipped). This module only illustrates how some of the library functions can
be used when large volumes of data are processed because all of the steps to
create the dataset need to be performed manually.

datacenter This module downloads and parses a list of IP ranges owned by

large companies and used as datacenters. The list is maintained in the IPcat
project [175]. Each entry contains the range, the company’s name, and a link
to its website.

dns This module resolves the domain or IP address using Google DNS [176].
The answer is added as a new target for possible follow-up search.

39
4. Pantomath: Tool for Automated OSINT Collection

dns_servers This module downloads and parses a list of IP addresses of

DNS name servers maintained by Public-DNS [177], with additional infor-
mation such as the name and the server’s location.

haveibeenpwned This module uses haveibeenpwned.com [87] to check if

a password of the specified e-mail address was leaked in the past or if the do-
main was breached. Results of both queries include additional information
about the breaches. An API key is required to use this module and costs
$3.5 per month.

ip2asn This module downloads and parses IP2ASN’s [178] database of

ASN information for different IP ranges.

openproxy This module downloads and parses a list of IP addresses that

are open proxies according to multiproxy.org [179]. Each entry also contains
a port used to run the proxy.

passive_dns and passive_ssl These modules search for the IP address

or the domain name in CIRCL.LU’s databases of historical DNS records [180]
and X.509 certificates [181], respectively. Access to both of these databases
needs to be requested and is granted only to researchers and security incident
handlers.

pgp This module searches for the domain name or the e-mail address in
PGP public key servers, namely The.Earth.li [182] and Key-Server.io [183],
which is used if the first server is not responsive. If anything is found for
the queried domain, all e-mail addresses associated with the domain are
retrieved and added as new targets. This module is somewhat fragile, as
both of these websites are sometimes not accessible.

psbdmp This module looks up the target IP address, domain name, or

e-mail address in a Pastebin dump [184]. If any dump containing the target
is found, the data is retrieved and searched for other potential targets.

spyonweb This module looks up the IP address or domain name on SpyOn-

Web [185]. For an IP address, the service looks for all domains that are hosted
on the IP and adds these domains as new targets for follow-up searches. For
a domain name, it looks for all Google Adsense and Google Analytics IDs
the domain uses and then searches for additional information about the IDs,

40
4. Pantomath: Tool for Automated OSINT Collection

including all domains sharing these IDs. These domains are again used as
possible new targets. The service requires an API key, with the free version
providing 10000 queries per month. Three paid versions are offered with
prices starting from $6 per month.

torexits This module downloads and parses a list of Tor exit nodes main-
tained by Torproject.org [186].

urlscan This module searches the domain at urlscan.io [100]. The service
visits the specified URL and records all activities happening during this pro-
cess, such as which domains and IP addresses were visited. These domains
and IP addresses are considered for follow-up searches. The results also in-
clude all resources of these domains, a screenshot, and much more. Based on
the configuration, either only URLs with the detailed results are attached
to the results, or they are fetched and included.

whois This module looks for the domain’s whois data on Whois XML
API [162]. The service requires an API key, where the free version provides
500 credits (searches) per month. Their database can also be downloaded,
with options to download 1 million entries for $240 or the whole database
for an undisclosed price.

whois_reverse This module looks for reverse whois data (i.e., informa-
tion about domains registered with the e-mail address) either on Whoxy [187]
or on Whoisology [188] if Whoxy is not available or does not return any
results. Both services require an API key. Search credits for Whoxy can be
bought with prices ranging between $4 and $8 per thousand credits based on
the number of credits bought. Whoisology costs $50 per month with a maxi-
mum of 2500 credits and $35 for every additional 2500 credits. The domains
that are associated with the e-mail address are added to the pool of new
targets.

4.3 Reliability Estimation

Besides the collection of OSINT itself, one of the requirements for Pantomath
was to provide an estimation of the reliability of the collected data. As dis-
cussed in Section 2.1, the best way to determine the truthfulness of OSINT
is to establish the reliability of the sources the information was retrieved
from and to use multiple sources providing the same type of information

41
4. Pantomath: Tool for Automated OSINT Collection

and compare the results [18]. Additionally, having context and query-specific
information is essential to avoid collecting information not relevant to the in-
quiry [19]. Pantomath narrows the inquiries down to simple keywords and
uses tools that provide a specific type of information about the given key-
word, such as a geolocation of an IP address. Therefore, the collected infor-
mation is implicitly relevant to what the user is looking for.
It is important to note that the reliability estimation does not necessarily
make sense for all the results. For example, the torexits module downloads
a list of Tor exit nodes directly from the Tor project website. Although
it is possible to obtain this information from other sources and compare
the results, it could be argued the fact that the Tor project itself provides it
gives enough confidence the result is correct. Another example of such a case
is the dns module, where multiple servers could be queried and the answers
compared. Nonetheless, an established DNS server, such as the one by Google
used in the module, has enough credibility to be trusted.
Another factor to consider is the need for multiple sources. As discussed
in Section 4.1, many services are either paid or provide only a limited num-
ber of free requests, meaning that using multiple sources for the reliability
estimation can significantly increase the cost if some or all of them are paid.
One of the cases where this applies is the whois_reverse module because
the vast majority of reverse whois APIs does not offer any free queries. Addi-
tionally, some services provide information that is too unique to be validated
as multiple services would need to be combined to produce the information,
such as the urlscan module, or there might not be other services providing
it at all.
Pantomath provides a reliability estimate for each new target that is
discovered during the search. The seed target passed to the CLI has the re-
liability set to 100%. The reliability of each new target is computed using
the previous target’s reliability and the reliability multiplier of the module
that discovered it. The multipliers are specified in the configuration file, and
they can be different for each module. Currently, all modules use the same
default multiplier, which is equal to 0.8. Figure 4.3 shows an example of
a search tree, including the reliability estimates of all targets. As the multi-
pliers are the same for all modules, targets with the same depth have equal
reliabilities, i.e., 80% for level 1, 64% for level 2, 51.2% for level 3, and so
forth.
By setting different multipliers, the user can control how reliabilities for
new targets are estimated. For example, the dns module might have a higher
multiplier, as the relationship between the queried target and the newly
discovered target is well-defined and generally has a high probability of being

42
4. Pantomath: Tool for Automated OSINT Collection

Figure 4.3: An illustration of the search tree that forms when querying
fi.muni.cz with the reliability of each depth in red.

correct. On the other hand, modules such as darkweb or psbdmp, where new
targets are discovered by extracting e-mail and IP addresses from blocks
of data, could have lower multipliers since the relationship between these
targets is unclear, and the extraction might not be precise. With varying
multipliers, the same depth targets could have different reliabilities, and
some targets might even have smaller reliabilities than those in lower levels
depending on the discovery chain. Additionally, the reliabilities could be used
as the indicator of which targets should be queried instead of the depth, as
they better represent the strength of the connection to the initial target.
The model proposed by Gong et al. [56] described in Section 2.3 provides
a systematic approach for reliability estimation of results collected from cy-
ber threat intelligence feeds. A simplified version of this model is used to cal-
culate the reliability of data in threat_intel, geolocation, and port_discovery
modules. In each module, the reliabilities of all implemented sources are
estimated, and the results for a specific target are evaluated using these esti-
mates. The initial values are set to the ones obtained in Section 5.1 and are
continuously updated after each query, meaning that each time a target is
queried in the module, the reliability is recalculated.
The values can be reset and calculated from scratch using data provided
by the user. The more values are collected and added to the reliability es-
timation, the more these estimates reflect how different sources perform in
the scenarios the user is interested in. For example, some websites used in
the geolocation module could provide accurate results for IP ranges owned by
large companies but be less precise when resolving the location of indepen-
dent addresses. If the user mostly investigates individuals, the initial data
where commercial IP ranges were also considered might distort the sources’

43
4. Pantomath: Tool for Automated OSINT Collection

precision. The same goes for the port_discovery module, where the portion
of IP addresses with no open ports significantly influences the reliability of
different services.

4.3.1 Cyber Threat Intelligence

The threat_intel module uses four different CTI feeds to collect information
about the target. Each of these feeds returns a so-called risk value that eval-
uates the risk associated with the target. The risk values are normalized to
a value between 0 and 1 to be comparable with each other. When a specific
IP address or domain name is queried, the reliability of the results is esti-
mated as the ratio of returned risk values and the maximum possible risk
value. In the model by Gong et al., the Cymon CTI feed [189] was used,
but it is currently not operational, so it was replaced by MetaDefender [109].
The original model compares many different pieces of information returned
by the feeds, but MetaDefender only provides the risk value, meaning that
it would not be comparable with the remaining feeds. Table 4.3 describes
the symbols used in the equations.

Symbol Description
n number of CTI feeds
Fn n-th CTI feed
risk(Fi ) risk value returned by CTI feed Fi

Table 4.3: Description of symbols used in the equations

Equation 4.1 calculates the distance between two feeds. It is equal to

the difference between the normalized risk values.

dist(Fi , Fj ) = |risk(Fi ) − risk(Fj )| (4.1)

The expected risk risk(Fexpected ) is computed by Equation 4.2 and is

equal to the average of all risk values. If no value is returned by a feed, it is
set to 0. Pn
risk(Fk )
risk(Fexpected ) = k=1 (4.2)
n
The error of CTI feed Fi , as shown in Equation 4.3, is calculated as
the distance between the risk value returned by Fi and the expected value
Fexpected , which is defined by risk(Fexpected ).

error(Fi ) = dist(Fi , Fexpected ) (4.3)

44
4. Pantomath: Tool for Automated OSINT Collection

The independence of CTI feed Fi is equal to the average of distances

between Fi and all other feeds, as shown in Equation 4.4.
Pn
k=1 dist(Fi , Fk )
independence(Fi ) = (4.4)
n−1

The error of CTI feeds decreases proportionally to their independence.

This is reflected in the weight of feed Fi shown in Equation 4.5. It is computed
as its independence divided by the maximum distance between any two feeds,
which is used as the boundary line of consideration. The result is a fraction
between 0 and 1.

independence(Fi )
weight(Fi ) = 1 − (4.5)
maxnj,k=1 (dist(Fj , Fk ))

Finally, Equation 4.6 computes the reliability of feed Fi . It is inversely

proportional to the error divided by the maximum distance between any two
feeds and proportional to the weight. Again, the result is a fraction between
0 and 1.
!
error(Fi )
reliability(Fi ) = 1− n weight(Fi ) (4.6)
maxj,k=1 (dist(Fj , Fk ))

The risk values collected for a particular target T and the reliabilities of
the CTI feeds are used to estimate the reliability of the results, as shown
in Equation 4.7. In this case, the final value is the risk associated with
the queried target and it is attached to the results returned by the module.
Pn
k=1 risk(Fk )reliability(Fk )
reliability(T ) = Pn (4.7)
k=1 reliability(Fk )

4.3.2 Geolocation

The geolocation module uses many IP geolocation services that either return
GPS coordinates (i.e., two numbers – latitude and longitude) or an address
that is resolved to coordinates using the OpenStreetMap API [152]. The co-
ordinates provide a convenient way to compare results from different services
and estimate the reliability. This section describes how the reliability of each
geolocation service is computed and how the results for a particular IP ad-
dress are evaluated. Table 4.4 describes the symbols used in the equations.

45
4. Pantomath: Tool for Automated OSINT Collection

Symbol Description
n number of geolocation sites
Sn n-th geolocation site
latn latitude resolved by the n-th site
lonn longitude resolved by the n-th site
m number of clusters
Cm m-th cluster
p set of sites in m-th cluster

Table 4.4: Description of symbols used in the equations.

Firstly, Equation 4.8 computes the distance between two sets of coor-
dinates resolved by sites S1 and S2 . As these coordinates correspond to
a point in a two-dimensional Euclidean space, it is calculated the same way
as the Euclidean distance.
q
dist(Si , Sj ) = (lati − latj )2 + (loni − lonj )2 (4.8)
Equations 4.9 and 4.10 compute the expected coordinates, which are
equal to the average of coordinates resolved by all sites.
Pn
k=1 latk
latexpected = (4.9)
n
Pn
k=1 lonk
lonexpected = (4.10)
n
The error of the coordinates resolved by site Si is computed as the dis-
tance to the expected value Sexpected , as shown in Equation 4.11. The ex-
pected value is defined by the expected coordinates latexpected and lonexpected .

error(Si ) = dist(Si , Sexpected ) (4.11)

The independence of geolocation site Si is computed using Equation 4.12.
It is equal to the average of distances between the coordinates resolved by
Si and the coordinates resolved by all the other sites.
Pn
k=1 dist(Si , Sk )
independence(Si ) = (4.12)
n−1
The weight of site Si shown in Equation 4.13 is equal to its independence
divided by the maximum distance between any two sites.
independence(Si )
weight(Si ) = 1 − (4.13)
maxnj,k=1 (dist(Sj , Sk ))

46
4. Pantomath: Tool for Automated OSINT Collection

Finally, Equation 4.14 computes the reliability of site Si . Just like the reli-
ability of CTI feeds, it is inversely proportional to the error and proportional
to the weight, and the result is a fraction between 0 and 1.
!
error(Si )
reliability(Si ) = 1− n weight(Si ) (4.14)
maxj,k=1 (dist(Sj , Sk ))

When coordinates from all sites are collected, they are clustered using
a clustering algorithm with a threshold defined in the configuration file (the
default value is set to 0.2). The clusters are mutually exclusive, and the num-
ber of clusters m can be anywhere between 1 and n. For each cluster, the ex-
pected location is computed, and the reliability of the location of cluster
Ci is equal to the sum of reliabilities of sites that constitute the cluster di-
vided by the sum of all reliabilities, as shown in Equation 4.15. In the end,
the module returns one or more pairs of coordinates and their reliabilities.
P
S ∈Ci reliability(Sj )
reliability(Ci ) = Pnj (4.15)
k=1 reliability(Sk )

4.3.3 Port Discovery

The port_discovery module checks which ports are open on the given IP
address. As the port numbers are categorical values that can be easily com-
pared between different services, they can be used for reliability estimation.
The results where none of the services returned any open ports are also
added to the calculation since these can be considered as equal results by
all services. This section explains how the reliability of each service is com-
puted and how the results for a particular IP address are evaluated. Table 4.5
describes the symbols used in the equations.
Equation 4.16 calculates the distance between the results from Pi and Pj .
The distance is equal to the number of ports that were returned by only one
of the services, i.e., the cardinality of the symmetric difference of Ri and Rj .

dist(Pi , Pj ) = |Ri 4Rj | (4.16)

The expected set Rexpected is a set of ports that were in more than a half
of sets Rn , i.e., more than a half of the port discovery services consider
the ports open, as shown in Equation 4.17.
n
Rexpected = {pi |qi > } (4.17)
2

47
4. Pantomath: Tool for Automated OSINT Collection

Symbol Description
n number of port discovery services
Pn n-th port discovery service
m total number of resolved ports
pm m-th resolved port
Rn set of ports resolved by n-th service
Qm set of services that resolved m-th port
qm number of services that resolved m-th port

Table 4.5: Description of symbols used in the equations

The error of service Pi is computed with Equation 4.18, and it is equal

to the distance between the service and the expected value Pexpected defined
by the set Rexpected .
error(Pi ) = dist(Pi , Pexpected ) (4.18)
The independence computed by Equation 4.19 is equal to the average of
distances between service Pi and all other services.
Pn
k=1 dist(Pi , Pk )
independence(Pi ) = (4.19)
n−1
Equation 4.13 computes the weight of service Pi . It is equal to its inde-
pendence divided by the maximum distance between any two services.
independence(Fi )
weight(Pi ) = 1 − (4.20)
maxnj,k=1 (dist(Pj , Pk ))
The reliability of service Pi is calculated from the error and the weight,
resulting in a fraction between 0 and 1.
!
error(Pi )
reliability(Pi ) = 1− n weight(Pi ) (4.21)
maxj,k=1 (dist(Pj , Pk ))
Once the ports from all services are collected, each port is evaluated in
terms of its reliability. The reliability is computed as the sum of reliabilities
of services that returned the port divided by the sum of all reliabilities, as
shown in Equation 4.22. The results returned by the module contain the set
of ports and the respective reliabilities.
P
P ∈Qi reliability(Pj )
reliability(pi ) = Pnj (4.22)
k=1 reliability(Pk )

48
5 Evaluation and Discussion

This chapter evaluates how Pantomath tackles some of the challenges of

the collection of OSINT. The reliability estimation model defined in Sec-
tion 4.3 is evaluated, and the measurements are presented and discussed in
Section 5.1. Section 5.2 compares Pantomath to three existing OSINT au-
tomation tools and states the main advantages and disadvantages of each.
Section 5.3 outlines some of the extensions that could be added to Pantomath
and the improvements that could be made to the existing functionality.

5.1 Evaluation of Reliability Estimation

The reliability estimation model defined in Section 4.3 compares all sources
used in each module to compute their reliability, and the values are continu-
ously updated when new targets are queried. The estimates reflect the pre-
cision of each source for the type of data that was used for the estimation,
and more data generally yields better accuracy. The model was evaluated
using large datasets to provide base reliabilities for the user and measure
how different sources perform.

5.1.1 Cyber Threat Intelligence

The reliability of the CTI feeds used in the threat_intel module was evalu-
ated using a dataset generated from various blacklists used in the blacklists
module. As the module uses both IP addresses and domain names as targets,
the dataset contained 500 values of each. The whole dataset was queried, and
the values defined in Section 4.3.1 were computed. Table 5.1 shows the mea-
sured values, and Table 5.2 shows the differences between all pairs of feeds
to evaluate how similar they are.
The biggest challenge when comparing different CTI feeds is the diver-
sity of results they return. Each feed uses different metrics for the risk value,
which makes it hard to compare. ThreatCrowd evaluates the risk in only
three categories, whereas VirusTotal gives highly varied results. The risk val-
ues and how much information each feed returns also changes significantly
when querying IP addresses and domain names. VirusTotal and AlienVault
evaluated many targets with zero risk, meaning they have a small average dis-
tance between them, as shown in Table 5.2. On the other hand, ThreatCrowd
often gave the highest possible risk value, which significantly increased the in-
dependence. That resulted in ThreatCrowd having the lowest reliability out
of all the feeds, and AlienVault and VirusTotal having high reliabilities.

49
5. Evaluation and Discussion

Website Independence Error Weight Reliability

AlienVault 0.1827 0.1325 0.6128 0.4583
ThreatCrowd 0.3402 0.2517 0.3601 0.2649
VirusTotal 0.1768 0.1177 0.6206 0.4773
MetaDefender 0.2464 0.1695 0.4059 0.3136

Table 5.1: Measurements of the values defined in Section 4.3.1.

Website AlienVault ThreatCrowd VirusTotal MetaDefender

AlienVault - 0.3140 0.0729 0.1612
ThreatCrowd 0.3140 - 0.2929 0.4135
VirusTotal 0.0729 0.2929 - 0.1645
MetaDefender 0.1612 0.4135 0.1645 -

Table 5.2: The average distance between all feeds. The lower is the value,
the closer are the risk values of the two feeds.

The model by Gong et al. [56] compares many different features to es-
timate the reliability of the CTI feeds, such as hashes of malicious files
associated with the target or IP addresses used in the same attack. As these
values represent distinct entities that are much more comparable, the com-
parison of feeds using this model is more methodical and better illustrates
the differences between them. By using multiple features, the reliability esti-
mation in the threat_intel module would be more reliable than the current
one. However, implementing such a model is non-trivial and requires a lot of
parsing to bring various pieces of information together.

5.1.2 Geolocation

The reliability of websites used in the geolocation module was estimated using
a dataset of 1000 randomly generated IP addresses. All multicast, reserved,
private, or loop-back addresses were filtered out. The IP addresses were
resolved by all geolocation services, and the values defined in Section 4.3.2
were calculated. Table 5.3 shows the measured values and the null ratio, i.e.,
the percentage of IP addresses where no geolocation was resolved. The null
ratio of the Utrace website was 90% due to an inconsistent operation. To
determine how dependent the websites are between each other, the pairwise
distance of these websites was generated and is shown in Table 5.4.
A group of 5 websites all have relatively small distances between each
other – FreeGeoIP, IPdata, Geoplugin, IPlocate, and Maxmind. These are

50
5. Evaluation and Discussion

Website Null ratio Independence Error Weight Reliability Adjusted rel. Known
FreeGeoIP 0.3% 7.2528 5.8982 0.6520 0.5056 0.4822 5.5557
IPdata 0.2% 6.8334 5.4831 0.6651 0.5208 - 1.3827
Extreme-IP 0.5% 10.2773 8.8586 0.5923 0.4419 0.4493 8.9512
Geoplugin 0.4% 7.1811 5.8395 0.6539 0.5068 0.4846 10.5209
IPregistry 0.1% 6.9512 5.5843 0.6732 0.5243 0.5245 1.7526
IPlocate 0.2% 6.8296 5.4639 0.6581 0.5125 0.4879 4.6368
IPinfo 0.2% 7.7042 6.2717 0.6090 0.4582 0.4642 0.7729
IPwhois 0.2% 10.8974 9.2205 0.5631 0.4158 0.4231 9.0029
IPify 0.1% 7.4466 6.0745 0.6340 0.4818 0.4926 4.8343
IP-API 0% 7.1182 5.7016 0.6669 0.5182 0.5200 0.7706
IPgeoloc 0% 9.4214 7.8840 0.5494 0.3997 0.4110 8.7674
WhoisXML 0% 7.4479 6.0814 0.6174 0.4693 0.4684 3.3321
Maxmind 0.2% 6.8275 5.4938 0.6641 0.5195 0.4878 1.1600
Utrace 90% 5.9583 5.1341 0.6182 0.4658 - 37.4484

Table 5.3: Measurements of the values defined in Section 4.3.2. Column Ad-
justed rel. contains reliabilities when IPdata and Utrace are removed from
the computation. Column Known represents the average distance between
the results given by each service and the known locations for a particular IP
address. The lower are the values in this column, the closer the results are
to the known locations.

shown in bold in Table 5.4. The similarity of the results is observable in

specific geolocations, where the coordinates are exactly the same for many
IP addresses. One of the reasons this is happening might be the fact that
Maxmind is a popular service providing a weekly updated version of their
database for free. If that is the case, the small differences could be caused
by the databases of these websites not being synchronized for all values.
However, the relationships between them would need to be analyzed in more
detail to find any dependencies.
One of the pairs – IPdata and Maxmind – has an average distance very
close to zero, meaning that the results from these websites were virtually
the same. To better reflect the reliabilities, IPdata and Utrace (due to a high
null ratio) were removed from the computations, and the newly computed
values are shown in column Adjusted rel. in Table 5.3. These two websites
are given reliability of zero in the configuration file. Arguably, the remain-
ing websites that have similar results to Maxmind could be omitted from
the evaluation as well, but the similarities were not as evident as with IP-
data.

51
Website FreeGeoIP IPdata Extreme-IP Geoplugin IPregistry IPlocate IPinfo IPwhois IPify IP-API IPgeoloc WhoisXML Maxmind Utrace
FreeGeoIP - 2.0388 11.7106 1.3410 8.7615 1.5776 9.7614 11.5476 10.1220 9.4695 11.3314 7.2650 1.9323 5.6141
IPdata 2.0388 - 11.6694 2.8688 7.5709 1.5855 8.2876 12.3284 9.3585 8.0196 11.4249 6.4100 0.2460 5.6424
Extreme-IP 11.7106 11.6694 - 11.2273 7.2691 11.1018 8.7193 11.2132 8.9421 8.4749 10.2314 11.0694 11.7252 4.9061
Geoplugin 1.3410 2.8688 11.2273 - 8.6757 1.5546 9.6862 10.8867 9.7167 9.3281 10.8621 7.1132 2.7437 5.4482
IPregistry 8.7615 7.5709 7.2691 8.6757 - 7.8255 4.0415 8.9490 4.6105 4.4684 7.1167 6.4388 7.5250 5.7971
IPlocate 1.5776 1.5855 11.1018 1.5546 7.8255 - 8.7850 11.7350 9.5077 8.3859 11.5172 6.6954 1.5009 5.4266
IPinfo 9.7614 8.2876 8.7193 9.6862 4.0415 8.7850 - 9.9856 5.0676 4.5488 7.5439 7.1109 8.4720 5.0361
IPwhois 11.5476 12.3284 11.2132 10.8867 8.9490 11.7350 9.9856 - 9.1594 9.1486 9.9889 11.2326 12.1958 8.4533
IPify 10.1220 9.3585 8.9421 9.7167 4.6105 9.5077 5.0676 9.1594 - 2.3990 5.8669 4.5707 9.4721 6.3511
IP-API 9.4695 8.0196 8.4749 9.3281 4.4684 8.3859 4.5488 9.1486 2.3990 - 6.8925 5.7068 8.0620 6.2440

5. Evaluation and Discussion

IPgeoloc 11.3314 11.4249 10.2314 10.8621 7.1167 11.5172 7.5439 9.9889 5.8669 6.8925 - 8.6826 11.3289 6.0856
WhoisXML 7.2650 6.4100 11.0694 7.1132 6.4388 6.6954 7.1109 11.2326 4.5707 5.7068 8.6826 - 6.5344 6.8170
Maxmind 1.9323 0.2460 11.7252 2.7437 7.5250 1.5009 8.4720 12.1958 9.4721 8.0620 11.3289 6.5344 - 5.6367
Utrace 5.6141 5.6424 4.9061 5.4482 5.7971 5.4266 5.0361 8.4533 6.3511 6.2440 6.0856 6.8170 5.6367 -

Table 5.4: The average distance between all services. The lower is the value, the closer are the results from the two
services. The values in bold show the pairs of websites that belong to a group of 5 websites with small distances
between each other.
52
5. Evaluation and Discussion

A dataset of 1000 IP addresses with known locations was created to

evaluate the precision of this model. The dataset contains addresses from
IP ranges owned by Amazon, Google, NordVPN, and Masaryk University.
The geographical locations of the ranges owned by Amazon, Google, and
NordVPN were collected from the official websites [190] [191] [192]. The av-
erage distance between the results from each website and the known locations
is shown in column Known in Table 5.3. Overall, these values correspond to
the estimated reliabilities quite well. A few of the websites performed better
in this evaluation than expected from the reliability and vice versa. Similarly,
the changes in the adjusted reliability seem to conform to the accuracy when
resolving the known locations. However, it is important to note that some
websites might use similar datasets to resolve the geolocation, which would
distort the measured distances.

5.1.3 Port Discovery

The reliability of the services used in the port_discovery module was eval-
uated using a dataset of around 700 IP addresses. Approximately half of
the addresses were collected manually from various sources to contain many
open ports, and the other half of the dataset contained randomly generated
IP addresses, which mostly have no open ports. All metrics defined in Sec-
tion 4.3.3 were calculated, and the results are shown in Table 5.5. The average
distance between all pairs of services is shown in Table 5.6.

Website Null ratio Independence Error Weight Reliability

Shodan 54.2% 4.2371 0.7518 0.5101 0.4813
Censys 61.6% 4.7873 3.6928 0.6724 0.6377
Spyse 47.7% 3.4129 0.7341 0.6889 0.6590

Table 5.5: Measurements of the values defined in Section 4.3.3.

Website Shodan Censys Spyse

Shodan - 5.6115 2.8626
Censys 5.6115 - 3.9631
Spyse 2.8626 3.9631 -

Table 5.6: The average distance between all services. The lower is the value,
the closer the results are between the two services.

53
5. Evaluation and Discussion

As the dataset is not entirely random, it is not representative of the nor-

mal distribution of open ports. With the randomly generated addresses,
the differences between the reliabilities decreased compared to the addresses
where multiple ports were expected to be open. A dataset containing only
random addresses would result in estimates that better reflect average queries
by the user. The fact that Shodan has the lowest reliability out of the three
services does not necessarily mean it is the least precise in reality. With only
three services used for the comparison and a relatively small dataset, the ef-
fect of IP addresses with many open ports and other outliers significantly
affect the calculations. However, the fact that the average distance between
the results from Spyse and Censys is not the lowest means these services do
not exhibit any dependence, which could, for example, indicate data from
both services are outdated.
To evaluate the precision of the reliability estimates similarly to the geo-
location module, the open ports could be collected locally using Nmap [114]
or MASSCAN [115]. Another option would be to create a dataset of IP ad-
dresses where some ports are verifiably open. The estimation could also
be extended with a comparison of other data returned by different ser-
vices. All three services check if some well-known software is running on
the port. The results also contain information about the used operating sys-
tem, the transport protocol, and many other features.

5.2 Comparison with Existing Tools

Section 3.2 discusses some existing OSINT automation tools and goes into
more detail about the three most notable ones – Recon-ng, SpiderFoot, and
Maltego. Table 5.7 summarizes the main differences between these tools and
Pantomath. Like Pantomath, all of these tools are designed to be modular
to allow a straightforward integration of new sources. The most well-known
sources are implemented in all the tools, meaning that a significant portion
of the results will be the same.
As opposed to SpiderFoot and Maltego, Recon-ng is entirely open-source,
and new modules can be added to the Recon-ng marketplace by any devel-
oper. The free version of SpiderFoot is open-source as well, but it lacks a great
deal of functionality offered in the paid version (SpiderFoot HX). Maltego
provides a free community version that misses many functions and is limited
in terms of the number of queries and the integration of additional modules.
Additionally, it has to run on Maltego’s cloud infrastructure, meaning that
all the traffic has to go through their servers. The most significant advantage
of Maltego is its state-of-the-art visualization capabilities. The results from

54
5. Evaluation and Discussion

Feature Recon-ng SpiderFoot Maltego Pantomath

Modular X X X X
CLI X X X
GUI X X X
Visualization X X
Number of modules ≈ 100 ≈ 200 56 21
Reliability estimation X
Proxy integration X X X X
Tor integration paid X
Offline mode X

Table 5.7: Comparison of Pantomath with other OSINT automation tools.

SpiderFoot are visualized in a simple graph as well, but this representation

only provides a simple overview of the discovered targets. Combined with
its GUI, the visualization in Maltego is very convenient and significantly
improves the user experience.
SpiderFoot has the upper hand in the number of implemented modules,
which is much larger than in any other tools, and Pantomath is far behind
in this regard. However, it is important to note that many modules in the ex-
isting tools overlap in terms of the type of results, whereas Pantomath has
a few modules incorporating many services providing the same type of infor-
mation. For example, the geolocation module queries 14 different IP geolo-
cation services, the blacklists module uses 22 blacklists from various sources,
and the threat_intel and port_discovery modules use multiple services as
well. All these services are implemented as separate modules in the exist-
ing tools. If all services in Pantomath were separated, the total number of
modules in Pantomath would be around 50.
Arguably the two biggest advantages of Pantomath compared to the ex-
isting tools are the different modes of operation and the reliability estimation.
All the tools can use proxy servers for queries, but only the paid version of
SpiderFoot integrates the Tor network. The stealth mode provides better
anonymity guarantees compared to the overt mode with no significant draw-
backs. The only consideration when using the stealth mode is the use of API
keys that can potentially associate requests sent to the service with the user.
None of the tools provides functionality similar to the offline mode, where
queries can be performed with no access to the Internet. Even though the se-

55
5. Evaluation and Discussion

lection of available data in the offline mode is significantly smaller than in

the overt or stealth modes, it can be extended with new sources in the future.
The specific improvements to the offline mode are detailed in Section 5.3.
The reliability estimation aims to tackle possibly the biggest challenge of
OSINT – validation of the acquired data. As the final evaluation of gathered
data will eventually require a person with some knowledge about the context
of the investigation, the goal of the reliability estimation in Pantomath was
to allow the user to make more informed decisions rather than to provide
any guarantees about the correctness of the information. This concept is
implemented in only a few modules, but the approach can be applied to
virtually any existing module. The continuous updates of the estimates also
allow adjustments for the current situation and the type of data queried in
Pantomath.

5.3 Future Work

The goals of Pantomath were to provide a framework that incorporates all

the functionality needed for automated collection of OSINT, to implement
some noteworthy modules, and to lay the groundwork for some of the main
challenges. Many possible improvements and extensions could be added to
the implementation. Firstly, Pantomath currently considers three identifiers
as potential targets – IP address in version 4, domain name, and e-mail
address. Other identifiers could be added, e.g., IP address in version 6, user-
name, real name, phone number, or Bitcoin address. In general, any keyword
that identifies an individual or an organization might be used as a target.
Parallelization of the search queries would significantly improve the per-
formance of the tool. All modules are currently called sequentially, even
though they operate independently, including connections to the database.
Also, a significant slowdown is the timeouts used by modules to prevent
the overwhelming of queried services, which could be bypassed if the remain-
ing modules continued their activity in the meantime. The tool’s output is
currently partially filtered and given to the user as a whole, including many
details returned by the modules. To allow for more straightforward analysis,
different detail levels could be specified, with the modules returning the re-
sults accordingly. The configuration file contains a few options for the results
produced by crtsh and urlscan modules.
One of the most significant disadvantages of Pantomath compared to
existing OSINT automation tools is the lack of a graphical user interface.
The currently implemented simple command-line interface does not provide
commands to change the tool’s settings conveniently or update API keys;

56
5. Evaluation and Discussion

these have to be changed manually in the configuration file. Additionally,

the GUI would help in making the tool much more interactive. For example,
the selection of modules to use could be prompted for each query separately,
the decision of which targets to use for follow-up searched could be entirely
in the user’s hands, or the results could be easily filtered and displayed.
Pantomath also does not have any visualization capabilities like SpiderFoot
or Maltego. By creating a wrapper, Pantomath could be used as a module
in Maltego, meaning that the state-of-the-art visualization capabilities of
Maltego could be utilized.
Another significant drawback of Pantomath is the lower number of imple-
mented modules. As discussed in detail in Chapter 3, many possible sources
could be utilized. The following are the more notable ones that would fit
well in Pantomath:
• BuiltWith [111] is a useful service that provides information about
the technology stack of a website, relationships between different web-
sites, redirects of a website, and so forth. These results could help
with the reliability estimation in the port_discovery module. However,
the price for this service is very high, starting at $295 per month.

• Traditional and dark web search engines are a powerful source of

OSINT when used correctly, e.g., utilizing Google Dorking [119] or
other techniques. With a direct search of the target, the search engines
could be able to extract valuable information.

• Many services provide information about Bitcoin addresses, e.g., the bal-
ance of the address, whether a scammer or a hacker used the address,
and much more.

• For user names and real names as targets, websites such as Pipl [84],
CheckUserNames [88], or Social Searcher [95] could be utilized.

• The Wikileaks [11] website provides access to various breaches, which

would help search for additional information about e-mail addresses
and domains that were breached in the past.

• To collect offline data and add another source for reliability estimation
in the port_discovery module, Nmap [114] or MASSCAN [115] can
be used. However, actively testing open ports in bulk would require
additional safeguards to prevent getting banned by many IP ranges.
The selection of data in the offline mode could be improved either by
implementing the existing services directly in Pantomath and collecting

57
5. Evaluation and Discussion

the data locally or by using web crawling and state-of-the-art algorithms

to extract valuable information. However, as discussed in Section 4.1, these
are non-trivial tasks going against the main focus of Pantomath. To get
data from online modules at least partially, many targets that might be
needed in the future could be collected in bulk and saved in the database.
This approach would require some prior knowledge about the potential tar-
gets. Additionally, data breaches and pastebins could be acquired, but these
are generally hard to get and could be considered unethical or even illegal.
WhoisXMLAPI [162] offers a complete download of their whois and reverse
whois database.
The reliability is estimated in three modules – geolocation, port_discovery,
and threat_intel. The way it is implemented in these modules could be used
as a blueprint for other modules where it makes sense. For example, the black-
lists module downloads many blacklists, but these are not compared because
they have different metrics for the addition of new values. Assuming there
would be multiple comparable blacklists, e.g., DNS-based blacklists, a stan-
dard type of blacklists, they could be used for reliability estimation. In gen-
eral, the reliability of any piece of information can be evaluated as long as
multiple sources are providing this information. The port_discovery module
could be extended to compare software and the transport protocol used on
given ports, the network owner, and other values besides just the open ports.
Another improvement in terms of reliability estimation would be the com-
parison of data between different modules. All the estimation is currently
performed separately in each module, but many modules partially overlap
in the data they provide. For example, the certificates acquired from the pas-
sive_ssl module are easily comparable with certificates from the crtsh mod-
ule. Likewise, data from the passive_dns module that are also returned by
the threat_intel module could support the information obtained from the dns
module. Part of the port_discovery module data could be verified using data
collected from BuiltWith to provide a more complex reliability estimation.
Both port discovery services and BuiltWith give information on the used
software, but for different targets – IP addresses and domains. However, this
evaluation would require a non-trivial algorithm matching diverse results
together.

58
6 Conclusions

The main goal of this thesis was to implement a tool for an automated collec-
tion of Open Source Intelligence. Pantomath is a highly modular framework
providing all the necessary functionality for data collection, processing, and
storage, with a straightforward way to add new data sources. The selection
of the implemented modules contains all essential services providing informa-
tion about IP addresses, domain names, and e-mail addresses. These include
port discovery services, IP geolocation websites, cyber threat intelligence
feeds, blacklists, whois data, and much more. The framework can be used
through a command-line interface or a simple API, with the possibility to
add other interfaces such as a web interface.
Pantomath offers three modes of operation with varying anonymity guar-
antees. The overt mode represents a regular operation of the tool that uses all
modules and sends requests directly to the implemented sources. The stealth
mode routes all queries through the Tor network to provide an intermediary
between the user and the Internet. The offline mode does not require an Inter-
net connection as the target is looked up in a database of preprocessed data,
allowing the users to query any targets completely anonymously. The selec-
tion of data in the offline mode is smaller than in the overt or stealth modes,
but additional data can be incorporated in the future.
The reliability estimation model attempts to evaluate the reliability of
the collected data, which is one of the biggest challenges of OSINT. The model
calculates the reliability of each source used in a module by comparing the re-
sults they return when specific targets are queried. The reliability estimation
is currently implemented in three modules, but the approach can be applied
to other modules as well. The reliabilities of the sources in all three modules
were estimated using datasets of various targets and the results can serve as
the base for future usage, as the estimates are updated continuously.
There are many possible extensions and additional sources that can
be added to the framework. The reliability estimation model can serve as
a blueprint for other modules where the results from different sources are
comparable. The modes of operation explore how higher anonymity require-
ments affect the usability and the information one can find when confiden-
tiality of the targets is critical. Pantomath lags behind existing OSINT au-
tomation tools in the user interface and the number of implemented sources,
but it builds the foundations for new concepts that are not yet explored in
the existing tools.

59
Bibliography

[1] B. Schneier. Data and Goliath: The Hidden Battles to Collect Your
Data and Control Your World. W. W. Norton & Company, 2016. isbn:
978-0393352177. url: https://www.schneier.com/books/data_
and_goliath/.
[2] A. Hulnick. “The Downside of Open Source Intelligence”. In: Interna-
tional Journal of Intelligence and CounterIntelligence 15 (Nov. 2002),
pp. 565–579. doi: 10.1080/08850600290101767.
[3] S. Gibson. “Open source intelligence”. In: The RUSI Journal 149.1
(2004), pp. 16–22. doi: 10.1080/03071840408522977. eprint: https:
//doi.org/10.1080/03071840408522977. url: https://doi.org/
10.1080/03071840408522977.
[4] Shodan. Shodan. [online], cit. [2020-7-10]. url: https://www.shodan.
io.
[5] T. Fingar. Reducing Uncertainty: Intelligence Analysis and National
Security. Stanford University Press, 2011. isbn: 9780804775946. url:
https://books.google.cz/books?id=wmakl6eGkwYC.
[6] L. Johnson. Handbook of Intelligence Studies. Taylor & Francis, 2007.
isbn: 9781135986889. url: https://books.google.cz/books?id=
U2yUAgAAQBAJ.
[7] C. Burke. Freeing knowledge, telling secrets: Open source intelligence
and development. Bond University, 2007. url: https://research.
bond . edu . au / en / publications / freeing - knowledge - telling -
secrets-open-sourceintelligence-and-dev.
[8] C. Hobbs, M. Moran, and D. Salisbury. Open Source Intelligence in
the Twenty-First Century. Palgrave Macmillan, London, 2014. url:
https://link.springer.com/book/10.1057/9781137353320.
[9] K. J. Riley et al. State and Local Intelligence in the War on Terrorism.
RAND Corporation, 2005. isbn: 0-8330-3859-1. url: https://www.
rand.org/pubs/monographs/MG394.html.
[10] Intelligence Community Information Sharing Executive. U.S. Na-
tional Intelligence: An Overview. Tech. rep. 2013.
[11] J. Assange. WikiLeaks. [online], cit. [2020-7-30]. url: https://wiki
leaks.org.
[12] H. Gibson. “Acquisition and Preparation of Data for OSINT Inves-
tigations”. In: Jan. 2016, pp. 69–93. isbn: 978-3-319-47670-4. doi:
10.1007/978-3-319-47671-1_6.

60
BIBLIOGRAPHY

[13] C. Perez and R. Germon. “Chapter 7 - Graph Creation and Anal-

ysis for Linking Actors: Application to Social Data”. In: Automat-
ing Open Source Intelligence. Ed. by R. Layton and P. A. Watters.
Boston: Syngress, 2016, pp. 103 –129. isbn: 978-0-12-802916-9. doi:
https : / / doi . org / 10 . 1016 / B978 - 0 - 12 - 802916 - 9 . 00007 - 5.
url: http : / / www . sciencedirect . com / science / article / pii /
B9780128029169000075.
[14] P. A. Watters. “Chapter 2 - Named Entity Resolution in Social Me-
dia”. In: Automating Open Source Intelligence. Ed. by R. Layton and
P. A. Watters. Boston: Syngress, 2016, pp. 21 –36. isbn: 978-0-12-
802916-9. doi: https://doi.org/10.1016/B978- 0- 12- 802916-
9 . 00002 - 6. url: http : / / www . sciencedirect . com / science /
article/pii/B9780128029169000026.
[15] C. Jouis et al. “Next Generation Search Engines: Advanced Models
for Information Retrieval”. In: Jan. 2012, pp. 344–370. doi: 10.4018/
978-1-4666-0330-1.
[16] T. Dokman and T. Ivanjko. “Open Source Intelligence (OSINT): is-
sues and trends”. In: Jan. 2020. doi: 10.17234/INFUTURE.2019.23.
[17] L. Cox. “Some Limitations of Risk = Threat x Vulnerability x Con-
sequence for Risk Analysis of Terrorist Attacks”. In: Risk analysis :
an official publication of the Society for Risk Analysis 28 (Nov. 2008),
pp. 1749–61. doi: 10.1111/j.1539-6924.2008.01142.x.
[18] J. Whang et al. “Scalable Data-Driven PageRank: Algorithms, Sys-
tem Issues, and Lessons Learned”. In: Aug. 2015, pp. 438–450. isbn:
978-3-662-48095-3. doi: 10.1007/978-3-662-48096-0_34.
[19] W. Song et al. “An effective query recommendation approach using
semantic strategies for intelligent information retrieval”. In: Expert
Systems with Applications: An International Journal 41 (Feb. 2014),
pp. 366–372. doi: 10.1016/j.eswa.2013.07.052.
[20] A. M. Ponder-Sutton. “Chapter 1 - The Automating of Open Source
Intelligence”. In: Automating Open Source Intelligence. Ed. by R. Lay-
ton and P. A. Watters. Boston: Syngress, 2016, pp. 1 –20. isbn: 978-0-
12-802916-9. doi: https://doi.org/10.1016/B978-0-12-802916-
9 . 00001 - 4. url: http : / / www . sciencedirect . com / science /
article/pii/B9780128029169000014.
[21] M. Kandias et al. “Which side are you on? A new Panopticon vs.
privacy”. In: 2013 International Conference on Security and Cryptog-
raphy (SECRYPT). 2013, pp. 1–13. isbn: 978-9-8975-8131-1.

61
BIBLIOGRAPHY

[22] H. Bean. “Is open source intelligence an ethical issue?” In: Research in
Social Problems and Public Policy 19 (Jan. 2011), pp. 385–402. doi:
10.1108/S0196-1152(2011)0000019024.
[23] C. Kopp et al. “Chapter 8 - Ethical Considerations When Using On-
line Datasets for Research Purposes”. In: Automating Open Source
Intelligence. Ed. by R. Layton and P. A. Watters. Boston: Syngress,
2016, pp. 131 –157. isbn: 978-0-12-802916-9. doi: https://doi.org/
10.1016/B978-0-12-802916-9.00008-7. url: http://www.scienc
edirect.com/science/article/pii/B9780128029169000087.
[24] J. Simola. “Privacy issues and critical infrastructure protection”. In:
2020. isbn: 9780128165942. doi: 10.1016/b978- 0- 12- 816203- 3.
00010-1.
[25] A. Cavoukian. Privacy by Design – The 7 Foundational Principles.
[online], cit. [2020-12-17]. 2010. url: https://www.ipc.on.ca/wp-
content/uploads/Resources/7foundationalprinciples.pdf.
[26] B.-J. Koops, J.-H. Hoepman, and R. Leenes. “Open-source intelli-
gence and privacy by design”. In: Computer Law & Security Review
29 (Dec. 2013), 676–688. doi: 10.1016/j.clsr.2013.09.005.
[27] P. Casanovas. “Cyber Warfare and Organised Crime. A Regulatory
Model and Meta-Model for Open Source Intelligence (OSINT)”. In:
Dec. 2017, pp. 139–167. isbn: 978-3-319-45299-9. doi: 10.1007/978-
3-319-45300-2_9.
[28] J. Rajamäki and J. Simola. “How to apply privacy by design in OS-
INT and big data analytics?” In: ECCWS 2019 - Proceedings of the
18th European Conference on Cyber Warfare and Security. June 2019,
pp. 364–371. isbn: 9781912764280.
[29] A. Gandomi and M. Haider. “Beyond the hype: Big data concepts,
methods, and analytics”. In: International Journal of Information
Management 35.2 (2015), pp. 137 –144. issn: 0268-4012. doi: https:
//doi.org/10.1016/j.ijinfomgt.2014.10.007. url: http://www.
sciencedirect.com/science/article/pii/S0268401214001066.
[30] A. Powell and C. Haynes. “Social Media Data in Digital Forensics
Investigations”. In: Jan. 2020, pp. 281–303. isbn: 978-3-030-23546-8.
doi: 10.1007/978-3-030-23547-5_14.
[31] G. Bello-Orgaz, J. J. Jung, and D. Camacho. “Social big data: Recent
achievements and new challenges”. In: Information Fusion 28 (2016),
pp. 45 –59. issn: 1566-2535. doi: https : / / doi . org / 10 . 1016 / j .
inffus . 2015 . 08 . 005. url: http : / / www . sciencedirect . com /
science/article/pii/S1566253515000780.

62
BIBLIOGRAPHY

[32] G. Kalpakis et al. “OSINT and the Dark Web”. In: Jan. 2016, pp. 111–
132. isbn: 978-3-319-47670-4. doi: 10.1007/978-3-319-47671-1_8.
[33] B. Nafziger. “Data Mining in the Dark: Darknet Intelligence Automa-
tion”. In: 2017.
[34] M. Schäfer et al. “BlackWidow: Monitoring the Dark Web for Cyber
Security Information”. In: May 2019, pp. 1–21. doi: 10.23919/CYCON.
2019.8756845.
[35] H. Chen. Dark Web - Exploring and Data Mining the Dark Side of the
Web. Springer-Verlag New York, 2012. isbn: 978-1-4614-1557-2. url:
https://www.springer.com/gp/book/9781461415565.
[36] J. Pastor-Galindo et al. “The not yet exploited goldmine of OSINT:
Opportunities, open challenges and future trends”. In: IEEE Access
PP (Jan. 2020), pp. 1–1. doi: 10.1109/ACCESS.2020.2965257.
[37] R. A. Best Jr. and A. Cumming. Open Source Intelligence (OSINT):
Issues for Congress. Congressional Research Service, 2007. url: htt
ps://fas.org/sgp/crs/intel/RL34270.pdf.
[38] T. Day, H. Gibson, and S. Ramwell. “Fusion of OSINT and Non-
OSINT Data”. In: Jan. 2016, pp. 133–152. isbn: 978-3-319-47670-4.
doi: 10.1007/978-3-319-47671-1_9.
[39] R. Scrivens et al. “Searching for Extremist Content Online Using The
Dark Crawler and Sentiment Analysis”. In: Aug. 2019, pp. 179–194.
isbn: 978-1-78769-866-6. doi: 10.1108/s1521-613620190000024016.
[40] E. Susnea. “A Real-Time Social Media Monitoring System as an Open
Source Intelligence (Osint) Platform for Early Warning in Crisis Situ-
ations”. In: International conference KNOWLEDGE-BASED ORGA-
NIZATION 24 (June 2018), pp. 427–431. doi: 10.1515/kbo-2018-
0127.
[41] L. Ball. “Automating social network analysis: A power tool for
counter-terrorism”. In: Security Journal 29 (Feb. 2013). doi: 10 .
1057/sj.2013.3.
[42] M. Dawson, M. Lieble, and A. Adeboje. “Open Source Intelligence:
Performing Data Mining and Link Analysis to Track Terrorist Activi-
ties”. In: Information Technology - New Generations. Ed. by S. Latifi.
Cham: Springer International Publishing, 2018, pp. 159–163. isbn:
978-3-319-54978-1.
[43] S. Carruthers. Social Engineering - A Proactive Security. [online], cit.
[2020-8-14]. 2018. url: https : / / www . mindpointgroup . com / wp -
content / uploads / 2018 / 08 / Social - Engineering - Whitepaper -
Part-Three-Phishing.pdf.

63
BIBLIOGRAPHY

[44] M. Edwards et al. “Panning for gold: Automatically analysing on-

line social engineering attack surfaces”. In: Computers & Security 69
(2017). Security Data Science and Cyber Threat Management, pp. 18
–34. issn: 0167-4048. doi: https://doi.org/10.1016/j.cose.2016.
12.013. url: http://www.sciencedirect.com/science/article/
pii/S0167404816301845.
[45] D. Hayes and F. Cappa. “Open-source intelligence for risk assess-
ment”. In: Business Horizons 61 (Mar. 2018). doi: 10 . 1016 / j .
bushor.2018.02.001.
[46] A. Cartagena et al. “Privacy Violating Opensource Intelligence Threat
Evaluation Framework: A Security Assessment Framework For Crit-
ical Infrastructure Owners”. In: Jan. 2020, pp. 0494–0499. doi: 10.
1109/CCWC47524.2020.9031172.
[47] Y. Tanaka and S. Kashima. “SeedsMiner: Accurate URL Blacklist-
Generation Based on Efficient OSINT Seed Collection”. In: Oct.
2019, pp. 250–255. isbn: 978-1-4503-6988-6. doi: 10.1145/3358695.
3361751.
[48] D. Quick and K.-K. R. Choo. “Digital forensic intelligence: Data sub-
sets and Open Source Intelligence (DFINT+OSINT): A timely and co-
hesive mix”. In: Future Generation Computer Systems 78 (Dec. 2016).
doi: 10.1016/j.future.2016.12.032.
[49] I. Vacas, I. Medeiros, and N. Neves. “Detecting Network Threats using
OSINT Knowledge-Based IDS”. In: 2018 14th European Dependable
Computing Conference (EDCC). 2018, pp. 128–135.
[50] S. Lee et al. “Managing Cyber Threat Intelligence in a Graph
Database: Methods of Analyzing Intrusion Sets, Threat Actors, and
Campaigns”. In: Jan. 2018, pp. 1–6. doi: 10.1109/PlatCon.2018.
8472752.
[51] C. Best. “Web Mining for Open Source Intelligence”. In: 2008 12th
International Conference Information Visualisation. 2008, pp. 321–
325.
[52] F. Neri, C. Aliprandi, and F. Camillo. “Mining the Web to Monitor
the Political Consensus”. In: May 2011, pp. 391–412. isbn: 978-3-7091-
0387-6. doi: 10.1007/978-3-7091-0388-3_19.
[53] C. Fleisher. “Using Open Source Data in Developing Competitive
and Market Intelligence”. In: European Journal of Marketing 42 (July
2008), pp. 852–866. doi: 10.1108/03090560810877196.
[54] A. Magalhães and J. a. P. Magalhães. “TExtractor: An OSINT Tool
to Extract and Analyse Audio/Video Content”. In: Innovation, En-

64
BIBLIOGRAPHY

gineering and Entrepreneurship. Springer International Publishing,

2019, pp. 3–9. isbn: 978-3-319-91334-6.
[55] P. Maciolek and G. Dobrowolski. “Cluo: Web-Scale Text Mining Sys-
tem For Open Source Intelligence Purposes”. In: Computer Science
14 (Jan. 2013). doi: 10.7494/csci.2013.14.1.45.
[56] S. Gong, J. Cho, and C. Lee. “A Reliability Comparison Method
for OSINT Validity Analysis”. In: IEEE Transactions on Industrial
Informatics 14.12 (2018), pp. 5428–5435.
[57] D. Jurafsky and J. Martin. Speech and Language Processing: An Intro-
duction to Natural Language Processing, Computational Linguistics,
and Speech Recognition. Vol. 2. Feb. 2008.
[58] Google LLC. Google. [online], cit. [2020-8-27]. url: https : / / www .
google.com.
[59] Apple Inc. Apple Siri. [online], cit. [2020-8-27]. url: https://www.
apple.com/siri/.
[60] S. Noubours, A. Pritzkau, and U. Schade. “NLP as an essential ingre-
dient of effective OSINT frameworks”. In: 2013 Military Communica-
tions and Information Systems Conference. 2013, pp. 1–7.
[61] R. Layton et al. “Indirect Information Linkage for OSINT through
Authorship Analysis of Aliases”. In: vol. 7867. Apr. 2013. doi: 10 .
1007/978-3-642-40319-4_4.
[62] K. Li et al. “Security OSIF: Toward Automatic Discovery and Analy-
sis of Event Based Cyber Threat Intelligence”. In: Oct. 2018, pp. 741–
747. doi: 10.1109/SmartWorld.2018.00142.
[63] G. Ganino et al. “Ontology population for open-source intelligence: A
GATE-based solution”. In: Software: Practice and Experience (Sept.
2018). doi: 10.1002/spe.2640.
[64] W3C. Ontologies. [online], cit. [2020-9-4]. 2018. url: https://www.
w3.org/standards/semanticweb/ontology.
[65] L. Serrano et al. “Events Extraction and Aggregation for Open Source
Intelligence: From Text to Knowledge”. In: Nov. 2013, pp. 518–523.
isbn: 978-1-4799-2972-6. doi: 10.1109/ICTAI.2013.83.
[66] University of Maryland. Global Terrorism Database. [online], cit.
[2020-7-30]. url: https://www.start.umd.edu/gtd/.
[67] E. Alpaydin. Introduction to Machine Learning. Adaptive Com-
putation and Machine Learning series. MIT Press, 2020. isbn:
97802620437-93. url: https : / / books . google . cz / books ? id =
tZnSDwAAQBAJ.

65
BIBLIOGRAPHY

[68] M. Jordan and T. Mitchell. “Machine Learning: Trends, Perspec-

tives, and Prospects”. In: Science (New York, N.Y.) 349 (July 2015),
pp. 255–60. doi: 10.1126/science.aaa8415.
[69] H. Pellet, S. Shiaeles, and S. Stavrou. “Localising social network users
and profiling their movement”. In: Computers & Security 81 (2019),
pp. 49 –57. issn: 0167-4048. doi: https : / / doi . org / 10 . 1016 / j .
cose.2018.10.009.
[70] Novetta. CLAVIN (Cartographic Location And Vicinity INdexer). [on-
line], cit. [2020-8-16]. url: https://github.com/Novetta/CLAVIN.
[71] P. Ranade et al. “Using Deep Neural Networks to Translate Multi-
lingual Threat Intelligence”. In: Nov. 2018, pp. 238–243. doi: 10 .
1109/ISI.2018.8587374.
[72] F. Alves, P. Ferreira, and A. Bessani. “Design of a Classification Model
for a Twitter-Based Streaming Threat Monitor”. In: June 2019, pp. 9–
14. doi: 10.1109/DSN-W.2019.00010.
[73] B. Mohit. “Named Entity Recognition”. In: Mar. 2014, pp. 221–245.
isbn: 978-3-642-45357-1. doi: 10.1007/978-3-642-45358-8_7.
[74] I. Deliu, C. Leichter, and K. Franke. “Extracting cyber threat intel-
ligence from hacker forums: Support vector machines versus convolu-
tional neural networks”. In: Dec. 2017, pp. 3648–3656. doi: 10.1109/
BigData.2017.8258359.
[75] J. Gu et al. “Recent Advances in Convolutional Neural Networks”. In:
Pattern Recognition (Dec. 2015). doi: 10.1016/j.patcog.2017.10.
013.
[76] S. Mittal, A. Joshi, and T. Finin. “Cyber-All-Intel: An AI for Security
related Threat Intelligence”. In: May 2019.
[77] DarkSearch.io. DarkSearch. [online], cit. [2020-7-13]. url: https://
darksearch.io.
[78] S. Chauhan and N. K. Panda. Hacking Web Intelligence - Open Source
Intelligence and Web Reconnaissance Concepts and Techniques. Else-
vier Inc., 2015. isbn: 978-0-12-801867-5. url: https://doi.org/10.
1016/C2014-0-00876-3.
[79] Q. Revell, T. Smith, and R. Stacey. “Tools for OSINT-Based Inves-
tigations”. In: Jan. 2016, pp. 153–165. isbn: 978-3-319-47670-4. doi:
10.1007/978-3-319-47671-1_10.
[80] A. Bielska et al. Open Source Intelligence Tools and Resources Hand-
book. [online], cit. [2020-8-27]. 2018. url: https://i-intelligence.
eu / uploads / public - documents / OSINT _ Handbook _ June - 2018 _
Final.pdf.

66
BIBLIOGRAPHY

[81] J. Nordine. OSINT Framework. [online], cit. [2020-7-10]. url: https:

//osintframework.com.
[82] M. Hoffman. Your OSINT Graphical Analyzer (YOGA). [online], cit.
[2020-7-13]. url: https://yoga.osint.ninja.
[83] B. Mortier. OSINT Open Source Intelligence Framework. [online], cit.
[2020-7-13]. url: https://start.me/p/ZME8nR/osint.
[84] Pipl. Pipl. [online], cit. [2020-7-29]. url: https://pipl.com.
[85] Ancestry. Ancestry. [online], cit. [2020-7-29]. url: https://www.anc
estry.com.
[86] B. Sanders. MailTester. [online], cit. [2020-8-22]. url: https://mail
tester.com.
[87] T. Hunt. Have I Been Pwned? [online], cit. [2020-7-14]. url: https:
//haveibeenpwned.com.
[88] KnowEm? CheckUserNames. [online], cit. [2020-7-14]. url: https :
//checkusernames.com.
[89] M. Hoffman. WhatsMyName. [online], cit. [2020-7-14]. url: https :
//github.com/WebBreacher/WhatsMyName.
[90] sundowndev. PhoneInfoga. [online], cit. [2020-7-29]. url: https://
github.com/sundowndev/PhoneInfoga.
[91] Facebook. Facebook for Developers. [online], cit. [2020-8-25]. url: ht
tps://developers.facebook.com.
[92] Twitter, Inc. Twitter API. [online], cit. [2020-8-25]. url: https://
developer.twitter.com/en/docs/twitter-api.
[93] LinkedIn Corporation. LinkedIn Developers. [online], cit. [2020-8-25].
url: https://www.linkedin.com/developers/.
[94] Tinfoleak. Tinfoleak. [online], cit. [2020-8-20]. url: https://tinfol
eak.com/.
[95] Social Searcher. Social Searcher. [online], cit. [2020-8-25]. url: https:
//www.social-searcher.com.
[96] webdevmedia. DNSlytics. [online], cit. [2020-7-14]. url: https : / /
dnslytics.com.
[97] IPinfo. IPinfo. [online], cit. [2020-7-14]. url: https://ipinfo.io.
[98] IKnowWhatYouDownload. I Know What You Download. [online], cit.
[2020-8-20]. url: https://iknowwhatyoudownload.com.
[99] Sectigo Limited. crt.sh. [online], cit. [2020-7-15]. url: https://crt.
sh.
[100] urlscan GmbH. urlscan.io. [online], cit. [2020-7-15]. url: https://
urlscan.io.
[101] Cisco Systems, Inc. SpamCop. [online], cit. [2020-7-15]. url: https:
//www.spamcop.net.

67
BIBLIOGRAPHY

[102] SURBL. SURBL. [online], cit. [2020-7-15]. url: http://www.surbl.

org.
[103] SORBS. SORBS. [online], cit. [2020-7-15]. url: http : / / www . us .
sorbs.net.
[104] abuse.ch. Abuse.ch SSL Blacklist. [online], cit. [2020-7-15]. url: http
s://sslbl.abuse.ch.
[105] RickAnalytics. Malware Domain Blocklist. [online], cit. [2020-7-15].
url: http://www.malwaredomains.com.
[106] FireHOL. FireHOL IP Lists. [online], cit. [2020-7-15]. url: https :
//iplists.firehol.org.
[107] AlienVault. Open Threat Exchange. [online], cit. [2020-7-22]. url: ht
tps://otx.alienvault.com.
[108] AlienVault. ThreadCrowd. [online], cit. [2020-7-22]. url: https : / /
www.threatcrowd.org.
[109] OPSWAT, Inc. MetaDefender. [online], cit. [2020-7-22]. url: https:
//metadefender.opswat.com/.
[110] Fortinet. FortiGuard Labs. [online], cit. [2020-7-22]. url: http : / /
fortiguard.com.
[111] BuiltWith R Pty Ltd. BuiltWith. [online], cit. [2020-7-15]. url: https:
//builtwith.com.
[112] Spyse. Spyse. [online], cit. [2020-7-10]. url: https://spyse.com.
[113] Censys. Censys. [online], cit. [2020-7-22]. url: https://censys.io.
[114] Nmap. Nmap. [online], cit. [2020-7-15]. url: https://github.com/
nmap/nmap.
[115] R. D. Graham. MASSCAN. [online], cit. [2020-7-15]. url: https :
//github.com/robertdavidgraham/masscan.
[116] Google LLC. Google Custom Search. [online], cit. [2020-7-10]. url:
https://developers.google.com/custom-search.
[117] Microsoft Corporation. Bing Web Search API. [online], cit. [2020-7-10].
url: https://azure.microsoft.com/en-us/services/cognitive-
services/bing-web-search-api/.
[118] DuckDuckGo, Inc. DuckDuckGo Instant Answer API. [online], cit.
[2020-7-10]. url: https://api.duckduckgo.com/api.
[119] Offensive Security. Google Hacking Database. [online], cit. [2020-7-14].
url: https://www.exploit-db.com/google-hacking-database.
[120] Ahmia. Ahmia. [online], cit. [2020-7-13]. url: https://ahmia.fi.
[121] TorchSearch.net. Torch Search Engine. [online], cit. [2020-7-13]. url:
https://torchsearch.net.
[122] OnionSearchEngine.com. Onion Search Engine. [online], cit. [2020-7-
13]. url: https://onionsearchengine.com.

68
BIBLIOGRAPHY

[123] E. Maor. Kilos: The Dark Web’s Newest – and Most Extensive –
Search Engine. [online], cit. [2020-7-13]. url: https://intsights.
com/blog/kilos- the- dark- webs- newest- and- most- extensive-
search-engine.
[124] PublicWWW. PublicWWW. [online], cit. [2020-7-14]. url: https://
publicwww.com.
[125] B. Boyter. Searchcode. [online], cit. [2020-7-14]. url: https://searc
hcode.com.
[126] M. Fagan. Fagan Finder. [online], cit. [2020-8-31]. url: https://www.
faganfinder.com/.
[127] ElevenPaths. FOCA. [online], cit. [2020-7-29]. url: https://github.
com/ElevenPaths/FOCA.
[128] Edge-Security. Metagoofil. [online], cit. [2020-7-29]. url: http://www.
edge-security.com/metagoofil.php.
[129] P. Harvey. ExifTool. [online], cit. [2020-7-29]. url: https://exiftoo
l.org.
[130] Google LLC. Google Images. [online], cit. [2020-7-29]. url: https :
//images.google.com.
[131] Flickr. Flickr Map. [online], cit. [2020-8-31]. url: https://www.flic
kr.com/map.
[132] A. Mohawk. PasteLert. [online], cit. [2020-7-30]. url: https://www.
andrewmohawk.com/pasteLert/.
[133] A. Musciano. Sniff-Paste: OSINT Pastebin Harvester. [online], cit.
[2020-7-30]. url: https://github.com/needmorecowbell/sniff-
paste.
[134] Internet Archive. Wayback Machine. [online], cit. [2020-7-31]. url:
https://web.archive.org.
[135] Web Scraper. Web Scraper. [online], cit. [2020-7-31]. url: https://
webscraper.io.
[136] ScraperAPI. ScraperAPI. [online], cit. [2020-7-31]. url: https://www.
scraperapi.com.
[137] ScrapeSimple. ScrapeSimple. [online], cit. [2020-7-31]. url: https :
//www.scrapesimple.com.
[138] Scrapinghub. Scrapy. [online], cit. [2020-7-31]. url: https://scrapy.
org.
[139] Ahmia. Ahmia Crawler. [online], cit. [2020-7-23]. url: https://git
hub.com/ahmia/ahmia-crawler.
[140] Ahmia. Ahmia Index. [online], cit. [2020-7-23]. url: https://github.
com/ahmia/ahmia-index.

69
BIBLIOGRAPHY

[141] Intelligence X. Intelligence X. [online], cit. [2020-7-22]. url: https:

//intelx.io.
[142] ShadowDragon, LLC. ShadowDragon. [online], cit. [2020-7-22]. url:
https://shadowdragon.io.
[143] Edge-Security. theHarvester. [online], cit. [2020-7-15]. url: https :
//github.com/laramies/theHarvester.
[144] T. Tomes. Recon-ng. [online], cit. [2020-7-15]. url: https://github.
com/lanmaster53/recon-ng.
[145] T. Tomes. Recon-ng Marketplace. [online], cit. [2020-7-22]. url: http
s://github.com/lanmaster53/recon-ng-marketplace.
[146] M. Technologies. Maltego. [online], cit. [2020-7-16]. url: https : / /
www.maltego.com.
[147] Maltego Technologies. Maltego Transform Hub. [online], cit. [2020-7-
24]. url: https://www.maltego.com/transform-hub/.
[148] Spread Security. Open Source Intelligence with Maltego. [online], cit.
[2020-12-20]. url: https://spreadsecurity.github.io/2016/09/
03/open-source-intelligence-with-maltego.html.
[149] SpiderFoot. SpiderFoot. [online], cit. [2020-7-10]. url: https://www.
spiderfoot.net.
[150] SpiderFoot. SpiderFoot Documentation. [online], cit. [2020-7-25]. url:
https://www.spiderfoot.net/documentation/.
[151] The PostgreSQL Global Development Group. PostgreSQL. [online],
cit. [2020-12-6]. url: https://www.postgresql.org/.
[152] OpenStreetMap. OpenStreetMap. [online], cit. [2020-12-6]. url: http
s://www.openstreetmap.org/.
[153] freegeoip.app. freegeoip.app. [online], cit. [2020-12-6]. url: http://
freegeoip.app.
[154] ipify.org. ipify IP Geolocation API. [online], cit. [2020-12-6]. url: ht
tp://geo.ipify.org.
[155] ip api.com. ip-api.com. [online], cit. [2020-12-6]. url: http : / / ip -
api.com.
[156] ipgeolocation.io. ipgeolocation.io. [online], cit. [2020-12-6]. url: http:
//ipgeolocation.io.
[157] ipdata.co. ipdata.co. [online], cit. [2020-12-6]. url: http://ipdata.
co.
[158] eXTReMe digital. eXTReMe-IP-LOOKUP. [online], cit. [2020-12-6].
url: http://extreme-ip-lookup.com.
[159] GEOPLUGIN, SAS. geoPlugin. [online], cit. [2020-12-6]. url: http:
//geoplugin.net.

70
BIBLIOGRAPHY

[160] ipwhois.io. ipwhois.io. [online], cit. [2020-12-6]. url: http://ipwhois.

io.
[161] Ipregistry. Ipregistry. [online], cit. [2020-12-6]. url: http://ipregis
try.co.
[162] WHOIS API, Inc. WhoisXMLAPI. [online], cit. [2020-11-16]. url: h
ttps://www.whoisxmlapi.com.
[163] IPLocate.io. IPLocate.io. [online], cit. [2020-12-6]. url: http://iplo
cate.io.
[164] Utrace.de. Utrace.de. [online], cit. [2020-12-6]. url: http://utrace.
de.
[165] MaxMind, Inc. MaxMind GeoIP. [online], cit. [2020-12-6]. url: https:
//www.maxmind.com/en/geoip2-databases.
[166] VirusTotal. VirusTotal. [online], cit. [2020-12-6]. url: http://virus
total.com.
[167] abuse.ch. Feodotracker. [online], cit. [2020-12-6]. url: http://feodo
tracker.abuse.ch.
[168] AlientVault. AlienVault reputation list. [online], cit. [2020-12-6]. url:
http://reputation.alienvault.com/reputation.generic.
[169] CINSscore.com. CINSscore.com. [online], cit. [2020-12-6]. url: http:
//cinsscore.com.
[170] Blocklist.de. Blocklist.de Lists. [online], cit. [2020-12-6]. url: http :
//lists.blocklist.de/lists.
[171] T. S. P. SLU. Spamhaus DROP list. [online], cit. [2020-12-6]. url:
http://spamhaus.org/drop.
[172] OpenPhish. OpenPhish. [online], cit. [2020-12-6]. url: http://openp
hish.com.
[173] ZeroDot1. ZeroDot1 CoinBlockerLists. [online], cit. [2020-12-6]. url:
https://zerodot1.gitlab.io/CoinBlockerListsWeb/.
[174] L. Hansliková. “Categorizing and visualizing the Dark Web”. Master’s
thesis. Masarykova univerzita, Fakulta informatiky, Brno, 2020 [cit.
2020-11-16]. url: https://is.muni.cz/th/noh49/.
[175] N. Galbreath. ipcat. [online], cit. [2020-11-16]. url: https://github.
com/client9/ipcat/.
[176] Google LLC. Google. [online], cit. [2020-11-16]. url: https://dns.
google.com/.
[177] Public-DNS. Public-DNS. [online], cit. [2020-11-16]. url: https://
public-dns.info/.
[178] F. Denis. IP2ASN. [online], cit. [2020-11-16]. url: https://iptoasn.
com/.

71
BIBLIOGRAPHY

[179] MultiProxy.org. MultiProxy. [online], cit. [2020-11-16]. url: http://

multiproxy.org/.
[180] CIRCL.LU. Passive DNS. [online], cit. [2020-11-16]. url: http : / /
circl.lu/services/passive-dns/.
[181] CIRCL.LU. Passive SSL. [online], cit. [2020-11-16]. url: http : / /
circl.lu/services/passive-ssl/.
[182] The.Earth.li. The.Earth.li. [online], cit. [2020-11-16]. url: http :/ /
the.earth.li/.
[183] C. T. Terrón. PGP Public Key Server. [online], cit. [2020-11-16]. url:
http://pgp.key-server.io/.
[184] psbdmp. Pastebin Dump. [online], cit. [2020-11-16]. url: https://
psbdmp.cc/.
[185] spyonweb.com. SpyOnWeb. [online], cit. [2020-11-16]. url: https://
spyonweb.com/.
[186] Tor Project. Tor Project. [online], cit. [2020-11-16]. url: https://
torproject.org.
[187] Whoxy.com. Whoxy. [online], cit. [2020-11-16]. url: https : / / www .
whoxy.com/.
[188] Whoisology. Whoisology. [online], cit. [2020-11-16]. url: https : / /
whoisology.com/.
[189] R. Firestein. Cymon API v2. [online], cit. [2020-12-6]. url: http :
//cymon.docs.apiary.io.
[190] Amazon.com Inc. Amazon Global Infrastructure. [online], cit. [2020-
12-25]. url: https://aws.amazon.com/about-aws/global-infras
tructure/.
[191] Google LLC. Google Data Centers Locations. [online], cit. [2020-12-25].
url: https://www.google.com/about/datacenters/locations/.
[192] NordVPN. NordVPN Servers. [online], cit. [2020-12-25]. url: https:
//nordvpn.com/servers/.

72
A Appendices

Electronic attachments content

• /Pantomath/ – Directory containing the Pantomath tool

– ./config/ – Directory containing all configuration files with set-

tings for API keys, PostgreSQL, Tor, reliability estimates, etc.
– ./README.md – File containing instructions for installing and
running Pantomath, more detailed description is on the next
page
– ./pantomath_cli – The Pantomath command-line interface
– ./docker-compose.yml – The file used for deployment of Pan-
tomath inside containers using Docker-compose

• /CLAVIN/ – Directory containing CLAVIN plugin for resolution of

locations in the dark web dataset

• /Results/ – Directory containing results in the form of JSON files

for queries of muni.cz (search depth of 1), fi.muni.cz (search depth of
0), and crocs.fi.muni.cz (search depth of 0)

• /Reliability/ – Directory containing the datasets used for reliability

testing and the results of the testing

73
A. Appendices

Installing and Using Pantomath

The easiest way to install and run Pantomath is to install docker-compose

and build the required containers using the docker-compose.yml file in the
/Pantomath/ directory. To do that, run the following sequence of commands.
Build both services:
$ docker-compose build
Run the database container in a detached mode:
$ docker-compose up -d postgres
Run the Pantomath container, which opens the command-line interface in-
side the container:
$ docker-compose run pantomath
When finished, run the following command to stop all running components:
$ docker-compose down
There are other deployment options as well. Pantomath needs the Post-
greSQL database to be running with the appropriate settings set in the
configuration file. The database container can be launched as before and
Pantomath can be installed and used locally. To install Pantomath on a
Debian system, install the following packages:
$ apt-get install python3 python3-pip python3-dev libc6-dev
,→ gcc libxslt1-dev build-essential libc6-dev libssl-dev
,→ libffi-dev postgresql-server-dev-all tor
Then install the Python libraries using the requirements.txt file:
$ pip3 install -r requirements.txt
The command-line interface can be launched using the script pantomath_cli
in the /Pantomath/ directory:
$ ./pantomath_cli
When the command-line interface is up and running, the help command
can be used to see all available commands with a description. The modules
command shows all available modules, the arguments they take as an input,
whether they provide offline data, and their description. A typical workflow
starts with the update of the offline data using the update command. Once
the database contains fresh data, the query command can be used with
arbitrary IP address, domain name, or e-mail address as a target:

74
A. Appendices

[pantomath]$ query DOMAIN muni.cz

To specify how deep the search should be, an optional argument search_depth
can be given to the command:
[pantomath]$ query IPv4 147.251.5.239 2
When all targets with given depth are queried, the CLI asks the user whether
the search should continue. The default search depth is set to 0, i.e., only the
first target is searched. The modes can be changed using the overt, stealth,
and offline commands, which also changes the prompt to see which mode is
currently on:
[pantomath][offline]$

[pantomath][stealth]$
After the search is over, the results can either be printed to the standard
output or exported to a JSON file using the print and export commands.

Motion To Amend Complaint Sample
100% (4)
Motion To Amend Complaint Sample
3 pages
(OSINT) Open Source Intelligence Investigation - From Strategy To Implementation (2016)
73% (11)
(OSINT) Open Source Intelligence Investigation - From Strategy To Implementation (2016)
302 pages
OSINT Free Tools
100% (9)
OSINT Free Tools
4 pages
Cyber Victimology Decoding Cyber Crime Victimisation
100% (2)
Cyber Victimology Decoding Cyber Crime Victimisation
121 pages
Awesome OSINT (Open-Source Intelligence)
100% (5)
Awesome OSINT (Open-Source Intelligence)
67 pages
Osint Curious OSINT Resource List PDF
No ratings yet
Osint Curious OSINT Resource List PDF
15 pages
Osint Playbook
0% (1)
Osint Playbook
11 pages
Open Source Intelligence Handbook
83% (6)
Open Source Intelligence Handbook
57 pages
Cognitive Warfare 2021
100% (2)
Cognitive Warfare 2021
45 pages
Open Source Intelligence (OSINT) (PDFDrive) PDF
100% (1)
Open Source Intelligence (OSINT) (PDFDrive) PDF
141 pages
Trace Labs OSINT Search Party CTF Contestant Guide - v1
No ratings yet
Trace Labs OSINT Search Party CTF Contestant Guide - v1
21 pages
The Ultimate Guide To Open Source Intelligence2 1-1438719078664
100% (4)
The Ultimate Guide To Open Source Intelligence2 1-1438719078664
164 pages
Employing Social Media Monitoring Tools As An Osint Platform For Intelligence Defence Security
No ratings yet
Employing Social Media Monitoring Tools As An Osint Platform For Intelligence Defence Security
18 pages
Cybercrime Investigations A Comprehensive Resource For Every
100% (3)
Cybercrime Investigations A Comprehensive Resource For Every
361 pages
Structured Analytic Techniques For Intelligence Analysis by Richards and Randolph
100% (1)
Structured Analytic Techniques For Intelligence Analysis by Richards and Randolph
310 pages
PPT
No ratings yet
PPT
22 pages
Thesis: Short Essay Rubric Excellent 5 Good 4 Fair 3 Poor 2-0 Scor e Introductory Paragraph
No ratings yet
Thesis: Short Essay Rubric Excellent 5 Good 4 Fair 3 Poor 2-0 Scor e Introductory Paragraph
1 page
Defining Second Generation Open Source Intelligence (OSINT) For The Defense Enterprise
No ratings yet
Defining Second Generation Open Source Intelligence (OSINT) For The Defense Enterprise
62 pages
The Cyber Intelligence Analyst's Cookbook: Open Source Researchers
No ratings yet
The Cyber Intelligence Analyst's Cookbook: Open Source Researchers
131 pages
Osint 1624490707
100% (2)
Osint 1624490707
13 pages
On OSINT and The Trade
100% (2)
On OSINT and The Trade
27 pages
Open Source Intelligence (Osint) Reference Sheet
50% (2)
Open Source Intelligence (Osint) Reference Sheet
23 pages
We Are Osintcurio - Us: HTML To PDF Api
No ratings yet
We Are Osintcurio - Us: HTML To PDF Api
12 pages
Open Source Intelligence Investigation
100% (1)
Open Source Intelligence Investigation
247 pages
Open Source Intelligence (OSInt) Link Directory
86% (7)
Open Source Intelligence (OSInt) Link Directory
88 pages
Google Information 1
100% (1)
Google Information 1
4 pages
Employing Social Media Monitoring Tools As An Osint Platform For Intelligence Defence Security PDF
No ratings yet
Employing Social Media Monitoring Tools As An Osint Platform For Intelligence Defence Security PDF
18 pages
OSINT Framework
No ratings yet
OSINT Framework
1 page
Osint Usage
100% (1)
Osint Usage
6 pages
Eforensics Magazine 2019 02 OSINT in Forensics PREVIEW
No ratings yet
Eforensics Magazine 2019 02 OSINT in Forensics PREVIEW
17 pages
Review OSINT Tool For Social Engineering
100% (1)
Review OSINT Tool For Social Engineering
13 pages
Social Media Osint
No ratings yet
Social Media Osint
24 pages
The Spymaster's Guide To OSINT - HUMINT
100% (1)
The Spymaster's Guide To OSINT - HUMINT
20 pages
Osint Tools - Osint Post
No ratings yet
Osint Tools - Osint Post
7 pages
Intelligence Gathering and Analysis Techniques For Cybersecurity
No ratings yet
Intelligence Gathering and Analysis Techniques For Cybersecurity
13 pages
Open-Source Intelligence
0% (1)
Open-Source Intelligence
11 pages
Open Source Intelligence (OSINT) (PDFDrive)
100% (1)
Open Source Intelligence (OSINT) (PDFDrive)
251 pages
The Concept of Open Source Intelligence
No ratings yet
The Concept of Open Source Intelligence
3 pages
Herramientas Osint
No ratings yet
Herramientas Osint
4 pages
Open Source Intelligence
No ratings yet
Open Source Intelligence
1 page
02-2009 Open-Source Investigations Handbook
100% (1)
02-2009 Open-Source Investigations Handbook
87 pages
OSINT Tools May 16
100% (1)
OSINT Tools May 16
6 pages
OSINT@nettrain
100% (2)
OSINT@nettrain
741 pages
Open Source Intelligence
100% (1)
Open Source Intelligence
6 pages
Deviance in Social Media and Social Cyber Forensics Uncovering Hidden Relations Using Open Source Information (OSINF)
No ratings yet
Deviance in Social Media and Social Cyber Forensics Uncovering Hidden Relations Using Open Source Information (OSINF)
116 pages
Open Source Intelligence (OSINT) : Issues and Trends: January 2020
No ratings yet
Open Source Intelligence (OSINT) : Issues and Trends: January 2020
11 pages
OSINT Cheat Sheet
80% (5)
OSINT Cheat Sheet
2 pages
2018 - Social Networking Analysis For Law Enforcement
No ratings yet
2018 - Social Networking Analysis For Law Enforcement
19 pages
The Osint Toolkit
50% (2)
The Osint Toolkit
29 pages
2013 07 11 OSINT 2ool Kit On The Go Bag O Tradecraft PDF
100% (1)
2013 07 11 OSINT 2ool Kit On The Go Bag O Tradecraft PDF
216 pages
Intelligence Gathering - The Penetration Testing Execution Standard PDF
100% (1)
Intelligence Gathering - The Penetration Testing Execution Standard PDF
15 pages
OSINT TikTok
No ratings yet
OSINT TikTok
21 pages
What Is OSINT Framework?
No ratings yet
What Is OSINT Framework?
7 pages
2019 OSINT Guide
No ratings yet
2019 OSINT Guide
20 pages
Top OSINT&Infosec Resources For You and Your Team100+ Blogs, Podcasts
100% (6)
Top OSINT&Infosec Resources For You and Your Team100+ Blogs, Podcasts
24 pages
Operator Handbook OSINT
80% (10)
Operator Handbook OSINT
437 pages
OSINT 101 Open Source Intelligence
No ratings yet
OSINT 101 Open Source Intelligence
16 pages
Reconnaissance 101: Footprinting & Information Gatherin: Ethical Hackers Bible To Collect Data About Target Systems
From Everand
Reconnaissance 101: Footprinting & Information Gatherin: Ethical Hackers Bible To Collect Data About Target Systems
Rob Botwright
No ratings yet
A Step by Step Guide For Multilingual OSINT PDF
100% (1)
A Step by Step Guide For Multilingual OSINT PDF
7 pages
Chapter 02 - Information Gathering - Handout
No ratings yet
Chapter 02 - Information Gathering - Handout
104 pages
OSINT Tools For Security Auditing Fosdem Edition
No ratings yet
OSINT Tools For Security Auditing Fosdem Edition
62 pages
Open Source Intelligence Tools (OSINT)
No ratings yet
Open Source Intelligence Tools (OSINT)
19 pages
21 OSINT Research Tools For Threat Intelligence 1643105531
100% (2)
21 OSINT Research Tools For Threat Intelligence 1643105531
15 pages
OSINT Curious OSINT Resource List
100% (1)
OSINT Curious OSINT Resource List
21 pages
Cybercrime Investigators Handbook
From Everand
Cybercrime Investigators Handbook
Graeme Edwards
No ratings yet
The Executive's Cybersecurity Advisor: Gain Critical Business Insight in Minutes
From Everand
The Executive's Cybersecurity Advisor: Gain Critical Business Insight in Minutes
Michael Gable
No ratings yet
The Hacker's Zibaldone
From Everand
The Hacker's Zibaldone
Roberto Dillon
No ratings yet
Extremist Propaganda in Social Media A Threat To Homeland S
No ratings yet
Extremist Propaganda in Social Media A Threat To Homeland S
235 pages
Untangling The Web A Guide To Internet Research
No ratings yet
Untangling The Web A Guide To Internet Research
652 pages
Predators Pedophiles Rapists and Other Sex Offenders
67% (3)
Predators Pedophiles Rapists and Other Sex Offenders
292 pages
Social Engineering Practical Overview 1639647933
No ratings yet
Social Engineering Practical Overview 1639647933
27 pages
GAN How To Detect Deepfake
No ratings yet
GAN How To Detect Deepfake
6 pages
Open Source Investigation of Detention in Tibet
No ratings yet
Open Source Investigation of Detention in Tibet
211 pages
Giant List of Search Engines
No ratings yet
Giant List of Search Engines
16 pages
Open Source Intelligence in A Networked World by Anthony Olcott
No ratings yet
Open Source Intelligence in A Networked World by Anthony Olcott
170 pages
Deception Counterdeception and Counterintelligence
100% (2)
Deception Counterdeception and Counterintelligence
458 pages
Researching Cybercrimes Methodologies, Ethics, and Critical Approaches
50% (2)
Researching Cybercrimes Methodologies, Ethics, and Critical Approaches
548 pages
Investigative Journalism
100% (3)
Investigative Journalism
317 pages
Counterterrorism and Open Source Intelligence
No ratings yet
Counterterrorism and Open Source Intelligence
378 pages
Automating Open Source Intelligence by Robert Layton, Watters
No ratings yet
Automating Open Source Intelligence by Robert Layton, Watters
177 pages
S.No Activity ID Activity Name Number of Resource Team Normal Resource Team Productivity
No ratings yet
S.No Activity ID Activity Name Number of Resource Team Normal Resource Team Productivity
3 pages
IELTS 16, Test 3 Answer Key, Academic Reading
No ratings yet
IELTS 16, Test 3 Answer Key, Academic Reading
6 pages
AchieversApplication - Acca
No ratings yet
AchieversApplication - Acca
1 page
Work Shop Lab Manual New Updated 555
No ratings yet
Work Shop Lab Manual New Updated 555
70 pages
Prashan - Shrestha - ICTPMG505 - Assessment Task 2
No ratings yet
Prashan - Shrestha - ICTPMG505 - Assessment Task 2
33 pages
MCNATT v. COMMONWEALTH OF PENNSYLVANIA, COUNTY OF VENANGO - Document No. 2
No ratings yet
MCNATT v. COMMONWEALTH OF PENNSYLVANIA, COUNTY OF VENANGO - Document No. 2
2 pages
1.2 Cash and Cash Equivalent
No ratings yet
1.2 Cash and Cash Equivalent
3 pages
Republic Act 10909 or No Shortchanging Act
No ratings yet
Republic Act 10909 or No Shortchanging Act
5 pages
BIOStudy Guide Unit 2 Bio 101
No ratings yet
BIOStudy Guide Unit 2 Bio 101
5 pages
Packaged Food - Vietnam: Euromonitor International: Country Market Insight November 2010
No ratings yet
Packaged Food - Vietnam: Euromonitor International: Country Market Insight November 2010
44 pages
The Sepher Yetzirah - The Book of Creation - Bill Heidrick
100% (1)
The Sepher Yetzirah - The Book of Creation - Bill Heidrick
17 pages
AV26.5B2 Bai Tap - Ky Nang Tong Hop 3
No ratings yet
AV26.5B2 Bai Tap - Ky Nang Tong Hop 3
15 pages
Aim Procedure and Source Code
No ratings yet
Aim Procedure and Source Code
8 pages
Stages of Taxes
No ratings yet
Stages of Taxes
2 pages
Bobine - General Deck
No ratings yet
Bobine - General Deck
11 pages
Dictionary PDF
No ratings yet
Dictionary PDF
41 pages
Constitution of Ecumenical Women Fellowship
No ratings yet
Constitution of Ecumenical Women Fellowship
8 pages
8 Improving Habitability of LPS Dwellings
No ratings yet
8 Improving Habitability of LPS Dwellings
55 pages
VLCAD Detailed Presentation
No ratings yet
VLCAD Detailed Presentation
12 pages
MGT Microproject 77-79
No ratings yet
MGT Microproject 77-79
10 pages
Public Policy Analysis
No ratings yet
Public Policy Analysis
12 pages
K. Akerhielm - Does Class Size Matter
No ratings yet
K. Akerhielm - Does Class Size Matter
13 pages
49018089
No ratings yet
49018089
5 pages
Meeting Invitation Template
No ratings yet
Meeting Invitation Template
1 page
WHLP q1w2 Grade 6
No ratings yet
WHLP q1w2 Grade 6
5 pages
TN SET NET JRF Unit 4 and 10 Study Material English Medium PDF Download
100% (1)
TN SET NET JRF Unit 4 and 10 Study Material English Medium PDF Download
15 pages
sss-one-physics
No ratings yet
sss-one-physics
7 pages