Automated Collection of Open Source Intelligence
Automated Collection of Open Source Intelligence
Faculty of Informatics
Automated Collection of
Open Source Intelligence
Master’s Thesis
Hereby I declare that this paper is my original authorial work, which I have
worked out on my own. All sources, references, and literature used or ex-
cerpted during elaboration of this work are properly cited and listed in com-
plete reference to the due source.
i
Acknowledgements
I would like to thank my advisor RNDr. Lukáš Němec for his guidance
throughout the entirety of this thesis. My thanks also go to RNDr. Martin
Stehlík, Ph.D. who provided many valuable suggestions and helped shape
the tool that is the outcome of this thesis.
Huge appreciation goes to my family for the support they have given me
during all the years of my studies.
I also want to thank Cedric from CIRCL.LU for providing me free access
to their Passive DNS and Passive SSL databases, and Gregory from Spyse
for giving me a free trial for their port discovery service, which allowed me
to extend Pantomath and further test the reliability estimation model.
ii
Abstract
With the ever-growing amount of data available on the Internet and the wide-
spread adoption of social media networks, publicly accessible websites have
grown into a goldmine of valuable information about individuals and com-
panies. Open Source Intelligence, shortly OSINT, is any information obtain-
able legally and ethically from publicly available sources addressing specific
intelligence requirements. The relatively easy and cheap integration makes
OSINT a practical solution for national security, cyber threat intelligence,
and many other fields. This thesis presents a framework called Pantomath
for an automated collection of OSINT that utilizes many existing tools and
services. The framework is highly modular, provides all the functionality
needed throughout the whole process of OSINT, offers three modes of oper-
ation for different anonymity requirements, and presents the data in a struc-
tured output. The reliability of some of the collected data is estimated to
allow the user to analyze the data more efficiently and precisely. The frame-
work is compared to existing OSINT automation tools, and the most notable
advantages and disadvantages are discussed.
iii
Keywords
iv
Contents
1 Introduction 1
v
6 Conclusions 59
Bibliography 60
A Appendices 73
vi
1 Introduction
With the exponential growth of the Internet in the last few decades, the
amount of data stored around the world has become immeasurable. It is
estimated that four of the biggest online companies, Amazon, Microsoft,
Google, and Facebook, store at least 1.2 million terabytes of data. At first,
data was thought of as a mere by-product of computing, but it has even-
tually grown into a product itself [1]. Companies sell their users’ data to
others that benefit from it, so collecting data of any value is essential for
many. A large portion of Internet data is accessible to anyone with an In-
ternet connection and often contains a lot of knowledge about individuals,
companies, or governments. All this data is commonly called Open Source
Intelligence, or shortly OSINT.
The value of OSINT is increasingly getting recognized in many different
fields. According to [2], over 80% of the knowledge used for policymaking
on a national level is derived from OSINT. Cyber threat intelligence heavily
utilizes OSINT and combines it with data collected by security devices to
evaluate possible threats to companies’ infrastructures. All in all, publicly
available sources constitute an irreplaceable source of knowledge. However,
due to the immense amount of data on the Internet and its unstructured and
heterogeneous nature, the collection and processing of OSINT is a challeng-
ing task requiring non-trivial methods. Arguably one of the biggest draw-
backs of OSINT is the lack of mechanisms for verification of the collected
information [3].
To make the whole process of OSINT easier and more accessible, various
tools and services that provide useful information exist. These range from
simple websites that provide basic information about IP addresses to more
complex tools implementing state-of-the-art algorithms, such as Shodan [4].
A framework called Pantomath for an automated collection of OSINT is pre-
sented in this thesis. The framework utilizes existing tools and services that
provide valuable information about Internet identifiers, such as IP addresses
or domain names. As the number of these services is enormous, Pantomath
was designed to make the integration of new sources more straightforward
by moving the data collection to separate modules, which can be added by
merely implementing a well-defined interface.
To address the user’s anonymity requirements, Pantomath offers three
modes of operation with varying guarantees and drawbacks. The overt mode
represents a regular operation where all sources are used, and an Internet
connection is required. In the stealth mode, all requests sent to the Internet
1
1. Introduction
are proxied through the Tor network. The offline mode provides the highest
guarantees for the user’s anonymity, as only a database of preprocessed data
is queried, and no Internet connection is needed. Pantomath also attempts to
tackle possibly the biggest challenge of OSINT – the validation of the gath-
ered data. A mathematical model for reliability estimation of the results is
defined and used in several modules.
The thesis is organized as follows. Chapter 2 introduces OSINT, discusses
some of the challenges, the value it provides, the fields where OSINT is often
utilized, and a few state-of-the-art techniques that improve the efficiency of
OSINT collection. Chapter 3 outlines the sources that can be used to gather
the data and some tools that aim to automatize this process. Pantomath,
a tool for an automated collection of OSINT, is presented in Chapter 4.
Chapter 5 evaluates Pantomath, compares it to tools with similar goals, and
drafts the possible extensions and improvements.
2
2 Open Source Intelligence
3
2. Open Source Intelligence
Publicly available data can be divided into four different categories [12],
as illustrated in Figure 2.1. Open Source Data (OSD) are any publicly avail-
able data that are not refined in any way, e.g., an image or raw social media
data. Open Source Information (OSINF) are OSD that have undergone fil-
tering, extraction of valuable information, and editing, for example, articles
and results from search engine queries. Open Source Intelligence (OSINT)
is a collection of OSINF that addresses a specific intelligence requirement,
with Validated Open Source Intelligence (OSINT-V) going a step further by
validating the OSINT using supporting information. Data from all these four
categories can be used for OSINT. However, Open Source Data and Open
Source Information require further assessment. Specific sources that can be
used for OSINT are discussed in Chapter 3.
Figure 2.1: Description of Open Source Data, Open Source Information, and
(Validated) Open Source Intelligence and how these categories differ in terms
of the level of data transformation [12].
4
2. Open Source Intelligence
2.1 Challenges
5
2. Open Source Intelligence
Besides some of the technical challenges, OSINT also brings additional ethi-
cal and legal issues. Although the collection of OSINT should be by definition
legal since only publicly accessible data are considered, the line between eth-
ical and ill-willed usage of the data is not completely clear [20]. Extremely
sensitive personal information such as sexual orientation, religion, or political
beliefs can be inferred even when these are not explicitly stated [21]. The com-
bination of multiple sources and the use of state-of-the-art techniques that
derive valuable information is what elevates mere data into a powerful tool.
According to some, the collection of OSINT is not much different from
someone reading a newspaper since the information is public in both cases.
However, it is the institutionalization of OSINT that raises concerns even
within the intelligence community [22]. As is apparent from leaks of classified
documents such as those released by Edward Snowden, the mass surveillance
performed by governments is not only focused on certain suspicious individ-
uals but rather omnipresent. The intelligence agencies of the world’s biggest
economies, such as the US, have enough resources to find virtually anything
about individual people the Internet has to offer [1].
6
2. Open Source Intelligence
The biggest issue that arises with such powerful knowledge is the po-
tential harm to the targeted individuals [23]. The EU’s General Protection
Regulation (GDPR) and other emerging privacy regulations are an incen-
tive for companies and individuals to carefully handle data to prevent any
direct or indirect leakage of personal information [24]. According to GDPR,
different pieces of information that together identify a specific person also
constitute personal data. This creates a non-trivial task to handle for com-
panies that sell pseudonymized data, as irreversible anonymization remains
an unresolved issue. Having regulations such as GDPR only solves a part of
the problem since users can voluntarily publish their personal information.
Nonetheless, the push for better privacy of Internet users could potentially
decrease the value of OSINT.
Both GDPR and Privacy by Design [25], an essential concept from GDPR
addressing users’ privacy, are also partially applicable to OSINT (or any data
collection process). By adhering to these principles, the OSINT investigators
can perform the tasks at hand as ethically and safely for the targeted indi-
viduals as possible in the given scenario. The principles can be summarized
as follows [12]:
7
2. Open Source Intelligence
Although there are some challenges when utilizing OSINT, the sheer amount
of information available on the Internet makes OSINT a viable solution for
national security, cyber threat intelligence, and many other fields. The incre-
mental improvements in the performance of computers throughout the past
have made OSINT more and more relevant. However, it was the emergence of
big data [29] and machine learning that made huge amounts of data available
at a much more rapid pace and with more valuable information extracted
from it. These are described later in Section 2.3.
With the widespread adoption of social media networks, the availability
of private personal information has increased considerably. Combined with
the fact that many users are unaware that large portions of the informa-
tion they share could be accessible to anyone, social media in particular are
a goldmine of OSINT [30]. Data from these websites can be used for market
research and other business-related use cases, but also for malicious activities,
such as phishing [31]. Companies also need to be aware of the information
that can be found about them publicly, as possession of a complete collec-
tion of such information could give a potential attacker a lot of valuable
knowledge as to how the company could be exploited.
The dark web offers strong anonymity and privacy guarantees for users
that wish to participate in illegal activities [32]. This fact alone makes
the dark web a very fruitful source of information, especially for govern-
ment agencies fighting against crime. The strong anonymity comes hand
in hand with the necessity to use specialized software to collect data from
the dark web [33], such as search engines and web crawlers that are able to
index hidden services. A framework called BlackWidow proposed by Schafer
et al. [34] brings together various tools for collection and analysis of the con-
tent to gather information related to cybersecurity and fraud monitoring.
There is a body of research focusing on the detection of activities of interna-
tional terrorist groups on the dark web and the education of agencies fighting
against these groups [35].
As shown in Figure 2.2, most of the use cases of OSINT can be divided
into three main categories – detection of organized crime, cybersecurity, and
social media and sentiment analysis. Many niche use cases do not necessar-
ily fall within any of these categories, such as competitive intelligence used
by companies to research potential markets for their products or cybersecu-
rity from the attacker’s perspective. Therefore, this thesis generalizes these
categories to military intelligence, cybersecurity from the viewpoint of both
attackers and defenders, and social and business intelligence.
8
2. Open Source Intelligence
Figure 2.2: The three main types of use cases for OSINT [36].
9
2. Open Source Intelligence
2.2.2 Cybersecurity
The world of OSINT provides a lot of valuable knowledge for companies con-
cerned about the security of their infrastructures, such as information about
the ever-evolving landscape of cybersecurity threats, details about new vul-
nerabilities that are constantly getting discovered, or reports about recent
security incidents. All these pieces of information constitute a goldmine of
intelligence that anyone can easily and freely utilize. Just as companies can
employ OSINT to get a better understanding of potential threats, the at-
tackers might use the publicly available information to explore the Internet
presence of the companies, find any possible loopholes for exploitation, or
shape a strategy for a phishing campaign [43].
Chapter 3 discusses some existing tools suitable for these tasks, for exam-
ple, tools that determine what software a website is using, which ports are
open on a particular IP address including the services running on there, or
people and e-mail addresses associated with a company. Edwards et al. [44]
demonstrated that a large-scale collection of information required for a social
engineering campaign could be carried out completely automatically with no
active communication with the targets. They did so by gathering contact in-
formation of all employees publicly affiliated with a company, tracking down
10
2. Open Source Intelligence
other employees through social media networks, obtaining their personal in-
formation, and so forth.
Hayes and Cappa [45] performed a thorough evaluation of the critical
infrastructure of a company operating in the U.S. electrical grid using only
publicly available data. They created a complete overview of the infrastruc-
ture, including specifics about the hardware and software used on various ma-
chines, outlined potential vulnerabilities within the infrastructure, and dis-
covered the company’s employees, including their e-mail addresses. The au-
thors conclude that a continuous collection of OSINT targeted at a particular
company could provide attackers with powerful knowledge, and companies
should pay more attention to what information about them is publicly ac-
cessible. Cartagena et al. [46] performed a similar analysis and were able to
achieve comparable results.
Tanaka and Kashima [47] propose a URL blacklist based solely on OS-
INT, and they show that 75% of the blacklist’s values are unknown to Google
Safe Browsing. Additionally, 23% of the malware used in these URLs is also
unknown. Quick and Choo [48] incorporate OSINT in digital forensic analysis
to add value to the data and aid with timely extraction of required evidence.
Vacas et al. [49] propose an automated approach for the collection of new
knowledge from OSINT for detection rules in intrusion detection systems.
The method was tested on real-world network traffic, proving it can detect
malicious activities within a network. Lee et al. [50] combine OSINT with
events detected by security devices to improve the knowledge of potential
threats.
How the general population feels about specific topics has always been an im-
portant aspect when designing marketing campaigns, deciding how a product
should be built to suit its users’ needs, or even formulating a manifesto be-
fore an election. With the recent advent of sentiment analysis algorithms
that can evaluate users’ opinions just from their posts, OSINT has become
an attractive source of information for social and business intelligence. Data
from social media networks, discussion forums, and other websites where
users often share opinions on different topics can be collected and evaluated
to get a grasp of the overall public opinion [51].
Neri et al. [52] performed a sentiment analysis of 1000 new articles related
to a public scandal of the former Italian Prime Minister Silvio Berlusconi.
The primary goal was to detect whether there was a coordinated press cam-
paign by evaluating the time and geographical distribution of the articles and
11
2. Open Source Intelligence
the proportion of positive and negative opinions. Fleisher [53] lays down how
OSINT affects competitive and marketing intelligence, details the biggest
challenges when utilizing it, and outlines the best practices for successful
utilization of public sources.
2.3 State-of-the-Art
12
2. Open Source Intelligence
ability of the CTI feed is determined by its independence and error and is
continuously updated with new values to reflect the current situation.
13
2. Open Source Intelligence
14
2. Open Source Intelligence
tions pinned directly in the posts, places mentioned in the posts themselves,
and geolocations found by CLAVIN (Cartographic Location And Vicinity
INdexer) [70], a tool that uses NLP and ML. The combination of user re-
lationships and geolocation seeds is used to determine past, present, and
possibly future physical locations of the users. The achieved accuracy is over
77%.
Ranade et al. [71] utilize deep learning for the translation of CTI data
from different languages to English. The translation is optimized specifically
for cybersecurity-related data by creating translation mappings for all key-
words that are associated with cybersecurity. Data collected from public
sources such as Twitter that are in other languages are preprocessed using
NLP and a translation framework that adopts deep learning, and the trans-
lation mappings translate all relevant parts of the data. Once the data are
translated to English, they can be fed to CTI systems, thus broadening
the amount of gained CTI. The system focuses on Russian but can be ex-
tended for other languages as well by simply creating mappings for the re-
spective language.
Alves et al. [72] present a framework for the collection and classifica-
tion of Twitter posts. The main goal is to gather security-related informa-
tion from these posts and provide them to Security Information and Event
Management (SIEM) systems, which employ event data to handle the se-
curity management of an organization. The premise is that many security
experts use Twitter to post short messages about security-related news in
near real-time. These messages are normalized, the features are extracted,
and the messages are classified by a supervised ML model and clustered.
Each cluster is analyzed using named entity recognizers [73], and the crucial
components (attack, vector, target) are retrieved.
Deliu et al. [74] investigate how ML and Neural Networks may help
with the collection of CTI from hacker forums and other social platforms.
Specifically, they employ supervised ML and Convolutional Neural Networks
(CNN) [75] to classify the posts from these platforms. The results show that
the ML classifier performs at least as well as the CNN one, with the accuracy
of both being approximately 98%. As CNN classifiers tend to be rather com-
plex and expensive when used in practical scenarios, having ML classifiers
with the same accuracy might increase the chances of companies utilizing
classifiers for CTI.
Mittal et al. [76] present a system for extraction and analysis of cyber-
security information collected from multiple sources, including national vul-
nerability databases, social networks, and dark web vulnerability markets.
The system uses NLP to remove unnecessary parts of the textual data and
15
2. Open Source Intelligence
16
3 OSINT Sources and Tools
This chapter introduces various different sources that can be used to obtain
data for OSINT. These are described in Section 3.1, ranging from simple
tools giving some specific information about the queried keyword, such as
geolocation of an IP address or a list of accounts associated with a username,
to more sophisticated software that employs non-trivial algorithms, such
as Shodan [4] or Darksearch [77]. Tools for OSINT automation also exist,
utilizing many available sources and using the results to search for additional
information. A few of these are described later in Section 3.2.
There are books and articles discussing various OSINT tools and prac-
tices in more detail. Chauhan and Panda [78] explain all the theory behind
OSINT and describe some of the tools from this chapter in more detail, in-
cluding instructions on how to use them. Revell et al. [79] establish a frame-
work for the assessment of OSINT tools and best practices for their usage.
The OSINT Handbook [80] provides an exhaustive list of all available OSINT
tools and resources.
Many websites aim to make the process of OSINT more organized and
methodical. The OSINT Framework [81] (illustrated in Figure 3.1) provides
a broad overview of OSINT-related tools that are either completely free or
offer a limited usage for free. Your OSINT Graphical Analyzer (YOGA) [82]
is a simple flowchart showing what a piece of information can be trans-
formed into or used for, for example, how an IP address can be used to find
other relevant data, such as domain name. OSINT Open Source Intelligence
Framework [83] is similar to the OSINT Framework but goes a bit further
by adding educational resources, listing notable companies and people who
contribute to the OSINT realm, and much more.
3.1 Overview
Real Names Gathering information about people using their real names
depends on their country of origin and the uniqueness of their name and
knowledge of other related information, such as address or date of birth.
Many countries keep records of various public information, including prop-
erty ownership, criminal activity, weddings, births, and deaths. Each country
has different rules and laws for what is considered public information and
what is kept secret. There are tools aiming to automatically collect data
from many public sources to find as much information about an individual
as possible. Pipl [84] is an identity resolution engine that tracks online iden-
17
3. OSINT Sources and Tools
Figure 3.1: The OSINT Framework website [81] showing a structured view of
free OSINT-related tools. These are divided into categories based on the type
of information they provide.
18
3. OSINT Sources and Tools
19
3. OSINT Sources and Tools
ited domains and IP addresses), saves the resources of these domains, takes
a screenshot, and so forth.
20
3. OSINT Sources and Tools
Metadata Metadata are very helpful for the management of files on a com-
puter. They carry a considerable amount of information about the file and
can often reveal a lot. Images can yield information about the camera that
took the picture, the date it was taken, and sometimes even the location. Dif-
21
3. OSINT Sources and Tools
ferent documents, such as PDF files, can contain information about the au-
thor or the system used to create it. FOCA [127] and Metagoofil [128] can
find all files of a specific type on a given domain, obtain metadata from a doc-
ument or a website, and find any similarities in metadata of multiple files.
ExifTool [129] is an offline tool for image metadata extraction, while Google
Images [130] can perform a reverse image search to find related images and
websites containing this image.
22
3. OSINT Sources and Tools
Pastebin is a popular type of service used to store and share plain text,
for example, source code or any other text that is formatted or is too long to
share directly through a messaging application. Pastebins are usually public,
and anyone can access the text shared by other users. Since some users are
not aware of this or do not consider it as a problem, they might use this
service to share private information. Additionally, pastebins are often used
to share private information on purpose, such as database leaks, meaning
that they can be a good source of information for an OSINT investigation.
There are multiple ways data from pastebins can be gathered. PasteLert [132]
is a service that sends an e-mail whenever a search term appears in a new
paste. Sniff-Paste [133] scrapes pastebins, stores them in a database, and
searches for noteworthy information. There are also the so-called pastebin
dumps that provide all pastes in one place.
Even with the broad selection of OSINT tools, searching for valuable infor-
mation about the target can be overwhelming. To overcome this obstacle
and make things easier for the OSINT investigator, proprietary products
such as Intelligence X [141] and ShadowDragon [142] exist. Intelligence X
23
3. OSINT Sources and Tools
3.2.1 Recon-ng
24
3. OSINT Sources and Tools
Additionally, Recon-ng also has a simple web interface. All the collected
data are stored in an SQL database and managed through the db command.
The database looks at all the information stored there as a potential new in-
put for subsequent data gathering. Snapshots of the database can be created
for simple data recovery in case of a failure. To present the results in a human-
readable format, CSV and HTML files can be generated. Workspaces create
the possibility to have multiple environments with independent configura-
tions and database instances, allowing the user to switch between them as
needed. Recon-ng can be run automatically through a so-called resources file
containing all the commands for the framework to run.
The modules are defined by an abstract class with a well-defined inter-
face and some accessory functions from which new modules need to inherit.
All the modules are available in a place called Recon-Ng Marketplace [145],
which is an independent GitHub repository. As of now, the Marketplace con-
tains around 100 different modules. Third-party modules that are not part
of the Marketplace can also be used. These are loaded directly from a local
Recon-ng directory. The abstraction allows the users to utilize any infor-
mation source by simply creating a wrapper class around that source since
the framework only requires the class to implement the interface. The mar-
ketplace and modules commands are the entry point for the administration
of the modules. Each module can define the type of input it takes, and
the database is scanned to search for any data of this type that could be
used as a new input for the module.
3.2.2 Maltego
25
3. OSINT Sources and Tools
data types, like documents and social networks. Similar to Recon-ng, entities
can be used as an input for further data collection. Entities are visualized as
nodes, and in the case a connection was found, they are clustered to networks
of nodes. Figure 3.2 shows an example of a graph generated by Maltego.
3.2.3 SpiderFoot
SpiderFoot [149] is another open-source framework for OSINT automation.
Like Recon-ng and Maltego, it is designed to be highly modular and provide
all the necessary functions for data manipulation and storage. The significant
advantage of SpiderFoot is the number of modules it implements, which is
26
3. OSINT Sources and Tools
more than 170. It has both a command-line interface and a web interface.
The starting points SpiderFoot can scan are domain names, IP addresses,
hostnames/subdomains, subnets, ASNs, e-mail addresses, phone numbers,
and human names. The features of the paid version SpiderFoot HX that are
not available in the open-source version include Tor browser integration for
deep web scanning, multi-target scanning, continuous monitoring with alerts
and e-mail notifications, and correlation engine, which looks for anomalies
and other notable results.
SpiderFoot’s web interface provides an easy way to configure the app and
the modules, add API keys, choose what modules to use for a scan, debug,
and visualize the results in the form of a table and a graph. The graph
representation is similar to the one in Maltego since it shows results as
nodes and displays relationships by clustering them. Selected results can be
marked as false positives, which also marks child elements and deletes them
from the graph. SpiderFoot HX can run the data collection step-by-step to
inspect how each result is discovered. Figure 3.3 shows the web interface of
SpiderFoot.
27
4 Pantomath: Tool for Automated OSINT Collec-
tion
The main goal of this thesis was to implement a tool for an automated collec-
tion of open-source intelligence. Pantomath1 is a highly modular framework
that provides a complete environment for collecting and evaluating OSINT
about IP addresses, e-mail addresses, and domain names. The framework
implements all the functionality required throughout the process of OSINT,
but separate modules perform the collection itself. New modules can be inte-
grated by merely implementing a well-defined interface. Some of the gathered
data are evaluated in terms of their reliability. And finally, all the results are
presented in a structured output.
The rest of the thesis is organized as follows. Section 4.1 defines the prob-
lem at hand, describes some of the main challenges, and how Pantomath
strives to solve them. Section 4.2 outlines the high-level architecture of
the tool and the functionality it provides. Section 4.2.3 goes into more detail
about the implemented modules, and finally, Section 4.3 establishes a model
for reliability estimation of some of the modules. Section 3.2 from the previ-
ous chapter describes existing tools with similar objectives, and Section 5.2
from the following chapter compares them to Pantomath, states the major
differences, and discusses the advantages and disadvantages of each.
There are many ways a tool that automatically collects OSINT could be
implemented since OSINT is an extensive topic, and the amount of avail-
able data is enormous. The collection of OSINT also poses many challenges,
many of which were discussed in Section 2.1. If the user aims to find specific
information, such as social media network contacts and posts of a particular
person or the footprint a company has on the Internet, various advanced
methods specializing in these tasks can be utilized. Section 2.3 describes
some of the more innovative approaches aiming to tackle specific OSINT
challenges and tasks. Using social media networks as an example, one could
take advantage of web scraping, sentiment analysis, machine translation, and
deanonymization to create a complete profile of various users of these social
media networks.
However, each problem the user might need to solve requires a differ-
ent approach, and building a state-of-the-art tool for each of these would
1. Pantomath is an English word for a person that wants to know and knows everything.
28
4. Pantomath: Tool for Automated OSINT Collection
be a laborious task. Instead, existing tools and services that already imple-
ment non-trivial gathering of OSINT can be utilized. Chapter 3 provides
an overview of such sources. Therefore, the objective of Pantomath is not
to actively collect data and transform them into valuable information, but
rather to automate the collection of OSINT by employing the existing tools.
Since the number of possible sources is immense and new ones can appear
in the future, one very desirable feature for the tool is an easy integration of
additional sources.
This approach brings new disadvantages that have to be taken into con-
sideration. By using existing services, Pantomath relies on their correct and
continuous operation and needs to address any changes that break the cur-
rent implementation. These services are also often built into commercial
products and, as such, provide only a limited usage for free or no free ver-
sion at all. For some types of information, there are no services that are not
paid, meaning that an API key needs to be bought or the information is
not available. One example of such a tool is BuiltWith [111], which analyzes
the technology stack of a website, thus providing very valuable information
about a domain. However, the price for access to BuiltWith’s API starts
at $295 per month. Anyhow, the selection of which paid services to choose
heavily depends on the use case.
One of the most significant challenges of OSINT that has not been ad-
dressed by any of existing OSINT automation tools2 is evaluation of the re-
liability of the collected information, or in other words, getting as close to
Validated Open Source Intelligence described in Chapter 2 as possible. Val-
idated OSINT is defined as OSINT with a high degree of certainty, which
is very hard to achieve without a certain level of human involvement. Any
OSINT investigation will eventually require a person with some knowledge
of the context of the investigation and OSINT itself to evaluate the gathered
data. The goal of Pantomath is not to provide any guarantees of the informa-
tion reliability but rather to allow the investigator to make more informed
decisions by producing a reliability estimate.
Collecting OSINT is usually not just about finding information about
one particular target but rather a repetitive cycle where follow-up searches
are refined based on information discovered in the previous iterations. By
performing these sequential searches, relationships between different targets
can be revealed. Similarly to the information itself, the reliability of the dis-
covered targets and their relationships with other targets can be estimated.
2. To the best of our knowledge, no existing tools provide an automatic reliability estima-
tion.
29
4. Pantomath: Tool for Automated OSINT Collection
The architecture of Pantomath is separated into two main parts. The base
framework provides a complete environment for automated collection of
OSINT, with independent modules performing the data gathering itself.
The framework defines an interface that each module needs to implement,
and the interface is described in more detail in Section 4.2.3. Search queries
are narrowed down to simple keywords representing entities on the Internet.
Currently, the target can be an IP address, a domain name, or an e-mail
address, but the selection can be easily extended with user names, phone
numbers, Bitcoin addresses, and other identifiers.
All results that might be used as new targets, i.e., any of the IP addresses,
domains, and e-mail addresses related to the target, are added to a pool and
used for follow-up searches. The decision of what should be considered as
a new target is left for each module. Each new target keeps track of how it was
discovered, including the target used in the previous query, the relationship
between these targets, and the module that disclosed it. Therefore, a kind
of a search tree forms, where the initial target serves as the root of this tree
and has a depth of 0. Figure 4.1 shows an example of a search tree with
the domain fi.muni.cz used as the seed. The nodes represent newly found
targets, and the edges represent the discovery by various modules.
30
4. Pantomath: Tool for Automated OSINT Collection
Each target discovered when querying the initial value is a child node
of the root, has a depth of 1, and is queried the same way as the initial
value. Again, new targets are extracted from the results. The whole process
is repeated until new nodes with a specific depth are found. The depth can
be specified before the query is started and increased if necessary. Generally,
the deeper the target is, the weaker the relationship is with the initial target,
and for some better-known targets, there might be hundreds of new targets
even at the first level.
3. Onion (or hidden) services are anonymous services only reachable through the Tor
network.
31
4. Pantomath: Tool for Automated OSINT Collection
requests in a short period of time to the same service and gets banned.
The delay is specified in the configuration file, and it can be different for
each service. Moreover, the framework contains functions for extraction of
IP, e-mail, and Bitcoin addresses from blocks of data and validation of these
values, including a function that attempts to fix invalid e-mail addresses (e.g.,
by removing trailing characters that are added to disguise the addresses and
make it harder to extract them).
The database software used in the offline mode described later in Sec-
tion 4.2.2 is PostgreSQL [151], an object-relational database system allow-
ing a great deal of flexibility for the schema by having a wide variety of
objects it can store. The framework provides all the necessary functions for
the database management, such as for table creation and data storage and
retrieval, and leaves the management itself to each module. This gives each
module enough flexibility in how it stores its data and how the data are re-
trieved. Specifically, each table within the database stores three attributes:
• the lookup key,
• the timestamp of when the item was added,
• a JSON object storing all the additional information.
In most cases, the lookup key is just a string representing the retrieved
target (e.g., a specific IP or e-mail address), but it is also possible to use
a CIDR block as the key. In that case, the target IP address can be searched
by checking whether it belongs to some of the CIDR blocks. Since the mod-
ules manage the data storage and retrieval, they can use the lookup key
as a regular unique ID and implement some more complex data indexing.
The timestamp is used to discard items that are expired, and it is managed
by the framework. This feature can be disabled, and the expiry time can be
specified in the configuration file.
Pantomath can be used either through an API or a command-line inter-
face. The framework is also built to be easily extendable with other inter-
faces, such as a web interface, a custom API, and so forth. The command-
line interface provides the commands listed in Table 4.1. The update com-
mand updates the offline data for all modules. The query command queries
the specified target in all modules, stores the results, and adds any new tar-
gets found by the modules to a pool used for follow-up searches. The overt,
stealth, and offline commands indicate which mode of operation should be
used. The query results can be exported to a JSON file or printed out to
the standard output in a structured form using the export and print com-
mands.
32
4. Pantomath: Tool for Automated OSINT Collection
Command Description
exit Exit the CLI
export Export the results into a JSON file
help Display a menu with descriptions of all available commands
modules Display a list of loaded modules with a description
offline Switch to offline mode: only offline sources are queried
overt Switch to overt mode: all sources are queried
print Print the results in a structured form
query Query the target in all available modules
stealth Switch to stealth mode: Tor is used to fetch the data
update Update the offline data in all available modules
33
4. Pantomath: Tool for Automated OSINT Collection
enables the stealth mode by integrating the Tor network. In this mode, all
requests sent to any of the used services go through the Tor network, making
it much more difficult to trace the request to the machine where Pantomath
is running. The use of Tor is just an illustration of how the stealth mode can
be implemented. Besides Tor, the fetch_url function could be extended to
utilize custom proxies or other forms of anonymization, or the tool could be
deployed on a cloud infrastructure.
Although the stealth mode supports the use of all modules, one important
consideration is the use of API keys. For the services that require an API
key, all requests can be easily correlated to the user even though the requests
are sent through intermediary nodes since API keys are generally tied to
a registered account. This obstacle could be bypassed by registering accounts
using only anonymous e-mail accounts and fake information. However, many
services employ non-trivial protection against this technique.
The offline mode takes the anonymity a step further by performing
queries with no access to the Internet. Instead, all data available as a whole
are downloaded, parsed, and stored in the database in advance using either
overt or stealth mode. Once the database contains fresh values, connection
to the Internet is no longer necessary, and the offline mode can be acti-
vated. Each query checks whether anything related to the target is stored
in the database. By downloading data as a whole and not requesting in-
formation for various targets separately, the data provider or anybody who
observes what was downloaded only knows about the possession of this data
and not about what exactly the data are used for. Preprocessing the data in
advance and storing them in the database also brings additional performance
34
4. Pantomath: Tool for Automated OSINT Collection
advantages whenever multiple queries are executed. The data are parsed only
once, and all subsequent queries just check the database, which is a much
faster operation.
From the list of modules providing some offline data in Section 4.2.3,
it is apparent the selection is relatively small. In general, services rarely
offer complete access to their data for free, and usually not even as a paid
service. To have access to more data in the offline mode, the functionality
provided by the existing services queried by Pantomath would need to be
implemented within the tool. As discussed in Section 4.1, this would be
a very laborious task entirely out of the scope of this thesis. However, for
some of the modules, open-source tools providing similar functionality exist,
meaning that Pantomath would only need to implement the continuous data
collection. One example of such a tool is MASSCAN [115], which provides
information about open ports of IP addresses similarly to Shodan [4].
4.2.3 Modules
35
4. Pantomath: Tool for Automated OSINT Collection
names, and e-mail addresses. Table 4.2 shows all the implemented modules,
which target types they take as an input, and whether they require an API
key. The remainder of this Section describes the modules in more detail.
Table 4.2: List of implemented modules with information about which types
of targets they take as an input, whether they provide offline data, if an API
key is required, and if the module can find new targets for follow-up searches.
36
4. Pantomath: Tool for Automated OSINT Collection
37
4. Pantomath: Tool for Automated OSINT Collection
• ThreatCrowd [108]
• VirusTotal [166]
• AlienVault OTX [107]
• MetaDefender [109]
• Shodan [4]
• Spyse [112]
• Censys [113]
38
4. Pantomath: Tool for Automated OSINT Collection
darkweb This module parses a CSV file containing data scraped from
the dark web and used for categorization of the websites in [174]. Each
entry in the file contains a link to the website, its content, possible locations
resolved from the content using CLAVIN [70], and the website’s category.
The module looks for any IP addresses and domain names in the website’s
content, and for each discovered IP or domain, the whole entry is saved into
the database (i.e., if no IP or domain is found in the content, the entry is
skipped). This module only illustrates how some of the library functions can
be used when large volumes of data are processed because all of the steps to
create the dataset need to be performed manually.
dns This module resolves the domain or IP address using Google DNS [176].
The answer is added as a new target for possible follow-up search.
39
4. Pantomath: Tool for Automated OSINT Collection
pgp This module searches for the domain name or the e-mail address in
PGP public key servers, namely The.Earth.li [182] and Key-Server.io [183],
which is used if the first server is not responsive. If anything is found for
the queried domain, all e-mail addresses associated with the domain are
retrieved and added as new targets. This module is somewhat fragile, as
both of these websites are sometimes not accessible.
40
4. Pantomath: Tool for Automated OSINT Collection
including all domains sharing these IDs. These domains are again used as
possible new targets. The service requires an API key, with the free version
providing 10000 queries per month. Three paid versions are offered with
prices starting from $6 per month.
torexits This module downloads and parses a list of Tor exit nodes main-
tained by Torproject.org [186].
urlscan This module searches the domain at urlscan.io [100]. The service
visits the specified URL and records all activities happening during this pro-
cess, such as which domains and IP addresses were visited. These domains
and IP addresses are considered for follow-up searches. The results also in-
clude all resources of these domains, a screenshot, and much more. Based on
the configuration, either only URLs with the detailed results are attached
to the results, or they are fetched and included.
whois This module looks for the domain’s whois data on Whois XML
API [162]. The service requires an API key, where the free version provides
500 credits (searches) per month. Their database can also be downloaded,
with options to download 1 million entries for $240 or the whole database
for an undisclosed price.
whois_reverse This module looks for reverse whois data (i.e., informa-
tion about domains registered with the e-mail address) either on Whoxy [187]
or on Whoisology [188] if Whoxy is not available or does not return any
results. Both services require an API key. Search credits for Whoxy can be
bought with prices ranging between $4 and $8 per thousand credits based on
the number of credits bought. Whoisology costs $50 per month with a maxi-
mum of 2500 credits and $35 for every additional 2500 credits. The domains
that are associated with the e-mail address are added to the pool of new
targets.
Besides the collection of OSINT itself, one of the requirements for Pantomath
was to provide an estimation of the reliability of the collected data. As dis-
cussed in Section 2.1, the best way to determine the truthfulness of OSINT
is to establish the reliability of the sources the information was retrieved
from and to use multiple sources providing the same type of information
41
4. Pantomath: Tool for Automated OSINT Collection
and compare the results [18]. Additionally, having context and query-specific
information is essential to avoid collecting information not relevant to the in-
quiry [19]. Pantomath narrows the inquiries down to simple keywords and
uses tools that provide a specific type of information about the given key-
word, such as a geolocation of an IP address. Therefore, the collected infor-
mation is implicitly relevant to what the user is looking for.
It is important to note that the reliability estimation does not necessarily
make sense for all the results. For example, the torexits module downloads
a list of Tor exit nodes directly from the Tor project website. Although
it is possible to obtain this information from other sources and compare
the results, it could be argued the fact that the Tor project itself provides it
gives enough confidence the result is correct. Another example of such a case
is the dns module, where multiple servers could be queried and the answers
compared. Nonetheless, an established DNS server, such as the one by Google
used in the module, has enough credibility to be trusted.
Another factor to consider is the need for multiple sources. As discussed
in Section 4.1, many services are either paid or provide only a limited num-
ber of free requests, meaning that using multiple sources for the reliability
estimation can significantly increase the cost if some or all of them are paid.
One of the cases where this applies is the whois_reverse module because
the vast majority of reverse whois APIs does not offer any free queries. Addi-
tionally, some services provide information that is too unique to be validated
as multiple services would need to be combined to produce the information,
such as the urlscan module, or there might not be other services providing
it at all.
Pantomath provides a reliability estimate for each new target that is
discovered during the search. The seed target passed to the CLI has the re-
liability set to 100%. The reliability of each new target is computed using
the previous target’s reliability and the reliability multiplier of the module
that discovered it. The multipliers are specified in the configuration file, and
they can be different for each module. Currently, all modules use the same
default multiplier, which is equal to 0.8. Figure 4.3 shows an example of
a search tree, including the reliability estimates of all targets. As the multi-
pliers are the same for all modules, targets with the same depth have equal
reliabilities, i.e., 80% for level 1, 64% for level 2, 51.2% for level 3, and so
forth.
By setting different multipliers, the user can control how reliabilities for
new targets are estimated. For example, the dns module might have a higher
multiplier, as the relationship between the queried target and the newly
discovered target is well-defined and generally has a high probability of being
42
4. Pantomath: Tool for Automated OSINT Collection
Figure 4.3: An illustration of the search tree that forms when querying
fi.muni.cz with the reliability of each depth in red.
correct. On the other hand, modules such as darkweb or psbdmp, where new
targets are discovered by extracting e-mail and IP addresses from blocks
of data, could have lower multipliers since the relationship between these
targets is unclear, and the extraction might not be precise. With varying
multipliers, the same depth targets could have different reliabilities, and
some targets might even have smaller reliabilities than those in lower levels
depending on the discovery chain. Additionally, the reliabilities could be used
as the indicator of which targets should be queried instead of the depth, as
they better represent the strength of the connection to the initial target.
The model proposed by Gong et al. [56] described in Section 2.3 provides
a systematic approach for reliability estimation of results collected from cy-
ber threat intelligence feeds. A simplified version of this model is used to cal-
culate the reliability of data in threat_intel, geolocation, and port_discovery
modules. In each module, the reliabilities of all implemented sources are
estimated, and the results for a specific target are evaluated using these esti-
mates. The initial values are set to the ones obtained in Section 5.1 and are
continuously updated after each query, meaning that each time a target is
queried in the module, the reliability is recalculated.
The values can be reset and calculated from scratch using data provided
by the user. The more values are collected and added to the reliability es-
timation, the more these estimates reflect how different sources perform in
the scenarios the user is interested in. For example, some websites used in
the geolocation module could provide accurate results for IP ranges owned by
large companies but be less precise when resolving the location of indepen-
dent addresses. If the user mostly investigates individuals, the initial data
where commercial IP ranges were also considered might distort the sources’
43
4. Pantomath: Tool for Automated OSINT Collection
precision. The same goes for the port_discovery module, where the portion
of IP addresses with no open ports significantly influences the reliability of
different services.
Symbol Description
n number of CTI feeds
Fn n-th CTI feed
risk(Fi ) risk value returned by CTI feed Fi
44
4. Pantomath: Tool for Automated OSINT Collection
independence(Fi )
weight(Fi ) = 1 − (4.5)
maxnj,k=1 (dist(Fj , Fk ))
The risk values collected for a particular target T and the reliabilities of
the CTI feeds are used to estimate the reliability of the results, as shown
in Equation 4.7. In this case, the final value is the risk associated with
the queried target and it is attached to the results returned by the module.
Pn
k=1 risk(Fk )reliability(Fk )
reliability(T ) = Pn (4.7)
k=1 reliability(Fk )
4.3.2 Geolocation
The geolocation module uses many IP geolocation services that either return
GPS coordinates (i.e., two numbers – latitude and longitude) or an address
that is resolved to coordinates using the OpenStreetMap API [152]. The co-
ordinates provide a convenient way to compare results from different services
and estimate the reliability. This section describes how the reliability of each
geolocation service is computed and how the results for a particular IP ad-
dress are evaluated. Table 4.4 describes the symbols used in the equations.
45
4. Pantomath: Tool for Automated OSINT Collection
Symbol Description
n number of geolocation sites
Sn n-th geolocation site
latn latitude resolved by the n-th site
lonn longitude resolved by the n-th site
m number of clusters
Cm m-th cluster
p set of sites in m-th cluster
Firstly, Equation 4.8 computes the distance between two sets of coor-
dinates resolved by sites S1 and S2 . As these coordinates correspond to
a point in a two-dimensional Euclidean space, it is calculated the same way
as the Euclidean distance.
q
dist(Si , Sj ) = (lati − latj )2 + (loni − lonj )2 (4.8)
Equations 4.9 and 4.10 compute the expected coordinates, which are
equal to the average of coordinates resolved by all sites.
Pn
k=1 latk
latexpected = (4.9)
n
Pn
k=1 lonk
lonexpected = (4.10)
n
The error of the coordinates resolved by site Si is computed as the dis-
tance to the expected value Sexpected , as shown in Equation 4.11. The ex-
pected value is defined by the expected coordinates latexpected and lonexpected .
46
4. Pantomath: Tool for Automated OSINT Collection
Finally, Equation 4.14 computes the reliability of site Si . Just like the reli-
ability of CTI feeds, it is inversely proportional to the error and proportional
to the weight, and the result is a fraction between 0 and 1.
!
error(Si )
reliability(Si ) = 1− n weight(Si ) (4.14)
maxj,k=1 (dist(Sj , Sk ))
When coordinates from all sites are collected, they are clustered using
a clustering algorithm with a threshold defined in the configuration file (the
default value is set to 0.2). The clusters are mutually exclusive, and the num-
ber of clusters m can be anywhere between 1 and n. For each cluster, the ex-
pected location is computed, and the reliability of the location of cluster
Ci is equal to the sum of reliabilities of sites that constitute the cluster di-
vided by the sum of all reliabilities, as shown in Equation 4.15. In the end,
the module returns one or more pairs of coordinates and their reliabilities.
P
S ∈Ci reliability(Sj )
reliability(Ci ) = Pnj (4.15)
k=1 reliability(Sk )
47
4. Pantomath: Tool for Automated OSINT Collection
Symbol Description
n number of port discovery services
Pn n-th port discovery service
m total number of resolved ports
pm m-th resolved port
Rn set of ports resolved by n-th service
Qm set of services that resolved m-th port
qm number of services that resolved m-th port
48
5 Evaluation and Discussion
The reliability estimation model defined in Section 4.3 compares all sources
used in each module to compute their reliability, and the values are continu-
ously updated when new targets are queried. The estimates reflect the pre-
cision of each source for the type of data that was used for the estimation,
and more data generally yields better accuracy. The model was evaluated
using large datasets to provide base reliabilities for the user and measure
how different sources perform.
49
5. Evaluation and Discussion
Table 5.2: The average distance between all feeds. The lower is the value,
the closer are the risk values of the two feeds.
The model by Gong et al. [56] compares many different features to es-
timate the reliability of the CTI feeds, such as hashes of malicious files
associated with the target or IP addresses used in the same attack. As these
values represent distinct entities that are much more comparable, the com-
parison of feeds using this model is more methodical and better illustrates
the differences between them. By using multiple features, the reliability esti-
mation in the threat_intel module would be more reliable than the current
one. However, implementing such a model is non-trivial and requires a lot of
parsing to bring various pieces of information together.
5.1.2 Geolocation
The reliability of websites used in the geolocation module was estimated using
a dataset of 1000 randomly generated IP addresses. All multicast, reserved,
private, or loop-back addresses were filtered out. The IP addresses were
resolved by all geolocation services, and the values defined in Section 4.3.2
were calculated. Table 5.3 shows the measured values and the null ratio, i.e.,
the percentage of IP addresses where no geolocation was resolved. The null
ratio of the Utrace website was 90% due to an inconsistent operation. To
determine how dependent the websites are between each other, the pairwise
distance of these websites was generated and is shown in Table 5.4.
A group of 5 websites all have relatively small distances between each
other – FreeGeoIP, IPdata, Geoplugin, IPlocate, and Maxmind. These are
50
5. Evaluation and Discussion
Website Null ratio Independence Error Weight Reliability Adjusted rel. Known
FreeGeoIP 0.3% 7.2528 5.8982 0.6520 0.5056 0.4822 5.5557
IPdata 0.2% 6.8334 5.4831 0.6651 0.5208 - 1.3827
Extreme-IP 0.5% 10.2773 8.8586 0.5923 0.4419 0.4493 8.9512
Geoplugin 0.4% 7.1811 5.8395 0.6539 0.5068 0.4846 10.5209
IPregistry 0.1% 6.9512 5.5843 0.6732 0.5243 0.5245 1.7526
IPlocate 0.2% 6.8296 5.4639 0.6581 0.5125 0.4879 4.6368
IPinfo 0.2% 7.7042 6.2717 0.6090 0.4582 0.4642 0.7729
IPwhois 0.2% 10.8974 9.2205 0.5631 0.4158 0.4231 9.0029
IPify 0.1% 7.4466 6.0745 0.6340 0.4818 0.4926 4.8343
IP-API 0% 7.1182 5.7016 0.6669 0.5182 0.5200 0.7706
IPgeoloc 0% 9.4214 7.8840 0.5494 0.3997 0.4110 8.7674
WhoisXML 0% 7.4479 6.0814 0.6174 0.4693 0.4684 3.3321
Maxmind 0.2% 6.8275 5.4938 0.6641 0.5195 0.4878 1.1600
Utrace 90% 5.9583 5.1341 0.6182 0.4658 - 37.4484
Table 5.3: Measurements of the values defined in Section 4.3.2. Column Ad-
justed rel. contains reliabilities when IPdata and Utrace are removed from
the computation. Column Known represents the average distance between
the results given by each service and the known locations for a particular IP
address. The lower are the values in this column, the closer the results are
to the known locations.
51
Website FreeGeoIP IPdata Extreme-IP Geoplugin IPregistry IPlocate IPinfo IPwhois IPify IP-API IPgeoloc WhoisXML Maxmind Utrace
FreeGeoIP - 2.0388 11.7106 1.3410 8.7615 1.5776 9.7614 11.5476 10.1220 9.4695 11.3314 7.2650 1.9323 5.6141
IPdata 2.0388 - 11.6694 2.8688 7.5709 1.5855 8.2876 12.3284 9.3585 8.0196 11.4249 6.4100 0.2460 5.6424
Extreme-IP 11.7106 11.6694 - 11.2273 7.2691 11.1018 8.7193 11.2132 8.9421 8.4749 10.2314 11.0694 11.7252 4.9061
Geoplugin 1.3410 2.8688 11.2273 - 8.6757 1.5546 9.6862 10.8867 9.7167 9.3281 10.8621 7.1132 2.7437 5.4482
IPregistry 8.7615 7.5709 7.2691 8.6757 - 7.8255 4.0415 8.9490 4.6105 4.4684 7.1167 6.4388 7.5250 5.7971
IPlocate 1.5776 1.5855 11.1018 1.5546 7.8255 - 8.7850 11.7350 9.5077 8.3859 11.5172 6.6954 1.5009 5.4266
IPinfo 9.7614 8.2876 8.7193 9.6862 4.0415 8.7850 - 9.9856 5.0676 4.5488 7.5439 7.1109 8.4720 5.0361
IPwhois 11.5476 12.3284 11.2132 10.8867 8.9490 11.7350 9.9856 - 9.1594 9.1486 9.9889 11.2326 12.1958 8.4533
IPify 10.1220 9.3585 8.9421 9.7167 4.6105 9.5077 5.0676 9.1594 - 2.3990 5.8669 4.5707 9.4721 6.3511
IP-API 9.4695 8.0196 8.4749 9.3281 4.4684 8.3859 4.5488 9.1486 2.3990 - 6.8925 5.7068 8.0620 6.2440
Table 5.4: The average distance between all services. The lower is the value, the closer are the results from the two
services. The values in bold show the pairs of websites that belong to a group of 5 websites with small distances
between each other.
52
5. Evaluation and Discussion
Table 5.6: The average distance between all services. The lower is the value,
the closer the results are between the two services.
53
5. Evaluation and Discussion
Section 3.2 discusses some existing OSINT automation tools and goes into
more detail about the three most notable ones – Recon-ng, SpiderFoot, and
Maltego. Table 5.7 summarizes the main differences between these tools and
Pantomath. Like Pantomath, all of these tools are designed to be modular
to allow a straightforward integration of new sources. The most well-known
sources are implemented in all the tools, meaning that a significant portion
of the results will be the same.
As opposed to SpiderFoot and Maltego, Recon-ng is entirely open-source,
and new modules can be added to the Recon-ng marketplace by any devel-
oper. The free version of SpiderFoot is open-source as well, but it lacks a great
deal of functionality offered in the paid version (SpiderFoot HX). Maltego
provides a free community version that misses many functions and is limited
in terms of the number of queries and the integration of additional modules.
Additionally, it has to run on Maltego’s cloud infrastructure, meaning that
all the traffic has to go through their servers. The most significant advantage
of Maltego is its state-of-the-art visualization capabilities. The results from
54
5. Evaluation and Discussion
55
5. Evaluation and Discussion
56
5. Evaluation and Discussion
• Many services provide information about Bitcoin addresses, e.g., the bal-
ance of the address, whether a scammer or a hacker used the address,
and much more.
• For user names and real names as targets, websites such as Pipl [84],
CheckUserNames [88], or Social Searcher [95] could be utilized.
• To collect offline data and add another source for reliability estimation
in the port_discovery module, Nmap [114] or MASSCAN [115] can
be used. However, actively testing open ports in bulk would require
additional safeguards to prevent getting banned by many IP ranges.
The selection of data in the offline mode could be improved either by
implementing the existing services directly in Pantomath and collecting
57
5. Evaluation and Discussion
58
6 Conclusions
The main goal of this thesis was to implement a tool for an automated collec-
tion of Open Source Intelligence. Pantomath is a highly modular framework
providing all the necessary functionality for data collection, processing, and
storage, with a straightforward way to add new data sources. The selection
of the implemented modules contains all essential services providing informa-
tion about IP addresses, domain names, and e-mail addresses. These include
port discovery services, IP geolocation websites, cyber threat intelligence
feeds, blacklists, whois data, and much more. The framework can be used
through a command-line interface or a simple API, with the possibility to
add other interfaces such as a web interface.
Pantomath offers three modes of operation with varying anonymity guar-
antees. The overt mode represents a regular operation of the tool that uses all
modules and sends requests directly to the implemented sources. The stealth
mode routes all queries through the Tor network to provide an intermediary
between the user and the Internet. The offline mode does not require an Inter-
net connection as the target is looked up in a database of preprocessed data,
allowing the users to query any targets completely anonymously. The selec-
tion of data in the offline mode is smaller than in the overt or stealth modes,
but additional data can be incorporated in the future.
The reliability estimation model attempts to evaluate the reliability of
the collected data, which is one of the biggest challenges of OSINT. The model
calculates the reliability of each source used in a module by comparing the re-
sults they return when specific targets are queried. The reliability estimation
is currently implemented in three modules, but the approach can be applied
to other modules as well. The reliabilities of the sources in all three modules
were estimated using datasets of various targets and the results can serve as
the base for future usage, as the estimates are updated continuously.
There are many possible extensions and additional sources that can
be added to the framework. The reliability estimation model can serve as
a blueprint for other modules where the results from different sources are
comparable. The modes of operation explore how higher anonymity require-
ments affect the usability and the information one can find when confiden-
tiality of the targets is critical. Pantomath lags behind existing OSINT au-
tomation tools in the user interface and the number of implemented sources,
but it builds the foundations for new concepts that are not yet explored in
the existing tools.
59
Bibliography
[1] B. Schneier. Data and Goliath: The Hidden Battles to Collect Your
Data and Control Your World. W. W. Norton & Company, 2016. isbn:
978-0393352177. url: https://www.schneier.com/books/data_
and_goliath/.
[2] A. Hulnick. “The Downside of Open Source Intelligence”. In: Interna-
tional Journal of Intelligence and CounterIntelligence 15 (Nov. 2002),
pp. 565–579. doi: 10.1080/08850600290101767.
[3] S. Gibson. “Open source intelligence”. In: The RUSI Journal 149.1
(2004), pp. 16–22. doi: 10.1080/03071840408522977. eprint: https:
//doi.org/10.1080/03071840408522977. url: https://doi.org/
10.1080/03071840408522977.
[4] Shodan. Shodan. [online], cit. [2020-7-10]. url: https://www.shodan.
io.
[5] T. Fingar. Reducing Uncertainty: Intelligence Analysis and National
Security. Stanford University Press, 2011. isbn: 9780804775946. url:
https://books.google.cz/books?id=wmakl6eGkwYC.
[6] L. Johnson. Handbook of Intelligence Studies. Taylor & Francis, 2007.
isbn: 9781135986889. url: https://books.google.cz/books?id=
U2yUAgAAQBAJ.
[7] C. Burke. Freeing knowledge, telling secrets: Open source intelligence
and development. Bond University, 2007. url: https://research.
bond . edu . au / en / publications / freeing - knowledge - telling -
secrets-open-sourceintelligence-and-dev.
[8] C. Hobbs, M. Moran, and D. Salisbury. Open Source Intelligence in
the Twenty-First Century. Palgrave Macmillan, London, 2014. url:
https://link.springer.com/book/10.1057/9781137353320.
[9] K. J. Riley et al. State and Local Intelligence in the War on Terrorism.
RAND Corporation, 2005. isbn: 0-8330-3859-1. url: https://www.
rand.org/pubs/monographs/MG394.html.
[10] Intelligence Community Information Sharing Executive. U.S. Na-
tional Intelligence: An Overview. Tech. rep. 2013.
[11] J. Assange. WikiLeaks. [online], cit. [2020-7-30]. url: https://wiki
leaks.org.
[12] H. Gibson. “Acquisition and Preparation of Data for OSINT Inves-
tigations”. In: Jan. 2016, pp. 69–93. isbn: 978-3-319-47670-4. doi:
10.1007/978-3-319-47671-1_6.
60
BIBLIOGRAPHY
61
BIBLIOGRAPHY
[22] H. Bean. “Is open source intelligence an ethical issue?” In: Research in
Social Problems and Public Policy 19 (Jan. 2011), pp. 385–402. doi:
10.1108/S0196-1152(2011)0000019024.
[23] C. Kopp et al. “Chapter 8 - Ethical Considerations When Using On-
line Datasets for Research Purposes”. In: Automating Open Source
Intelligence. Ed. by R. Layton and P. A. Watters. Boston: Syngress,
2016, pp. 131 –157. isbn: 978-0-12-802916-9. doi: https://doi.org/
10.1016/B978-0-12-802916-9.00008-7. url: http://www.scienc
edirect.com/science/article/pii/B9780128029169000087.
[24] J. Simola. “Privacy issues and critical infrastructure protection”. In:
2020. isbn: 9780128165942. doi: 10.1016/b978- 0- 12- 816203- 3.
00010-1.
[25] A. Cavoukian. Privacy by Design – The 7 Foundational Principles.
[online], cit. [2020-12-17]. 2010. url: https://www.ipc.on.ca/wp-
content/uploads/Resources/7foundationalprinciples.pdf.
[26] B.-J. Koops, J.-H. Hoepman, and R. Leenes. “Open-source intelli-
gence and privacy by design”. In: Computer Law & Security Review
29 (Dec. 2013), 676–688. doi: 10.1016/j.clsr.2013.09.005.
[27] P. Casanovas. “Cyber Warfare and Organised Crime. A Regulatory
Model and Meta-Model for Open Source Intelligence (OSINT)”. In:
Dec. 2017, pp. 139–167. isbn: 978-3-319-45299-9. doi: 10.1007/978-
3-319-45300-2_9.
[28] J. Rajamäki and J. Simola. “How to apply privacy by design in OS-
INT and big data analytics?” In: ECCWS 2019 - Proceedings of the
18th European Conference on Cyber Warfare and Security. June 2019,
pp. 364–371. isbn: 9781912764280.
[29] A. Gandomi and M. Haider. “Beyond the hype: Big data concepts,
methods, and analytics”. In: International Journal of Information
Management 35.2 (2015), pp. 137 –144. issn: 0268-4012. doi: https:
//doi.org/10.1016/j.ijinfomgt.2014.10.007. url: http://www.
sciencedirect.com/science/article/pii/S0268401214001066.
[30] A. Powell and C. Haynes. “Social Media Data in Digital Forensics
Investigations”. In: Jan. 2020, pp. 281–303. isbn: 978-3-030-23546-8.
doi: 10.1007/978-3-030-23547-5_14.
[31] G. Bello-Orgaz, J. J. Jung, and D. Camacho. “Social big data: Recent
achievements and new challenges”. In: Information Fusion 28 (2016),
pp. 45 –59. issn: 1566-2535. doi: https : / / doi . org / 10 . 1016 / j .
inffus . 2015 . 08 . 005. url: http : / / www . sciencedirect . com /
science/article/pii/S1566253515000780.
62
BIBLIOGRAPHY
[32] G. Kalpakis et al. “OSINT and the Dark Web”. In: Jan. 2016, pp. 111–
132. isbn: 978-3-319-47670-4. doi: 10.1007/978-3-319-47671-1_8.
[33] B. Nafziger. “Data Mining in the Dark: Darknet Intelligence Automa-
tion”. In: 2017.
[34] M. Schäfer et al. “BlackWidow: Monitoring the Dark Web for Cyber
Security Information”. In: May 2019, pp. 1–21. doi: 10.23919/CYCON.
2019.8756845.
[35] H. Chen. Dark Web - Exploring and Data Mining the Dark Side of the
Web. Springer-Verlag New York, 2012. isbn: 978-1-4614-1557-2. url:
https://www.springer.com/gp/book/9781461415565.
[36] J. Pastor-Galindo et al. “The not yet exploited goldmine of OSINT:
Opportunities, open challenges and future trends”. In: IEEE Access
PP (Jan. 2020), pp. 1–1. doi: 10.1109/ACCESS.2020.2965257.
[37] R. A. Best Jr. and A. Cumming. Open Source Intelligence (OSINT):
Issues for Congress. Congressional Research Service, 2007. url: htt
ps://fas.org/sgp/crs/intel/RL34270.pdf.
[38] T. Day, H. Gibson, and S. Ramwell. “Fusion of OSINT and Non-
OSINT Data”. In: Jan. 2016, pp. 133–152. isbn: 978-3-319-47670-4.
doi: 10.1007/978-3-319-47671-1_9.
[39] R. Scrivens et al. “Searching for Extremist Content Online Using The
Dark Crawler and Sentiment Analysis”. In: Aug. 2019, pp. 179–194.
isbn: 978-1-78769-866-6. doi: 10.1108/s1521-613620190000024016.
[40] E. Susnea. “A Real-Time Social Media Monitoring System as an Open
Source Intelligence (Osint) Platform for Early Warning in Crisis Situ-
ations”. In: International conference KNOWLEDGE-BASED ORGA-
NIZATION 24 (June 2018), pp. 427–431. doi: 10.1515/kbo-2018-
0127.
[41] L. Ball. “Automating social network analysis: A power tool for
counter-terrorism”. In: Security Journal 29 (Feb. 2013). doi: 10 .
1057/sj.2013.3.
[42] M. Dawson, M. Lieble, and A. Adeboje. “Open Source Intelligence:
Performing Data Mining and Link Analysis to Track Terrorist Activi-
ties”. In: Information Technology - New Generations. Ed. by S. Latifi.
Cham: Springer International Publishing, 2018, pp. 159–163. isbn:
978-3-319-54978-1.
[43] S. Carruthers. Social Engineering - A Proactive Security. [online], cit.
[2020-8-14]. 2018. url: https : / / www . mindpointgroup . com / wp -
content / uploads / 2018 / 08 / Social - Engineering - Whitepaper -
Part-Three-Phishing.pdf.
63
BIBLIOGRAPHY
64
BIBLIOGRAPHY
65
BIBLIOGRAPHY
66
BIBLIOGRAPHY
67
BIBLIOGRAPHY
68
BIBLIOGRAPHY
[123] E. Maor. Kilos: The Dark Web’s Newest – and Most Extensive –
Search Engine. [online], cit. [2020-7-13]. url: https://intsights.
com/blog/kilos- the- dark- webs- newest- and- most- extensive-
search-engine.
[124] PublicWWW. PublicWWW. [online], cit. [2020-7-14]. url: https://
publicwww.com.
[125] B. Boyter. Searchcode. [online], cit. [2020-7-14]. url: https://searc
hcode.com.
[126] M. Fagan. Fagan Finder. [online], cit. [2020-8-31]. url: https://www.
faganfinder.com/.
[127] ElevenPaths. FOCA. [online], cit. [2020-7-29]. url: https://github.
com/ElevenPaths/FOCA.
[128] Edge-Security. Metagoofil. [online], cit. [2020-7-29]. url: http://www.
edge-security.com/metagoofil.php.
[129] P. Harvey. ExifTool. [online], cit. [2020-7-29]. url: https://exiftoo
l.org.
[130] Google LLC. Google Images. [online], cit. [2020-7-29]. url: https :
//images.google.com.
[131] Flickr. Flickr Map. [online], cit. [2020-8-31]. url: https://www.flic
kr.com/map.
[132] A. Mohawk. PasteLert. [online], cit. [2020-7-30]. url: https://www.
andrewmohawk.com/pasteLert/.
[133] A. Musciano. Sniff-Paste: OSINT Pastebin Harvester. [online], cit.
[2020-7-30]. url: https://github.com/needmorecowbell/sniff-
paste.
[134] Internet Archive. Wayback Machine. [online], cit. [2020-7-31]. url:
https://web.archive.org.
[135] Web Scraper. Web Scraper. [online], cit. [2020-7-31]. url: https://
webscraper.io.
[136] ScraperAPI. ScraperAPI. [online], cit. [2020-7-31]. url: https://www.
scraperapi.com.
[137] ScrapeSimple. ScrapeSimple. [online], cit. [2020-7-31]. url: https :
//www.scrapesimple.com.
[138] Scrapinghub. Scrapy. [online], cit. [2020-7-31]. url: https://scrapy.
org.
[139] Ahmia. Ahmia Crawler. [online], cit. [2020-7-23]. url: https://git
hub.com/ahmia/ahmia-crawler.
[140] Ahmia. Ahmia Index. [online], cit. [2020-7-23]. url: https://github.
com/ahmia/ahmia-index.
69
BIBLIOGRAPHY
70
BIBLIOGRAPHY
71
BIBLIOGRAPHY
72
A Appendices
73
A. Appendices
74
A. Appendices
[pantomath][stealth]$
After the search is over, the results can either be printed to the standard
output or exported to a JSON file using the print and export commands.
75