Paperaccepted ICACDS2020
Paperaccepted ICACDS2020
net/publication/343033121
CITATIONS READS
3 2,999
4 authors, including:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Bipin Kumar Rai on 06 June 2021.
1 Introduction
Security researcher Mark M. Lowenthal defines OSINT as “any and all information
that can be obtained from the overt collection: all media types, government reports,
and other files, scientific research and reports, business information providers, the
Internet, etc.” [1]
The major help that Open Source Intelligence does is the wide variety of information
it can give which is not restricted to only a single format such as text or image but the
entire possible and available format of data can be extracted from a publicly
accessible domain such as audio, video, etc.
In this paper, we are aiming to provide digital solutions that would help in collecting
information about the targeted entity through a single platform, saving most
importantly time.
For the first one, we have proposed a solution named OSINTEI (Open Source
Intelligence for Efficient Investigation) that helps in the investigation and data
extraction of a target host, particularly administrative officers as they are responsible
for all the administrative duties and root development of the nation. Government
officials are responsible for the development of the nation and its citizens. But what if
the officials who are looked upon for carrying out administrative responsibilities,
involve in owning illegal assets, show abnormality in expenses, have eye-catching
work behavior, etc. For such officials, information gathering is started by looking into
various different records and files available in public records present on governmental
and web portals. The process takes a lot of time and it all goes to nothing when
nothing suspicious is found. The time consumed in such cases could have been used
for other tasks too.
Investigation being the most crucial, time and effort consuming, cost absorbing phase
which is done manually and is being done the same way for ages. This traditional way
costs a lot than just money [2]. Even the risk of life for most of the investigation team
officials. This has risen the need for an automated investigation platform that is the
product of a cohesive technological advancement which reduces all the above-
mentioned investments and results in a digital age information-gathering protocol that
may ensure efficient investigation.
OSINT has been used for various other tools that fulfil searching, data extraction and
compilation goals earning fame and wide user demand due to the compatibility,
efficiency, and ease of use they provide [3].
For the next one, we have proposed a solution named OSINTSF (Open Source
Intelligence as Social Finder) that aims to provide a tool that makes searching and
finding details/information about an entity easy and accessible. Also, to benefit
businesses with the search capability and would allow content to be searched present
on social networking sites in real-time and provide profound analytical data.
2 Related Work
The Investigation [6] and policing methodologies have evolved with time. But neither
has been embedded totally with the technology.
The first proposed solution aims to provide a complete one destination platform for
the entire information extraction investigation, focusing on the targeted individual and
delivering miscellaneous as well as sorted data to the researcher. Data being either an
age ago news coverage or the current financial status, this product would search each
and every single module present in the entire webspace and find out the relevant
information about the host needed for the investigation. Once complete, it would help
to attain all the necessary information or data that are needed to find a direction for
investigation, saving time that manual information gathering consumes. Instead of
manually searching into different online sources, making a report on the same and
taking days for a single task, they can use our solution to find relevant data about the
target in few seconds and would get a data-filled report via a template on the same.
Data is extracted from various public domains that can be legally used, government
websites such as Supremo where data gets updated each year and is accurate. So, only
the rich data would be extracted and would help in the investigation or information
gathering. Making the primary stage of Investigation an easy task with efficient and
reliable results.
The second solution aims to help businesses that keep records of their clients and seek
new interested hosts. Would help individuals keep track of their social and web
presence. Social media is becoming a crucial part of digital communication strategies.
It is now an effective tool not only to improve brand loyalty and win new clients but
also to strengthen the customer service by allowing businesses to access the social
media networks to establish relationships and expand the span of their interactions.
It would also help businesses that work in insurance, banking or any other
investments industry to build peer-to-peer networks to meet non-contactable
consumers whose renewal or incentive or maturity programs rest unclaimed. They can
use this solution to find details about the customers and contact them.
Both of these solutions ask for minimal resource requirements for usage and also
minimal input data about the target. With only a single laptop/device having a
connection to it and you get the results immediately.
Being the technology, which has not been used commonly, OSINT [7] has a lot more
in its treasure of usability that can be scraped out to create a software product with
higher usability strength [8]. [9] The solution, being a software product would use
Machine learning and Artificial Intelligence, classification and regression algorithms
such as the Naïve Bayes algorithm. Selenium [10], which is an efficient and portable
framework used for crawling and testing web applications and ensuring quality would
also be used. The solution uses JWT (Java Web Tokens) for session management
bringing security and uses microservices to upgrade the scalability of the product. A
detailed description of the technologies used and their roles are present ahead
Both the solutions use a similar technological stack but are different when it comes to
their functionalities and use cases. Facial/pictorial data or image can also be used as
an alternative input for investigation but only if name (being the primary input) is
unavailable. This feature increases the ease of use and broadens functionalities of the
solution.
The proposed solution works by following a particular algorithm Global Search (GS).
This GS is used as a Global data structure that contains all the various details of a
particular person, a foreign agent or a group.
This GS Algorithm further gets divided into four smaller sub algorithms,
1. GS-Crawl
2. GS-Extraction
3. GS-Reinforce
4. GS-Template
3.1.1. OSIGS-Crawl
• GS-Crawl(Pi)
driver unit.
Searcht Pi.
If(! Searcht )
cwdriver instance
for each i in S:
tempS[i] ꓯ iS
cw ꓯ temp.
lst cw
call insertIntoGS.
db_init( JSONF ).
• insertintoGS(cwi):
retrieve cwi.
Push into Global Stack(GS).
if (GS_count < 0)
Return null
else
Return GS(r1,r2,r3……..rn)
G_StackGS(r1,r2,………rn)
If G_Stack is null :
goto 1.
else
itemi pop(G_Stack).
serialize (itemi).
JSONs serialized(itemi)
goto 4.
return JSONs.
3.1.2. OSIGS-Extraction
3.1.3. OSIGS-Reinforce
OSIGS-Reinforce algorithm takes the input from the extraction maintenance service
as a JSON and builds a date set after deserializing the response. The OSIGSR behaves
as a Rest end point consumer for the processing of consuming JSONF in order to train
the model.
The various results that have been gathered from the public domains against the target
are used as different parameters in order to train the learning model.
• OSIGS-Reinforce(JSONF )
init ɸ (JSONF , t).
Ji E JSONf , Ji ꓯ JSONF :
temp = Ji
for each (i in j):
Select t from JSONf
Do trigger t,
watch output 0 and next Ji+1
ɸ(Ji , t)Q(Ji , t) + ß [ O + ρ . maxß, ɸ(Ji’, t’ )-ɸ(Ji ,t)].
Ji Ji’
Push JSON[Ji1’,Ji2 ‘……..Jin’] in db.
Initialize the ɸ value i.e., ɸ( JSON, trigger) then watch the current state JSONi
choose a trigger it , only belonging to the Ji . Now provide the output 0 and watch out
for new or next state Ji+1. Update ɸ values until all values of JSON are exhausted.
After this the new and approved results of the JSON[Ji1’,Ji2’,Ji3’,Ji4’…….Jn’] will
be reduced by the reinforcement learning service , which will go into the DB service.
3.2.2. Kafka
Kafka architecture is being used here as it provides higher throughput, speed,
scalability, reliability and replication characteristics for any real-time streaming data
architectures, big data collection or can provide real-time analytics [11].
The data being extracted from the public and authentic government sources would be
converted into data sets that would be used to train our model which further would
help us to relevantly classify between the most relevant new link or information
which would be added to the template.
3.2.5. Relevancy Factor and data classification via Naïve Bayes Theorem
Naive Bayes is a simple technique for building classifiers: models assigning class
labels to problem instances defined as vectors of feature values, where the class labels
are taken from some finite set. [15] There is no single algorithm for training such
classifiers, but a family of algorithms based on a common principle: all naive Bayes
classifiers conclude that, given the class variable, the value of a particular feature is
independent of the value of any other attribute.
To check which link to be given priority of being shown via the template over the
other, the Naïve Bayes theorem comes in handy. The crawler extracts various links
that are yet to be checked for the relevancy. For the check, all the links are checked
and compared with the trained data searches for the probability of relevancy for each
of the query link.
The link with the highest probability is then chosen to be showcased in the template.
Fig. 5. Relevancy factor identification (For the above graph, it is easily clear that
Query 1 has the highest probability of relevancy than Query 1 or 3 and so, Query 2
will be added to the template.)
After all the links are compared, as being shown in all above graphs we can notice a
link showing higher relevancy probability ratio than the others in each graph and so
they’ll be used in the template that would ensure higher efficiency and information
relevancy.
Spring Boot, being a Java-based open source framework, helps create micro Service
[16]. It would make our solution more scalable.
It offers a flexible way to configure Java Beans, XML, and Database Transactions.
This offers efficient batch processing and REST endpoints management. Everything
is auto-configured in Spring Boot; no manual settings are required. It offers Spring
application based on annotation. Managing reliance eases. This requires Embedded
Servlet Container [17].
Micro Service is an architecture that allows the developers to independently develop
and deploy services. Every program running has its own mechanism and this enables
the lightweight business application support model.
Spring Boot provides Java developers with a good platform to develop a stand-alone
and production-grade spring application that they can just run. With minimal
configurations, you can get started without having to set up a whole Spring
configuration [18].
Selenium is an open source tool designed to automate web browsers. This provides a
single interface that allows you write test scripts in various programming languages
such as Ruby, Java, NodeJS, PHP, Perl, Python, and C#, and more.
Versatility of Selenium is part of the reason why selenium is so popular. Anyone who
codes for the web may use Selenium to check their code / app–from individual
freelance developers running a short series of debugging tests to UI engineers
conducting visual regression tests after a new integration process [19].
3.2.10. JWT
JWT (JSON Web Token) is an Internet standard for creating JSON access tokens
which assert a number of claims. A server could, for example, generate a token that
has the claim "logged in as an admin" and provide it to a client. Then the client could
use the token to show it's signed in as admin. The tokens are signed by the private key
of one party (usually the server's), so that both parties (the other being already in
control of the respective public key by some appropriate and trustworthy means) can
check that the token is valid. The tokens are designed to be lightweight, URL-safe and
especially usable in a single-sign-on (SSO) web browser setting.
Usually, JWT [23] statements can be used to transfer identification of authenticated
users between an identity provider and a service provider, or any other form of
assertion that business processes require.
Amazon Rekognition offers fast and accurate face recognition, enabling us to use our
private face picture repository to identify individuals on a photo or video. We can also
check identity by evaluating a facial picture for contrast to photographs that you have
kept. [24]
We can easily detect when faces appear in images and videos with Amazon
Rekognition, and get attributes such as gender, age range, eyes open, glasses, facial
hair for each. We can also calculate how these facial features change over time in
film, such as creating a timeline of an actor's articulated emotions.
So, no matter how old the image is, it can efficiently recognize the individual and help
our software to search and find out the related data about that particular entity.
4 Future Work
Based on the proposed solution we can create a platform that can make investigation
stress and hassle-free process and providing a business-friendly solution that solves
crucial business use cases. With the adoption of such a high-end digital platform,
national security and efficient investigation won’t be a dream anymore. The success
rate of any investigation would be maintained by making efficient use of time and
resources. Tracking presence and staying updated on one’s or the client’s web/social
presence would be much easy. Estimating the buzz of your brand would be handy.
OSINT can prove to be a wonderful choice when putting in terms of data extraction
and processing. It truly helps in getting and availing the best of the best data from the
abundantly available data stocks available on legal and accessible public domains.
This solution when ready can bring a new wave in the field of investigation and real-
time searching with huge pros to the nation.
5 Conclusion
To establish efficient and reliable law solutions that solve real-world problems related
to the use of data, records, and information, OSINT can be considered as a better
option and can be considered as the time, money and resource saver making it an
efficient option. In the digital age, to investigate manually and traditionally by visiting
record rooms and searching through the document files can be called foolish and so
bringing out a software product can brief these efforts and make investigation a lot
easier and also broader in perspective. The application of Open Source Intelligence in
the investigation process can truly help in receiving better, optimized, accurate and
reliable results all in one single platform with only minimal input information and no
physical effort at all. It can also help track the presence and online fame of an
individual gathering a lot of commercial project ideas in the future using the same
technology.
References
[1] Roger Z George, Robert D Kline, Mark M. Lowenthal. “Intelligence and the national
security strategist: enduring issues and challenges”, Rowman and
Littlefield, ISBN 9780742540392, vol. 58, pp.273-284, 2005.
[2] James Byrne1, Gary Marx. “Technological Innovations in Crime Prevention and Policing.
A Review of the Research on Implementation and Impact”. Maklu-Uitgevers. ISBN 978-
90-466-0412-0, pp.17-40, 2011.
[3] Ricardo Andrés Pinto Rico, Martin José Hernández Medina, Cristian Camilo Pinzón
Hernández, Daniel Orlando Díaz López, Juan Carlos Camilo García Ruíz. Open source
intelligence (OSINT) as support of cybersecurity operations. “Use of OSINT in a
colombian context and sentiment Analysis.” Revista Vínculos: Ciencia, Tecnología y
Sociedad. Vol 15, pp.195-214, 2018.
[4] J. Pastor-Galindo, P. Nespoli, F. Gómez Mármol and G. Martínez Pérez, "The Not Yet
Exploited Goldmine of OSINT: Opportunities, Open Challenges and Future Trends," in
IEEE Access, vol. 8, pp. 10282-10304, 2020.
[5] Florian Schaurer, Jan Störger. “Guide to the Study of Intelligence. The Evolution of Open
Source Intelligence (OSINT)”. Intelligencer: Journal of U.S. Intelligence Studies. Vol 19
No 3, pp.53-56, 2010.
[6] Richard Adderley & Peter Musgrove. “Police crime recording and investigation systems –
A user’s view. Policing: An International Journal of Police Strategies & Management”.
Emerald. 24(1), pp.100-114, 2001.
[7] Clive Best, "Web Mining for Open Source Intelligence,". IEEE. 12th International
Conference Information Visualisation, London, 2008, pp.321-325.
[8] Giovanni Nacci. “The General Theory for Open Source Intelligence in brief. A proposal”.
Intelli|sfèra. pp.1-3, 2019.
[9] Nihad A. Hassan, Rami Hijazi. “Open Source Intelligence Methods and Tools”. Apress
Media LLC. ISBN-13 (pbk): 978-1-4842-3212-5 ISBN-13 (electronic): 978-1-4842-3213-
2. 15-18, 2018.
[10] Arjun Satheesh, Monisha Singh. (2017). “Comparative Study of Open Source Automated
Web Testing Tools: Selenium and Sahi”. Vol 10(13), ISSN (Print): 0974-6846. ISSN
(Online): 0974-5645, 2017.
[11] Philippe Dobbelaere, Kyumars Sheykh Esmaili. “Kafka versus RabbitMQ: A comparative
study of two industry reference publish/subscribe implementations: Industry Paper”,
pp.227-238, 2017.
[12] Bell, Jason. (2020). Machine Learning Streaming with Kafka. O’Reilly, ch12, pp.239-303,
2020.
[13] R. Shree, T. Choudhury, S. C. Gupta and P. Kumar, "KAFKA: The modern platform for
data management and analysis in big data domain,". 2nd International Conference on
Telecommunication and Networks (TEL-NET), 2017, pp. 1-5.
[14] X. Wang and D. Loguinov, "Load-Balancing Performance of Consistent Hashing:
Asymptotic Analysis of Random Node Join," in IEEE/ACM Transactions on Networking,
vol. 15, no. 4, pp. 892-905, Aug. 2007.
[15] Z. Zi-qiong, Y. Qiang and Li Yi-jun, "Using Naïve Bayes Classifier to Distinguish
Reviews from Non-review Documents in Chinese," 2007 International Conference on
Management Science and Engineering, Harbin, 2007, pp. 115-121.
[16] P. D. Francesco, I. Malavolta and P. Lago, "Research on Architecting Microservices:
Trends, Focus, and Potential for Industrial Adoption," 2017 IEEE International
Conference on Software Architecture (ICSA), Gothenburg, 2017, pp. 21-30.
[17] B. Christudas, “Spring Boot, Practical Microservices Architectural Patterns”, pp.147-182,
2019.
[18] K. Reddy, “Web Applications with Spring Boot - Beginning Spring Boot 2: Applications
and Microservices with the Spring Framework”, pp.107-132, 2017.
[19] R. Chen and H. Miao, "A Selenium based approach to automatic test script generation for
refactoring JavaScript code," 2013 IEEE/ACIS 12th International Conference on
Computer and Information Science (ICIS), Niigata, 2013, pp. 341-346.
[20] Iuliana Cosmina, “Building Reactive Applications Using Spring”. Pivotal Certified
Professional Core Spring 5 Developer Exam, 2020, pp.903-955.
[21] S. S. Abdhullah, K. Jyoti, S. Sharma and U. S. Pandey, "Review of recent load balancing
techniques in cloud computing and BAT algorithm variants," 2016 3rd International
Conference on Computing for Sustainable Global Development (INDIACom), New Delhi,
2016, pp. 2428-2431.
[22] S. W. Prakash and P. Deepalakshmi, "Server-based Dynamic Load Balancing," 2017
International Conference on Networks & Advances in Computational Technologies
(NetACT), Thiruvanthapuram, 2017, pp. 25-28.
[23] P. Wehner, C. Piberger and D. Göhringer, "Using JSON to manage communication
between services in the Internet of Things," 2014 9th International Symposium on
Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), Montpellier,
2014, pp. 1-4.
[24] Abhishek Mishra, “Amazon Rekognition- Machine Learning in the AWS Cloud”, John
Wiley & Sons, ch18, 2019, pp.421-444.
Appendix
f = frequency of letter in the document.
d = JSONF document.
D = total number of JSONF documents.
N = number of d in which t occurs
pi = ith person
cw = crawler
S = set of links to be searched
lst = local storage of each crawl result
G_Stack = Global stack
JSONF = final JSON
t = triggers
ß = learning rate