Cyber Forensic Analysis - A Machine Learning Prediction: Group Members
Cyber Forensic Analysis - A Machine Learning Prediction: Group Members
Group Members:
Mridula C M
Jimna M
Parvathy Chandran T
Rinsha N
ABSTRACT:
1
Cyber forensic specialist used computing facilities and analytics capabilities to gain insights and
faster conclusions. Data Analytics will revolutionize every area and offers a faster solution to cyber forensic
investigations. In this study, we examined the importance of Machine Learning techniques for cyber
forensics. The objective of this study is to understand different applications of Machine Learning for
forensic analysis.
INTRODUCTION:
Earlier cyber forensics mainly concentrated on workstations and connected network but now
anything is connected to the internet and ubiquitous connectivity extended the forensics to any device
connected to communicate. Data Science and analytics grabs the attention from the entire field including
business, industry, academics, research, health, transportation and more. Analytics proven to be effective at
solving computing research problems and started to inspire other areas as well. Different devices and
applications are in use produces a vast amount of data during operation and continue the production of data.
Analysis of these data during investigation plays a major role in cyber forensics to gather evidences.
Technology has direct influence in our daily life and therefore extraction and analysis of data by the
devices is a serious issue in the field of cyber forensics. With improvement in the usage of technology and
IOT devices, there is an increase in the number of cybercrimes. Cyber forensics deals with tackling
cybercrimes and has two models: Mc. Kemmish model and Kent/NIST model. It is the process of
identification, extraction, examination and analysis of the data while maintaining the integrity so as to be
admissible in the court of law. Goal of digital forensic is to perform investigation while maintaining
documented chain of evidence to find exactly what happened on computing device, when and who was
responsible it. Forensic investigators follow a set of procedures and they require specialised expertise and
tools for collecting and storing data available to end users. Computer forensic covers electronic evidence
discovery, mobile forensics, cell site analysis, cloud forensics, drone forensics, windows forensic, mac
2
FORENSIC PROCESS
The digital forensic process is a recognized scientific and forensic process used in digital forensics
investigations. The process is mostly used in computer and mobile forensic investigations and consists of
DATA
ANALYSIS REPORTING
ACQUISITION
Diagram 1
Data Acquisition: Data acquisition begins with data seizure, collecting digital evidence to identify the
suspected media using procedures that preserve the integrity of the data. Avoid the loss of dynamic data like
list of current network connection or other battery powered devices we should collect the data in a timely
manner. After acquiring digital data, create an exact duplicate image of original data and validate it with
hash values. Hash function like MD5 and SHA-1 and SHA-258, uses a mathematical algorithm to the digital
data and returns fixed bit string hash values. Data having similar hash values are identical. If the values
validated, then we can prove that evidences are still in the original state.
Data Examination and Analysis: After creating duplicate image, examination and analysis stage
begins on duplicate image while preserving the integrity. Depending on forensic request, an analyst reports
findings about different types if information like email, log files, documents, images etc. Results of
Reporting: Analysis result should be reported. It include description of action employed, explanation of
how tools and procedures were used, any other actions were performed and improve the existing system etc.
3
The rapid advancement of technology and hence increased data generation was created problems in
cybercrime investigations. The new devices, technology and protocols make crime analysis harder and
tougher. The veracity of data generation by various devices and its heterogeneity demands new methodology
and tools for data analysis. Now the technology is the greatest source for big data and analytics need to work
effectively for making intelligence. In forensic methodology if identified data for analysis, then investigators
faced the problems due to real time and big data. It is not easy for analysts to examine the given collected
There exist a variety of tools to find the evidences but majority of them failed to solve correlation
problems to maintain consistency which is very important to accept the report by the court. The lack of
standard techniques for examining and analysing large volume of data from multiple heterogeneous sources
created diversity problems in analysis. Now a day the devices seized from the crime scene are increased and
many of them are potentially evidence rich with plenty of associations which seriously impacted the
timeliness of investigations and associated delays in prosecutions. Also an analyst needs to answer the
1. Who or What application generate the data for analysis and raised the identity challenge.
4. Is it related to any other offences and what the user did with the data?
It will be beneficial only if we utilise intelligent automated evidence collection and processing
approaches to examine the data. Data mining techniques like pre-processing and dimensionality reduction
will be useful in making data suitable for investigations. Machine learning algorithms have great potential
for biometric estimation, location tracking, detecting anomalies and deviations in digital data including
4
If we consider analysis as a procedure in which first need to gather the required data, then analyse it
using tools and statistical models and interpret the results. This is also known as knowledge discovery from
the data set or records. Data mining helps discover knowledge from any data set like clocking the identity of
a particular piece of information. This can be successfully utilised during crime investigations. We can
classify any digital information collected to reduce the effort put forth by the investigators and to gain
LITERATURE REVIEW
For handling large volume of real time data, we need to combine different types of forensics
technologies such as network forensics, computer forensics (device) and cloud forensics. Text
summarization, a technique used for shortening long content is a solution for handling large amounts of
information and can settle information overloading. If automatic summarization technology is supported,
then an analyst can analyse IoT data within a limited time span.
The computing devices which are possibly used for criminal purposes can provide forensic
evidences. The data from these devices can be used to prove a motive for crime or not. One example for this
is the CCTV footages used by the investigative officers to proving the crime. It is challenging if data is
collected from a variety of devices. As more and more devices are connected, identification of potentially
relevant data is critical. With this situation in mind, Darren Quick suggested data reduction and semi-
automatic sub set analysis. This helped in timely analysis of large volumes of data.
P. Rizwan et al used machine learning algorithm to predict the traffic density using data collected
from CCTV cameras. The system can be used as an alternative for traffic congestion control with low cost. .
Named entity extraction is very useful in cyber forensic analysis as it is useful to recover meaningful
entities like names, addresses, narcotic dealing details, vehicle specifications etc. Commonly used entity
extraction techniques such as lexical-lookup, rule-based extractor, statistical based techniques, and machine
learning based models are quite relevant in IoT data as it improves speed of analysis.
5
Outlier analysis in data mining supports the identification of differences in files located in same
directory of a system, helpful in obtaining any deviation of a particular item from other. This also helps the
investigator in understanding potential intrusions into the system. Discriminant analysis is one of the
techniques used to assign an incident to any matching incident, thus providing a mechanism for event
reconstruction to prove a case. Plenty of visualisation techniques are available which along with data mining
approaches, helps the investors to easily capture any outliers and deviations. This visualization helps a lot to
MOTIVATION OF STUDY
Internet is an essential component in today’s life and connecting any one without restrictions.
Numbers of internet users are growing at a rate of one billion in every year. Business, Banking, Industries,
Academics, Corporate organizations, Health care etc. are also moved their business online. Some
organizations allow their employees to bring their devices and connect for communication. Internet surely
considered as an opportunity to the society but not safe from cyber terrorists and attacks.
Cyber forensic is a field of computer science and systematically analyse computer related
information when crime related activities were reported. Now with the advancement in technology number
of filed cases crossed millions and lack of resources and skilled experts delayed the sentences. Also because
of connected physical world and cyber physical systems the volume of data collected for analysis inevitably
huge. The investigation procedure completely tied and relied on the experience of investigators and
expensive tools. Sometimes the reliability of tools also was questioned in the court and lead to the dismissal
of the case. It is very difficult for the analyst to conclude the report if two different tools will results separate
conclusions. The cost of investigation also becomes too high, if depending on commercialized software for
analysis. In this situation an analyst can trust on Machine Learning and Artificial intelligence to gain
Increasing use of technology such as navigation facility, automatic vehicles, cell phones, home
automation systems, smart surveillance systems etc. has increased type of evidences. Because of this world
of technology now, conventional crimes also required to tie with the cyber forensics.
6
MACHINE LEARNING FOR CYBER FORENSICS – USE CASE
1. Text Forensics
Text documents and email messages are primary source of information and may be main source for
digital evidence. The manual analysis on huge amount of data is impractical and our first task is to use
machine learning for extracting information from email using python. The main task needs to find the
owner and attributes from the text document or email and used named entity extraction for extracting the
authorship of the text document or email message. The entity extraction helps to understand whether
Our project is about applying Data Mining techniques for extracting information from damaged
media using python language. In analysis phase we will do the entity extraction, identifying correlation,
sorting forensic data into groups, collection of keywords through interview. By finding keyword we can
proceed the case in better way. A digital forensic investigator will be interested in gathering information
and conducting interviews regarding computer crime, child pornography, fraud, hacking, and other digital
crimes.
For the study download a text document such as a pdf (Portable Document Format) from
Google .To retrieve the authorship information apply entity extraction approach on the text document. For
gaining information about a particular keyword, used a program designed in python to retrieve all the
matching words. The program is useful to gain particular words that are relevant for investigation from the
large file. Extracting text from pdf may help us to parse through hundreds of PDF files to extract keywords in
order to make them searchable. It is a Part of solving the problem was figuring out how to extract textual
data from all these PDF files. This study can provide a useful view of unknown data sets by immediately
revealing at a minimum, who, and what, the information contains. As a result, an analyst would be able to
see a structured representation of all of the names of people, companies, brands, cities or countries,
even phone numbers in a corpus that could serve as a point of departure for further analysis and
investigation.
1. PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
2. textract (To convert non-trivial, scanned PDF files into text readable by Python)
3. re ("re" module included with Python primarily used for string searching and manipulation)
Step 2: Read PDF file and converting pdf image into text.
Step 3: Finding keywords from the text and get the count of each keywords.
Extracting text from pdf may help us to parse through hundreds of PDF files to extract keywords in
order to make them searchable. It is a Part of solving the problem was figuring out how to extract textual
Output:
8
Figure: 1
From figure 1 investigators can get an idea of how many times the victim involved in the crime.
Figure 2
From the figure 2, it may help the analyst to check directly whether the person is involved in the crime or
not.
Extracting these type of information from text is very useful for the investigator to check whether the
information matching with the information collected through the interview. This would be helpful to obtain
CONCLUSION
With new technologies existing forensic procedures are not adequate and need to extend the field of
digital forensics to data analytics and machine learning. This paper identifies the use of data mining and
9
machine learning for crime investigations in big data processing. Also analysing some works available in
literature has proven that data mining is useful in analysing digital evidences. This paper explores the
incorporation of analytic techniques for forensic investigation. The presented use cases give the overview of
possibilities of Machine Learning in Forensic analysis. The future of this wok enquire the possibilities of
REFERENCES
1. [b]vie L. Carroll, Stephen K. Brannon, Thomas Song, “Computer Forensics: Digital Forensic
2. [z] Rami Mustafa A Mohammad, Mohammed Alq, “A comparison of machine learning techniques
for file system forensics analysis” , Journal of Information Security and Applications· March 2019.
3. Deepti Sehrawat1, Nasib Singh Gill, “Data Mining in IoT and its Challenges”, International Journal
4. lFrancescoServida, EoghanCasey, “IoT forensic challenges and opportunities for digital traces”,
Science Direct Digital Investigation Volume 28, Supplement, April 2019, Pages S22-S29.
5. Darren Quick et al, “IoT Device Forensics and Data Reduction”, IEEE Access 2018.
6. [a]P. Rizwan, K.Suresh, M.R.Babu, Real time smart traffic management system for smartcities by
7. Raburu George,, Omollo Richard,, Okumu Daniel , “Applying Data Mining Principles in the
Extraction of Digital Evidence”, International Journal of Computer Science and Mobile Computing.
8. George Forman, Kave Eshghi, and Stephane Chiocchetti,(2005),” Finding similar files in large
document repositories.”, In KDD ’05: Proceeding of the eleventh ACM SIGKDD international
Conference on Knowledge discovery in data mining, pages 394–400, ACM, New York, NY, USA,
ISBN 1-59593-135-X.
10
9. Kumar Shanu Singh ,Annie Irfan and Neelam Dayal “Cyber Forensics and Comparative Analysis of
10. Asaf Varol andYeşim Ülgen Sönmez “Review of evidence analysis and reporting phases in digital
forensics process”, 2017 International Conference on Computer Science and Engineering (UBMK)
11