0% found this document useful (0 votes)
4 views

PBL-2 Report File

The document outlines a project on phishing URL detection using machine learning techniques, specifically focusing on algorithms like Decision Tree, Random Forest, and Autoencoder Neural Network. The project aims to develop a model that can effectively identify malicious URLs to protect users from online scams. The results indicate that the Random Forest algorithm achieved the highest accuracy of 87% in detecting fake URLs.

Uploaded by

Samarth Agrawal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

PBL-2 Report File

The document outlines a project on phishing URL detection using machine learning techniques, specifically focusing on algorithms like Decision Tree, Random Forest, and Autoencoder Neural Network. The project aims to develop a model that can effectively identify malicious URLs to protect users from online scams. The results indicate that the Random Forest algorithm achieved the highest accuracy of 87% in detecting fake URLs.

Uploaded by

Samarth Agrawal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

a

PROJECT BASED LEARNING (PBL-2) LAB


(CSP297)
PHISHING : Fake URL Detection

B.TECH 2nd YEAR


SEMESTER: 4th
SESSION: 2023-2024
Submitted By:

Abhinav Thapliyal (202235445)


Kashish (2022484080)
Samarth Agrawal (2022004067)

SECTION: J

Submitted To

Ms. Saptadeepa Kalita


Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SHARDA SCHOOL OF ENGINEERING & TECHNOLOGY

SHARDA UNIVERSITY, GREATER NOIDA


Table of Contents Page No
Project Title....................................................................................................................................................
Team / Group Formation:..............................................................................................................................
Technologies to be used.................................................................................................................................
Tools..............................................................................................................................................................
Problem Statement.........................................................................................................................................
Literature Survey...........................................................................................................................................
Project Description.........................................................................................................................................
Project Modules: Design/Algorithm.............................................................................................................
Implementation Methodology........................................................................................................................
Result & Conclusion......................................................................................................................................
Future Scope and further enhancement of the Project...................................................................................
Advantages of this Project.............................................................................................................................
Outcome.........................................................................................................................................................
References......................................................................................................................................................
Project Title
Phishing : Fake URL Detection

Team / Group Formation:

S. No Student Name Roll Number System ID Role

1. Abhinav Thapliyal 2201010021 2022354445 Implementation

2. Kashish 2201010322 2022484080 Implementation

3. Samarth Agrawal 2201010624 2022004067 Implementation

Technologies to be used
Software Platform

a) Programming Language: Python


b) Phishing URL detection Libraries: train_test_split, Random forest, Decision tree,
Autoencoder neural network.
c) Machine Learning Libraries: Seaborn, pandas & matplotlib

Hardware Platform

a) Operating System
b) Computing Resources: GPUs (Graphics Processing Units)
c) Memory and Storage: RAM
d) Networking: Reliable internet connectivity is essential for data collection, model
training, and deployment.

Tools
1. Machine Learning and Sentiment Analysis:

• Tool Name: scikit-learn

• Vendor Name: Scikit-learn developers

• Version No.: Version 0.24.2

• Purpose: Scikit-learn is a machine learning library for building and training sentiment
analysis models.
2. Data Visualization:

• Tool Name: Matplotlib

• Vendor Name: Matplotlib development team

• Version No.: Version 3.4.3

• Purpose: Matplotlib is used for creating visualizations to present sentiment analysis results
graphically.

3. Jupyter Notebook:

• Tool Name: Jupyter Notebook

• Vendor Name: Project Jupyter

• Version No.: Varies with the Jupyter distribution version

• Purpose: Jupyter Notebook is an interactive environment for documenting and sharing the
project's code and analysis

Problem Statement
The problem statement of this model involves developing algorithms which are capable of
detecting malicious URL’s which are used by online attackers to scam people. These URL’s
look like an original replica of the genuine URL’s which are used to get the sensitive
information of the user. In this model, we test different machine learning and neural network
models and evaluate which is the most suited model for this problem. This model is used to
analyze various features of URL such as domain name, content, behaviour etc. The main
purpose of this project is to provide effective tools which is able to detect and help users so
that they do not access any phishing websites.

Literature Survey
Project Description
Phishing URL detection model is a cutting-edge data-driven solution designed to help
individuals not to get scammed by visiting any fraud website. Leveraging advanced machine
learning algorithms and a rich dataset of URL’s features, the model accurately predicts the
phished website.

Phishing attacks have some significant threat to cybersecurity, as hackers are constantly
developing new techniques to break the detection techniques, we also need to evolve the
traditional techniques to detect and stop the threat, which is the aim of our project so that we
prevent users to be the victim.

Project Modules: Design/Algorithm


Fake URL Detection is a task to detect or differentiate whether a given URL is legitimate or
illegitimate. The following is a high-level design and algorithm for performing Fake URL
Detection. There are multiple approaches and algorithms, but we'll outline a common one
using machine learning techniques.

These are some common approaches that we carried out in our project:

1) Dataset Collection

2) Data Preprocessing

3) Training Model

4) Model Testing

5) Deployment of model

Implementation Methodology
A. DATASET COLLECTION
The data collection refers to the technique or the process of organizing, recording and
gathering information for reference or analysis. Its work is to systematically collect all the
data from various sources such as interviews, observations, sensors, etc. The effective
collection of data is very necessary for the research purposes, business intelligence and many
different fields where the collection of data is considered very crucial.
B. DATA PREPROCESSING

Preprocessing of the data is a technique where we convert raw data in an ordered data . If The
data is processed, it converts it undemanding to translate the data and makes it is easy to use.
Preprocessing techniques are used because the data gives accurate and high-quality results.
The necessitate tasks in preprocessing of data are cleaning, integration, transformation and
reduction. Usually, a real - world data contains missing values, noises and sometimes extra
information or impractical format which can not be as it is used by models. Data
preprocessing is a required task for finding the missing values, cleaning the data and making
it suitable for the training model. Data can be cleaned by using a technique called Binning.
Binning is a technique where we sort the data and then partition the into equal frequency
bins.

Fig - Flow Diagram for the Training and Testing of the Model

C. MODEL TRAINING

First the dataset that we pre-processed is divided into two parts, one is training data and the
other the testing data. 70-80 percent of data is applied in tutoring and preparation of our ML
and neural models and the remaining is used for model testing. Model training is a training
dataset with various algorithms and then testing refers to check the accuracy and correctness
of the model up to its full. If the model is trained correctly and with a large quantity of data
then the accuracy for prediction or testing will increase.
D. MODEL TESTING

In machine learning, model training is allude to as the operation where the performance of a
experienced model is evaluated our testing dataset that we divided earlier. When our model
is trained by using the dataset, the neural model provides the best accuracy or the output for
the processed data-set.

Testing explicitly identifies which part of the code fails and provides a relatively coherent
coverage measure. In Machine learning testing, the programmer enters input and observes the
behaviour and logic of the machine. Hence, the purpose of testing machine learning is to
elaborate that the logic learned by machine remains consistent always. The logic should not
vary even if the program is called multiple times.

Result & Conclusion


Result
In this research paper we have mainly used three techniques that are Decision tree, Random
Forest and Autoencoder Neural Network. And Random forest gives the best accuracy among
the three of around 0.87 which means that 87% of the time the model will be able to predict
the fake URL.

Table - Training and Testing Accuracy of the techniques used

Conclusion

In summary, this research paper mainly concentrates on the techniques that can differentiate
between the real and the fake URL’s. In the research Autoencoder Neural Network, Random
Forest and Decision Tree algorithms have been used. And after the comparison with other
algorithms we came to the conclusion that Random Forest gives a better accuracy than the
other two techniques used. Our research also highlighted the importance of fairness and
ethics in using this technology. It's crucial to make sure our predictions are unbiased and
transparent. Even though our work is successful, still there is a much to explore and make our
model more reliable so that it can detect the fake URL’s more accurately.

Future Scope and further enhancement of the Project


Future Scope : The future scope of Phishing URL detection project is wide and promising,
as cyberthreats continuously evolving and the number of users surfing on internet is
increasing day to day all around the world.

Here are some potential areas of development:

1. Improved Models and Algorithms: As machine learning techniques continue to


advance, more sophisticated algorithms and models will be developed to better
capture the depth and behaviour of URL’s. This could include the use of deep
learning, reinforcement learning, and other advanced techniques.

2. Cross-Platform protection: As number of users is increasing day by day number of


devices also increased as well as the surfing platforms is increased. In future URL’s
detection, cross-platform protection plays a crucial role where we need to done real
time detection on web browsers, applications, emails etc.
3. Integration with Real-Time Data Streams: Machine learning models may be
integrated with real-time data streams to provide real time detection based on URL’S
behaviour and other factors.

4. User education and awareness: We also need to empower our users with the
knowledge as well as the tools which is used to detect the phishing URL so that they
are able to avoid the hackers and save themselves to be a victim of cyberattack.

Enhancement of the Project –

To enhance the Phishing URL detection project, we'll implement multiple machine
learning models as well as one neural network model and used good quality of data which
help to detect which is the most suited algorithm/model for this project.

Additionaly, we are also able to implement deep learning algorithms for better result.
Continuous monitoring and feedback mechanism will be established to guarantee
ongoing model refinement and comprehensive documentation will be maintained for
transparency and reproducibility. This comprehensive approach aims to deliver a highly
accurate, transparent and user-friendly Phishing URL detection tool.

Advantages of this Project


The advantages of Phishing URL Detection using machine learning offer several advantages:

1. Improved Accuracy : ML models can process large amounts of data and identify
patterns that may not be immediately apparent to humans. This can lead to more
accurate predictions compared to traditional methods.

2. Automation and Efficiency : Once a ML model is trained, it can automate the


prediction process, saving time and effort compared to manual
calculations or analysis.

3. Scalability: ML models can be scaled to handle a large number of URL and a wide
range of features so that user will surf on that site without any tension.

4. Prevention from data breaches: It helps to prevent data breaches by checking and
blocking the access to any harmful or malicious website/URL crafted to get the
sensitive information such as their addresses, bank details, their email passwords etc.

5. Minimize the cost of incident: This model helps to reduce the cost of incident
response such as investigations , and other activities.

Outcome
NA
References

[1] Huang, Y., Qin, J., & Wen, W. (2019). Phishing URL detection via capsule-based neural
network. 2019 IEEE 13th International Conference on Anti-Counterfeiting, Security,
and Identification (ASID). https://doi.org/10.1109/icasid.2019.8925000

[2] Gajera, K., Jangid, M., Mehta, P., & Mittal, J. (2019). A novel approach to detect
phishing attack using artificial neural networks combined with pharming detection.
2019 3rd International Conference on Electronics, Communication and Aerospace
Technology (ICECA). https://doi.org/10.1109/iceca.2019.8822053

[3] Rahmadeyan, A., Mustakim, M., Ahmad, I., Alexander, A. D., & Rahman, A. (2023).
Phishing website detection with Ensemble Learning Approach using artificial neural
network and AdaBoost. 2023 International Conference on Information Technology
Research and Innovation (ICITRI). https://doi.org/10.1109/icitri59340.2023.10249799

[4] McGinley, C., & Monroy, S. A. (2021). Convolutional neural network optimization for
phishing email classification. 2021 IEEE International Conference on Big Data (Big
Data). https://doi.org/10.1109/bigdata52589.2021.9671531

[5] S, J., & Eliyas, S. (2023). Detecting phishing attacks using Convolutional Neural Network
and LSTM. 2023 3rd International Conference on Advance Computing and Innovative
Technologies in Engineering (ICACITE).
https://doi.org/10.1109/icacite57410.2023.10183234

[6] Abutaha, M., Ababneh, M., Mahmoud, K., & Baddar, S. A.-H. (2021). URL phishing
detection using machine learning techniques based on urls lexical analysis. 2021 12th
International Conference on Information and Communication Systems (ICICS).
https://doi.org/10.1109/icics52457.2021.9464539

[7] Zhang, L., Zhang, P., Liu, L., & Tan, J. (2021). Multiphish: Multi-modal features fusion
networks for phishing detection. ICASSP 2021 - 2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP).
https://doi.org/10.1109/icassp39728.2021.9415016

[8] Sujatha, G., Ayyannan, M., Priya, S. G., Arun, V., Arularasan, A. N., & Kumar, M. J.
(2023). Hybrid optimization algorithm to mitigate phishing URL attacks in Smart
Cities. 2023 3rd International Conference on Innovative Practices in Technology and
Management (ICIPTM). https://doi.org/10.1109/iciptm57143.2023.10118171

[9] Chen, Y., Zhou, Y., Dong, Q., & Li, Q. (2020). A malicious URL detection method based
on CNN. 2020 IEEE Conference on Telecommunications, Optics and Computer
Science (TOCS). https://doi.org/10.1109/tocs50858.2020.9339761

[10] S Tambe, Y., & Mohammad, S. (2023). Phishing URL detection using machine learning.
Journal of Advanced Research in Production and Industrial Engineering, 10(01), 1–5.
https://doi.org/10.24321/2456.429x.202301

[11] Al-Milli, N., & Hammo, B. H. (2020). A convolutional neural network model to detect
illegitimate urls. 2020 11th International Conference on Information and
Communication Systems (ICICS). https://doi.org/10.1109/icics49469.2020.239536

[12] D, N., B, G. S., Poongodi, C., K, J., P, N., & J, J. (2022). Autoencoder based feature
selection for phishing URL attack detection in IOT using stacked Autoencoder (AFS-
SAE). 2022 13th International Conference on Computing Communication and
Networking Technologies (ICCCNT).
https://doi.org/10.1109/icccnt54827.2022.9984574

[13] Kankrale, Prof. R. (2021). Phishing website detection using Machine Learning.
International Journal for Research in Applied Science and Engineering Technology,
9(VI), 3216–3220. https://doi.org/10.22214/ijraset.2021.35671

[14] Nabila, O. G., Wicaksono, H. R., Girinoto, A., Yasa, R. N., & Setiawan, H. (2023).
Benchmarking model URL features and image based for phishing URL detection. 2023
International Conference on Informatics, Multimedia, Cyber and Informations System
(ICIMCIS). https://doi.org/10.1109/icimcis60089.2023.10349059

[15] Swathi, G., Shwetha, M., Potluri, P., Murthy Raju, K., Kumar, Y., & Rajchandar, K.
(2023). Smart cities hybridized to prevent phishing URL attacks. 2023 Second
International Conference on Electronics and Renewable Systems (ICEARS).
https://doi.org/10.1109/icears56392

.2023.10085315

Signature

Student Name Student Sign Faculty Name Faculty Sign


Abhinav Thapliyal
Ms. Saptadeepa
Kashish
Kalita
Samarth Agrawal

You might also like