PBL-2 Report File
PBL-2 Report File
SECTION: J
Submitted To
Technologies to be used
Software Platform
Hardware Platform
a) Operating System
b) Computing Resources: GPUs (Graphics Processing Units)
c) Memory and Storage: RAM
d) Networking: Reliable internet connectivity is essential for data collection, model
training, and deployment.
Tools
1. Machine Learning and Sentiment Analysis:
• Purpose: Scikit-learn is a machine learning library for building and training sentiment
analysis models.
2. Data Visualization:
• Purpose: Matplotlib is used for creating visualizations to present sentiment analysis results
graphically.
3. Jupyter Notebook:
• Purpose: Jupyter Notebook is an interactive environment for documenting and sharing the
project's code and analysis
Problem Statement
The problem statement of this model involves developing algorithms which are capable of
detecting malicious URL’s which are used by online attackers to scam people. These URL’s
look like an original replica of the genuine URL’s which are used to get the sensitive
information of the user. In this model, we test different machine learning and neural network
models and evaluate which is the most suited model for this problem. This model is used to
analyze various features of URL such as domain name, content, behaviour etc. The main
purpose of this project is to provide effective tools which is able to detect and help users so
that they do not access any phishing websites.
Literature Survey
Project Description
Phishing URL detection model is a cutting-edge data-driven solution designed to help
individuals not to get scammed by visiting any fraud website. Leveraging advanced machine
learning algorithms and a rich dataset of URL’s features, the model accurately predicts the
phished website.
Phishing attacks have some significant threat to cybersecurity, as hackers are constantly
developing new techniques to break the detection techniques, we also need to evolve the
traditional techniques to detect and stop the threat, which is the aim of our project so that we
prevent users to be the victim.
These are some common approaches that we carried out in our project:
1) Dataset Collection
2) Data Preprocessing
3) Training Model
4) Model Testing
5) Deployment of model
Implementation Methodology
A. DATASET COLLECTION
The data collection refers to the technique or the process of organizing, recording and
gathering information for reference or analysis. Its work is to systematically collect all the
data from various sources such as interviews, observations, sensors, etc. The effective
collection of data is very necessary for the research purposes, business intelligence and many
different fields where the collection of data is considered very crucial.
B. DATA PREPROCESSING
Preprocessing of the data is a technique where we convert raw data in an ordered data . If The
data is processed, it converts it undemanding to translate the data and makes it is easy to use.
Preprocessing techniques are used because the data gives accurate and high-quality results.
The necessitate tasks in preprocessing of data are cleaning, integration, transformation and
reduction. Usually, a real - world data contains missing values, noises and sometimes extra
information or impractical format which can not be as it is used by models. Data
preprocessing is a required task for finding the missing values, cleaning the data and making
it suitable for the training model. Data can be cleaned by using a technique called Binning.
Binning is a technique where we sort the data and then partition the into equal frequency
bins.
Fig - Flow Diagram for the Training and Testing of the Model
C. MODEL TRAINING
First the dataset that we pre-processed is divided into two parts, one is training data and the
other the testing data. 70-80 percent of data is applied in tutoring and preparation of our ML
and neural models and the remaining is used for model testing. Model training is a training
dataset with various algorithms and then testing refers to check the accuracy and correctness
of the model up to its full. If the model is trained correctly and with a large quantity of data
then the accuracy for prediction or testing will increase.
D. MODEL TESTING
In machine learning, model training is allude to as the operation where the performance of a
experienced model is evaluated our testing dataset that we divided earlier. When our model
is trained by using the dataset, the neural model provides the best accuracy or the output for
the processed data-set.
Testing explicitly identifies which part of the code fails and provides a relatively coherent
coverage measure. In Machine learning testing, the programmer enters input and observes the
behaviour and logic of the machine. Hence, the purpose of testing machine learning is to
elaborate that the logic learned by machine remains consistent always. The logic should not
vary even if the program is called multiple times.
Conclusion
In summary, this research paper mainly concentrates on the techniques that can differentiate
between the real and the fake URL’s. In the research Autoencoder Neural Network, Random
Forest and Decision Tree algorithms have been used. And after the comparison with other
algorithms we came to the conclusion that Random Forest gives a better accuracy than the
other two techniques used. Our research also highlighted the importance of fairness and
ethics in using this technology. It's crucial to make sure our predictions are unbiased and
transparent. Even though our work is successful, still there is a much to explore and make our
model more reliable so that it can detect the fake URL’s more accurately.
4. User education and awareness: We also need to empower our users with the
knowledge as well as the tools which is used to detect the phishing URL so that they
are able to avoid the hackers and save themselves to be a victim of cyberattack.
To enhance the Phishing URL detection project, we'll implement multiple machine
learning models as well as one neural network model and used good quality of data which
help to detect which is the most suited algorithm/model for this project.
Additionaly, we are also able to implement deep learning algorithms for better result.
Continuous monitoring and feedback mechanism will be established to guarantee
ongoing model refinement and comprehensive documentation will be maintained for
transparency and reproducibility. This comprehensive approach aims to deliver a highly
accurate, transparent and user-friendly Phishing URL detection tool.
1. Improved Accuracy : ML models can process large amounts of data and identify
patterns that may not be immediately apparent to humans. This can lead to more
accurate predictions compared to traditional methods.
3. Scalability: ML models can be scaled to handle a large number of URL and a wide
range of features so that user will surf on that site without any tension.
4. Prevention from data breaches: It helps to prevent data breaches by checking and
blocking the access to any harmful or malicious website/URL crafted to get the
sensitive information such as their addresses, bank details, their email passwords etc.
5. Minimize the cost of incident: This model helps to reduce the cost of incident
response such as investigations , and other activities.
Outcome
NA
References
[1] Huang, Y., Qin, J., & Wen, W. (2019). Phishing URL detection via capsule-based neural
network. 2019 IEEE 13th International Conference on Anti-Counterfeiting, Security,
and Identification (ASID). https://doi.org/10.1109/icasid.2019.8925000
[2] Gajera, K., Jangid, M., Mehta, P., & Mittal, J. (2019). A novel approach to detect
phishing attack using artificial neural networks combined with pharming detection.
2019 3rd International Conference on Electronics, Communication and Aerospace
Technology (ICECA). https://doi.org/10.1109/iceca.2019.8822053
[3] Rahmadeyan, A., Mustakim, M., Ahmad, I., Alexander, A. D., & Rahman, A. (2023).
Phishing website detection with Ensemble Learning Approach using artificial neural
network and AdaBoost. 2023 International Conference on Information Technology
Research and Innovation (ICITRI). https://doi.org/10.1109/icitri59340.2023.10249799
[4] McGinley, C., & Monroy, S. A. (2021). Convolutional neural network optimization for
phishing email classification. 2021 IEEE International Conference on Big Data (Big
Data). https://doi.org/10.1109/bigdata52589.2021.9671531
[5] S, J., & Eliyas, S. (2023). Detecting phishing attacks using Convolutional Neural Network
and LSTM. 2023 3rd International Conference on Advance Computing and Innovative
Technologies in Engineering (ICACITE).
https://doi.org/10.1109/icacite57410.2023.10183234
[6] Abutaha, M., Ababneh, M., Mahmoud, K., & Baddar, S. A.-H. (2021). URL phishing
detection using machine learning techniques based on urls lexical analysis. 2021 12th
International Conference on Information and Communication Systems (ICICS).
https://doi.org/10.1109/icics52457.2021.9464539
[7] Zhang, L., Zhang, P., Liu, L., & Tan, J. (2021). Multiphish: Multi-modal features fusion
networks for phishing detection. ICASSP 2021 - 2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP).
https://doi.org/10.1109/icassp39728.2021.9415016
[8] Sujatha, G., Ayyannan, M., Priya, S. G., Arun, V., Arularasan, A. N., & Kumar, M. J.
(2023). Hybrid optimization algorithm to mitigate phishing URL attacks in Smart
Cities. 2023 3rd International Conference on Innovative Practices in Technology and
Management (ICIPTM). https://doi.org/10.1109/iciptm57143.2023.10118171
[9] Chen, Y., Zhou, Y., Dong, Q., & Li, Q. (2020). A malicious URL detection method based
on CNN. 2020 IEEE Conference on Telecommunications, Optics and Computer
Science (TOCS). https://doi.org/10.1109/tocs50858.2020.9339761
[10] S Tambe, Y., & Mohammad, S. (2023). Phishing URL detection using machine learning.
Journal of Advanced Research in Production and Industrial Engineering, 10(01), 1–5.
https://doi.org/10.24321/2456.429x.202301
[11] Al-Milli, N., & Hammo, B. H. (2020). A convolutional neural network model to detect
illegitimate urls. 2020 11th International Conference on Information and
Communication Systems (ICICS). https://doi.org/10.1109/icics49469.2020.239536
[12] D, N., B, G. S., Poongodi, C., K, J., P, N., & J, J. (2022). Autoencoder based feature
selection for phishing URL attack detection in IOT using stacked Autoencoder (AFS-
SAE). 2022 13th International Conference on Computing Communication and
Networking Technologies (ICCCNT).
https://doi.org/10.1109/icccnt54827.2022.9984574
[13] Kankrale, Prof. R. (2021). Phishing website detection using Machine Learning.
International Journal for Research in Applied Science and Engineering Technology,
9(VI), 3216–3220. https://doi.org/10.22214/ijraset.2021.35671
[14] Nabila, O. G., Wicaksono, H. R., Girinoto, A., Yasa, R. N., & Setiawan, H. (2023).
Benchmarking model URL features and image based for phishing URL detection. 2023
International Conference on Informatics, Multimedia, Cyber and Informations System
(ICIMCIS). https://doi.org/10.1109/icimcis60089.2023.10349059
[15] Swathi, G., Shwetha, M., Potluri, P., Murthy Raju, K., Kumar, Y., & Rajchandar, K.
(2023). Smart cities hybridized to prevent phishing URL attacks. 2023 Second
International Conference on Electronics and Renewable Systems (ICEARS).
https://doi.org/10.1109/icears56392
.2023.10085315
Signature