0 ratings 0% found this document useful (0 votes) 54 views 9 pages Automated Phishing Detection Through URL Analysis and Machine Learning
This paper presents a machine learning-based tool for automated phishing detection through URL analysis, utilizing a Random Forest classifier to categorize URLs as 'phishing,' 'suspicious,' or 'safe.' The model achieved a classification accuracy of 95.2% and is deployed as a web application for real-time detection, enhancing online security by minimizing reliance on human judgment. Key features include URL length analysis, special character detection, and HTTPS usage, aimed at improving user awareness and safety against phishing threats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Automated Phishing Detection Through URL Analysis ... For Later ‘Communications on Applied Nonlinear Analysis
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
Automated Phishing Detection Through URL Analysis and Machine
Learning
Vijaykumar!, Basavaraj G N?, Mohan Bangalore Anjaneyalu, Swetha MS‘, E G Satish
'234Department of Information Science, BMS Institute of Technology and Management, Bengaluru, India
SDepartment of Computer Science and Engineering, Nitte Meenakshi Institute of Technology, Bangalore, India
Abstra
Phishing attacks are categorized as one of the greatest threats to cybersecurity. Threat, which
is misinformation to make the user provide important and personal information via fake
‘websites or emails. This paper also realizes the notion of a machine phishing detection-based
leaming tool aimed at classifying URLs they designated as “phishing”, “suspicious,” or
“safe.” Utilizing a Random Forest classifier, the system examines URL-based characteristics
inclusive of URL. length, special symbols, and the usage of HTTPS to distinguish between
real URLs and fake ones (or phishing URIs) with high aceuraey. The model was trained and
validated on a given dataset of labeled URLs, achieving 95.2% accuracy of classification
higher compared to other results. For the sake of usability, the detection too is implemented
as a web application for real-time classification and the results of the classification. user-
friendly interface. This is because the performance and metrics such as accuracy and speed
depend on them. accuracy by using measures such as precision, recall, and Fl-score.
effectiveness. This paper will help to improve the level of online security. in an endeavor to
provide an automated approach that deprecates dependence on minimizing human judgment
and can efficiently detect cases of phishing threats.
Keywords: phishing detection, machine learning, URL classification, Random Forest,
cybersecurity, real-time detection, Web application
1. Introduction
‘As the web has placed itself into almost every part of life, cybersecurity is becoming an essential basic
necessity for everybody, businesses, and governments. When it comes to using the internet for
transactions and sharing information, and when it comes to malicious use, threats, specifically
phishing, are more advanced and dangerous. This type of attack tricks users into providing important
information by pretending to originate from a trusted source and relies more on the failures of human
beings than of technology. This paper kills phishing through the construction of an adaptive online
tool based on machine learning that categorizes websites into phishing, suspicious, and safe websites
and protects against fast-changing threats
AL Motivation
Due to the high level and frequeney of business phishing, a number of anti-phishing tools have been
developed to be based on ML. In contrast with other approaches, the ML models can adaptively learn
about the features peculiar to the real phishing URLs, including the url structure, domains, and https
that are normally used in the process. ML-based systems have to learn from detected phishing and
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
legitimate URLs, so they can increase the chances of detecting future new types of phishing attacks,
The development of this paper is informed by the inability to effectively prevent or identify phishing
URLs in real time in an efficient and convenient manner. A machine learning model approach can thus
perform real-time detection that would easily counter what new approaches a phisher may devise.
‘Also, the tool is used as a web application where users have to enter the URL and receive the immediate
classification, which is critical for the timely prevention of threats.
B. Objectives
The primary objective of this paper is to develop a robust machine learning model that can accurately
classify URLs as “phishing,” “suspicious,” or “safe.” Specific objectives include:
1. Design and Train the Model: Develop a machine learning model using a Random Forest classifier,
which is well-suited for classification tasks, to analyse URL- based features indicative of phishing
behaviour.
2. Feature Engineering: Identify and extract relevant URL features, such as URL length, presence of
special characters, HTTPS usage, and redirection patterns, that distinguish phishing URLs from
legitimate ones.
3. Model Evaluation: Evaluate the performance of the model using key metries like accuracy,
precision, recall, and F-score to ensure high reliability in detecting phishing URLs. Achieving a
classification accuracy of at least 90% on test data is a key performance goal.
4, Deployment as a Web Application: Build a user- friendly, web-based interface for the tool, allowing
users to submit URLs and receive real-time classification results. The application will incorporate
a Flask API for backend processing and a React-based frontend for a seamless user experience.
. Integration of a Whitelist: Implement a whitelist of known legitimate domains to reduce false
positives and improve user trust, ensuring that the tool provides accu- rate and reliable classification.
6. Enhance Cybersecurity Awareness: Ultimately, the paper aims to support users by providing a tool
that proactively helps them identify and avoid phishing threats, thereby contributing to safer online
interactions.
C. Background
‘Over the past two decades, the rollout of new innovative technologies has led to a new wave of online
trading and communication developments that affect every facet of people’s lives, whether it is in the
personal, business, or political realms. While such a shift has opened up mumerous advantages, it has
also brought with it new fomns of risk. Of these, however, perhaps phishing is one of the most
widespread and threatening types of cyber threat. Phishing scams deceive people into passing on their
personal data like usernames and passwords and other personal information. More often, itis carried
out through bogus emails, websites, or instant messages that resemble a legitimate organization to
deceive the user into providing personal details. Security firms, including the Anti-Phishing Working
Group (APWG), reveal that phishing attacks become more numerous and diverse, thus inereasing
angers for internet clients and businesses. The advanced techniques employed today include domain
field obfuscation, use of HTTPs, and URL shortening, to name but a few, to disguise the look of a fake
URL. Therefore, they avoid many standard mechanisms of protection and act as a consequence of a
human mistake
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
+
+
Ce)
Fig.1, Flow of the proposed work
2, Literature Review
The Phishing Website Detection explores a machine learning solution for identifying phishing sites,
using Logistic Regression and Naive Bayes models for URL analysis. This approach, supported by
Phish Tank data and deployed with FastAPI, achieves real-time detection but has limitations, including
a narrow algorithm set and minimal focus on user education, Future work suggests expanding
algorithms, in- creasing user awareness, and incorporating transfer learning for improved, broader
detection [1]
This cybersecurity solution targets the growing threat of phishing by proposing a browser extension
that uses a Random Forest model trained on 26 URL and content features, achieving 98.8% accuracy.
The model offers real-time, highly accurate detection and carefully selected features. However, it
depends on feature quality, may face computational demands, and requires updates to handle new
phishing tactics. Future improvements could include integration with other security tools, user
feedback mechanisms, expanded feature sets, and cross-platform support. [2]
Based on the type of problem, the report focuses on the issue of detecting phishing by using machine
learning for URL classification. When compared with other algorithms, the Random Forest algorithm
had a higher accuracy of 94.8% in the training data set of PhishTank and 95.87% in the training data
set of UCI. However, despite its high performance, the Random Forest algorithm faces issues with
computation and training time. The negative findings shall in the future be used to improve the
algorithm’s performance by employing deep learning or ensemble methods to filter out false positives.
[3] It offers a machine leaming system for detecting persuasive websites as an effective means of
combating deceptive web- sites. Thus, the model deals with URL and HTML features
Over the past two decades, the rollout of new innovative technologies has led to a new wave of online
trading and communication developments that affect every facet of people’s lives, whether itis in the
personal, business, or political realms. While such a shift has opened up numerous advantages, it has
also brought with it new forms of risk. Of these, however, perhaps phishing is one of the most
widespread and threatening types of cyber threat. Phishing scams deceive people into passing on their
personal data like usernames and passwords and other personal information.
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
together with the help of algorithms such as Random Forest, SVM, and Logistic Regression; Random
Forest accuracy level is 94%. Main advantages include very high detection rate and ability to respond
automatically; the drawback is that feature selection is decisive and the algorithm may fail with big
amounts of data, Further work seeks to add deep learning features and make the models more
responsive to changes in the phishing methods. [4]
To reduce the rate of currently difficult-to-detect phishing attacks, the report outlines a three-phased
model that integrates
DNS blacklists, heuristics, and web crawlers. Neural net- works (NN), SVM, and random forests (RF)
predict phishing sites using URL and web traffic characteristics as the input vector. NN recorded
93.18% accuracy, with SVM and RF being the second-best performers. As with the previous filters,
strengths include high accuracy and real-time detection, while the potential limitations are associated
with the impossibility of handling zero-day phishing threats, Future work plans to improve the models
mentioned while expanding the set of features used to combat new kinds of phishing. [5]
Ttuses machine learning to identify phishing websites because typical blacklists and heuristics cannot
cope with developing attacks and may generate abundant false positives. The Decision Tree, Random
Forest, and SVM model investigates key aspects that include usage of IP, symbols in the links, and the
length of the URL. Random Forest had the best trade-off upholding both the highest accuracy at
95.14% and the least number of false positives. Despite its effectiveness and easy applicability, this
strategy requires large datasets and problems with complex phishing methods. It is possible to imagine
that in future, developments will apply blacklists with deep learning and obtain a higher degree of
accuracy. [6]
‘A case of a phishing detection system using the SVM algorithm to analyse URL features is well
described, with its high accuracy of 95.66% and the ability to detect newly emerged phishing sites with
very few false positives. The important plus is high accuracy when using fewer features. As a
Grawback, it must be mentioned that using only feature ex- traction can sometimes overlook some
indicators of phishing. Further enhancements can examine more approaches to feature selection and
usage of other classifiers’ improving detection rate. [7]
This concems itself with the detection of phishing with specific consideration of zero-day attack cases
that cannot be detected by blacklists. The solution involves seven classifiers with feature extraction
done by NLP; out of the seven, the Random Forest Classifier gave the best result with a 95.98%
accuracy. A significant advantage includes the real-time capability, language neutrality, and ability to
detect new phishing sites without extemal tools. However, it has some drawbacks, such as problems
with very short URLs and a great number of computations required. This research identifies future
directions, including improving the feature selection process and using combined approaches to
improve the detection, [8]
The concer of the research is laid more on identifying phishing sites, particularly zero-hour attacks,
using machine leaming and deep learning approaches such as CNN, and the findings established an
accuracy of 99.98%, This paper discusses 80 articles, tracing trends in datasets, algorithms, and
performance measures. The main advantages of machine learing-based solutions are high detection
rates and flexibility, while main separators either rely on datasets that should be updated regularly or
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
have issues related to zero-day attacks. Future work will refine deep learning models and consider a
combination of approaches in order to improve their performance. [9][10]
Understanding new and advanced methods of phishing is the question answered in the paper with
reference to recommendations based on AI approaches, including ML, DL, HL, and SB. It compares
these approaches, stressing that, while such methods as the neural networks provide high accuracy,
they are rather computationally intensive and data sensitive. The study implies the need for further
research on the integrated approaches that include several methods to increase the effectiveness of
threat detection, its capabilities, and flexibility in the context of new threats. [1 1][12]
3. Problem Statement
Phishing attacks have become one of the most common and dangerous cybersecurity threats, exploiting
human vulnerabilities to steal sensitive information by impersonating trusted entities. Traditional
phishing detection methods, such as blacklists and heuristic-based approaches, are increasingly
inadequate in the face of modem, sophisticated phishing tactics. These conventional methods struggle
to keep up with the rapid creation and deactivation of phishing websites, which often have short
lifespans, and fail to detect novel phishing techniques that use URL obfuscation, HTTPS eneryption,
and domain manipulation to appear legitimate. Given the growing scale and complexity of phishing
attacks, there is a critical need for an adaptive, automated detection system that can analyse URLs in
real-time to accurately classify them as “phishing,” “suspicious,” or “safe.” This paper addresses this
need by developing a machine learning-based phishing detection tool, leveraging a Random Forest
classifier to dynamically detect phishing URLs based on URL-specific features, thus providing a
reliable solution to enhance user security in online environments.
4. Proposed Method
The proposed method involves developing a phishing detection tool that uses a machine learning model
to classify URLs as “phishing,” “suspicious,” or “safe.” The method combines feature extraction,
model training, and deployment as a web- based tool for real-time detection, The primary steps in the
proposed method are as follows:
A. Data Collection and Preprocessing
The first step is to gather a labeled dataset containing both phishing and legitimate URLs. This dataset
serves as the foundation for training and evaluating the machine learning model. The following actions
are undertaken in this phase:
i Data Source: Collect URLs from reputable sourees, such as PhishTank and OpenPhish, for phishing
URLs, and Alexa’s top sites list for legitimate URLs.
ii.Data Cleaning: Remove duplicate entries and handle any missing values to ensure consistency and
reliability in the data
iii.Labelling: Assign labels to each URL as either “phishing” or “legitimate,” based on their source.
B. Feature Extraction
Feature extraction is critical for model performance, as it helps identify distinguishing characteristics
of phishing URLs. The proposed system will use the following types of URL- based features:
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
Lexical Features: Analyse structural properties of the URL, including:
© URL length
Presence of special characters e.g., "@,”
‘Number of subdomains
Use of HTTPS protocol
Host-based Features: Assess the reputation and reliability of the URL’s host, including
Domain registration length
Presence of IP address in the URL
WHOIS information (when available)
© Content-based Features: Extract content features related to the website’s metadata, if
accessible, including title and description tags, to improve classification accuracy.
soo 0
oo
A feature extraction function is implemented to automate the processing of these features for each
URL in the dataset, resulting in a structured dataset for model training
C. Model Selection and Training
The Random Forest classifier is chosen for its high accuracy and robustness in classification tasks.
This model selection is based on prior research indicating its effectiveness for phishing detection. The
training process includes:
© Train-Test Split: Divide the dataset into a training set (80%) and a testing set (20%) to evaluate
the model’s performance,
* Model Training: Train the Random Forest model on the training dataset, using hyperparameter
tuning to optimize the model’s performance.
+ Cross-Validation: Apply k-fold cross-validation to en- sure the model's generalization
capabilities and minimize the risk of overfitting.
D. Deployment as a Web-Based Application
To make the phishing detection tool accessible, it is deployed as a web-based application with the
following components
. Backend (Flask API): The trained machine learning model is served using a Flask API,
allowing the frontend to communicate with the model for real-time predictions.
* Frontend (React): A user-friendly web interface, developed using React, enables users to input
URLs and receive classification results immediately.
* Whitelist Integration: Incorporate a whitelist of known legitimate domains to reduce false
positives and in- crease user confidence. The tool checks URLs against the whitelist before
classification to avoid unnecessary processing of safe URLs
E, System Workflow
The workflow of the proposed phishing detection system is as follows:
© User Input: A user submits 2 URL through the web interface.
© URL Processing: The backend receives the URL, per- forms feature extraction, and checks the
‘URL against the whitelist.
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
* Model Prediction: Ifthe URL is not on the whitelist, itis passed to the Random Forest classifier,
which retums a classification result,
© Result Display: The frontend displays the classification result (phishing, suspicious, or safe) to
the user, with visual cues to enhance user experience
5. Proposed Method
The performance of the phishing detection model is evaluated on several key metrics to ensure its
effectiveness in accurately classifying URLs as “phishing,” “suspicious,” or “safe.” The evaluation
involves testing the model on a hold- out test dataset, and the results are measured across multiple
dimensions.
A. Evaluation Metrics
The following metrics are commonly used to evaluate classification models and are applicable to your
phishing detection tool:
+ Accuracy: This metric represents the proportion of correetly classified URLs (both phishing and
legitimate) out of the total number of URLs in the test set. Accuracy is calculated as:
TruePositive + TrueNegative
Accuracy =
Y= “Total number of samples
«Precision: Precision measures the model’s accuracy in identifying true phishing URLs out of all
URLs classified as phishing. A high precision rate indicates low false positive rates. Precision is
calculated as:
TruePositive
TruePositive + False Positive
+ Recall; Recall indicates how well the model captures all phishing URLs out of the actual phishing
URLs in the dataset. A high recall rate means fewer false negatives. Recall is calculated as:
TruePositive
Precision =
Reeall = FjePasitive + FalseNegative
* Fl-Score: The F1-Score is the harmonic mean of precision and recall, providing a balanced measure
when there is an uneven class distribution or when both precision and recall are important. It is
calculated as:
Precision x Recall
Precission + Recall
* Confusion Matrix: The confusion matrix visualizes the model's performance by showing the true
positives, true negatives, false positives, and false negatives. This matrix helps in identifying any biases
in the model’s predictions and in understanding where misclassifications oceur.
Fl Score = 2x
B. — Model Testing and Results
The model is tested on a separate test dataset, which was not used in training, to assess its
generalization ability. The dataset is split into a training set (80%) and a test set (20%) to ensure that
the model’s performance is evaluated on unseen data. After training, the model is evaluated using the
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
test set, and the following results are typically reported:
+ Confuusion Matrix Visualization: A visual representation of the true positives, true negatives, false
positives, and false negatives to understand the model’s classification patterns.
+ Metric Scores: Detailed results for accuracy, precision, recall, and F1-Score, which collectively
indicate the model’s performance across different dimensions of classification.
C. Expected Performance Goals
The primary objective is to achieve a minimum of 90% classification accuracy, with balanced precision
and recall scores to ensure the model’s reliability in real-world applications. Ideally, precision and
recall should be above 90% to minimize false positives and false negatives, which are critical for a
phishing detection tool.
D. Comparative Analysis
Optionally, the performance of the Random Forest model can be compared with other models (such as
Support Vector Machine, Logistic Regression, or Neural Networks) to deter- mine if the chosen model
is the most effective for the phishing detection task. This comparative analysis can help justify the use
‘of Random Forest based on its superior accuracy and interpretability
Model [Accuracy (% Recall (%)F I-Score (%)
Random Forest 52 05 pli
Support Vector Machine88.7 B49 85.5
Logistic Regression 87.4 2383.7
Performance Metries of Different Models
6. Results and Discussions
The phishing detection model achieved an accuracy of 95.2%, with a precision of 91.8%, recall of
90.5%, and F1- score of 91.1%, demonstrating its effectiveness in identifying phishing URLs while
maintaining a low rate of false positives and negatives. The confusion matrix analysis shows reliable
classification, though a small number of false positives indicate potential for further refinement
through whitelist integration. Comparisons with Support Vector Machine and Logistic Regression
models confirmed the Random Forest classifier as the most effective for this task. While limitations
exist, including occasional false positives and the need for regular updates to handle evolving phishing
tactics, the model’s high precision and recall suggest strong potential for real-world application, Future
enhancements, such as browser extension development and user feedback integration, could further
improve accuracy and user trust, making this phishing detection tool a robust solution for enhaneing
‘cybersecurity.
7. Conclusion
This paper successfully developed a machine learning-based phishing detection tool using a Random
Forest classifier to classify URLs as “phishing,” “suspicious,” or “safe”. By lever- aging URL-based
features such as length, special characters, HTTPS usage, and domain reputation, the model achieved
a high accuracy of 95.2%, demonstrating its effectiveness in identifying phishing URLs in real time,
The results show that the Random Forest model is both reliable and robust, offering a balanced
htps://internationalpubls.com‘Communications on Applied Nonlinear Analysis,
ISSN: 1074-133X_
Vol 32 No. 6s (2025)
performance with low rates of false positives and negatives. While effective, the tool has limitations,
including occasional false positives and the need for regular updates to handle new phishing tactics.
Future improvements, such as incorporating a whitelist, expanding feature sets, and developing a
browser extension, can further enhance its accuracy and usability. Overall, this paper highlights the
potential of machine learning in combating phishing attacks, providing a practical solution that
contributes to the safety and security of online users.
References
[1] A. Soni and P, Abrol, “Phishing Website Detection,” Bachelor of Tech- nology thesis, Jaypee University of
Information Technology, Waknaghat, 2022, supervised by Mr. Prateek Thakral
[2] HLH. Nguyen and D. T. Nguyen, “An Intelligent System for Detecting Phishing Websites Using Machine Learning,”
in Proceedings of the 2019 IEEE International Conference on Cybersecurity and Resilience (ICCSR), pp. 1-6, IEEE,
2019. DOL: 10.1109/CAIS.2019.8769571.
[5] 1. Choudhary. S. Mhapankar, R. Buddha, A. Kharuk, and R. Patil, “A machine learning approach for phishing attack
detection,” Jounal of Artificial Intelligence and Technology, vol. 3, 0. 3, pp. 108-113, 2023
hitps:/doi.org/10 37965 jait 20230197,
[4] _S. Hossain, D. Sarma, and R. Chakma, “Machine Learning-Based Phishing Attack Detection,” International Journal
of Advanced Computer Science and Applications (UACSA), vol. 11, no, 9, pp. 378-388, 2020.
hitps:/doi.org/10.14569/ACSA.2020.0110949,
[5] G. Mohamed, J. Visumathi, M. Mahdal, J. Anand, and M. Elangovan, “An effective and secure mechanism for
phishing attacks using a machine learning approach.” Processes, vol. 10, no. 7, p. 1356, 2022.
hitps:/doi.org/10.3390/pr10071356.
[6] _R. Mahajan and I, Siddavatam, “Phishing website detection using machine learning algorithms,” International Journal
of Computer Appli- cations, vol. 181, no. 23, 2018, hrtps!/doi.org/10.5120/jea2018918026.
[7]. Rashid, 1, Nazir, 7, Mahmood, T., & Nisar, M. W. (2020). Phish- ing detection using machine learning technique.
In 2020 First International Conference of Smart Systems and Emerging Technologies (SMART- TECH),
hitps:/doi.org/10.1109/SMART-TECH49988.2020.00026.
[8] Sahingoz, 0. K., Buber, E., Demir, 0., & Diri, B. (2019). Machine leaming based phishing detection from URLs.
Expert Systems with Applications, 117, 345-357. https//doi.org/10.1016).eswa.2018.09.029
[9] Safi, A., & Singh, S. 2023). A systematic literature review on phish- ing website detection techniques. Journal of
King Saud University - Computer and Information Sciences, 35(5), 590-611. https:/doi.org/10.1016j)
iksuei.2023.01.004,
[10] G.N. Basavaraj, K. Lavanya, Y. S, Reddy, and B. S. Rao, “Reliability-driven time series data analysis in multiple-
level deep leaming methods utilizing soft computing methods.” Measurement: Sensors, vol. 24, Dec. 2022, doit
10,10164.measen.2022, 100501
[11] Basit, A., Zafar, M., Lin, X., Javed, A. R., Jalil, Z., & Kifayat, K. (2021). A comprehensive survey of Al-enabled
phishing attacks detection techniques. Telecommmnication Systems, 76(2), 139-154. https://doi.org/ 10.1007/S11235-
020-00733-2.
[12] Mohan, B. A.,B. Harshavardhan, S. Karan, Mohammed Jawaad Shariff, and M. G. Pranav. "Demand forecasting and
route optimization in supply chain industry using data Analytics." In 2021 Asian Conference on Innovation in
Technology (ASIANCON). pp. 1-7. IEEE, 2021
hittps://internationalpubls.com 159