Machine Learning
Machine Learning
on
BACHELOR OF
TECHNOLOGY
In
INFORMATION TECHNOLOGY
Submitted by
S. SAI PRANEETH 22K81A1255
V.SRINIJAREDDY 22K81A1263
B.MANIKANAND 22K81A1206
JUNE- 2025
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500 100
www.smec.ac.in
Certificate
Declaration
We, the students of ‘Bachelor of Technology in Department of Information Technology’,
session: 2022 - 2026, St. Martin’s Engineering College, Dhulapally, Kompally,
Secunderabad, hereby declare that the work presented in this Project Work entitled “ML-
DRIVEN POTENTIAL THEFT IDENTIFICATION FOR ENHANCING INTEGRITY
AND EFFICIENCY OF SMART GRID SYSTEMS” is the outcome of our own bonafide work and
is correct to the best of our knowledge and this work has been undertaken taking care of Engineering
Ethics. This result embodied in this project report has not been submitted in any university for award of
any degree.
The satisfaction and euphoria that accompanies the successful completion of any task would
be incomplete without the mention of the people who made it possible and whose encouragement
and guidance have crowded our efforts with success.
First and foremost, we would like to express our deep sense of gratitude and indebtedness to
our College Management for their kind support and permission to use the facilities available in the
Institute.
We especially would like to express our deep sense of gratitude and indebtedness to Dr. P.
SANTOSH KUMAR PATRA, Professor and Group Director, St. Martin’s Engineering College
Dhulapally, for permitting us to undertake this project.
We wish to record our profound gratitude to Dr. M. SREENIVAS RAO, Principal, St.
Martin’s Engineering College, for has motivation and encouragement
We are also thankful to Dr. N. KRISHNAIAH, Head of the Department, Information
Technology, St. Martin’s Engineering College, Dhulapally, Secunderabad. for his support and
guidance throughout our project as well as Project Coordinator MRS.G. GOUTHAMI, Assistant
Professor, Information Technology department for her valuable support.
We would like to express our sincere gratitude and indebtedness to our project supervisor
MRS.K. SURYA KANTHI,Assistant Professor, Information Technology, St.Martins Engineering
College, Dhulapally, for her support and guidance throughout our project.
Finally, we express thanks to all those who have helped us successfully completing this
project. Furthermore, we would like to thank our family and friends for their moral support and
encouragement. We express thanks to all those who have helped us in successfully completing
the project.
i
ABSTRACT
The security and efficiency of smart grid systems are increasingly being threatened by potential thefts,
cyber-attacks, and unauthorized access, leading to significant financial and operational losses.
Traditional approaches for detecting theft and ensuring grid integrity rely heavily on manual
monitoring, which is both time-consuming and prone to human error. This research introduces a
machine learning (ML)-driven system designed to identify potential thefts and optimize the efficiency
of smart grid systems. By leveraging data such as energy consumption patterns, grid operations, and
external factors, the system can detect anomalies in real-time, predicting suspicious activities that may
indicate theft or fraud. Using machine learning models, the system analyzes vast amounts of data to
identify subtle patterns that would otherwise be overlooked by conventional methods. This proactive
approach enables the early detection of theft, reducing its impact and improving the overall efficiency
of the smart grid. Moreover, the system can be fine-tuned and adapted to different regions and grid
configurations, providing a scalable and flexible solution for grid operators. With the increasing
adoption of smart meters and IoT devices in modern grids, the availability of real-time data allows for
more accurate and timely decision-making. This research highlights the potential of machine learning
in enhancing grid security, optimizing resource distribution, and reducing operational costs.
ii
LIST OF FIGURES
Figure no. Figure Title Page no.
iii
LIST OF TABLES
4.1 Database 15
4.5.1 Hardware Requirements 24
4.5.2 Software Requirements 25
iv
LIST OF ACRONYMS AND DEFINITIONS
4. BC Bagging Classifier
6. IF Isolation Forest
v
CONTENTS
ACKNOWLEDGEMENT i
ABSTRACT ii
LIST OF TABLES iv
CHAPTER 1 INTRODUCTION 1
1.1 Overview 1
4.1 Database 12
4.3 Design 15
4.4 Modules 23
4.6 Testing 26
7.1 Conclusion 37
REFERENCES 40
Patent/Publication
CHAPTER 1
INTRODUCTION
1.1. Overview
Smart grid systems, which integrate advanced communication technologies with traditional electrical
grids, aim to optimize energy distribution, improve grid reliability, and enable real-time monitoring of
energy usage. However, as these systems become more complex and interconnected, they become
vulnerable to potential theft, fraud, and cyber-attacks. According to the U.S. Department of Energy,
energy theft alone costs utilities billions of dollars annually. Smart grids are particularly susceptible to
these threats because of their reliance on distributed sensors, smart meters, and IoT devices, which can
be manipulated or hacked. Traditional theft detection systems are often reactive, relying on physical
inspections, customer complaints, and audits to identify discrepancies. These methods are time-
consuming and inefficient, especially when large volumes of data from multiple sources need to be
processed. To address these issues, this research proposes an ML-driven solution that continuously
monitors energy consumption patterns, identifies irregularities, and flags potential thefts in real-time.
By analyzing the vast datasets generated by smart grids, this system will help utility companies
proactively detect and prevent theft, thereby improving both grid security and operational efficiency.
The motivation behind this research stems from the growing need for more secure and efficient smart
grid systems in the face of increasing theft and fraud. As smart grids become integral to the
functioning of modern energy systems, ensuring their integrity becomes paramount. Theft, whether
through illegal connections, meter tampering, or cyber intrusion, undermines the goals of smart grid
technology by reducing revenue, compromising system performance, and raising operational costs.
Traditional methods of detecting theft are not equipped to handle the volume and complexity of data
produced by modern grids. Machine learning offers a promising solution by enabling the automated
detection of anomalies in real-time, identifying patterns that could indicate potential theft or
inefficiency. The research aims to harness the power of machine learning to enhance the security and
efficiency of smart grids, providing utility operators with the tools needed to mitigate losses and
optimize energy distribution.
1
1.3. Problem Statement
As the adoption of smart grid technologies grows, theft and unauthorized access pose serious risks to
the reliability, security, and efficiency of the system. Traditional theft detection mechanisms are
reactive and cannot keep pace with the evolving techniques used by those attempting to steal energy.
The problem lies in the ability of utility companies to monitor and detect suspicious activities across
vast, decentralized networks that generate enormous amounts of data. This massive influx of data
makes manual inspection and traditional algorithms inefficient and impractical.A more automated,
proactive approach is necessary to identify potential thefts early on, preventing further damage and
optimizing the grid's overall performance. This research focuses on addressing the gap by developing
an ML-driven system capable of detecting anomalies and potential theft in smart grid systems in real-
time.
Energy Theft Detection: The primary application of the proposed system is in identifying energy
theft in smart grids, helping utility companies minimize losses and maintain operational integrity.
Grid Optimization: By analyzing energy usage patterns, the system can optimize resource
distribution, ensuring that energy is efficiently allocated where it is needed most.
Predictive Maintenance: The ML models can help predict when equipment is likely to fail,
allowing for proactive maintenance and reducing downtime.
Customer Billing Accuracy: By detecting irregularities in energy usage.
2
CHAPTER 2
LITERATURE SURVEY
Gunduz et al. [1] presented a detailed overview of the Internet of Things (IoT), discussing its
evolution, core components, and diverse application areas. Their work emphasizes the foundational
role of IoT in sectors such as healthcare, transportation, and especially smart grids. The paper
highlights how IoT integration facilitates real-time monitoring and efficient resource management.
They also touch upon the communication challenges and the need for robust infrastructure in smart
systems. Das et al. [2] investigated the vulnerabilities and threats posed by cyber-attacks on IoT-based
critical infrastructure. They focused on identifying attack vectors and emphasized the potential damage
to essential systems such as smart grids and industrial control environments. Their study outlines the
importance of security frameworks and recommends advanced detection mechanisms. It also illustrates
specific cases of attacks, proposing mitigation strategies through secure IoT architectures. Emmanuel
et al. [3] explored various communication technologies applicable to smart grid environments,
including ZigBee, Wi-Fi, and LTE. Their survey discussed the trade-offs between latency, range, and
data throughput for each technology. They highlighted how the choice of communication protocol
directly influences grid performance, scalability, and cyber-security. This work provides a
foundational comparison essential for designing communication layers in smart infrastructure. Kimani
et al. [4] examined the cybersecurity challenges faced by IoT-based smart grid networks. They
identified major threats such as data breaches, denial-of-service attacks, and unauthorized access. The
paper also delved into security requirements like authentication, encryption, and intrusion detection
systems. Their findings emphasize the urgent need for robust security models tailored to the unique
characteristics of smart grid networks. Gunduz et al. [5] provided an analysis of the communication
infrastructure and cyber-security aspects specific to smart grids. Their study categorizes network layers
and identifies vulnerabilities inherent in each communication tier. They propose integrated solutions
involving blockchain and encryption techniques. The authors argue for a layered security model to
defend against sophisticated threats in critical grid systems. Qays et al. [6] conducted a comprehensive
review of communication technologies, applications, and protocols in IoT-assisted smart grid systems.
The paper categorizes existing techniques based on functionality and security capabilities. It also
provides insights into future research directions such as 6G and edge computing for smart gradient
boosting and deep neural networks.
3
Sahoo et al. [7] proposed a data-driven approach for electricity theft detection using smart meter data.
By analyzing consumption patterns and comparing them with expected norms, their method identifies
anomalies suggestive of theft. They validated the approach using real-world datasets and demonstrated
significant detection accuracy. This work underlines the role of advanced analytics in securing energy
distribution. Althobaiti et al. [8] presented a survey on energy theft in smart grids, focusing on various
attack strategies and detection methods. They reviewed machine learning, statistical, and heuristic-
based techniques for fraud identification. The paper discusses challenges such as data imbalance, false
positives, and evolving theft tactics. It concludes with recommendations for adaptive and scalable
detection frameworks. Takiddin et al. [9] addressed the problem of false data injection attacks in smart
grids, proposing detection algorithms based on signal processing. Their model identifies abnormal data
patterns and signals injected to disrupt grid operation. They tested the method on benchmark datasets,
showing effectiveness in minimizing error rates. Their contribution emphasizes real-time detection to
maintain grid integrity. Badr et al. [10] reviewed existing data-driven methods for fraud detection in
smart metering systems. The study encompasses supervised and unsupervised learning models,
discussing their strengths and limitations. It also considers the impact of feature selection, training data
quality, and model interpretability. Their findings guide researchers in choosing suitable AI techniques
for electricity fraud scenarios. Wang et al. [11] introduced a deep learning approach to infer socio-
demographic information from smart meter data. Their model can predict household attributes like
occupancy and income based on electricity consumption behavior. They highlight privacy concerns
but also demonstrate the potential for targeted energy policies. This research shows how smart grid
data can reveal insights beyond utility usage. Reda et al. [12] surveyed false data injection attacks in
smart grids, offering taxonomies based on models, targets, and consequences. They categorized attacks
by their entry points and assessed impacts on reliability and safety. The paper also presents defenses
such as blockchain, anomaly detection, and secure protocols. Their work aids in developing
comprehensive protection strategies. Javaid et al. [13] employed GANCNN and ERNET models to
detect non-technical losses in smart grids. Their hybrid framework enhances detection accuracy
through feature learning and error correction. The system was tested on multiple datasets and showed
robust performance in identifying fraud. Their research confirms the utility of deep learning in energy
theft scenarios. Habib et al. [14] investigated false data injection attacks in smart grid cyber-physical
systems, highlighting current issues and future challenges. They reviewed technical and regulatory
gaps in existing defense mechanisms.
4
CHAPTER 3
SYSTEM ANALYSIS AND DESIGN
Traditional theft detection systems in smart grids rely on physical inspections, periodic audits, and
customer complaints. These systems are reactive, only identifying theft after it has occurred or been
reported. With the increasing complexity and scale of smart grids, these methods have become less
effective in dealing with the volume of data generated. Physical inspections are labor-intensive and
time-consuming, while audits rely on incomplete or delayed data, making it difficult to detect theft in
real-time. Furthermore, these methods fail to capture subtle or sophisticated theft techniques, such as
meter tampering or cyber intrusions, which can go unnoticed for extended periods. The traditional
systems also struggle to scale, especially in large, decentralized smart grids, and often cannot identify
inefficiencies or potential threats until they cause significant disruptions.
Although these methods have proven effective in identifying known malware, viruses, and
intrusion patterns, they have notable limitations when it comes to scalability, adaptability, real-
time response, and most importantly, privacy.
Reactive Nature of Traditional Systems.
These systems are not proactive. They typically detect theft only after it has occurred or been reported
by consumers, leading to delayed responses. This delay often translates to prolonged revenue losses,
unmonitored grid instability, and in some cases, permanent damage to infrastructure.
Scalability Challenges.
Smart grids are massively distributed and generate vast amounts of real-time data from thousands or
millions of smart meters and edge devices. Traditional systems are not designed to handle this scale.
Manual audits or periodic checks cannot keep up with the frequency and volume of data, rendering
them ineffective in identifying emerging threats in real time.
Physical inspections require considerable manpower and time, making them inefficient, especially in
urban areas with high consumer density or rural areas with hard-to-reach locations. Furthermore,
5
audits often depend on incomplete historical data that may not reflect current patterns or real-time
anomalies.
Modern smart grids require real-time analytics to quickly respond to abnormal consumption patterns.
Traditional systems lack real-time monitoring and are not adaptive. They cannot evolve or learn from
new patterns of theft and inefficiencies. This static nature makes them ineffective in dynamic
environments where consumption behavior and attack methods constantly evolve.
Rule-based systems can only detect known, pre-defined patterns. They fail to recognize novel or
low-profile anomalies, especially those crafted to mimic normal behavior. Sophisticated thieves often
exploit these blind spots by staying just within acceptable thresholds.
Limitations:
Traditional theft detection systems in smart grids rely on physical inspections, periodic audits, and
customer complaints. These systems are reactive, only identifying theft after it has occurred or been
reported. With the increasing complexity and scale of smart grids, these methods have become less
effective in dealing with the volume of data generated. Physical inspections are labor-intensive and
time-consuming, while audits rely on incomplete or delayed data, making it difficult to detect theft in
real-time. Furthermore, these methods fail to capture subtle or sophisticated theft techniques, such as
meter tampering or cyber intrusions, which can go unnoticed for extended periods. The traditional
systems also struggle to scale, especially in large, decentralized smart grids, and often cannot identify
inefficiencies or potential threats until they cause significant disruptions.
Reactive approach to theft detection.
Labor-intensive and time-consuming physical inspections.
Inability to scale efficiently across large smart grid networks.
6
3.2. Proposed System and its Advantages
The proposed ML-driven system offers a proactive and scalable approach to detecting theft in smart
grid systems. By continuously analyzing energy consumption data, the system can identify anomalies
and flag potential thefts in real-time. Using advanced machine learning algorithms, the system can
detect patterns that indicate suspicious activities, such as unusual consumption spikes or tampered
meters, and alert grid operators immediately. This enables faster response times and prevents further
losses. Additionally, the system can scale to handle large volumes of data from smart meters and IoT
devices, ensuring its applicability across diverse grid configurations. The ML model can be trained and
fine-tuned over time to improve detection accuracy and adapt to new theft techniques.
The proposed ML-driven system presents a proactive, intelligent, and scalable solution for detecting
electricity theft in modern smart grid environments. Unlike conventional methods that rely heavily on
physical inspections and customer complaints, this system leverages real-time data analytics and
machine learning to monitor and assess energy consumption patterns continuously. Through advanced
algorithms such as Random Forests, LSTM networks, Autoencoders, and Isolation Forests, the system
can identify anomalies in electricity usage that suggest fraudulent activity—such as unexpected spikes,
prolonged low usage, or tampered meter signals. These anomalies are flagged immediately, allowing
utility providers to act swiftly and prevent further losses.
Working of the Proposed Federated Learning-Based System:
The proposed system utilizes Federated Learning (FL) to enable intelligent, decentralized theft
detection across smart grid systems without compromising user privacy. Unlike traditional
centralized machine learning approaches, FL allows smart meters and local devices to
collaboratively train a global model without sharing raw user data, making the system secure,
scalable, and privacy-preserving.
Working Process:
Initialization:
Each smart meter trains the model on its own historical data to learn unique
consumption patterns.
At scheduled intervals (e.g., daily or weekly), each smart meter sends encrypted model
updates—not the data—to the central server.
7
Global Model Aggregation:
The server aggregates updates using algorithms like FedAvg (Federated Averaging).
Model Distribution:
With each iteration, smart meters become more effective at detecting real-time anomalies or suspicious
behaviors.
Privacy-Preserving: No raw data leaves the consumer’s premises, satisfying privacy regulations.
Scalable: Efficiently operates across millions of smart meters in a distributed manner.
Adaptive Learning: Models continuously evolve based on local and global insights.
Reduced Bandwidth Usage: Only model updates are transmitted, not bulky raw data.
Real-Time Detection: Enables timely identification of theft or tampering.
8
Advantages Of Proposed System:
Real-time anomaly detection, enabling early identification of theft and fraud.
Scalable to handle large datasets and complex grid structures.
Automated analysis of energy consumption patterns, reducing the need for manual inspections.
Ability to detect sophisticated theft techniques, including cyber intrusions and meter tampering.
Improved efficiency and cost-effectiveness compared to traditional methods.
1. Functional Requirements
Upload Dataset: Import the dataset containing energy consumption data, grid operations, and
external factors such as weather or demand fluctuations.
Data Preprocessing: Clean and preprocess the data, including handling missing values,
normalizing consumption patterns, and encoding categorical variables.
EDA (Exploratory Data Analysis): Visualize the data to identify trends, correlations, and
potential outliers that could indicate theft or inefficiency.
Data Splitting: Split the data into training, validation, and test sets to ensure the robustness of the
model and prevent overfitting.
Model Building: Build and train machine learning models to detect anomalies and predict
potential theft or inefficiency.
Model Testing: Evaluate the model's performance on the test data to assess its accuracy in
detecting theft and predicting grid performance issues.
Performance Evaluation: Evaluate the model using metrics like accuracy, precision, recall, and
F1 score to measure its effectiveness in identifying anomalies.
Model Prediction on Test Data: Use the trained model to predict potential thefts and anomalies
on new data, enabling real-time monitoring of the smart grid system.
2. Hardware Requirements:
Connectivity: Wi-Fi / Zigbee / LoRa / LTE (for communication with the central server)
RAM: Minimum 32 GB
GPU (Optional): NVIDIA Tesla / RTX 3060+ (for deep learning models)
3. Software Requirements:
Operating Systems:
10
Database and Storage
11
CHAPTER 4
SYSTEM REQUIREMENTS & SPECIFICATIONS
4.1. Database
SmartGridTest.csv A dataset with the same Same features as Used to test trained
structure but without training set (excluding models and predict
labels, used for Theft_Type) theft cases on new,
predictions unseen data.
4.2. Algorithms
What is KNN?
K-Nearest Neighbors is a supervised machine learning algorithm that classifies data points based on
the majority class among their 'k' closest neighbors in the feature space. It’s a lazy learner and works
well with smaller datasets.
Load SmartGridTheftData.csv
12
Normalize the feature data if needed using StandardScaler
if os.path.exists('KNNClassifier.pkl'):
knn_model = joblib.load('KNNClassifier.pkl')
else:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(x_train, y_train)
joblib.dump(knn_model, 'KNNClassifier.pkl')
Step 4: Prediction
y_pred_knn = knn_model.predict(x_test)
Step 6: Visualization
What is RandomForestClassifier?
Random Forest Classifier is an ensemble learning algorithm that builds multiple decision trees
and combines their results to make accurate and robust predictions. It works well for both
classification and regression tasks and helps reduce overfitting.
13
Step 1: Data Preprocessing
• Load the BoTNeTIoT-L01-v2.csv dataset.
• Remove unnecessary columns like Device_Name and Attack_subType.
• Define feature set (X) and label set (y).
Step 2: Train-Test Split
• Divide the dataset into training and testing sets:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) Step 3:
Model Loading or Training
• Check if the model is already trained and saved:
if os.path.exists('RandomForest_weights.pkl'):
classifier = joblib.load('RandomForest_weights.pkl')
else:
classifier = RandomForestClassifier(random_state=42)
classifier.fit(x_train, y_train)
joblib.dump(classifier, ‘RandomForest_weights.pkl’)
Step 4: Prediction
• Predict on the test data using the trained model:
y_pred = classifier.predict(x_test)
Step 5: Model Evaluation
• Evaluate the performance using:
• accuracy_score
• precision_score
• recall_score
• f1_score
• confusion_matrix
• classification_report
Step 6: Visualization
• Display confusion matrix using seaborn.heatmap() to analyze true vs predicted classes
visually.
Step 7: Prediction on New Data
• Load and preprocess the external test dataset (test.csv).
• Predict and display labels such as Normal, BASHLITE, or Mirai based on prediction
output values (0,1,2).
14
4.1 Design
The system for "ML-Driven Potential Theft Identification for Enhancing Integrity and
Efficiency of Smart Grid Systems" is designed as a modular, intelligent, and scalable architecture.
It integrates components for smart meter data acquisition, data preprocessing, anomaly detection
using machine learning models, real-time alerting, and visualization. The architecture supports both
batch and real-time energy data analysis, making it suitable for modern decentralized smart grid
environments.
15
16
4.1.2 Data Flow Diagram
The flowchart illustrates a complete machine learning workflow starting from the user’s
interaction with the system. The user initiates the process by uploading a dataset through a user-
friendly interface. This uploaded data then enters the preprocessing stage, where it is cleaned,
normalized, and transformed to handle any missing values or noise, ensuring consistency and
quality. Once preprocessing is complete, the data is split into training and testing sets to allow
unbiased evaluation of model performance. The training set is then used in the model training
phase, where algorithms such as K-Nearest Neighbors (KNN) and Random Forest Classifier are
applied to learn patterns from the data. After training, the model is evaluated using the testing set
in the model testing stage to verify its generalization ability. The outcomes of the model testing
are then used to generate predictions for new data inputs. Simultaneously, performance metrics
such as accuracy, precision, recall, and F1-score are calculated to assess how well the model is
performing. These predictions and evaluation results are then made accessible to the user. This
structured and iterative process ensures both the reliability of the predictions and transparency in
model evaluation.
17
4.1.3 UML Diagrams
UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling
language in the field of object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group. The goal is for UML to become a common language
for creating models of object-oriented computer software. In its current form UML is comprised
of two major components: a Meta- model and a notation. In the future, some form of method or
process may also be added to; or associated with, UML.
GOALS: The Primary goals in the design of the UML are as follows:
i) Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
ii) Provide extendibility and specialization mechanisms to extend the core concepts.
iii) Be independent of particular programming languages and development process.
iv) Provide a formal basis for understanding the modeling language.
v) Encourage the growth of OO tools market.
vi) Support higher level development concepts such as collaborations, frameworks, patterns and
components.
vii)Integrate best practices.
18
4.3.3.1 Class Diagram
The class diagram represents an object-oriented design for a machine learning pipeline. The Dataset
class manages core data operations such as uploading, preprocessing, and splitting the dataset. It
connects to the Preprocessing class, which performs essential tasks like exploratory data analysis
(EDA), removing null values, handling missing values, and normalizing the data. After
preprocessing, the clean data is passed to the Model class, which defines general machine learning
functions such as training, testing, evaluating, and predicting. Two specific model classes, K-
Nearest Neighbors (KNN) Model and RandomForest_Model, inherit from the Model class and
implement algorithms for K-Nearest Neighbors (KNN) and Random Forest respectively. These
subclasses override the train and test methods with model-specific logic. This structure promotes
modularity and code reuse while keeping the workflow organized. It also makes it easy to extend
the system with new models by simply creating subclasses of Model. Such an architecture is
effctive for building scalable and maintainable machine learning applications.
19
4.3.3.2 Activity Diagram
The flowchart illustrates a complete machine learning pipeline starting from data acquisition
to prediction. The process begins with uploading the dataset, followed by performing
Exploratory Data Analysis (EDA) to understand the structure and insights of the data. Next,
any missing values are handled appropriately to ensure data quality, and the data is
normalized to bring all features to a common scale. After preprocessing, the dataset is split
into training and testing sets. Machine learning models, specifically K-Nearest Neighbors
(KNN) and Random Forest Classifier, are then trained on the training data. These models
are subsequently tested on the test data to evaluate their performance. The evaluation step
involves analyzing metrics such as accuracy, precision, recall, or F1-score to determine
model effectiveness. Finally, the trained models are used to make predictions on unseen
data, completing the pipeline in a structured and modular manner.
20
4.3.3.3 Use Case Diagram
This use case diagram depicts a Federated Learning System workflow utilizing the K-Nearest
Neighbors (KNN) and Random Forest Classifier models. The user initiates the process by
uploading the dataset, after which the data is preprocessed—this includes tasks such as handling
missing values and normalization. The models (K-Nearest Neighbors (KNN) and Random Forest)
are then trained on local data without centralizing it, adhering to the principles of federated
learning. These models are subsequently tested to verify their performance, followed by an
evaluation phase that involves measuring metrics such as accuracy or F1-score. Finally, the system
generates predictions based on the trained models. Throughout this process, the system allows
performance metrics to be viewed, enabling transparent assessment and model comparison. This
approach enhances privacy by keeping data distributed and supports robust, secure machine
learning across multiple nodes.
21
4.3.3.4 Sequence Diagram
The sequence diagram depicts the flow of a machine learning application utilizing K-Nearest
Neighbors (KNN) and Random Forest models. The process starts with the User uploading a
dataset to the System, which then sends it to the Preprocessing module for exploratory data
analysis (EDA) and cleaning. After data preprocessing, the cleaned data is sent back to the
System, which then initiates model training using K-Nearest Neighbors (KNN) and Random
Forest classifiers. These models are trained within the Model module and returned to the System
once completed. The System proceeds to test the trained models by sending them along with test
data to the Model module, which returns the test results. These results are used by the System to
display performance metrics (e.g., accuracy, precision, recall) to the User. If the User requests
predictions, the System forwards this request to the Model module, which makes predictions using
the K-Nearest Neighbors (KNN) and Random Forest models and sends the predicted outputs back
to the System. Finally, the System displays the prediction results to the User.
22
4.3.3.5 Deployment Diagram
The architecture shows a clear division between the User Device and the Server, where the core
processing takes place. The User Interface on the User Device allows the user to upload data and
request results. This data is sent to the Preprocessing Module on the Server, where it is cleaned
and transformed—binary features are prepared for K-Nearest Neighbors, while general
numerical and categorical features are readied for Random Forest. The processed data then
moves into the Model Training Module, where both the K-Nearest Neighbors and Random
Forest models are trained in parallel. After training, the models are evaluated on unseen data
within the Model Testing Module. The results, including performance metrics like accuracy,
precision, and recall, are aggregated in the Evaluation Module. These metrics and any prediction
outputs are then sent back to the User Interface, allowing the user to review and interact with the
outcomes of both classifiers. This modular architecture supports efficient deployment and
comparison of both probabilistic and ensemble learning models.
23
4.2 Modules
1. Upload Dataset:
The user initiates the process by uploading a dataset of smart meter readings, containing labeled data
indicating normal and theft-related activities. This dataset may include features such as energy
consumption (kWh), voltage, current, power factor, and timestamps, along with labels such as
normal or theft for supervised training.
2. Data Preprocessing:
The uploaded dataset undergoes systematic preprocessing. This includes handling missing values
using imputation, removing outliers, converting categorical fields (if any) to numerical values (e.g.,
theft type encoding), and normalizing continuous features. This ensures the data is clean, structured,
and ready for effective machine learning model training.
3. Exploratory Data Analysis (EDA):
Visualization techniques are employed to understand patterns in the energy consumption data. Plots
such as line graphs, distribution plots, and correlation heatmaps are generated to detect trends,
outliers, seasonality, or unusual spikes in energy usage—helpful in feature selection and hypothesis
building.
4. Data Splitting:
The cleaned dataset is partitioned into training and testing subsets using train_test_split, typically
following a 70-30 or 80-20 ratio. This division allows the model to learn from historical patterns and
be tested on unseen data to evaluate its generalization ability.
5. Model Building:
K-Nearest Neighbors (KNN): A distance-based classifier that assigns theft or normal labels
based on the proximity of similar energy usage profiles.
Random Forest Classifier (RFC): An ensemble-based model that constructs multiple decision
trees to enhance accuracy and reduce overfitting, ideal for capturing non-linear relationships in
smart grid data.
24
6. Model Testing:
The trained KNN and RFC models are applied to the test dataset to generate predictions. These
predictions are then compared to actual labels to determine whether energy theft is correctly
identified.
7. Performance Evaluation:
8. Model Prediction:
The model with the best evaluation performance is selected for deployment. It is then used to predict
theft on real-time or batch smart meter data, enabling timely detection of anomalies. Detected theft
instances can be flagged for further inspection, reported to operators, or used to trigger automated
alerts.
The hardware requirements for running the cyber threat detection system are influenced by factors
such as the size of the network traffic dataset, the complexity of preprocessing steps, and the
computational needs of the machine learning models (Bernoulli Naive Bayes and Random Forest).
Below are the minimum recommended specifications:
25
Component Specification
System Intel i3 Processor or equivalent (dual- core
minimum)
RAM 4 GB (8 GB recommended for larger datasets)
Component Specification
Operating System Windows 10 / Linux Ubuntu 18.04+ / macOS
10.13+
Programming Language Python 3.7 or higher
IDE / Editor Jupyter Notebook / VS Code / PyCharm
Libraries scikit-learn, pandas, numpy, matplotlib, seaborn
Package Manager pip / conda
Browser Chrome / Firefox (for notebook interface)
26
4.6 Testing
27
4.6.5 White Box Testing:
White box testing focuses on the internal logic, paths, and control flows in the code. It involves
checking the implementation of algorithms, conditions, loops, and data transformations.
Developers use this to ensure all logical paths are tested. For example, confirming that
preprocessing conditions adapt to different data types. This ensures code correctness and improves
coverage.
28
CHAPTER 5
SOURCE CODE
import pandas as pd
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans, DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
import joblib
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
data=pd.read_csv("Datasets/AllData.csv")
data
# Create a count plot
sns.set(style="darkgrid") # Set the style of the plot
plt.figure(figsize=(8, 6)) # Set the figure size
# Replace 'dataset' with your actual DataFrame and 'Drug' with the column name
ax = sns.countplot(x=data['IsStealer'])
plt.title("Count Plot") # Add a title to the plot
plt.xlabel("Categories") # Add label to x-axis
plt.ylabel("Count") # Add label to y-axis
# Annotate each bar with its count value
for p in ax.patches:
29
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')
30
# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
target_names=["Non-Stealer", "Stealer"]
# existing model
import os
import joblib
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
if os.path.exists(MODEL_PATH):
print("Loading existing model...")
clf = joblib.load(MODEL_PATH) # Load model
else:
print("Training new model...")
clf = KNeighborsClassifier(n_neighbors=5) # You can tweak n_neighbors
clf.fit(X_train, y_train)
joblib.dump(clf, MODEL_PATH) # Save model
print("Model saved.")
y_pred = clf.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred) * 100
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred, target_names=target_names))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names,
yticklabels=target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("knn_model Confusion Matrix")
plt.show()
31
# Define model filename
MODEL_PATH = "model/random_forest.pkl"
# Predictions
y_pred = clf.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)*100
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred, target_names=target_names))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names,
yticklabels=target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("random_forest_model Confusion Matrix")
plt.show()
test=pd.read_csv("Datasets/test.csv")
test
# Drop 'UserId' as it is not a feature
test = test.drop(columns=['UserId'])
32
Test
labels=["DISEASED","NORMAL"]
# Make predictions on the selected test data
predict = clf.predict(test)
# Loop through each prediction and print the corresponding row
for i, p in enumerate(predict):
print(test.iloc[i]) # Print the row
print(f"Row {i}:************************************************** {target_names[p]}")
CHAPTER 6
33
EXPERIMENTAL RESULTS
Data Analysis
This figure demonstrates a set of fundamental exploratory data analysis operations applied to the
dataset. The null values check identifies if any missing or incomplete records exist in the dataset. The
nunique function shows the number of unique values for each feature, giving insight into data
variability.The info summary displays the data types and memory usage of each column, which is
essential for understanding the structure of the dataset before processing. The describe function
provides statistical measures such as mean, standard deviation, minimum, and maximum values for
each feature, enabling an assessment of distribution and potential anomalies.
34
Fig.6.3: Count Plot of the Dataset.
This figure visualizes the distribution of the target classes within the dataset using a count plot. It
shows the frequency of instances labeled as 'Stealer' and 'Non-Stealer'. This visualization is critical to
assess whether the dataset is balanced or imbalanced in terms of class representation. A balanced
dataset ensures fair training of the model, while an imbalanced one requires handling strategies such as
resampling or weighted loss during model training.
35
Fig.6.5: Performance Metrics of RFC, KNN Models
Metric Class Existing Model (KNN) Proposed Model (RFC)
Accuracy - 82.95% 94.98%
Precision Non-Stealer 0.98 0.95
Stealer 0.75 0.95
Recall Non-Stealer 0.67 0.95
Stealer 0.99 0.95
F1-Score Non-Stealer 0.80 0.95
Stealer 0.85 0.95
Macro Avg Precision 0.87 0.95
Recall 0.83 0.95
F1-Score 0.83 0.95
Weighted Avg Precision 0.87 0.95
Recall 0.83 0.95
F1-Score 0.83 0.95
This figure presents the evaluation metrics—precision, recall, and F1-score—for both the RFC and
KNN models in a comparative format. Each metric is shown for both Stealer and Non-Stealer classes,
along with overall accuracy, macro average, and weighted average. The chart illustrates the
effectiveness of each model in terms of predictive performance. The RFC model shows consistently
higher metrics across all categories, indicating superior accuracy and balanced performance.
This figure shows how the trained models predict outcomes on unseen test data. It includes a side-by-
36
side comparison of the actual vs. predicted values for each entry in the test dataset. The results visually
validate how closely the model’s predictions match the actual labels. The RFC model produces
predictions with higher accuracy and consistency, demonstrating its reliability and generalization
capabilities in practical deployment scenarios.
Fig.6.7: Presents the Average Daily Electricity Consumption Across All Users.
37
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENT
7.1. Conclusion
The proposed ML-driven system provides a powerful, scalable, and intelligent approach to identifying
potential electricity theft within modern smart grid infrastructures. By leveraging advanced machine
learning algorithms such as K-Nearest Neighbors (KNN) and Random Forest Classifier (RFC), the
system can detect anomalies in energy consumption patterns with high accuracy and efficiency. Unlike
traditional theft detection methods that rely heavily on manual inspections and reactive auditing, this
solution offers a proactive mechanism capable of processing real-time data and flagging suspicious
behavior promptly.
Through systematic preprocessing, effective model training, and rigorous evaluation, the system
ensures reliable performance even in complex and dynamic grid environments. The integration of
smart metering, automated analysis, and visualization tools enhances transparency, supports grid
integrity, and minimizes losses due to unauthorized usage. This work not only demonstrates the
potential of machine learning in addressing energy theft but also lays the groundwork for future
enhancements such as federated learning and deep learning techniques for even greater adaptability
and security.
Ultimately, this project contributes toward building a more secure, efficient, and intelligent smart grid
ecosystem that benefits both utility providers and consumers alike.
The system also supports scalability, making it suitable for implementation across a wide range of
urban and rural grid setups. Furthermore, it has the potential to integrate federated learning
frameworks, thereby preserving data privacy while enabling distributed training on edge devices or
substations.
Overall, this work makes a significant contribution to the field of smart grid cybersecurity and
operational efficiency. It not only minimizes energy loss and financial damage due to theft but also
builds trust and transparency between energy providers and consumers. The project serves as a
foundational model for future enhancements involving deep learning, edge computing, blockchain
38
integration for secure logging, and real-time alert automation, thus paving the way toward a smarter
and more resilient energy infrastructure.
39
Blockchain Integration for Data Integrity:
Integrating blockchain technology can provide an immutable log of energy transactions, model
updates, and system alerts. This ensures transparency, prevents tampering, and reinforces trust in the
theft detection process, especially when combined with federated learning.
40
REFERENCES
[1]. Gunduz, M.Z.; Das, R. Internet of things (IoT): Evolution, components and applications
fields. Pamukkale Univ. J. Eng. Sci. 2018, 24, 327–335. [Google Scholar] [CrossRef]
[2]. Das, R.; Gunduz, M.Z. Analysis of cyber-attacks in IoT-based critical infrastructures. Int. J.
Inf. Secur. Sci. 2019, 8, 122–133. [Google Scholar]
[3]. Emmanuel, M.; Rayudu, R. Communication technologies for smart grid applications: A
survey. J. Netw. Comput. Appl. 2016, 74, 133–148. [Google Scholar] [CrossRef]
[4]. Kimani, K.; Oduol, V.; Langat, K. Cyber security challenges for IoT-based smart grid
networks. Int. J. Crit. Infrastruct. Prot. 2019, 25, 36–49. [Google Scholar] [CrossRef]
[5]. Gunduz, M.Z.; Das, R. Communication Infrastructure and Cyber-Security in Smart Grids. J.
Inst. Sci. Technol. 2020, 10, 970–984. [Google Scholar] [CrossRef]
[6]. Qays, M.O.; Ahmad, I.; Abu-Siada, A.; Hossain, M.L.; Yasmin, F. Key communication
technologies, applications, protocols and future guides for IoT-assisted smart grid systems: A
review. Energy Rep. 2023, 9, 2440–2452. [Google Scholar] [CrossRef]
[7]. Sahoo, S.; Nikovski, D.; Muso, T.; Tsuru, K. Electricity theft detection using smart meter
data. In Proceedings of the 2015 IEEE Power & Energy Society Innovative Smart Grid
Technologies Conference (ISGT), Washington, DC, USA, 18–20 February 2015; pp. 1–5.
[Google Scholar] [CrossRef]
[8]. Althobaiti, A.; Jindal, A.; Marnerides, A.K.; Roedig, U. Energy Theft in Smart Grids: A
Survey on Data-Driven Attack Strategies and Detection Methods. IEEE Access 2021, 9,
159291–159312. [Google Scholar] [CrossRef]
[9]. Takiddin, A.; Ismail, M.; Serpedin, E. Detection of Electricity Theft False Data Injection
Attacks in Smart Grids. In Proceedings of the 2022 30th European Signal Processing
Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 1541–1545.
[Google Scholar] [CrossRef]
[10]. Badr, M.M.; Ibrahem, M.I.; Kholidy, H.A.; Fouda, M.M.; Ismail, M. Review of the Data-
Driven Methods for Electricity Fraud Detection in Smart Metering Systems. Energies 2023, 16,
2852. [Google Scholar] [CrossRef]
[11]. Wang, Y.; Chen, Q.; Gan, D.; Yang, J.; Kirschen, D.S.; Kang, C. Deep Learning-Based
Socio-Demographic Information Identification From Smart Meter Data. IEEE Trans. Smart
Grid 2019, 10, 2593–2602. [Google Scholar] [CrossRef]
[12]. Reda, H.T.; Anwar, A.; Mahmood, A. Comprehensive survey and taxonomies of false data
injection attacks in smart grids: Attack models, targets, and impacts. Renew. Sustain. Energy
41
Rev. 2022, 163, 112423. [Google Scholar] [CrossRef]
[13]. Javaid, N.; Gul, H.; Baig, S.; Shehzad, F.; Xia, C.; Guan, L.; Sultana, T. Using GANCNN
and ERNET for Detection of Non Technical Losses to Secure Smart Grids. IEEE
Access 2021, 9, 98679–98700. [Google Scholar] [CrossRef]
[14]. Habib, A.A.; Hasan, M.K.; Alkhayyat, A.; Islam, S.; Sharma, R.; Alkwai, L.M. False data
injection attack in smart grid cyber physical system: Issues, challenges, and future
direction. Comput. Electr. Eng. 2023, 107, 108638. [Google Scholar] [CrossRef]
[15]. El-Toukhy, A.T.; Badr, M.M.; Mahmoud, M.M.E.A.; Srivastava, G.; Fouda, M.M.;
Alsabaan, M. Electricity Theft Detection Using Deep Reinforcement Learning in Smart Power
Grids. IEEE Access 2023, 11, 59558–59574. [Google Scholar] [CrossRef]
[16]. Berghout, T.; Benbouzid, M.; Muyeen, S.M. Machine learning for cybersecurity in smart
grids: A comprehensive review-based study on methods, solutions, and prospects. Int. J. Crit.
Infrastruct. Prot. 2022, 38, 100547. [Google Scholar] [CrossRef]
[17]. Buzau, M.M.; Tejedor-Aguilera, J.; Cruz-Romero, P.; Gómez-Expósito, A. Detection of
Non-Technical Losses Using Smart Meter Data and Supervised Learning. IEEE Trans. Smart
Grid 2019, 10, 2661–2670. [Google Scholar] [CrossRef]
[18]. Abdulaal, M.J.; Ibrahem, M.I.; Mahmoud, M.M.E.A.; Khalid, J.; Aljohani, A.J.; Milyani,
A.H.; Abusorrah, A.M. Real-Time Detection of False Readings in Smart Grid AMI Using
Deep and Ensemble Learning. IEEE Access 2022, 10, 47541–47556. [Google Scholar]
[CrossRef]
42