0% found this document useful (0 votes)
7 views

Machine Learning

The document presents a mini project report on an ML-driven system for identifying potential theft in smart grid systems, aimed at enhancing integrity and efficiency. It highlights the limitations of traditional theft detection methods and proposes a proactive approach using machine learning to analyze energy consumption patterns for real-time anomaly detection. The project, submitted by students of St. Martin's Engineering College, emphasizes the importance of improving grid security and operational efficiency through advanced technology.

Uploaded by

featureswag83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Machine Learning

The document presents a mini project report on an ML-driven system for identifying potential theft in smart grid systems, aimed at enhancing integrity and efficiency. It highlights the limitations of traditional theft detection methods and proposes a proactive approach using machine learning to analyze energy consumption patterns for real-time anomaly detection. The project, submitted by students of St. Martin's Engineering College, emphasizes the importance of improving grid security and operational efficiency through advanced technology.

Uploaded by

featureswag83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

A

Mini Project Report

on

ML-DRIVEN POTENTIAL THEFT IDENTIFICATION FOR ENHANCING


INTEGRITY AND EFFICIENCY OF SMART GRID SYSTEMS.
Submitted for partial fulfilment of the requirements for the award of the degree of

BACHELOR OF
TECHNOLOGY

In

INFORMATION TECHNOLOGY

Submitted by
S. SAI PRANEETH 22K81A1255

V.SRINIJAREDDY 22K81A1263

B.MANIKANAND 22K81A1206

Under the Guidance of


Mrs.K.Surya Kanthi
Assistant Professor

DEPARTMENT OF INFORMATION TECHNOLOGY

St. MARTIN'S ENGINEERING COLLEGE


UGC Autonomous
Affiliated to JNTUH, Approved by AICTE,
Accredited by NBA & NAAC A+, ISO 9001-2008 Certified
Dhulapally, Secunderabad - 500 100
www.smec.ac.in

JUNE- 2025
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad - 500 100
www.smec.ac.in

Certificate

This is to certify that the project entitled “ML-DRIVEN POTENTIAL THEFT


IDENTIFICATION FOR ENHANCING INTEGRITY AND EFFICIENCY OF SMART
GRID SYSTEMS” is being submitted by S.Sai Praneeth(22K81A1255),V.SrinijaReddy(22K81A1263),
B.Manikanand(22K81A1206) in fulfilment of the requirement for the award of degree of

BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY is recorded of


bonafide work carried out by them. The result embodied in this report have been verified and found
satisfactory.

Signature of Guide Signature of HOD


Mrs. K. SURYA KANTHI Dr. N. KRISHNAIAH
Assistant Professor Professor and Head of Department
Department of IT Department of IT

Internal Examiner External Examiner

Place: Dhulapally, Secunderabad


Date:
St. MARTIN'S ENGINEERING COLLEGE
UGC Autonomous
Affiliated to JNTUH, Approved by AICTE
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad-500100
www.smec.ac.in

DEPARTMENT OF INFORMATION TECHNOLOGY

Declaration
We, the students of ‘Bachelor of Technology in Department of Information Technology’,
session: 2022 - 2026, St. Martin’s Engineering College, Dhulapally, Kompally,
Secunderabad, hereby declare that the work presented in this Project Work entitled “ML-
DRIVEN POTENTIAL THEFT IDENTIFICATION FOR ENHANCING INTEGRITY
AND EFFICIENCY OF SMART GRID SYSTEMS” is the outcome of our own bonafide work and
is correct to the best of our knowledge and this work has been undertaken taking care of Engineering
Ethics. This result embodied in this project report has not been submitted in any university for award of
any degree.

S. Sai Praneeth 22K81A1255


V. Srinija Reddy 22K81A1263
B. Manikanand 22K81A1206
ACKNOWLEDGEMENT

The satisfaction and euphoria that accompanies the successful completion of any task would
be incomplete without the mention of the people who made it possible and whose encouragement
and guidance have crowded our efforts with success.
First and foremost, we would like to express our deep sense of gratitude and indebtedness to
our College Management for their kind support and permission to use the facilities available in the
Institute.
We especially would like to express our deep sense of gratitude and indebtedness to Dr. P.
SANTOSH KUMAR PATRA, Professor and Group Director, St. Martin’s Engineering College
Dhulapally, for permitting us to undertake this project.
We wish to record our profound gratitude to Dr. M. SREENIVAS RAO, Principal, St.
Martin’s Engineering College, for has motivation and encouragement
We are also thankful to Dr. N. KRISHNAIAH, Head of the Department, Information
Technology, St. Martin’s Engineering College, Dhulapally, Secunderabad. for his support and
guidance throughout our project as well as Project Coordinator MRS.G. GOUTHAMI, Assistant
Professor, Information Technology department for her valuable support.
We would like to express our sincere gratitude and indebtedness to our project supervisor
MRS.K. SURYA KANTHI,Assistant Professor, Information Technology, St.Martins Engineering
College, Dhulapally, for her support and guidance throughout our project.
Finally, we express thanks to all those who have helped us successfully completing this
project. Furthermore, we would like to thank our family and friends for their moral support and
encouragement. We express thanks to all those who have helped us in successfully completing
the project.

S. Sai Praneeth 22K81A1255


V. Srinija Reddy 22K81A1263
B. Manikanand 22K81A1206

i
ABSTRACT

The security and efficiency of smart grid systems are increasingly being threatened by potential thefts,
cyber-attacks, and unauthorized access, leading to significant financial and operational losses.
Traditional approaches for detecting theft and ensuring grid integrity rely heavily on manual
monitoring, which is both time-consuming and prone to human error. This research introduces a
machine learning (ML)-driven system designed to identify potential thefts and optimize the efficiency
of smart grid systems. By leveraging data such as energy consumption patterns, grid operations, and
external factors, the system can detect anomalies in real-time, predicting suspicious activities that may
indicate theft or fraud. Using machine learning models, the system analyzes vast amounts of data to
identify subtle patterns that would otherwise be overlooked by conventional methods. This proactive
approach enables the early detection of theft, reducing its impact and improving the overall efficiency
of the smart grid. Moreover, the system can be fine-tuned and adapted to different regions and grid
configurations, providing a scalable and flexible solution for grid operators. With the increasing
adoption of smart meters and IoT devices in modern grids, the availability of real-time data allows for
more accurate and timely decision-making. This research highlights the potential of machine learning
in enhancing grid security, optimizing resource distribution, and reducing operational costs.

ii
LIST OF FIGURES
Figure no. Figure Title Page no.

3.2 Block Diagram of Proposed System 8


4.3.1 System Architecture 15
4.3.2 Data Flow Diagram 16
4.3.3.1 Class diagram 18
4.3.3.2 Activity Diagram 19
4.3.3.3 Use Case Diagram 20
4.3.3.4 Sequence Diagram 21
4.3.3.5 Deployment Diagram 22
6.1 Display Sample Dataset 33
6.2 Preprocessing 33
6.3 CountPlot of Dataset 34
6.4 Confusion Matrix 34
6.5 Performance Evaluation 35
6.6 Model Prediction 35
6.7 Average Consumption 36
6.8 Stealer Vs Non-Stealer 36

iii
LIST OF TABLES

Table no. Table Name Page no.

4.1 Database 15
4.5.1 Hardware Requirements 24
4.5.2 Software Requirements 25

iv
LIST OF ACRONYMS AND DEFINITIONS

S. No. Acronym Definition

1. KNN K-Nearest Neighbors

2. LSTM Long Short-Term Memory

3. EDA Exploratory Data Analysis

4. BC Bagging Classifier

5. RFC Random Forest Classifier

6. IF Isolation Forest

7. SVM Support Vector Machine

8. UML Unified Modelling Language

v
CONTENTS
ACKNOWLEDGEMENT i

ABSTRACT ii

LIST OF FIGURES iii

LIST OF TABLES iv

LIST OF ACRONYMS AND DEFINITIONS v

CHAPTER 1 INTRODUCTION 1
1.1 Overview 1

1.2 Research Motivation 1

1.3 Problem Statement 2


1.4 Need and Significance
2
1.5 Applications
2

CHAPTER 2 LITERATURE SURVEY 3

CHAPTER 3 SYSTEM ANALYSIS AND DESIGN 5

3.1 Existing System 6

3.2 Proposed System 9

3.3 System Configuration 11

CHAPTER 4 SYSTEM REQUIREMENTS AND SPECIFICATIONS 12

4.1 Database 12

4.2 K-Nearest Neighbors 13


4.2.1 RandomForest Classifiers 14

4.3 Design 15

4.3.1 System Architecture 15

4.3.2 Data Flow Diagram 16

4.3.3 UML Diagram 17

4.3.3.1 Class Diagram 18

4.3.3.2 Activity Diagram 19

4.3.3.3 Use Case Diagram 20


4.3.3.4 Sequence Diagram 21

4.3.3.5 Deployment Diagram 22

4.4 Modules 23

4.5 System Requirements 24

4.5.1 Software Requirements 24

4.5.2 Hardware Requirements 25

4.6 Testing 26

4.6.1 Unit Testing 26

4.6.2 Integration Testing 26

4.6.3 Functional Testing 26

4.6.4 System Testing 26

4.6.5 White Box Testing 27

4.6.6 Black Box Testing 27

4.6.7 Acceptance Testing 27

CHAPTER 5 SOURCE CODE 28

CHAPTER 6 EXPERIMENTAL RESULTS 33

CHAPTER 7 CONCLUSION & FUTURE ENHANCEMENT 37

7.1 Conclusion 37

7.2 Future Enhancement 38

REFERENCES 40

Patent/Publication
CHAPTER 1
INTRODUCTION

1.1. Overview

Smart grid systems, which integrate advanced communication technologies with traditional electrical
grids, aim to optimize energy distribution, improve grid reliability, and enable real-time monitoring of
energy usage. However, as these systems become more complex and interconnected, they become
vulnerable to potential theft, fraud, and cyber-attacks. According to the U.S. Department of Energy,
energy theft alone costs utilities billions of dollars annually. Smart grids are particularly susceptible to
these threats because of their reliance on distributed sensors, smart meters, and IoT devices, which can
be manipulated or hacked. Traditional theft detection systems are often reactive, relying on physical
inspections, customer complaints, and audits to identify discrepancies. These methods are time-
consuming and inefficient, especially when large volumes of data from multiple sources need to be
processed. To address these issues, this research proposes an ML-driven solution that continuously
monitors energy consumption patterns, identifies irregularities, and flags potential thefts in real-time.
By analyzing the vast datasets generated by smart grids, this system will help utility companies
proactively detect and prevent theft, thereby improving both grid security and operational efficiency.

1.2. Research Motivation

The motivation behind this research stems from the growing need for more secure and efficient smart
grid systems in the face of increasing theft and fraud. As smart grids become integral to the
functioning of modern energy systems, ensuring their integrity becomes paramount. Theft, whether
through illegal connections, meter tampering, or cyber intrusion, undermines the goals of smart grid
technology by reducing revenue, compromising system performance, and raising operational costs.
Traditional methods of detecting theft are not equipped to handle the volume and complexity of data
produced by modern grids. Machine learning offers a promising solution by enabling the automated
detection of anomalies in real-time, identifying patterns that could indicate potential theft or
inefficiency. The research aims to harness the power of machine learning to enhance the security and
efficiency of smart grids, providing utility operators with the tools needed to mitigate losses and
optimize energy distribution.

1
1.3. Problem Statement

As the adoption of smart grid technologies grows, theft and unauthorized access pose serious risks to
the reliability, security, and efficiency of the system. Traditional theft detection mechanisms are
reactive and cannot keep pace with the evolving techniques used by those attempting to steal energy.
The problem lies in the ability of utility companies to monitor and detect suspicious activities across
vast, decentralized networks that generate enormous amounts of data. This massive influx of data
makes manual inspection and traditional algorithms inefficient and impractical.A more automated,
proactive approach is necessary to identify potential thefts early on, preventing further damage and
optimizing the grid's overall performance. This research focuses on addressing the gap by developing
an ML-driven system capable of detecting anomalies and potential theft in smart grid systems in real-
time.

1.4. Need and Significance

 Early Detection of Theft: ML algorithms enable the real-time identification of abnormal


consumption patterns, allowing utilities to detect theft and unauthorized access before they escalate.
 Improved Grid Efficiency: By identifying and addressing inefficiencies in energy distribution, the
system helps optimize grid performance and reduce operational costs.
 Scalability: The proposed system can be adapted to a wide range of grid configurations, ensuring
its applicability to both small and large-scale smart grid systems.
 Reduced Operational Costs: Detecting theft early prevents revenue losses, reducing the need for
costly investigations and manual inspections.
 Enhanced Security: By continuously monitoring the grid and predicting potential threats, the
system helps improve overall grid security and resilience against cyber-attacks.
1.5. Applications

 Energy Theft Detection: The primary application of the proposed system is in identifying energy

theft in smart grids, helping utility companies minimize losses and maintain operational integrity.

 Grid Optimization: By analyzing energy usage patterns, the system can optimize resource
distribution, ensuring that energy is efficiently allocated where it is needed most.
 Predictive Maintenance: The ML models can help predict when equipment is likely to fail,
allowing for proactive maintenance and reducing downtime.
 Customer Billing Accuracy: By detecting irregularities in energy usage.

2
CHAPTER 2
LITERATURE SURVEY

Gunduz et al. [1] presented a detailed overview of the Internet of Things (IoT), discussing its
evolution, core components, and diverse application areas. Their work emphasizes the foundational
role of IoT in sectors such as healthcare, transportation, and especially smart grids. The paper
highlights how IoT integration facilitates real-time monitoring and efficient resource management.
They also touch upon the communication challenges and the need for robust infrastructure in smart
systems. Das et al. [2] investigated the vulnerabilities and threats posed by cyber-attacks on IoT-based
critical infrastructure. They focused on identifying attack vectors and emphasized the potential damage
to essential systems such as smart grids and industrial control environments. Their study outlines the
importance of security frameworks and recommends advanced detection mechanisms. It also illustrates
specific cases of attacks, proposing mitigation strategies through secure IoT architectures. Emmanuel
et al. [3] explored various communication technologies applicable to smart grid environments,
including ZigBee, Wi-Fi, and LTE. Their survey discussed the trade-offs between latency, range, and
data throughput for each technology. They highlighted how the choice of communication protocol
directly influences grid performance, scalability, and cyber-security. This work provides a
foundational comparison essential for designing communication layers in smart infrastructure. Kimani
et al. [4] examined the cybersecurity challenges faced by IoT-based smart grid networks. They
identified major threats such as data breaches, denial-of-service attacks, and unauthorized access. The
paper also delved into security requirements like authentication, encryption, and intrusion detection
systems. Their findings emphasize the urgent need for robust security models tailored to the unique
characteristics of smart grid networks. Gunduz et al. [5] provided an analysis of the communication
infrastructure and cyber-security aspects specific to smart grids. Their study categorizes network layers
and identifies vulnerabilities inherent in each communication tier. They propose integrated solutions
involving blockchain and encryption techniques. The authors argue for a layered security model to
defend against sophisticated threats in critical grid systems. Qays et al. [6] conducted a comprehensive
review of communication technologies, applications, and protocols in IoT-assisted smart grid systems.
The paper categorizes existing techniques based on functionality and security capabilities. It also
provides insights into future research directions such as 6G and edge computing for smart gradient
boosting and deep neural networks.

3
Sahoo et al. [7] proposed a data-driven approach for electricity theft detection using smart meter data.
By analyzing consumption patterns and comparing them with expected norms, their method identifies
anomalies suggestive of theft. They validated the approach using real-world datasets and demonstrated
significant detection accuracy. This work underlines the role of advanced analytics in securing energy
distribution. Althobaiti et al. [8] presented a survey on energy theft in smart grids, focusing on various
attack strategies and detection methods. They reviewed machine learning, statistical, and heuristic-
based techniques for fraud identification. The paper discusses challenges such as data imbalance, false
positives, and evolving theft tactics. It concludes with recommendations for adaptive and scalable
detection frameworks. Takiddin et al. [9] addressed the problem of false data injection attacks in smart
grids, proposing detection algorithms based on signal processing. Their model identifies abnormal data
patterns and signals injected to disrupt grid operation. They tested the method on benchmark datasets,
showing effectiveness in minimizing error rates. Their contribution emphasizes real-time detection to
maintain grid integrity. Badr et al. [10] reviewed existing data-driven methods for fraud detection in
smart metering systems. The study encompasses supervised and unsupervised learning models,
discussing their strengths and limitations. It also considers the impact of feature selection, training data
quality, and model interpretability. Their findings guide researchers in choosing suitable AI techniques
for electricity fraud scenarios. Wang et al. [11] introduced a deep learning approach to infer socio-
demographic information from smart meter data. Their model can predict household attributes like
occupancy and income based on electricity consumption behavior. They highlight privacy concerns
but also demonstrate the potential for targeted energy policies. This research shows how smart grid
data can reveal insights beyond utility usage. Reda et al. [12] surveyed false data injection attacks in
smart grids, offering taxonomies based on models, targets, and consequences. They categorized attacks
by their entry points and assessed impacts on reliability and safety. The paper also presents defenses
such as blockchain, anomaly detection, and secure protocols. Their work aids in developing
comprehensive protection strategies. Javaid et al. [13] employed GANCNN and ERNET models to
detect non-technical losses in smart grids. Their hybrid framework enhances detection accuracy
through feature learning and error correction. The system was tested on multiple datasets and showed
robust performance in identifying fraud. Their research confirms the utility of deep learning in energy
theft scenarios. Habib et al. [14] investigated false data injection attacks in smart grid cyber-physical
systems, highlighting current issues and future challenges. They reviewed technical and regulatory
gaps in existing defense mechanisms.

4
CHAPTER 3
SYSTEM ANALYSIS AND DESIGN

3.1. Existing System and Their Limitations

Traditional theft detection systems in smart grids rely on physical inspections, periodic audits, and
customer complaints. These systems are reactive, only identifying theft after it has occurred or been
reported. With the increasing complexity and scale of smart grids, these methods have become less
effective in dealing with the volume of data generated. Physical inspections are labor-intensive and
time-consuming, while audits rely on incomplete or delayed data, making it difficult to detect theft in
real-time. Furthermore, these methods fail to capture subtle or sophisticated theft techniques, such as
meter tampering or cyber intrusions, which can go unnoticed for extended periods. The traditional
systems also struggle to scale, especially in large, decentralized smart grids, and often cannot identify
inefficiencies or potential threats until they cause significant disruptions.
Although these methods have proven effective in identifying known malware, viruses, and
intrusion patterns, they have notable limitations when it comes to scalability, adaptability, real-
time response, and most importantly, privacy.
 Reactive Nature of Traditional Systems.

These systems are not proactive. They typically detect theft only after it has occurred or been reported
by consumers, leading to delayed responses. This delay often translates to prolonged revenue losses,
unmonitored grid instability, and in some cases, permanent damage to infrastructure.

 Scalability Challenges.

Smart grids are massively distributed and generate vast amounts of real-time data from thousands or
millions of smart meters and edge devices. Traditional systems are not designed to handle this scale.
Manual audits or periodic checks cannot keep up with the frequency and volume of data, rendering
them ineffective in identifying emerging threats in real time.

 Time and Resource Intensive.

Physical inspections require considerable manpower and time, making them inefficient, especially in
urban areas with high consumer density or rural areas with hard-to-reach locations. Furthermore,

5
audits often depend on incomplete historical data that may not reflect current patterns or real-time
anomalies.

 Lack of Real-Time Detection and Adaptability.

Modern smart grids require real-time analytics to quickly respond to abnormal consumption patterns.
Traditional systems lack real-time monitoring and are not adaptive. They cannot evolve or learn from
new patterns of theft and inefficiencies. This static nature makes them ineffective in dynamic
environments where consumption behavior and attack methods constantly evolve.

 Limited Anomaly Recognition.

Rule-based systems can only detect known, pre-defined patterns. They fail to recognize novel or
low-profile anomalies, especially those crafted to mimic normal behavior. Sophisticated thieves often
exploit these blind spots by staying just within acceptable thresholds.

Limitations:

Traditional theft detection systems in smart grids rely on physical inspections, periodic audits, and
customer complaints. These systems are reactive, only identifying theft after it has occurred or been
reported. With the increasing complexity and scale of smart grids, these methods have become less
effective in dealing with the volume of data generated. Physical inspections are labor-intensive and
time-consuming, while audits rely on incomplete or delayed data, making it difficult to detect theft in
real-time. Furthermore, these methods fail to capture subtle or sophisticated theft techniques, such as
meter tampering or cyber intrusions, which can go unnoticed for extended periods. The traditional
systems also struggle to scale, especially in large, decentralized smart grids, and often cannot identify
inefficiencies or potential threats until they cause significant disruptions.
 Reactive approach to theft detection.
 Labor-intensive and time-consuming physical inspections.
 Inability to scale efficiently across large smart grid networks.

6
3.2. Proposed System and its Advantages

The proposed ML-driven system offers a proactive and scalable approach to detecting theft in smart
grid systems. By continuously analyzing energy consumption data, the system can identify anomalies
and flag potential thefts in real-time. Using advanced machine learning algorithms, the system can
detect patterns that indicate suspicious activities, such as unusual consumption spikes or tampered
meters, and alert grid operators immediately. This enables faster response times and prevents further
losses. Additionally, the system can scale to handle large volumes of data from smart meters and IoT
devices, ensuring its applicability across diverse grid configurations. The ML model can be trained and
fine-tuned over time to improve detection accuracy and adapt to new theft techniques.
The proposed ML-driven system presents a proactive, intelligent, and scalable solution for detecting
electricity theft in modern smart grid environments. Unlike conventional methods that rely heavily on
physical inspections and customer complaints, this system leverages real-time data analytics and
machine learning to monitor and assess energy consumption patterns continuously. Through advanced
algorithms such as Random Forests, LSTM networks, Autoencoders, and Isolation Forests, the system
can identify anomalies in electricity usage that suggest fraudulent activity—such as unexpected spikes,
prolonged low usage, or tampered meter signals. These anomalies are flagged immediately, allowing
utility providers to act swiftly and prevent further losses.
Working of the Proposed Federated Learning-Based System:

The proposed system utilizes Federated Learning (FL) to enable intelligent, decentralized theft
detection across smart grid systems without compromising user privacy. Unlike traditional
centralized machine learning approaches, FL allows smart meters and local devices to
collaboratively train a global model without sharing raw user data, making the system secure,
scalable, and privacy-preserving.
Working Process:

 Initialization:

A pre-trained anomaly detection model is distributed to all smart meters.

 Local Model Training:

Each smart meter trains the model on its own historical data to learn unique
consumption patterns.

 Model Update Sharing:

At scheduled intervals (e.g., daily or weekly), each smart meter sends encrypted model
updates—not the data—to the central server.
7
 Global Model Aggregation:

The server aggregates updates using algorithms like FedAvg (Federated Averaging).

 Model Distribution:

The updated global model is redistributed to all smart meters.

 Real-Time Monitoring and Detection:

With each iteration, smart meters become more effective at detecting real-time anomalies or suspicious

behaviors.

Advantages of the Federated Learning-Based System:

 Privacy-Preserving: No raw data leaves the consumer’s premises, satisfying privacy regulations.
 Scalable: Efficiently operates across millions of smart meters in a distributed manner.
 Adaptive Learning: Models continuously evolve based on local and global insights.
 Reduced Bandwidth Usage: Only model updates are transmitted, not bulky raw data.
 Real-Time Detection: Enables timely identification of theft or tampering.

8
Advantages Of Proposed System:
 Real-time anomaly detection, enabling early identification of theft and fraud.
 Scalable to handle large datasets and complex grid structures.
 Automated analysis of energy consumption patterns, reducing the need for manual inspections.
 Ability to detect sophisticated theft techniques, including cyber intrusions and meter tampering.
 Improved efficiency and cost-effectiveness compared to traditional methods.

3.3. SYSTEM CONFIGURATIONS


The implementation of the proposed ML-driven theft detection system requires a blend of hardware
and software components, both at the edge (smart meter level) and central/server level. The system
is designed to support scalability, real-time processing, and secure data handling.

1. Functional Requirements

 Upload Dataset: Import the dataset containing energy consumption data, grid operations, and
external factors such as weather or demand fluctuations.
 Data Preprocessing: Clean and preprocess the data, including handling missing values,
normalizing consumption patterns, and encoding categorical variables.
 EDA (Exploratory Data Analysis): Visualize the data to identify trends, correlations, and
potential outliers that could indicate theft or inefficiency.
 Data Splitting: Split the data into training, validation, and test sets to ensure the robustness of the
model and prevent overfitting.
 Model Building: Build and train machine learning models to detect anomalies and predict
potential theft or inefficiency.
 Model Testing: Evaluate the model's performance on the test data to assess its accuracy in
detecting theft and predicting grid performance issues.
 Performance Evaluation: Evaluate the model using metrics like accuracy, precision, recall, and
F1 score to measure its effectiveness in identifying anomalies.
 Model Prediction on Test Data: Use the trained model to predict potential thefts and anomalies
on new data, enabling real-time monitoring of the smart grid system.
2. Hardware Requirements:

 Edge Devices (Smart Meters / IoT Nodes):

Processor: ARM Cortex-A series / Raspberry Pi 4 or equivalent

RAM: Minimum 2 GB (for light ML processing)


9
Storage: 16 GB or higher (for local data logging)

Connectivity: Wi-Fi / Zigbee / LoRa / LTE (for communication with the central server)

Sensors: Voltage, current, power factor, energy consumption sensors

Power Supply: Battery-backed or grid-connected with UPS

Optional: Tamper detection sensors (magnet, lid open, shock sensors)

 Central Server / Cloud Node:

Processor: Intel Xeon or AMD EPYC, multi-core (8 cores or higher)

RAM: Minimum 32 GB

Storage: SSD with at least 1 TB space (HDD for backups)

GPU (Optional): NVIDIA Tesla / RTX 3060+ (for deep learning models)

Network: High-speed internet or internal grid communication backbone

3. Software Requirements:
 Operating Systems:

Edge Devices: Raspbian OS / Ubuntu Core / Embedded Linux

Central Server: Ubuntu Server 20.04+ / CentOS / Debian

 Development & ML Frameworks:

Python 3.8+ – Primary language for ML model development

TensorFlow / PyTorch – For building and training ML models

Scikit-learn – For classical ML algorithms and preprocessing

Pandas / NumPy / Matplotlib – For data handling and visualization

Keras – High-level API for neural network

PySyft for privacy-preserving ML (optional)

10
 Database and Storage

Edge Level: SQLite or local JSON/CSV logging

Central Level: PostgreSQL / MySQL for structured data

NoSQL (Optional): MongoDB for time-series or unstructured data

11
CHAPTER 4
SYSTEM REQUIREMENTS & SPECIFICATIONS

4.1. Database

Dataset Name Description Key Features Application


SmartGridTheftData.csv A labeled dataset Customer_ID Used to train and
generated from smart - Timestamp evaluate ML models
- Energy_Usage,
meters containing both Voltage, Current, for IoT botnet threat
normal and theft Power_Factor detection.
- Theft_Type
consumption data

SmartGridTest.csv A dataset with the same Same features as Used to test trained
structure but without training set (excluding models and predict
labels, used for Theft_Type) theft cases on new,
predictions unseen data.

4.2. Algorithms

K-Nearest Neighbors (KNN):

What is KNN?
K-Nearest Neighbors is a supervised machine learning algorithm that classifies data points based on
the majority class among their 'k' closest neighbors in the feature space. It’s a lazy learner and works
well with smaller datasets.

Step 1: Data Preprocessing

Load SmartGridTheftData.csv

Drop unnecessary columns like Customer_ID

Encode the target column Theft_Type (e.g., Normal = 0, Tampering = 1, Bypass = 2)

12
Normalize the feature data if needed using StandardScaler

Separate features X and labels y

Step 2: Train-Test Split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Model Loading or Training

if os.path.exists('KNNClassifier.pkl'):

knn_model = joblib.load('KNNClassifier.pkl')

else:

knn_model = KNeighborsClassifier(n_neighbors=5)

knn_model.fit(x_train, y_train)

joblib.dump(knn_model, 'KNNClassifier.pkl')

Step 4: Prediction

y_pred_knn = knn_model.predict(x_test)

Step 5: Model Evaluation

Evaluate model with accuracy_score, precision_score, recall_score, f1_score

Generate confusion matrix and classification report

Step 6: Visualization

Plot confusion matrix using seaborn.heatmap()

4.2.1 Random Forest Classifier (RandomForestClassifier)

What is RandomForestClassifier?

Random Forest Classifier is an ensemble learning algorithm that builds multiple decision trees
and combines their results to make accurate and robust predictions. It works well for both
classification and regression tasks and helps reduce overfitting.

13
Step 1: Data Preprocessing
• Load the BoTNeTIoT-L01-v2.csv dataset.
• Remove unnecessary columns like Device_Name and Attack_subType.
• Define feature set (X) and label set (y).
Step 2: Train-Test Split
• Divide the dataset into training and testing sets:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) Step 3:
Model Loading or Training
• Check if the model is already trained and saved:
if os.path.exists('RandomForest_weights.pkl'):
classifier = joblib.load('RandomForest_weights.pkl')
else:
classifier = RandomForestClassifier(random_state=42)
classifier.fit(x_train, y_train)
joblib.dump(classifier, ‘RandomForest_weights.pkl’)
Step 4: Prediction
• Predict on the test data using the trained model:
y_pred = classifier.predict(x_test)
Step 5: Model Evaluation
• Evaluate the performance using:
• accuracy_score
• precision_score
• recall_score
• f1_score
• confusion_matrix
• classification_report
Step 6: Visualization
• Display confusion matrix using seaborn.heatmap() to analyze true vs predicted classes
visually.
Step 7: Prediction on New Data
• Load and preprocess the external test dataset (test.csv).
• Predict and display labels such as Normal, BASHLITE, or Mirai based on prediction
output values (0,1,2).
14
4.1 Design
The system for "ML-Driven Potential Theft Identification for Enhancing Integrity and
Efficiency of Smart Grid Systems" is designed as a modular, intelligent, and scalable architecture.
It integrates components for smart meter data acquisition, data preprocessing, anomaly detection
using machine learning models, real-time alerting, and visualization. The architecture supports both
batch and real-time energy data analysis, making it suitable for modern decentralized smart grid
environments.

4.1.1 System Architecture


The system architecture begins with the user, who interacts with the application through a web
interface designed to capture input and display results. This interface communicates with the
backend server, which coordinates the internal operations. Upon receiving data, the data
preprocessing module is activated to clean, normalize, and encode the input data, making it
suitable for model training. The processed data is then passed to the federated learning engine,
which ensures that learning happens locally across nodes without centralizing raw data, thereby
preserving privacy. Within this engine, two models are utilized—K-Nearest Neighbors (KNN)
and Random Forest Classifier (RFC)—each trained on local data segments. These models
forward their outputs to the prediction system, which integrates and interprets results. Finally, a
performance evaluation module assesses the accuracy and effectiveness of the predictions. The
evaluated outcomes are then returned to the user via the web interface, completing the cycle. This
decentralized approach not only improves scalability and response time but also protects sensitive
data from being exposed to a central server..

15
16
4.1.2 Data Flow Diagram

The flowchart illustrates a complete machine learning workflow starting from the user’s
interaction with the system. The user initiates the process by uploading a dataset through a user-
friendly interface. This uploaded data then enters the preprocessing stage, where it is cleaned,
normalized, and transformed to handle any missing values or noise, ensuring consistency and
quality. Once preprocessing is complete, the data is split into training and testing sets to allow
unbiased evaluation of model performance. The training set is then used in the model training
phase, where algorithms such as K-Nearest Neighbors (KNN) and Random Forest Classifier are
applied to learn patterns from the data. After training, the model is evaluated using the testing set
in the model testing stage to verify its generalization ability. The outcomes of the model testing
are then used to generate predictions for new data inputs. Simultaneously, performance metrics
such as accuracy, precision, recall, and F1-score are calculated to assess how well the model is
performing. These predictions and evaluation results are then made accessible to the user. This
structured and iterative process ensures both the reliability of the predictions and transparency in
model evaluation.

17
4.1.3 UML Diagrams

UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling
language in the field of object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group. The goal is for UML to become a common language
for creating models of object-oriented computer software. In its current form UML is comprised
of two major components: a Meta- model and a notation. In the future, some form of method or
process may also be added to; or associated with, UML.

The Unified Modeling Language Is a standard language for Specifying, Visualization,


Constructing and Documenting the artifacts of software system, as well as for business modeling
and other non-software systems. The UML represents a collection of best engineering practices
that have proven successful in the modeling of large and complex systems. The UML is a very
important part of developing objects-oriented software and the software development process. The
UML uses mostly graphical notations to express the design of software projects.

GOALS: The Primary goals in the design of the UML are as follows:

i) Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
ii) Provide extendibility and specialization mechanisms to extend the core concepts.
iii) Be independent of particular programming languages and development process.
iv) Provide a formal basis for understanding the modeling language.
v) Encourage the growth of OO tools market.
vi) Support higher level development concepts such as collaborations, frameworks, patterns and
components.
vii)Integrate best practices.

18
4.3.3.1 Class Diagram

The class diagram represents an object-oriented design for a machine learning pipeline. The Dataset
class manages core data operations such as uploading, preprocessing, and splitting the dataset. It
connects to the Preprocessing class, which performs essential tasks like exploratory data analysis
(EDA), removing null values, handling missing values, and normalizing the data. After
preprocessing, the clean data is passed to the Model class, which defines general machine learning
functions such as training, testing, evaluating, and predicting. Two specific model classes, K-
Nearest Neighbors (KNN) Model and RandomForest_Model, inherit from the Model class and
implement algorithms for K-Nearest Neighbors (KNN) and Random Forest respectively. These
subclasses override the train and test methods with model-specific logic. This structure promotes
modularity and code reuse while keeping the workflow organized. It also makes it easy to extend
the system with new models by simply creating subclasses of Model. Such an architecture is
effctive for building scalable and maintainable machine learning applications.

19
4.3.3.2 Activity Diagram

The flowchart illustrates a complete machine learning pipeline starting from data acquisition
to prediction. The process begins with uploading the dataset, followed by performing
Exploratory Data Analysis (EDA) to understand the structure and insights of the data. Next,
any missing values are handled appropriately to ensure data quality, and the data is
normalized to bring all features to a common scale. After preprocessing, the dataset is split
into training and testing sets. Machine learning models, specifically K-Nearest Neighbors
(KNN) and Random Forest Classifier, are then trained on the training data. These models
are subsequently tested on the test data to evaluate their performance. The evaluation step
involves analyzing metrics such as accuracy, precision, recall, or F1-score to determine
model effectiveness. Finally, the trained models are used to make predictions on unseen
data, completing the pipeline in a structured and modular manner.

20
4.3.3.3 Use Case Diagram

This use case diagram depicts a Federated Learning System workflow utilizing the K-Nearest
Neighbors (KNN) and Random Forest Classifier models. The user initiates the process by
uploading the dataset, after which the data is preprocessed—this includes tasks such as handling
missing values and normalization. The models (K-Nearest Neighbors (KNN) and Random Forest)
are then trained on local data without centralizing it, adhering to the principles of federated
learning. These models are subsequently tested to verify their performance, followed by an
evaluation phase that involves measuring metrics such as accuracy or F1-score. Finally, the system
generates predictions based on the trained models. Throughout this process, the system allows
performance metrics to be viewed, enabling transparent assessment and model comparison. This
approach enhances privacy by keeping data distributed and supports robust, secure machine
learning across multiple nodes.

21
4.3.3.4 Sequence Diagram

The sequence diagram depicts the flow of a machine learning application utilizing K-Nearest
Neighbors (KNN) and Random Forest models. The process starts with the User uploading a
dataset to the System, which then sends it to the Preprocessing module for exploratory data
analysis (EDA) and cleaning. After data preprocessing, the cleaned data is sent back to the
System, which then initiates model training using K-Nearest Neighbors (KNN) and Random
Forest classifiers. These models are trained within the Model module and returned to the System
once completed. The System proceeds to test the trained models by sending them along with test
data to the Model module, which returns the test results. These results are used by the System to
display performance metrics (e.g., accuracy, precision, recall) to the User. If the User requests
predictions, the System forwards this request to the Model module, which makes predictions using
the K-Nearest Neighbors (KNN) and Random Forest models and sends the predicted outputs back
to the System. Finally, the System displays the prediction results to the User.

22
4.3.3.5 Deployment Diagram

The architecture shows a clear division between the User Device and the Server, where the core
processing takes place. The User Interface on the User Device allows the user to upload data and
request results. This data is sent to the Preprocessing Module on the Server, where it is cleaned
and transformed—binary features are prepared for K-Nearest Neighbors, while general
numerical and categorical features are readied for Random Forest. The processed data then
moves into the Model Training Module, where both the K-Nearest Neighbors and Random
Forest models are trained in parallel. After training, the models are evaluated on unseen data
within the Model Testing Module. The results, including performance metrics like accuracy,
precision, and recall, are aggregated in the Evaluation Module. These metrics and any prediction
outputs are then sent back to the User Interface, allowing the user to review and interact with the
outcomes of both classifiers. This modular architecture supports efficient deployment and
comparison of both probabilistic and ensemble learning models.

23
4.2 Modules

1. Upload Dataset:

The user initiates the process by uploading a dataset of smart meter readings, containing labeled data
indicating normal and theft-related activities. This dataset may include features such as energy
consumption (kWh), voltage, current, power factor, and timestamps, along with labels such as
normal or theft for supervised training.

2. Data Preprocessing:
The uploaded dataset undergoes systematic preprocessing. This includes handling missing values
using imputation, removing outliers, converting categorical fields (if any) to numerical values (e.g.,
theft type encoding), and normalizing continuous features. This ensures the data is clean, structured,
and ready for effective machine learning model training.
3. Exploratory Data Analysis (EDA):

Visualization techniques are employed to understand patterns in the energy consumption data. Plots
such as line graphs, distribution plots, and correlation heatmaps are generated to detect trends,
outliers, seasonality, or unusual spikes in energy usage—helpful in feature selection and hypothesis
building.

4. Data Splitting:

The cleaned dataset is partitioned into training and testing subsets using train_test_split, typically
following a 70-30 or 80-20 ratio. This division allows the model to learn from historical patterns and
be tested on unseen data to evaluate its generalization ability.

5. Model Building:

Two machine learning models are developed:

K-Nearest Neighbors (KNN): A distance-based classifier that assigns theft or normal labels
based on the proximity of similar energy usage profiles.

Random Forest Classifier (RFC): An ensemble-based model that constructs multiple decision
trees to enhance accuracy and reduce overfitting, ideal for capturing non-linear relationships in
smart grid data.

24
6. Model Testing:

The trained KNN and RFC models are applied to the test dataset to generate predictions. These
predictions are then compared to actual labels to determine whether energy theft is correctly
identified.

7. Performance Evaluation:

Both models are evaluated using key classification metrics:

Accuracy: Measures overall correctness of predictions.

Precision: Indicates how many flagged thefts were truly thefts.

Recall: Measures how well actual thefts are detected.

F1 Score: Provides a balance between precision and recall.


Confusion matrices and detailed classification reports are generated to assess strengths,
weaknesses, and possible misclassification patterns.

8. Model Prediction:

The model with the best evaluation performance is selected for deployment. It is then used to predict
theft on real-time or batch smart meter data, enabling timely detection of anomalies. Detected theft
instances can be flagged for further inspection, reported to operators, or used to trigger automated
alerts.

4.5 System Requirements

4.5.1 Hardware Requirements:

The hardware requirements for running the cyber threat detection system are influenced by factors
such as the size of the network traffic dataset, the complexity of preprocessing steps, and the
computational needs of the machine learning models (Bernoulli Naive Bayes and Random Forest).
Below are the minimum recommended specifications:

25
Component Specification
System Intel i3 Processor or equivalent (dual- core
minimum)
RAM 4 GB (8 GB recommended for larger datasets)

Hard Disk 1 TB HDD (or 256 GB SSD for faster


performance)
Monitor 14” Color Monitor
Mouse Optical Mouse
Keyboard Standard USB Keyboard

4.5.2 Software Requirements:


The software environment must support data preprocessing, model training, and evaluation using
machine learning libraries. The following tools and libraries are required for building and running
the cyber threat detection system:

Component Specification
Operating System Windows 10 / Linux Ubuntu 18.04+ / macOS
10.13+
Programming Language Python 3.7 or higher
IDE / Editor Jupyter Notebook / VS Code / PyCharm
Libraries scikit-learn, pandas, numpy, matplotlib, seaborn
Package Manager pip / conda
Browser Chrome / Firefox (for notebook interface)

26
4.6 Testing

4.6.1 Unit Testing


Unit testing involves testing individual functions like data preprocessing, feature encoding, or
evaluation metric calculations. Each function is validated in isolation to ensure it behaves
correctly for expected inputs. For example, testing if missing values are handled as intended or if
accuracy is calculated accurately. These tests help catch small bugs early in development. They
also ensure core components remain reliable after code changes.

4.6.2 Integration Testing


Integration testing ensures that different modules like preprocessing, model training, and
evaluation work together as expected. It verifies the correct flow of data between stages, such as
from cleaned data into model training. Any mismatches in data format or logic between modules
can be caught here. This testing ensures smooth end-to-end interaction. It helps confirm that
combined modules produce meaningful results.

4.6.3 Functional Testing


Functional testing checks that the system performs all intended user-level operations correctly.
This includes uploading a dataset, training the models, generating predictions, and displaying
evaluation metrics. Tests are based on specified functional requirements. They focus on what the
system does rather than how it does it. Successful functional testing confirms that key features
meet user expectations.

4.6.4 System Testing


System testing evaluates the entire machine learning pipeline as one unified system. It includes
dataset handling, model training with BernoulliNB and Random Forest, and result visualization.
This test simulates a real-world scenario to ensure all components work together effectively. It
confirms that the system handles complete workflows from start to finish. System testing is
typically done before user testing or deployment.

27
4.6.5 White Box Testing:
White box testing focuses on the internal logic, paths, and control flows in the code. It involves
checking the implementation of algorithms, conditions, loops, and data transformations.
Developers use this to ensure all logical paths are tested. For example, confirming that
preprocessing conditions adapt to different data types. This ensures code correctness and improves
coverage.

4.6.6 Black Box Testing:


Black box testing evaluates the system without any knowledge of its internal logic. It involves
providing various inputs (valid, invalid, edge cases) and checking the outputs. This is used to
ensure the system handles unexpected or extreme inputs gracefully. It focuses on user interactions
and output correctness. It’s particularly useful for detecting bugs users might encounter.

4.6.7 Acceptance Testing:


Acceptance testing validates that the system meets business and user requirements. It simulates
real-world usage scenarios, such as detecting threats in a new dataset. The system must deliver
accurate predictions and clear metrics as expected. This testing is typically done by stakeholders
or users before final approval. A passed acceptance test means the system is ready for
deployment.

28
CHAPTER 5
SOURCE CODE

import pandas as pd
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans, DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
import joblib
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
data=pd.read_csv("Datasets/AllData.csv")
data
# Create a count plot
sns.set(style="darkgrid") # Set the style of the plot
plt.figure(figsize=(8, 6)) # Set the figure size
# Replace 'dataset' with your actual DataFrame and 'Drug' with the column name
ax = sns.countplot(x=data['IsStealer'])
plt.title("Count Plot") # Add a title to the plot
plt.xlabel("Categories") # Add label to x-axis
plt.ylabel("Count") # Add label to y-axis
# Annotate each bar with its count value
for p in ax.patches:

29
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')

plt.show() # Display the plot


#preprocessing
# Drop 'UserId' as it is not a feature
data = data.drop(columns=['UserId'])

# Separate features (X) and target (y)


X = data.drop(columns=['IsStealer'])
y = data['IsStealer']
# Handle missing values
imputer = SimpleImputer(strategy='mean') # Replace 'mean' with 'median' or 'most_frequent' if needed
X = imputer.fit_transform(X) # Impute missing values in features
smote = SMOTE(sampling_strategy='auto', random_state=42)
X,y= smote.fit_resample(X, y)
X.shape
# Create a count plot
sns.set(style="darkgrid") # Set the style of the plot
plt.figure(figsize=(8, 6)) # Set the figure size
# Replace 'dataset' with your actual DataFrame and 'Drug' with the column name
ax = sns.countplot(x=y)
plt.title("Count Plot") # Add a title to the plot
plt.xlabel("Categories") # Add label to x-axis
plt.ylabel("Count") # Add label to y-axis
# Annotate each bar with its count value
for p in ax.patches:
ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
textcoords='offset points')

plt.show() # Display the plot

30
# data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
target_names=["Non-Stealer", "Stealer"]
# existing model
import os
import joblib
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

MODEL_PATH = "model/knn_model.pkl" # Changed filename for clarity

if os.path.exists(MODEL_PATH):
print("Loading existing model...")
clf = joblib.load(MODEL_PATH) # Load model
else:
print("Training new model...")
clf = KNeighborsClassifier(n_neighbors=5) # You can tweak n_neighbors
clf.fit(X_train, y_train)
joblib.dump(clf, MODEL_PATH) # Save model
print("Model saved.")

y_pred = clf.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred) * 100
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred, target_names=target_names))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names,
yticklabels=target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("knn_model Confusion Matrix")
plt.show()
31
# Define model filename
MODEL_PATH = "model/random_forest.pkl"

# Check if model exists


if os.path.exists(MODEL_PATH):
print("Loading existing model...")
clf = joblib.load(MODEL_PATH) # Load model
else:
print("Training new model...")
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
joblib.dump(clf, MODEL_PATH) # Save model
print("Model saved.")

# Predictions
y_pred = clf.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)*100
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred, target_names=target_names))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names,
yticklabels=target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("random_forest_model Confusion Matrix")
plt.show()
test=pd.read_csv("Datasets/test.csv")
test
# Drop 'UserId' as it is not a feature
test = test.drop(columns=['UserId'])
32
Test
labels=["DISEASED","NORMAL"]
# Make predictions on the selected test data
predict = clf.predict(test)
# Loop through each prediction and print the corresponding row
for i, p in enumerate(predict):
print(test.iloc[i]) # Print the row
print(f"Row {i}:************************************************** {target_names[p]}")

CHAPTER 6
33
EXPERIMENTAL RESULTS

Data Analysis

Fig.6.1: Display Sample Dataset

Fig.6.2: Preprocessing of the dataset

This figure demonstrates a set of fundamental exploratory data analysis operations applied to the
dataset. The null values check identifies if any missing or incomplete records exist in the dataset. The
nunique function shows the number of unique values for each feature, giving insight into data
variability.The info summary displays the data types and memory usage of each column, which is
essential for understanding the structure of the dataset before processing. The describe function
provides statistical measures such as mean, standard deviation, minimum, and maximum values for
each feature, enabling an assessment of distribution and potential anomalies.

34
Fig.6.3: Count Plot of the Dataset.
This figure visualizes the distribution of the target classes within the dataset using a count plot. It
shows the frequency of instances labeled as 'Stealer' and 'Non-Stealer'. This visualization is critical to
assess whether the dataset is balanced or imbalanced in terms of class representation. A balanced
dataset ensures fair training of the model, while an imbalanced one requires handling strategies such as
resampling or weighted loss during model training.

Fig.6.4: Prediction Confusion Matrix of RFC, KNN Models


This figure displays the confusion matrices of the Random Forest Classifier and the K-Nearest
Neighbors model after making predictions on the test dataset. Each matrix includes four quadrants:
True Positives, True Negatives, False Positives, and False Negatives. The confusion matrix helps in
understanding the performance of each model in correctly identifying both Stealer and Non-Stealer
classes. It provides a clear visualization of errors made by the models and highlights areas where one
model outperforms the other.

35
Fig.6.5: Performance Metrics of RFC, KNN Models
Metric Class Existing Model (KNN) Proposed Model (RFC)
Accuracy - 82.95% 94.98%
Precision Non-Stealer 0.98 0.95
Stealer 0.75 0.95
Recall Non-Stealer 0.67 0.95
Stealer 0.99 0.95
F1-Score Non-Stealer 0.80 0.95
Stealer 0.85 0.95
Macro Avg Precision 0.87 0.95
Recall 0.83 0.95
F1-Score 0.83 0.95
Weighted Avg Precision 0.87 0.95
Recall 0.83 0.95
F1-Score 0.83 0.95

This figure presents the evaluation metrics—precision, recall, and F1-score—for both the RFC and
KNN models in a comparative format. Each metric is shown for both Stealer and Non-Stealer classes,
along with overall accuracy, macro average, and weighted average. The chart illustrates the
effectiveness of each model in terms of predictive performance. The RFC model shows consistently
higher metrics across all categories, indicating superior accuracy and balanced performance.

Fig.6.6: Model Prediction on Test Data

This figure shows how the trained models predict outcomes on unseen test data. It includes a side-by-
36
side comparison of the actual vs. predicted values for each entry in the test dataset. The results visually
validate how closely the model’s predictions match the actual labels. The RFC model produces
predictions with higher accuracy and consistency, demonstrating its reliability and generalization
capabilities in practical deployment scenarios.

Fig.6.7: Presents the Average Daily Electricity Consumption Across All Users.

Fig.6.8: Consumption Distribution for Stealers vs. Non-Stealers

37
CHAPTER 7
CONCLUSION AND FUTURE ENHANCEMENT

7.1. Conclusion

The proposed ML-driven system provides a powerful, scalable, and intelligent approach to identifying
potential electricity theft within modern smart grid infrastructures. By leveraging advanced machine
learning algorithms such as K-Nearest Neighbors (KNN) and Random Forest Classifier (RFC), the
system can detect anomalies in energy consumption patterns with high accuracy and efficiency. Unlike
traditional theft detection methods that rely heavily on manual inspections and reactive auditing, this
solution offers a proactive mechanism capable of processing real-time data and flagging suspicious
behavior promptly.

Through systematic preprocessing, effective model training, and rigorous evaluation, the system
ensures reliable performance even in complex and dynamic grid environments. The integration of
smart metering, automated analysis, and visualization tools enhances transparency, supports grid
integrity, and minimizes losses due to unauthorized usage. This work not only demonstrates the
potential of machine learning in addressing energy theft but also lays the groundwork for future
enhancements such as federated learning and deep learning techniques for even greater adaptability
and security.

Ultimately, this project contributes toward building a more secure, efficient, and intelligent smart grid
ecosystem that benefits both utility providers and consumers alike.

The system also supports scalability, making it suitable for implementation across a wide range of
urban and rural grid setups. Furthermore, it has the potential to integrate federated learning
frameworks, thereby preserving data privacy while enabling distributed training on edge devices or
substations.

Overall, this work makes a significant contribution to the field of smart grid cybersecurity and
operational efficiency. It not only minimizes energy loss and financial damage due to theft but also
builds trust and transparency between energy providers and consumers. The project serves as a
foundational model for future enhancements involving deep learning, edge computing, blockchain

38
integration for secure logging, and real-time alert automation, thus paving the way toward a smarter
and more resilient energy infrastructure.

7.2. Future Enhancement

Integration of Deep Learning Techniques:


Future versions of the system can incorporate deep learning models such as Long Short-Term Memory
(LSTM) networks or Convolutional Neural Networks (CNNs) to analyze time-series electricity
consumption data more effectively. These models can better capture complex temporal patterns and
improve detection accuracy for subtle or evolving theft behaviors.

Real-time Monitoring and Detection:


Currently focused on batch processing, the system can be enhanced to support real-time threat
detection using streaming technologies like Apache Kafka, Apache Flink, or Spark Streaming. This
will enable continuous monitoring of energy data and instant response to suspicious activity,
minimizing losses and improving grid security.

Adaptive and Incremental Learning Models:


To ensure ongoing accuracy, machine learning models can be adapted to support incremental learning,
where models update automatically with new data. This approach reduces the need for full retraining
and helps the system stay effective against emerging and unknown theft strategies.

Federated Learning Across Smart Meters:


Introducing federated learning can decentralize model training by allowing smart meters and edge
devices to learn collaboratively without sharing raw data. This not only preserves user privacy but also
increases the system's ability to scale across large, distributed smart grid infrastructures.

Advanced Feature Engineering Techniques:


Future versions may benefit from the application of automated or semi-automated feature selection
techniques such as Principal Component Analysis (PCA), Recursive Feature Elimination (RFE), or
AutoML pipelines. These enhancements can lead to more accurate models and faster training times.

39
Blockchain Integration for Data Integrity:
Integrating blockchain technology can provide an immutable log of energy transactions, model
updates, and system alerts. This ensures transparency, prevents tampering, and reinforces trust in the
theft detection process, especially when combined with federated learning.

Behavioral Pattern Analysis:


By analyzing customer-specific usage profiles, the system can be extended to include user behavior
analytics. This will help detect anomalies caused by insider threats, fraud, or unauthorized alterations,
adding an additional layer of intelligence to theft detection.

Multi-class Theft Classification:


The current system may be extended to classify different types of electricity theft (e.g., meter
tampering, illegal connections, cyber manipulation) using multi-class classification models. This
enables grid operators to respond with more specific, actionable measures.

Energy-efficient Learning for Edge Devices:


To accommodate low-power IoT environments, future work could include the deployment of
lightweight models using TinyML, model pruning, or quantization techniques. These optimizations
allow accurate inference on resource-constrained smart meters and edge devices.

Interactive Analytics and Visualization Tools:


The inclusion of rich data visualization dashboards using tools like Grafana, Kibana, or Power BI can
help operators visualize usage trends, detect anomalies, and monitor model performance in real-time.
These tools improve operational transparency and assist in decision-making.

40
REFERENCES

[1]. Gunduz, M.Z.; Das, R. Internet of things (IoT): Evolution, components and applications
fields. Pamukkale Univ. J. Eng. Sci. 2018, 24, 327–335. [Google Scholar] [CrossRef]
[2]. Das, R.; Gunduz, M.Z. Analysis of cyber-attacks in IoT-based critical infrastructures. Int. J.
Inf. Secur. Sci. 2019, 8, 122–133. [Google Scholar]
[3]. Emmanuel, M.; Rayudu, R. Communication technologies for smart grid applications: A
survey. J. Netw. Comput. Appl. 2016, 74, 133–148. [Google Scholar] [CrossRef]
[4]. Kimani, K.; Oduol, V.; Langat, K. Cyber security challenges for IoT-based smart grid
networks. Int. J. Crit. Infrastruct. Prot. 2019, 25, 36–49. [Google Scholar] [CrossRef]
[5]. Gunduz, M.Z.; Das, R. Communication Infrastructure and Cyber-Security in Smart Grids. J.
Inst. Sci. Technol. 2020, 10, 970–984. [Google Scholar] [CrossRef]
[6]. Qays, M.O.; Ahmad, I.; Abu-Siada, A.; Hossain, M.L.; Yasmin, F. Key communication
technologies, applications, protocols and future guides for IoT-assisted smart grid systems: A
review. Energy Rep. 2023, 9, 2440–2452. [Google Scholar] [CrossRef]
[7]. Sahoo, S.; Nikovski, D.; Muso, T.; Tsuru, K. Electricity theft detection using smart meter
data. In Proceedings of the 2015 IEEE Power & Energy Society Innovative Smart Grid
Technologies Conference (ISGT), Washington, DC, USA, 18–20 February 2015; pp. 1–5.
[Google Scholar] [CrossRef]
[8]. Althobaiti, A.; Jindal, A.; Marnerides, A.K.; Roedig, U. Energy Theft in Smart Grids: A
Survey on Data-Driven Attack Strategies and Detection Methods. IEEE Access 2021, 9,
159291–159312. [Google Scholar] [CrossRef]
[9]. Takiddin, A.; Ismail, M.; Serpedin, E. Detection of Electricity Theft False Data Injection
Attacks in Smart Grids. In Proceedings of the 2022 30th European Signal Processing
Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 1541–1545.
[Google Scholar] [CrossRef]
[10]. Badr, M.M.; Ibrahem, M.I.; Kholidy, H.A.; Fouda, M.M.; Ismail, M. Review of the Data-
Driven Methods for Electricity Fraud Detection in Smart Metering Systems. Energies 2023, 16,
2852. [Google Scholar] [CrossRef]
[11]. Wang, Y.; Chen, Q.; Gan, D.; Yang, J.; Kirschen, D.S.; Kang, C. Deep Learning-Based
Socio-Demographic Information Identification From Smart Meter Data. IEEE Trans. Smart
Grid 2019, 10, 2593–2602. [Google Scholar] [CrossRef]
[12]. Reda, H.T.; Anwar, A.; Mahmood, A. Comprehensive survey and taxonomies of false data
injection attacks in smart grids: Attack models, targets, and impacts. Renew. Sustain. Energy
41
Rev. 2022, 163, 112423. [Google Scholar] [CrossRef]
[13]. Javaid, N.; Gul, H.; Baig, S.; Shehzad, F.; Xia, C.; Guan, L.; Sultana, T. Using GANCNN
and ERNET for Detection of Non Technical Losses to Secure Smart Grids. IEEE
Access 2021, 9, 98679–98700. [Google Scholar] [CrossRef]
[14]. Habib, A.A.; Hasan, M.K.; Alkhayyat, A.; Islam, S.; Sharma, R.; Alkwai, L.M. False data
injection attack in smart grid cyber physical system: Issues, challenges, and future
direction. Comput. Electr. Eng. 2023, 107, 108638. [Google Scholar] [CrossRef]
[15]. El-Toukhy, A.T.; Badr, M.M.; Mahmoud, M.M.E.A.; Srivastava, G.; Fouda, M.M.;
Alsabaan, M. Electricity Theft Detection Using Deep Reinforcement Learning in Smart Power
Grids. IEEE Access 2023, 11, 59558–59574. [Google Scholar] [CrossRef]
[16]. Berghout, T.; Benbouzid, M.; Muyeen, S.M. Machine learning for cybersecurity in smart
grids: A comprehensive review-based study on methods, solutions, and prospects. Int. J. Crit.
Infrastruct. Prot. 2022, 38, 100547. [Google Scholar] [CrossRef]
[17]. Buzau, M.M.; Tejedor-Aguilera, J.; Cruz-Romero, P.; Gómez-Expósito, A. Detection of
Non-Technical Losses Using Smart Meter Data and Supervised Learning. IEEE Trans. Smart
Grid 2019, 10, 2661–2670. [Google Scholar] [CrossRef]
[18]. Abdulaal, M.J.; Ibrahem, M.I.; Mahmoud, M.M.E.A.; Khalid, J.; Aljohani, A.J.; Milyani,
A.H.; Abusorrah, A.M. Real-Time Detection of False Readings in Smart Grid AMI Using
Deep and Ensemble Learning. IEEE Access 2022, 10, 47541–47556. [Google Scholar]
[CrossRef]

42

You might also like