Enhancing SQL Injections
Enhancing SQL Injections
Abstract—SQL Injection (SQLi) continues to pose a significant Several major challenges have made SQLi detection diffi-
threat to the security of web applications, enabling attackers cult: the lack of data diversity, static detection systems, and
to manipulate databases and access sensitive information with- the high error rates of existing solutions. These limitations
out authorisation. Although advancements have been made in
detection techniques, traditional signature-based methods still prevent traditional methods from adapting to the broad range
struggle to identify sophisticated SQL injection attacks that of SQL Injection Attack (SQLIA) types, particularly the more
evade predefined patterns. As SQLi attacks evolve, the need for complex ones.
more adaptive detection systems becomes crucial. This paper To address these limitations, this research introduces a
introduces an innovative approach that leverages generative dynamic approach that combines synthetic data generation
models to enhance SQLi detection and prevention mechanisms.
By incorporating Variational Autoencoders (VAE), Conditional with advanced deep learning models. Specifically, Variational
Wasserstein GAN with Gradient Penalty (CWGAN-GP), and U- Autoencoders (VAE), U-Net, and Conditional Wasserstein
Net, synthetic SQL queries were generated to augment training GAN with Gradient Penalty (CWGAN-GP) are leveraged to
datasets for machine learning models. The proposed method generate diverse synthetic SQL data. This enriched dataset
demonstrated improved accuracy in SQLi detection systems helps improve generalisation and provides better identification
by reducing both false positives and false negatives. Extensive
empirical testing further illustrated the ability of the system to of both traditional and modern SQLi attacks [5], [6], [7].
adapt to evolving SQLi attack patterns, resulting in enhanced The aim of this research is to enhance SQLIA detection by
precision and robustness. integrating synthetic data generation with advanced machine
Index Terms—SQL Injection, Machine Learning, Generative learning models to improve accuracy, adaptability, and overall
Models, Variational Autoencoder (VAE), U-Net, CWGAN-GP, system robustness. By preprocessing and embedding SQL
Data Augmentation, Cybersecurity. queries and using synthetic data to diversify the training set,
the study focuses on optimising key performance metrics such
as accuracy, precision, recall, and F1-score. The remainder
I. I NTRODUCTION
of this paper is organised as follows. Section II covers the
SQL Injection (SQLi) remains one of the most critical literature review; Section III presents the implementation to
security vulnerabilities affecting web applications today. As develop the SQLi detection solution. In Section IV, results and
cyber threats evolve, attackers continuously exploit input han- analysis are discussed, followed by the conclusion in Section
dling weaknesses, injecting malicious SQL commands into V. Finally, Section VI outlines the limitations and future scope
legitimate queries. These attacks, often launched through input of the work.
fields such as login forms or URL parameters, enable unau-
thorised access to sensitive data or, in severe cases, complete II. L ITERATURE R EVIEW
control over the database.
The Open Web Application Security Project (OWASP) con- A. SQL Injection
tinues to rank SQLi among the top security risks, reinforcing Traditional SQL Injection (SQLi) prevention methods pri-
its prevalence and severity in the landscape of web vulnerabil- marily focused on fundamental coding practices such as input
ities [1]. The impact of SQLi attacks is often severe, leading validation and parameterised queries, which aimed to mitigate
to data breaches, financial loss, and reputational damage to attacks by sanitising user inputs. Although these methods were
affected organisations. effective for basic attacks, more sophisticated techniques, such
Traditional defence mechanisms, such as input validation as time-based, blind, and second-order SQL injections, enabled
and signature-based detection systems, have been widely em- malicious inputs to bypass traditional validation mechanisms
ployed to combat SQLi attacks. However, these methods often and execute the payload at a later stage [2]. As SQLi threats
fall short when confronting the evolving techniques used by evolved, signature-based detection systems were introduced,
attackers. Signature-based systems, in particular, struggle with relying on known attack patterns to identify malicious queries
false positives and false negatives, especially when attackers in real time. However, these systems encountered significant
use obfuscation or innovative variations of SQLi that deviate difficulties in handling novel and obfuscated attacks that devi-
from known patterns [2], [3], [4]. ated from predefined patterns, resulting in high false-positive
JOURNAL OF CYBERSECURITY AND DATA SCIENCE, JANUARY 2025 2
and false-negative rates [3]. To address these challenges, rule- B. Text Data Synthesis
based systems were developed to analyse query structures
more deeply. Yet, they continued to experience high false- 1) Rule-based Text Synthesis: Text data synthesis is crucial
positive rates and struggled to detect subtle attacks [4]. for enhancing the performance of machine learning models,
offering a range of techniques to generate synthetic data.
With the continuous advancement of SQLi techniques, Model-based techniques, such as those explored in the work
behavioural detection systems were developed to identify of Panagiotis et al [13], generate diverse data by rephrasing
anomalies in query behaviour. These systems aimed to detect content while preserving its meaning, though they are com-
deviations from normal query patterns but often produced putationally demanding and can introduce semantic drift. On
high false positives, particularly in dynamic environments the other hand, rule-based augmentation methods, such as syn-
[8]. Hybrid models that combined static code analysis with onym replacement, random insertion, and swapping [14], offer
dynamic execution traces improved detection by analysing computational efficiency but often fail to maintain contextual
both code structure and runtime behaviour. Nonetheless, their meaning, leading to distorted outputs. These limitations make
dependency on labelled data reduced their effectiveness in rule-based methods unsuitable for complex tasks such as SQL
real-world scenarios, where such datasets are often limited query augmentation.
[9]. Heuristic-based systems, such as V1p3R, attempted to
Feature-Space Augmentation, introduced in the work of
overcome these issues by leveraging error message feedback
Shorten et al [15], applies transformations to latent em-
to adapt detection in real time, but complex and obfuscated
beddings to improve generalisation by modifying intermedi-
attacks remained difficult to detect [10].
ate representations. Graph-Structured Augmentation preserves
To overcome these persistent limitations, machine learning syntactic relationships by leveraging knowledge graphs or
approaches have been increasingly applied to SQLi detection. syntax trees, while MixUp Augmentation blends text samples
Early models, such as Naı̈ve Bayes combined with Role- and labels to expand decision boundaries and reduce overfit-
Based Access Control (RBAC), improved detection accuracy ting. However, these methods can reduce interpretability and
by classifying queries probabilistically. However, these models introduce inconsistencies, particularly in structured data such
faced difficulties in handling obfuscated attacks and required as SQL queries, where maintaining syntactic and semantic
manually crafted features, which limited their adaptability to relationships is critical. Minor changes in SQL queries can
novel attack patterns [5]. Support Vector Machines (SVMs) disrupt the query logic, making rule-based approaches less
offered further improvements by enhancing scalability and suitable for augmenting SQL data. Consequently, model-
handling more complex SQLi patterns. Yet, their reliance on based synthesis provides a more context-aware and accurate
manual feature engineering rendered them less effective in approach for SQL data augmentation.
detecting rapidly evolving attacks [6]. Ensemble methods, in- 2) Model-based Text Synthesis: Large Language Models
cluding LightGBM and Gradient Boosting Machines (GBM), (LLMs), as highlighted in Lovelace et al [16], capture both
achieved high detection accuracy by combining multiple weak short- and long-term dependencies, making them effective
learners. Despite these advancements, their dependence on for tasks such as SQL query augmentation. However, LLMs
hand-crafted features hindered their ability to generalise to require substantial computational resources and large data sets
unseen queries [7]. for training, which limits their utility in resource-constrained
More recently, deep learning models, such as Convolu- environments. Labs [17] discusses the use of Variational
tional Neural Networks (CNNs) and Multi-Layer Perceptrons Autoencoders (VAEs) and Generative Adversarial Networks
(MLPs), have pushed SQLi detection forward by automati- (GANs), with VAEs providing flexibility by manipulating
cally extracting complex patterns from SQL queries. These latent space and GANs excelling in generating realistic data
models reduced false positives and enhanced the detection through adversarial training. However, GANs demand careful
of obfuscated attacks. However, they required large, labelled tuning to prevent mode collapse, which can limit their appli-
data sets and high computational resources, which limits their cation in certain scenarios.
scalability in practical applications [11]. SQLNN, a deep Transformer-based models, such as BERT, T5, and BART,
learning model that utilised TF-IDF for feature extraction, utilise attention mechanisms to capture long-range depen-
demonstrated high accuracy but faced challenges in inter- dencies and are effective for tasks such as text generation
pretability and struggled to detect highly obfuscated queries and translation. However, these models face scalability chal-
[12]. lenges when deployed at scale. Recurrent Neural Networks
Despite these advancements, several challenges remain. (RNNs), including LSTM and GRU, remain valuable for
Adapting to evolving SQLi techniques, managing limited la- sequence modelling involving temporal dependencies but are
belled datasets, and addressing the computational costs of deep increasingly being replaced by transformers in many scenarios.
learning models continue to present significant hurdles. Tra- Additionally, Diffusion Models, such as Denoising Diffusion
ditional detection systems remain vulnerable to sophisticated Probabilistic Models (DDPM), introduced in the work of
attacks, while machine learning models are still dependent on Labs [17], iteratively refine noisy data to improve sample
manual feature engineering. These limitations have driven the quality and efficiency, though they also require significant
exploration of adaptive solutions, such as the generation of computational resources. Seq-U-Net, introduced in the work
synthetic data to augment limited datasets and improve model of Stoller et al [18], provides a more efficient alternative for
generalisation. sequence modelling. By using causal convolutions, Seq-U-Net
JOURNAL OF CYBERSECURITY AND DATA SCIENCE, JANUARY 2025 3
and blind SQL injection attacks to ensure a comprehensive terisation trick was applied, as represented by the following
representation of various SQLi attack types. This integration equation:
aimed to improve the detection capabilities of the model for 2
a broader range of SQL injection patterns, including those σ
z = µ + ϵ · exp , ϵ ∼ N (0, 1)
identified in the OWASP Top 10 A03:2021. 2
The VAE’s loss function combines two key components:
B. Tokenisation & Embedding 1. Reconstruction loss, which measures how accurately the
A custom tokeniser was developed to convert SQL queries decoder reconstructs the original SQL queries:
into structured tokens, ensuring the capture of essential syn- N
1 X
tactic and semantic features. Various embedding methods, Lreconstruction = ∥xi − fdec (zi )∥2
including FastText, Character-level embeddings, Byte Pair N i=1
Encoding (BPE), and BERT, were evaluated to determine the 2. Kullback-Leibler (KL) divergence, which regularises the
optimal approach. As shown in Fig. 2, FastText emerged as latent space by ensuring the learned distribution is close to a
the most efficient, offering a strong balance between accuracy unit Gaussian:
and training time, making it the best option for transforming
1X
SQL queries into vector representations for subsequent model LKL = − (1 + log(σ 2 ) − µ2 − σ 2 )
training. 2
The total VAE loss is expressed as:
LV AE = Lreconstruction + β · LKL
where β controls the trade-off between reconstruction quality
and regularisation.
Fig. 4 illustrates the convergence of training and validation
losses during VAE training, demonstrating stable learning and
model generalisation.
D. Synthetic Data Generation conducted using the Optuna framework. The Optuna Tree-
To enhance the diversity of the dataset, two generative structured Parzen Estimator (TPE) was used to explore the
models U-Net, and CWGAN-GP were utilised to generate hyperparameter space. The optimisation aimed to minimise
synthetic SQL queries that closely mimic real-world SQL the Mean Squared Error (MSE) between the original and re-
injection patterns. The following subsections will discuss each constructed SQL queries. The key hyperparameters optimised
model in detail, outlining their architecture and adaptations for were:
SQL query data generation. • Base Filters: Ranging from 32 to 128 filters.
−5
1) U-Net Model: In this study, the U-Net architecture was • Learning Rate: 1e to 1e−2 .
adapted for generating synthetic SQL queries to augment the • Dropout Rate: 0.1 to 0.5.
dataset used for SQL injection detection. The U-Net model • Depth: 3 to 5 layers.
was chosen due to its ability to capture both local and global The best configuration included a base filter size of 704, a
dependencies, which is essential for preserving the hierarchical learning rate of 4.61e−5 , and a dropout rate of 0.03. These
structure of SQL queries. hyperparameters ensured a balance between model capacity
Model Architecture: The U-Net model retained its core and generalisation, minimising overfitting.
encoder-decoder architecture, but was adapted for 1D se- Training Process: The U-Net model was trained using
quential data, as shown in Fig. 5. The encoder consists of the Adam optimiser, with a learning rate decay schedule
convolutional layers followed by batch normalisation, ReLU that gradually reduced the learning rate. Early stopping was
activation, and max-pooling to capture abstract features and employed to prevent overfitting. The final loss function was
reduce dimensionality. The decoder mirrors the encoder but defined as:
performs up-sampling, restoring the original sequence struc- N
1 X 2
ture of the SQL queries while retaining critical low-level LU −N et = (xi − fdec (fenc (xi )))
details through skip connections. N i=1
Where xi represents the input SQL query, and fenc and fdec
are the encoder and decoder functions, respectively.
as benign or malicious queries. By incorporating a gradient Loss Functions: The CWGAN-GP uses the Wasserstein
penalty, the model enforces Lipschitz continuity, ensuring loss with gradient penalty for both the generator and the critic:
smoother training and more realistic query generation. • Critic loss:
Model Architecture: The CWGAN-GP architecture con-
LCritic = E[D(xreal )]−E[D(xf ake )]+λE (∥∇x̂ D(x̂)∥2 − 1)2
sists of two primary components, the generator and the
critic (discriminator), as illustrated in Figure 7. The generator The critic maximises the difference between its evaluation
produces synthetic SQL queries by combining a random noise of real queries xreal and generated queries xf ake , while
vector z with a one-hot encoded label y, which is used to minimising the gradient penalty term to enforce stability.
condition the output. The critic, on the other hand, evaluates • Generator loss:
both real and synthetic SQL queries, using the Wasserstein
distance to distinguish between real and fake data, while LGenerator = −E[D(xf ake )]
a gradient penalty regularises the critic to ensure Lipschitz
The generator minimises this loss to create synthetic
continuity.
queries that the critic struggles to differentiate from real
queries.
Mathematical Formulation:
The generator G(z, y) takes a noise vector z sampled Fig. 8. CWGAN-GP Generator and Critic Losses with Gradient Penalty
from a normal distribution and a one-hot encoded label y.
These inputs are concatenated and passed through several fully Training and Optimisation: The training process alternates
connected layers, which use ReLU activations to generate between updating the critic and the generator:
synthetic SQL queries. The generator can be mathematically
• Critic update: The critic is updated using real and fake
formulated as [29]:
SQL queries, with the gradient penalty applied to enforce
Lipschitz continuity. For each generator update, the critic
G(z, y) = Densek (ReLU(Concat(z, y))) → . . . → Output Layer is trained multiple times (in this case, ncritic = 2) to
ensure stability.
where z is the latent noise vector and y is the label. The
• Generator update: The generator is updated to minimise
output layer produces a vector of the same dimensionality as
the score of the critic on the generated SQL queries.
the original SQL queries.
The critic D(x, y) receives real or generated SQL queries To fine-tune the CWGAN-GP model, two approaches were
and their corresponding labels. The critic uses dense layers employed:
with ReLU activations to estimate the Wasserstein distance, • Bayesian Optimisation: Initially, Bayesian Optimisation
a real-valued score that differentiates between real and fake was used to explore the hyperparameter space, resulting
queries. The critic is defined as Atienza [29]: in minimal reconstruction loss. Hyperparameters such as
the number of layers, dropout rates, and learning rate
were tuned.
D(x, y) = Densek (ReLU(Concat(x, y))) → . . . → Output Layer • Optuna Fine-Tuning: After the Bayesian phase, Optuna
where x is either the real or the generated SQL query. was used to further fine-tune the model, exploring a
To ensure the critic satisfies the Lipschitz constraint, a narrower and higher-potential search space. The dynamic
gradient penalty term is added to the loss function. The exploration as provided by Optuna, combined with its
gradient penalty is computed as follows: pruning mechanism, enabled efficient fine-tuning by halt-
ing underperforming trials early.
LGP = λE (∥∇x̂ D(x̂)∥2 − 1)2
Evaluation Metrics: The performance of the CWGAN-
where x̂ is an interpolation between real and fake data, and GP model was evaluated using various metrics, including
λ is a regularisation parameter controlling the contribution of Mean Squared Error (MSE), R² score, BLEU score, Cosine
the gradient penalty. similarity, and Lowenstein distance. These metrics were used
JOURNAL OF CYBERSECURITY AND DATA SCIENCE, JANUARY 2025 7
to assess how closely the generated SQL queries resembled synthetic data as generated by U-Net and CWGAN-GP mod-
real SQL queries. Furthermore, Principal Component Analysis els. The following steps were performed to evaluate the
(PCA) was employed to visualise the overlap between real and performance of the model:
synthetic data, confirming the CWGAN-GP model’s ability to Hybrid Dataset Composition: The combined dataset
generate realistic, high-quality SQL queries. Dcombined was created by mixing real data with synthetic data
The results demonstrated that the CWGAN-GP model sig- from U-Net and CWGAN-GP in different proportions. The
nificantly improved the diversity and quality of the synthetic combined dataset is formulated as follows:
SQL queries, providing a robust solution for SQL injection
detection systems.
Dcombined = Dreal ∪ (DU-Net × p1 ) ∪ (DCWGAN-GP × p2 )
where Dreal represents the real dataset, DU-Net and
E. Pseudo-Labelling of Synthetic Data
DCWGAN-GP are the synthetic datasets generated by U-Net and
To refine the synthetic SQL data as generated by U-Net and CWGAN-GP, and p1 and p2 are the proportions of synthetic
CWGAN-GP, pseudo-labelling was employed. This method data from each model. By adjusting p1 and p2 , different hybrid
involved reducing the dimensionality of the high-dimensional dataset compositions were tested to optimise the training data
data using Principal Component Analysis (PCA) and applying balance between real and synthetic data.
KMeans clustering to assign pseudo-labels. Cross-Validation of Dataset Combinations: Stratified K-
PCA for Dimensionality Reduction: Principal Component Fold Cross-Validation was employed to evaluate different
Analysis (PCA) was applied to reduce the dimensions of combinations of real and synthetic data while preserving class
the synthetic data to two principal components for better distribution across all folds. This method ensured that the
visual representation and easier clustering. Mathematically, the performance of the model was evaluated consistently across
transformation can be described as: various splits of the data. The performance was measured
using two key metrics:
Z = XW
True Positives + True Negatives
Accuracy =
where X represents the original high-dimensional data and Total Samples
W is the projection matrix consisting of the top two eigenvec-
tors of the covariance matrix of the data. This transformation True Positives
Sensitivity =
enabled a clear separation of the data into clusters, facilitating True Positives + False Negatives
the next step of clustering and pseudo-labelling. These metrics provided insight into the ability of the model
KMeans Clustering for Pseudo-Labelling: Once the data to correctly classify SQLi attacks while minimising false nega-
was reduced to two dimensions, KMeans clustering was per- tives. The cross-validation process helped identify the optimal
formed to assign pseudo-labels. The KMeans algorithm min- proportion of real and synthetic data for maximising model
imised the Within-Cluster Sum of Squares (WCSS), defined performance, ensuring a robust balance between precision and
as: recall.
k X
X
W CSS = ∥x − µi ∥2 G. Final Model Evaluation
i=1 x∈Ci After identifying the best dataset combination, the XGBoost
where k is the number of clusters (in this case, k = 2), classifier was trained on the combined dataset. XGBoost was
Ci represents the set of points assigned to cluster i, µi is selected for its high efficiency and scalability, especially in
the centroid of cluster i, and x is each data point. The dealing with structured data such as SQL queries. The final
KMeans algorithm assigned pseudo-labels corresponding to model used logistic loss as the objective function, defined as:
benign (Class 0) and malicious (Class 1) SQL queries. The
labelling was based on the spread of the data: benign queries N
1 X
exhibited a lower spread, while malicious queries displayed LXGBoost =− [yi log ŷi + (1 − yi ) log(1 − ŷi )]
N i=1
a higher spread in the feature space. This distinction in the
data distribution enabled the clustering algorithm to effectively where N is the total number of samples, yi is the true label
separate benign from malicious queries. By leveraging these for sample i, and ŷi is the predicted probability for sample
differences in data spread, the KMeans algorithm enabled i. This loss function optimises the classification model by
the accurate classification of synthetic data into the relevant minimising the error in predicting the correct labels for both
categories, facilitating its use for training machine learning benign and malicious queries.
models. The trained XGBoost model was evaluated on the test
set using multiple metrics, including accuracy, sensitivity,
precision, recall, and F1-score.
F. Evaluation of Models on Hybrid Data The results demonstrated that the combination of real and
To further enhance model performance, a hybrid dataset pseudo-labelled synthetic data improved the ability of the
was created by combining real SQL data with pseudo-labelled model to generalise to new, unseen SQL queries. The final
JOURNAL OF CYBERSECURITY AND DATA SCIENCE, JANUARY 2025 8
XGBoost model achieved high accuracy, sensitivity, and pre- As depicted in Figure 9, XGBoost achieved the highest
cision, indicating its effectiveness in SQL injection detection accuracy, outperforming other models with an impressive
across diverse attack types. score of 99.40%. LightGBM followed closely with 99.35%,
while Random Forest and KNN achieved 99.15% and 98.65%,
IV. R ESULTS AND A NALYSIS respectively. Neural Networks, while often strong performers
in complex tasks, registered 95.39% in this specific SQLIA
In this section, the performance of several machine learning detection task. Logistic Regression, Support Vector Classifier
models trained on VAE-encoded SQL data is evaluated for (SVC), and Naive Bayes were outperformed by the others,
detecting SQL Injection Attacks (SQLIA). The models tested with Naive Bayes being the least accurate at 72.82%.
include XGBoost, LightGBM, Random Forest, K-Nearest The results underscore the suitability of XGBoost for
Neighbors (KNN), Neural Networks, Logistic Regression, SQLIA detection tasks, making it the preferred choice for
Support Vector Classifier (SVC), and Naive Bayes. The evalu- further experiments. Its balance of speed, precision, and han-
ation focuses on metrics such as accuracy, precision, recall, F1- dling of imbalanced data makes it ideal for this application,
score, and sensitivity for both benign (Class 0) and malicious particularly when combined with synthetic data generation
(Class 1) queries. techniques such as those explored in this study.
the reliability of the generated data for SQL injection 1.4565 supports the high similarity between the datasets,
detection. requiring minimal changes for alignment.
Other Metrics and Justifications The Principal Component Analysis (PCA) results in Figure
Several other metrics commonly used in text generation, 11 further validate the consistency of the U-Net model in
such as perplexity and compression ratio, were considered but generating synthetic queries that align closely with the real
deemed unsuitable for this structured dataset: SQL data. Minimal distributional differences, as indicated by
• Perplexity: This metric is typically used in language the Mean Difference of 0.0062 and Variance Difference of
modelling to measure uncertainty in word predictions. 0.0045, emphasise the strong generalisation capabilities of the
However, because SQL queries are deterministic and do model in mimicking real-world query structures.
not involve probabilistic word choices, perplexity is not
applicable in this context [30].
• Compression Ratio: Commonly used to evaluate text
summarisation, this metric measures the reduction in text
length. In SQL query generation, the goal is to preserve
accuracy and completeness rather than conciseness, mak-
ing this metric inappropriate for the task.
By selecting the metrics most relevant to structured SQL
data, this evaluation ensures that the synthetic queries gen-
erated are reliable and useful for training machine learning
models to detect SQL injection attacks.
In addition, Principal Component Analysis (PCA) and K-
Means Clustering were used to visually inspect the alignment
between real and synthetic data distributions. These techniques
provide additional insights into the structural similarities of
both datasets. The combination of these metrics ensures a
robust evaluation of the synthetic data’s quality, supporting its Fig. 11. PCA of Real vs Synthetic Data for U-Net Model
use in training models for SQL Injection detection systems.
2) CWGAN-GP Results and Discussion: Similar to the
C. Evaluation of Synthetic Data U-Net model, the synthetic data generated by CWGAN-GP
was evaluated using key performance metrics to assess its
1) U-Net Model: Results and Discussion: The performance
effectiveness.
of the U-Net model in generating synthetic SQL queries was
evaluated using the above key metrics.
Figure 13 shows the Principal Component Analysis (PCA), E. XGBoost Performance on Various Data Combinations
where the real and synthetic data exhibit partial overlap, with
the synthetic data displaying greater dispersion. This suggests In this section, the performance of the XGBoost model
that while CWGAN-GP effectively captures overall patterns, trained on a combination of original data and synthetic data
there remains some variability. as generated by U-Net and CWGAN-GP models is analysed.
In summary, the CWGAN-GP model demonstrates effective Figure 15 illustrates the comparative results of XGBoost across
token-level and vector-based similarities with real data, though various evaluation metrics.
with some structural deviations, as evidenced by the PCA and
higher Lowenstein Distance.
Fig. 14. KMeans Clustering of U-Net and CWGAN-GP Synthetic Data with
Pseudo Labels. U-Net results on the left and CWGAN-GP on the right.