Paper 16-Detection of SQL Injection Using A Genetic Fuzzy Classifier System
Paper 16-Detection of SQL Injection Using A Genetic Fuzzy Classifier System
Abstract—SQL Injection (SQLI) is one of the most popular developed techniques to bypass web application firewall (WAF
vulnerabilities of web applications. The consequences of SQL bypassing). The security agents started to use buffer overflow
injection attack include the possibility of stealing sensitive methods and applied new bypassing methods like special
information or bypassing authentication procedures. SQL characters bypassing. The various types of injections at
injection attacks have different forms and variations. One different levels require a solution that can cope with such
difficulty in detecting malicious attacks is that such attacks do changes.
not have a specific pattern. A new fuzzy rule-based classification
system (FBRCS) can tackle the requirements of the current stage A number of approaches address detection of SQLI attacks.
of security measures. This paper proposes a genetic fuzzy system Such approaches include static analysis, dynamic analysis, and
for detection of SQLI where not only the accuracy is a priority, combined approach. Researchers developed other approaches
but also the learning and the flexibility of the obtained rules. To like mutation based approach, query tokenization and applying
create the rules having high generalization capabilities, our regular expressions. These approaches suffer from a number of
algorithm builds on initial rules, data-dependent parameters, and problems preventing them from being the optimal solutions [6].
an enhancing function that modifies the rule evaluation Those techniques lack flexibility and scalability; they cannot
measures. The enhancing function helps to assess the candidate deal with unknown types or larger ranges of injections [7].
rules more effectively based on decision subspace. The proposed Lack of learning capabilities is a vital problem. Most solutions
system has been evaluated using a number of well-known data parse user input and confirm match limited to fixed and very
sets. Results show a significant enhancement in the detection
small patterns, which are modeled by reference to existing
procedure.
malicious web code. However, there are new malicious web
Keywords—SQL injection; web security; genetic fuzzy system; codes which can deliberately be developed to avoid being
fuzzy rule learning matched with the registered patterns [8]. The available parsing
techniques can also cause high computational overhead
I. INTRODUCTION affecting real-time detection [9].
Web applications are vulnerable to numerous attacks. SQL Recently, machine learning techniques are adapted to
injection is a widely common threat, which remains on top of overcome previously mentioned problems as they can give
the list of web application attacks as ranked by OWASP (the leverage for the broader range of malicious web code and can
Open Web Application Security Project) [1]. Various be adapted to variations and changes [8]. Machine learning
techniques of SQL injection are used by hackers to achieve techniques explore the study and construction of
different purposes: bypassing a login system, modifying a table algorithms that can learn from and make predictions
in a database, shutting down SQL server, getting database on data. Such algorithms operate by building a model from
information from the returned error message, or executing example inputs in order to make data-driven predictions or
stored procedures [2]. decisions, rather than following strictly static program
SQL injection attacks are a type of vulnerability that is instructions [10]. Some existing machine learning techniques
ultimately caused by insufficient input validation. Such attacks suffer from high computational overhead; the training of
occur when data provided by the user is not properly validated classifiers in those techniques is time-consuming and causes
and included directly in an SQL query. By leveraging these computational overhead. Furthermore, a number of existing
vulnerabilities, an attacker can submit SQL commands directly solutions lack adaptation capability to detect new attacks [9].
to the database. Web applications are threatened by this kind of Uncertainty and fuzziness are popular phenomena in
vulnerability that uses user input to form SQL queries to access applications of machine learning. Different types of uncertainty
an underlying database [3]. Generally, SQL injection attacks can be observed: (i) Noise, outliers, and errors affect the input
are classified into seven types: tautologies, illegal/logically data. A machine learning method has to deal with this type of
incorrect queries, piggy-backed queries, stored queries, fuzzy information, showing robustness with respect to such
inference and alternate encodings [2] [4] [5]. disturbances. (ii) Distribution and fuzziness influence
Attackers continuously develop new ways to bypass representation of information within a machine learning
controls added by developers. In the recent years, hackers system. According to these different locations and goals of
started to use different styles to perform SQLI. Hackers fuzzy information, a variety of different models exist which
allow machine learning to deal with uncertain information as
129 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
input, output, or internal representation [11]. Fuzzy rule-based II. LITERATURE REVIEW AND RELATED WORK
systems (FRBSs) are well-known methods within soft In general, SQL injection attacks can be divided into the
computing, based on fuzzy concepts that address complex real- three main categories: in-band, out-of-band and inferential
world problems. They are powerful methods to address [14]. In the in-band attacks, the information is extracted from
uncertainty, imprecision, and non-linearity [12]. the same channel that is used for the attack. For example, the
Fuzzy rule-based classification systems (FRBCSs) are list of users will appear in the current page. In out-of-band
specialized in handling classification tasks. A main attack, the extracted information is sent back to the attacker
characteristic of classification is that the outputs are categorical using another channel such as email. For inferential, which is
data. Therefore, in this model type, we preserve the antecedent also known as a blind injection, no data is sent back directly to
part of linguistic variables and change the consequent part to be the attacker. However, the attacker can reconstruct the data by
a class Cj from a pre-specified class set C = {C1,.......,CM}. trying the different attacks and observing the behavior of the
FRBCS aim at representing the knowledge of human experts in web application.
a set of fuzzy IF-THEN rules. Instead of using crisp sets as in In the literature, SQLI detection techniques can be
classical rules, fuzzy rules use fuzzy sets. Rules were initially classified into the dynamic analysis, static analysis, combined
derived from human experts through knowledge engineering approach, machine learning, and other approaches (e.g. Hash
processes. However, this approach may not be feasible when technique, Black Box Testing) [3][15-19]. Static analysis
facing complex tasks or when human experts are not available. checks whether every flow from a source to a sink is subject to
An effective alternative is to generate the FRBCS model an input validation and/or input sanitizing routine [20];
automatically from data by using learning methods. FRBCSs whereas dynamic analysis is based on dynamically mining the
have demonstrated their ability to handle control problems, programmer’s intended query structure on any input and
modeling, classification or data mining in a huge number of detects attacks by comparing it against the structure of the
applications [13]. actual query issued [21].
The automatic definition of FRBCS rules can be seen as an AMNESIA, as a combined approach, is a model-based
optimization problem. Genetic Algorithms (GAs) are global technique that combines the static and dynamic analysis for
search techniques with the ability to explore a large search detection and prevention of SQLI attacks [3]. In the static
space for suitable solutions only requiring a performance phase, to build the models of the SQL queries that are
measure. In addition to their ability to find near optimal generated at points of access to the database, AMNESIA uses a
solutions in complex search spaces, the generic code structure static analysis. In the dynamic phase, AMNESIA intercepts all
and independent performance features of GAs qualifies them to the SQL queries before they are sent to the database and checks
incorporate a priori knowledge. In the case of FRBCSs, this a each query against the statically built models. Queries that
priori knowledge may be in the form of linguistic variables, violate the model are identified as SQLI attacks. The accuracy
fuzzy membership function parameters, fuzzy rules, number of of AMNESIA depends on the static analysis stage.
rules (Genetic rule learning), etc. These capabilities extended Unfortunately, certain types of complicated codes and/or query
the use of GAs in the development of a wide range of generation techniques make this step less precise and generate
approaches for designing FRBSs over the last few years. both false positives and negatives [22].
Therefore, GAs remain today as one of the fewest knowledge
schemes available to design and optimize FRBCSs with respect As mentioned above, several approaches for detection of
to the design decisions. According to the performance SQL injection were developed. The literature survey
measures, decision makers decide which components are fixed emphasizes on the machine learning techniques which are
and which need to change [13]. relevant to our proposed system. Valeur et al. [23] proposed an
intrusion detection system capable of detecting a variety of
In this work, we investigate the FRBCS technique for SQL injection attacks. Profiles of normal access to the database
detection of SQLI; we suggest a new technique to address the are built using statistical methods. At runtime, queries that do
uncertainty, fuzziness and adaptation problems associated with not match any built model are identified as a possible attack.
existing machine learning techniques. The rule selection As with most learning-based anomaly detection techniques, the
mechanism in FRBCS induces competition among rules by system requires a training phase prior to detection. The main
only considering the quality of matching performed by each problem of this technique besides the false positives and
rule. To increase the generalization power of the classifier, we negatives is its execution and storage overhead, due to
have proposed a genetic fuzzy approach that creates more difficulty in training on all the possible normal benign queries
cooperative rules in the final population. The proposed system with normal behavior [24].
uses genetic algorithm (GA) for optimizing the FRBCS
technique to enhance its learning and adaptation capabilities. In [9], the authors proposed an SQLI detection technique
in adversarial environments by K-centers. They introduced a
The rest of the paper is organized as follows: Section 2 new online learning technique in which samples are learned
discusses related work reported in the literature. An overview one by one, and as a result, number and centers of the clusters
of the proposed fuzzy genetic system is explained in Section 3. are adjusted accordingly. Therefore, the K-centred technique
The experimental result and evaluation of the proposed system can adapt to different kinds of attacks. The experimental results
are discussed in Section 4. Finally, in Section 5 the conclusion show that their method has a satisfying result on the SQLI
and future research directions are presented. attacks detection in the adversarial environment. The main
130 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
drawback of their method is that it must receive a true label of SQLI is proposed in [30]. In this technique, levels of SQLI are
each statement after classification [25]. The concept of pattern detected using template matching. The ultimate goal of the
classifiers to detect injection attacks and protect web genetic algorithm is to optimize the matching rules of SQLI
applications is introduced in [24]. HTTP requests are captured queue in the template library. These rules are in the form of IF
and converted into numeric attributes. Numeric attributes (condition) THEN (execution); where conditions refer to attack
include the length and the number of keywords of parameters. sequence matches. However, the algorithm relies on template
Using these attributes, the system classifies the parameters by sequence to define SQLIA. Therefore, the system fails to detect
Bayesian classifier to judge whether the parameters are the attacks of different sequences that are not included in the
injection patterns or not. The main drawback is that the system template library.
depends on limited types of features.
The main objective of this paper is to propose a combined
The major contributions of the work in [26] are the approach where FRBS and the genetic algorithm can be used
proposal of a novel method based on the genetic algorithm together to improve the accuracy of the system for detection of
applied to SQLI attack detection task and correlation of a SQLI, consequently, new SQLI attacks can be processed and
number of detection tools altogether with the novel method. In detected. To the best of our knowledge, there is no previous
this work, the authors prove that correlating several sources of work that uses FRBS for detecting SQLI attacks. To enhance
information and then performing reasoning on the correlated the accuracy of learning capability, we extend FRBS with the
information can improve the results of attacks detection. The genetic algorithm to find the most suitable rules for FRBCS.
main disadvantage of this algorithm is the overhead in
performance and storage caused by the correlation approach. III. PROPOSED SQL INJECTION DETECTION SYSTEM USING
FUZZY GENETIC
The implementation of Artificial Neural Networks (ANN)
as a biologically inspired computing is investigated in [2] to This paper introduces a GA based method to generate a
detect SQLI attacks. Multilayer Feed forward Networks fuzzy rule base for SQLI detection. With the specific structure
(MLN) was used in the implemented system. It has the ability of the chromosome, the GA operations and the adequate fitness
to learn and store the empirical knowledge, the nonlinearity function, the proposed method produces a fuzzy rule base
nature of the neural networks, the ability to generalize the (FRB) with proper rules. Designers usually cannot guarantee
solutions and to adapt when the context changes, and suitable that the fuzzy control system designed with trial-and-error for
computational performance. The limitations include depending building fuzzy rules has a reliable performance. Fig. 1
on the appearance of certain SQL keywords along with illustrates the flow diagram of the proposed system.
suspicious characters without considering the relative order In this work, the fuzzy rule base is tuned automatically by
between them. For this reason, despite the different order of the GA, known as Genetic Fuzzy System (GFS). The fuzzy logic
keywords, if a normal signature contains many keywords and produces controllers that are suitable for dealing with
suspicious characters that often appear together in an SQLI, it uncertainty and imprecision. Second, fuzzy behaviors can be
is highly likely to be misclassified. Another work, related to conveniently synthesized by a set of IF-THEN rules using
ANN-based SQLI detection, is introduced in [27, 28]. It easy-to-understand linguistic terms to encode expert
depends on limited SQL patterns for training so it is knowledge. Finally, the interpolative nature of fuzzy systems
susceptible to generate false positives. helps express partial and simultaneous simulations of SQLI
TF-IDF has been used in [8] for weight calculation of features, and the smooth transitions between these features
tokens to evaluate the performance of three machine learning [30].
approaches: SVM, Naive-Bayes, and K-NN. This method has GA starts with a population of randomly generated
low computation time complexity but susceptible to chromosomes, and advance towards better chromosomes by
generating false positives [9]. Furthermore, Gene Expression applying genetic operators inspired by the genetic process
Programming (GEP) for detection of SQLI is discussed in [29]. occurring in nature. The population undergoes evolution in a
At the beginning, chromosomes are generated randomly. Then, form of natural selection. During successive iterations, called
in each iteration of GEP, a linear chromosome is expressed in generation, chromosomes in the population are evaluated for
the form of expression tree and executed. The fitness value is their adaptation as solutions, and on the basis of this
calculated and termination condition is checked. The best evaluation, a new population of chromosomes is formed using
individual is preserved through the next iteration. Afterward, a selection mechanism, crossover, and mutation operators. A
the populations are subjected to genetic operators with defined fitness function must be devised for each problem to be solved.
probability. New individuals in temporary population Each chromosome is evaluated using the fitness function,
constitute the current population. Classification accuracy returning a single numerical value. The probability of selection
received from GEP depicts great efficiency for SQL queries of a certain chromosome is directly proportional to its fitness
constituted from 10 to 15 tokens. For longer statements, the function [31]. A GA-tuned fuzzy system with seven inputs and
averaged FP and FN is approximately 23%. one output will be illustrated to explain the SQLI detection
Among the approaches, genetic algorithm for detection of process.
131 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
132 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
5) Recalculate the distance between each data point and If more than half input variables are ‘H’, the output
the new obtained cluster centers. If no data point is reassigned variable is set to ‘H’.
then stop, otherwise repeat from step 3. This algorithm aims at
If both f 7 and f 2 are ‘H’, the output variable is set to
minimizing an objective function, in this case, a squared error
function. ‘H’.
If both
k
J (V ) ji1 ( xi v j ) ,
c 2
(1) f 7 and f1 are ‘H’, the output variable is set to
i 1 ‘H’.
ci is the number of data points in ith cluster, and k is the
number of cluster centers. If both f1 and f 2 are ‘H’, the output variable is set to
‘H’.
C. Fuzzy Logic System
Owing to low computational requirement and capability of If f 7 , f 2 and f1 are ‘H’, the output variable is set to
modeling human perception, fuzzy Logic (FL) is an efficient ‘H’.
and flexible method for managing degrees of uncertainty in
attack detection. Problems can be described in natural If any of f 7 , f 2 and f1 is ‘H’, the output variable is set
descriptions, linguistic terms, rather than the numerical values. to ‘M’.
The FL system consists of (i) fuzzifier that takes input values
and determines the degree to which they belong to each of the If both f1 and f 2 are ‘M’, the output variable is set to
fuzzy sets via membership functions (MFs); (ii) fuzzy ‘M’.
inference system that defines a non-linear mapping of the input
data vector into a scalar output, using fuzzy rules and (3) Other rules are obtained using the Cartesian product
defuzzifier that maps output fuzzy sets into a crisp number method of the seven features; which is to consider all the
[34]. A fuzzy set [35] is defined as [2]: combinations of antecedent linguistic values and generate a
fuzzy rule for each combination. The output variable of each
D x, D x | x X , D x 0,1 , (2) case depends on the nature of dataset. The rules altogether deal
where X represents the universal set, x is an element of X, D with the weight assignments impliedly in the same way that
is a fuzzy subset in X and μD(x) is the membership function of humans think. The fuzzy inference processes all of the cases in
fuzzy set D. A membership function is a curve that defines a parallel manner, which makes the decision more reasonable.
how each point in the input space is mapped to a membership
value (or degree of membership) between 0 and 1 [35]. The output of the fuzzy system is the probability of SQLI
(PSQLI) and it is also described by three fuzzy variables,
Next we will fuzzify the input (features of SQLI) and the including ‘high’, ‘medium’ and ‘low’ with triangular MFs. The
output (probability of injection), i.e. input and output are outputs of fuzzy values are then defuzzified to generate a crisp
mapped into a set of fuzzy partitions. Here, a seven-input value for the variable. The most popular defuzzification
single-output fuzzy system is used, which is given by method is the centroid, which calculates and returns the center
f : U R m Z R n , where U U1 .... U 7 is the input of gravity of the aggregated fuzzy set [36] and is given by
space and Z is the output space. Three fuzzy variables s
, (4)
describe the features. Their respective MFs (µA) [36] are s
triangular function calculated as:
(r)
xa cx r 1
f ( x; a, c) max(min( , ),0) where (r ) is the center of the suggested output at rule r, n
ba c b , (3)
is the number of rules and is the MF at rule r. The
(r )
where a, b and c are the outputs of the k-mean clustering
that represent lower, center and upper limits of a cluster obtained crisp value is then mapped to its range (low, medium,
respectively. To achieve overlap between the membership high) to indicate the potential of SQLI attack.
functions (overlapped fuzzy-sets) of each feature, the system
D. Rule Induction using Genetic Algorithm
makes an intersection with 15% -20% between the consecutive
MFs. In general, a rule base can be constructed by human experts
or by machine learning techniques from datasets. The machine
Once the system acquires the fuzzy descriptions of the learning approach is useful where it is desired to extract rules
features distance, the Mamdani rule base (fuzzy reasoning) can from the analysis that can be related to conceivable human
be built to make an inference of detection of SQLI. Fuzzy behavior. The essential feature of a GA is that a population of
reasoning, which is formulated by the group of fuzzy IF– proposed solutions (coded using a “chromosome”) is modified
THEN rules, presents a degree of presence or absence of using biologically inspired operators (especially crossover and
association or interaction between the elements of two or more mutation), and incorporating a random component, to explore a
sets. In the proposed system, reasoning is carried out through solution space [37]. Formally, let P(g) and S(g) be parents and
the following rules: offspring in generation g; the GA is working as follows:
133 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
134 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
135 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
136 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 6, 2016
[24] E. H. Cheon, Z. Huang, and Y. S. Lee, "Preventing SQL injection attack [32] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman,
based on machine learning," International Journal of Advancements in and A. Y. Wu, "An efficient k-means clustering algorithm: analysis and
Computing Technology, vol. 5, issue 9, pp. 967-974, 2013. implementation," IEEE Transactions on Pattern Analysis and Machine
[25] M. Kaushik and G. Ojha, "SQL injection attack detection and prevention Intelligence, vol. 24, no. 7, pp. 881-892, 2002.
methods: a critical review," International Journal of Innovative Research [33] E. Forgey, "Cluster analysis of multivariate data: Efficiency vs.
in Science, Engineering and Technology, vol. 3, issue 4, pp. 11370- interpretability of classification", Biometrics, vol. 21, no. 3, pp. 768-769,
11377, April 2014. 1965.
[26] M. Choraś, R. Kozik, D. Puchalski, and W. Hołubowicz, "Correlation [34] S. M. Saad, "Application of fuzzy logic and genetic algorithm in
approach for sql injection attacks detection," Advances in Intelligent biometric text-independent writer identification", IET Information
Systems and Computing, Springer, vol. 189, pp. 177-185, 2013. Security, vol. 5, no.1, pp. 1-9, 2011.
[27] N. M. Sheykhkanloo, "Employing neural networks for the detection of sql [35] I. Elamvazuthi, P. Vasant, and J. F. Webb, "The application of mamdani
injection attack," Proceedings of the 7th International Conference on fuzzy model for auto zoom function of a digital camera", arXiv preprint
Security of Information and Networks, pp. 318-323, UK, 2014. arXiv:1001.2279, 2010.
[28] N. M. Sheykhkanloo, "SQL-IDS: evaluation of SQLi attack detection and [36] M. Abdulghafour, "Image segmentation using fuzzy logic and genetic
classification based on machine learning techniques," Proceedings of the algorithms," Journal of WSCG, vol. 11,no. 1, pp.1-8, 2003.
8th International Conference on Security of Information and Networks, [37] J. Ricketts, "Tuning a modified Mamdani fuzzy rulebase system with a
pp. 258-266, USA, 2015. genetic algorithm for travel decisions," 18th World IMACS / MODSIM
[29] J. Skaruz, J. P. Nowacki, A. Drabik, F. Seredynski and P. Bouvry, "Soft Congress, Australia, pp. 768-774, 2009.
computing techniques for intrusion detection of SQL-based attacks," [38] M. E. Cintra and H. D. A. Camargo, "Fuzzy rules generation using
Lecture Notes in Computer Science,Springer, Vol. 5990, pp. 33-42, genetic algorithms with self-adaptive selection," IEEE International
2010. Conference on Information Reuse and Integration, pp. 261-266, USA,
[30] J. Chen, L. Yang, H. Zhang, and Y. Liu, "A GA-based approach for SQL- 2007.
injection detection", Future Information Engineering, vol. 49, p. [39] R. B. Jadhav and M. B. B. Gite, "Real time intrusion detection with
291, 2014. fuzzy, genetic and apriori algorithm," International Journal of Advance
[31] A. Adriansyah and S. H. M. Amin, "Knowledge base tuning using genetic Foundation and Research in Computer (IJAFRC), Vol. 1, Issue 11, pp.
algorithm for fuzzy behavior-based autonomous mobile robot," 34-40, 2014
Proceeding of 9th International Conference on Mechatronics Technology, [40] Willian Halfond, 'Testbed', [Online]. Available: http://www-
pp. 120-125, Malaysia, 2005. bcf.usc.edu/~halfond/testbed.html. [Accessed: 16- JUNE- 2016]
137 | P a g e
www.ijacsa.thesai.org