A Tutorial On Principal Component Analysis For Dimensionality Reduction in Machine Learning

Anomaly detection has become a crucial technology in several application fields, mostly for network security. The classification challenge of anomaly detection using machine learning techniques on network data has been described here. Using the KDD99 dataset for network IDS, dimensionality reduction and classification techniques are investigated and assessed. For the application on network data, Principal Component Analysis for dimensionality reduction and Support Vector Machine for classification have been taken into consideration, and the results are examined.. The result shows the decrease in execution time for the classification as we reduce the dimension of the input data and also the precision and recall parameter values of the classification algorithm shows that the SVM with PCA method is more accurate as the number of misclassification decreases. Enormous data in health research is extremely interesting since data-based studies may move more quickly than hypothesis-based research, despite the fact that enormous databases are becoming common and hence challenging to interpret. Using Principal Component Analysis (PCA), one may make some datasets less dimensional. enhances interpretability while retaining most of the information. It does this by introducing fresh variables that are unrelated to one another.

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

A Tutorial On Principal Component Analysis For Dimensionality Reduction in Machine Learning

Uploaded by

International Journal of Innovative Science and Research Technology

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

A Tutorial on Principal Component Analysis for

Dimensionality Reduction in Machine Learning
Jasmin Praful Bharadiya
Doctor of Philosophy Information Technology
University of the Cumberlands, USA

Abstract:- Anomaly detection has become a crucial many ways, big data deposits have been constructed by
technology in several application fields, mostly for network corporations with special needs data processing like operating
security. The classification challenge of anomaly detection systems has gradually expanded in sizes and number of
using machine learning techniques on network data has available data sets, will need massively parallel software
been described here. Using the KDD99 dataset for network running on tens, hundreds, or even thousands of servers to run
IDS, dimensionality reduction and classification parallel software.
techniques are investigated and assessed. For the
application on network data, Principal Component Through the provision of personalised healthcare and
Analysis for dimensionality reduction and Support Vector prescriptive analysis, clinical risk response and predictive
Machine for classification have been taken into analysis, reduction of duplication and care variability,
consideration, and the results are examined.. The result automatic external and internal patient reporting, structured
shows the decrease in execution time for the classification medical terminology, and patient registers, big data
as we reduce the dimension of the input data and also the technology has continued to improve medical treatment.
precision and recall parameter values of the classification According to particular specified requirements, dimension
algorithm shows that the SVM with PCA method is more reduction algorithms convert data from higher dimensions to
accurate as the number of misclassification decreases. lower dimensions. Principal Component Analysis (PCA) is a
Enormous data in health research is extremely interesting dimensionality-reduction (DR) approach that is primarily used
since data-based studies may move more quickly than to condense a huge set of variables into a manageable number
hypothesis-based research, despite the fact that enormous while retaining the majority of their information.
databases are becoming common and hence challenging to
interpret. Using Principal Component Analysis (PCA), one As computer networks become more prevalent in today's
may make some datasets less dimensional. enhances society, so are the dangers that they face. Intrusion detection
interpretability while retaining most of the information. It systems are required to identify numerous threats. Both host-
does this by introducing fresh variables that are unrelated based and network-based intrusion detection systems are
to one another. available. Host-based technology looks at things like which
files were accessed and which programmes were run. With the
Keywords:- Machine Learning, Principal Component use of network-based technologies, events are examined as
Analysis, Dimensionality Reduction, Intrusion Detection, computer-to-computer information packet transfers. Building
Anomaly Detection, Principal Component Analysis, Support useful behaviour models to discriminate between normal and
Vector Machine. pathological activities by watching data is one of the primary
challenges for NIDSs.
I. INTRODUCTION
There are two types of intrusion detection approaches,
Machine Learning (ML) is the automatic training of misuse detection where we model attack behaviour or features
a computer for specific tasks through algorithms. using intrusion audit data and anomaly detection, which is to
Automatically learning to know user preferences, application model normal usage behaviours. Usually in the commercial
sets of algorithms are used to mine data that discover and filter NIDS, the signature or misuse based approach is followed but
general rules in large data sets. Big data is an anomaly based approach is efficient using the machine
environment in which approaches are extract, information learning methods. There are many data mining and machine
is regularly collected or otherwise stored in data sets that learning methods used for network intrusion detection.
are too much or complicated to be managed by standard Unsupervised methods such as clustering and supervised
data processing application. Current use of the term big data methods such as Naïve Bayes, Support Vector Machine are
uses predictive analytical systems, user behavioral analytics or used. But comparisons of the results of using an unsupervised
some other advanced methods of data analysis that extract dimensionality reduction method along with the supervised
value from big data and rarely a particular data set size. In

IJISRT23MAY2426 www.ijisrt.com 2028

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
SVM method to SVM without dimensionality reduction is not  Calculate the Eigen values and the corresponding
considered much. eigenvectors for this covariance matrix, and the principal
components are computed by solving the eigenvalues
II. SYSTEM DESIGN problem of covariance matrix
 To find the principal components, choose the eigenvectors
The first intrusion detection model based on data mining corresponding to K largest eigenvalues, where K<<N.
was proposed by Denning [1] and many research works have Dimensionality reduction step keep only the terms
been devoted to the construction of effective intrusion corresponding to the K largest eigenvalues. Hence obtain a
detection models. The KD99 dataset for intrusion detection is new feature vector consisting of eigenvectors of principal
meant for data mining algorithms, and was established by the components. The final data computed using this feature
Third International Knowledge Discovery and Data Mining vector and the mean adjusted original input data using the
Tools Competition [4]. In the KDD99 data set, each data given equation
record corresponds to a set of derived features of a connection  𝐹𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎=𝑅𝑜𝑤𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑉𝑒𝑐𝑡𝑜𝑟×𝑅𝑜𝑤𝐷𝑎𝑡𝑎𝐴𝑑𝑗𝑢𝑠𝑡
in the network data. Each connection is labelled either as  Row Feature Vector is the matrix in which eigenvectors in
normal or as an attack, with exactly one specific attack type. the columns transposed and Row Data Adjust is the mean
In this paper, we will compare various machine learning adjusted input data. The obtained subspace is spanned by
algorithms that can be used for anomaly detection. The the orthogonal set of Eigen vectors, which reveal the
technical challenges in NIDSs based on machine learning maximum variance in the data space.
methods are dimensionality reduction and classification. There
are three main parts depicted in a framework in figure 1 for an  Dimensionality Reduction Techniques (DR)
intrusion or anomaly detection tool: pre-processing of network Dimensionality Reduction (DR) algorithm, which
data, feature extraction, classification. After the dimension aims to reduce the distance in a latent space between
reduction using PCA, the reduced set of features that are linear distributions of different data sets to allow efficient
combination of original features is obtained. The classifier transfer learning. The results point that for each device
solves the anomaly detection problem using the Support individually, the findings with Dimensionality Reduction
Vector Machine algorithm with the output from the PCA (DR) are much preferable to those without decreased
algorithm. Comparison of the classification output using dimensionality.The low dimensional data representation of the
support vector machine (SVM) without dimension reduction initial data tends to overcome the issue of the dimensionality
and classification using the original data is also done. curse, and can be easily analyzed, processed, and
visualized. Advantage of dimensionality reduction
techniques applied to a dataset.

 Decrease the number of dimensions, and data storage

space.
 It requires less time to compute.
 Irrelevant, noisy, and redundant data can be deleted.
 Data quality may well be optimized.
 Helps an algorithm to work efficiently and improves
accuracy.
 Allow to visualize data
 It simplifies the classification and increases performance
as well.

Generally, the DR techniques are classified into two

main different techniques: Feature Selection (FS) and Feature
Extraction (FE). Feature Selection (FS) is considered an
important method since data is constantly produced at an ever-
Fig 1 System Design increasing rate; with this method, some significant
dimensionality concerns can be minimized, such as
 Algorithm effectively decreasing redundancy, eliminating unnecessary
 Consider the network data corresponding to each data, and enhancing comprehensibility of results.
connection record after mapping. Thus each column Moreover, Feature Extraction (FE) addresses the issue of
represents a dimension of the input data. finding the most distinctive, informative, and reduced set of
 Compute the mean for each dimension, and subtract it features to improve the efficiency of both data processing and
from each data value. Compute the covariance matrix C storage.
of the input data matrix.

IJISRT23MAY2426 www.ijisrt.com 2029

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 PCA in a streaming setting risk assessment, and identifying influential factors in
A streaming setting in PCA is characterized by financial markets.
sequentially arriving data points over a period of time during
which the parameters describing the subspace are repeatedly  Text Mining: PCA can be used in text mining to analyze
updated. Over a period of time, the covariance matrix or the large text datasets, such as document collections or social
subspace can vary, so that tracking and reacting to such media data. It aids in feature extraction, topic modeling,
changes is necessary to maintain a best possible and sentiment analysis.
approximation. PCA algorithms capable of updating its set of
parameters continuously without knowledge of the history of IV. ADVANCEMENTS IN PCA:
data are referred to as online PCA. Popular types of algorithms
that fall under the term of online PCA are: incremental PCA or  Kernel PCA: Kernel PCA extends PCA to nonlinear
incremental SVD and neural network-based PCA, the focus of dimensionality reduction by employing a kernel function
this work lies on the latter. to map the data into a higher-dimensional feature space. It
allows capturing complex relationships between variables
 Neural network-based PCA. and is particularly useful when the data has nonlinear
Neural network-based PCA refers to typically structures.
unsupervised methods that estimate the eigenvalues λi and
eigenvectors wi online from the input data stream . These  Sparse PCA: Sparse PCA incorporates sparsity constraints
methods are particularly useful for high-dimensional data into the PCA framework, promoting the identification of a
streams since they avoid the computation of the large sparse set of principal components. This is beneficial when
covariance matrix. In addition, they can track non-stationary dealing with high-dimensional data where only a few
data (i.e. data with a slowly changing covariance matrix). variables contribute significantly to the data structure.
While the development of neural network-based PCA is
described in the previous section, it is the focus of this section  Incremental PCA: Incremental PCA enables the
to provide a more technical view of the neural network-based application of PCA on large datasets that cannot fit into
PCA that is extended and benchmarked in this work. The PCA memory. It processes data in batches or incrementally
extended in this work by an adaptive dimensionality updates the principal components, making it more
adjustment is based on a robust recursive least square computationally efficient and scalable.
algorithm (RRLSA) with interlocking of learning and Gram-
Schmidt orthonormalization. In this method, the eigenvectors  Robust PCA: Robust PCA is designed to handle outliers
are updated in a hierarchically way: The eigenvector with the and noise in the data. It separates the data into low-rank
largest eigenvalue is obtained using a single-unit learning rule and sparse components, effectively extracting the
applied to the original data. underlying structure even in the presence of outliers or
corrupted data
III. APPLICATIONS OF PCA
 Online PCA: Online PCA adapts PCA for streaming data
 Image and Video Processing: PCA is widely used in where new observations arrive continuously. It updates the
image and video processing tasks such as face recognition, principal components incrementally, allowing real-time
image compression, and denoising. By reducing the analysis and adaptive dimensionality reduction.
dimensionality of image data, PCA can effectively capture
the most important features and patterns. These advancements in PCA techniques have expanded
its applicability and improved its performance in various
 Signal Processing: In signal processing, PCA can be used domains, addressing specific challenges and requirements of
for feature extraction, noise reduction, and signal different datasets and scenarios.
classification. It helps in identifying the underlying
structure and relevant features in signals V. BENEFITS OF PRINCIPAL COMPONENT
ANALYSIS (PCA):
 Genomics and Bioinformatics: PCA finds applications in
genomics and bioinformatics for analyzing gene  Dimensionality Reduction: PCA helps in reducing the
expression data, identifying disease subtypes, and dimensionality of high-dimensional datasets by identifying
understanding the relationships between genes. It aids in a smaller set of principal components that capture the most
identifying significant features and reducing noise in high- important information in the data. This reduces
dimensional biological datasets. computational complexity, memory requirements, and
improves algorithm efficiency.
 Financial Data Analysis: PCA is applied in financial data
analysis to identify patterns and reduce the dimensionality  Feature Extraction: PCA can extract meaningful features
of financial time series. It helps in portfolio optimization, from complex datasets, allowing for better understanding

IJISRT23MAY2426 www.ijisrt.com 2030

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
and interpretation of the underlying data structure. It helps result in significant information loss, while retaining too
in identifying the most influential variables or features many components may lead to overfitting or unnecessary
contributing to the variation in the data. complexity in the data representation.

 Noise Reduction: PCA can effectively filter out noise and It is important to consider these limitations and assess the
irrelevant variations in the data by focusing on the suitability of PCA for specific datasets and analysis goals. In
components with the highest eigenvalues. It helps in some cases, alternative dimensionality reduction techniques or
improving the signal-to-noise ratio and enhances the customized approaches may be more appropriate to address
performance of subsequent analysis or modeling tasks. specific challenges or requirements.

 Visualization: PCA enables the visualization of high-  Considerations and Future Directions:
dimensional data in a lower-dimensional space. By The article emphasizes important considerations when
projecting the data onto a reduced set of principal applying PCA, such as selecting the appropriate number of
components, it allows for the visualization of clusters, principal components, assessing the quality of dimensionality
patterns, and relationships in the data, aiding in exploratory reduction, and evaluating the impact on downstream tasks. It
data analysis. also identifies open research challenges, such as incorporating
domain knowledge into PCA, handling high-dimensional and
 Multicollinearity Detection: PCA can identify and streaming data, and addressing the interpretability of the
address multicollinearity issues in datasets, where reduced feature space.
variables are highly correlated. It helps in identifying
linear dependencies among variables and provides a more VII. CONCLUSION
independent set of components.
In conclusion, Principal Component Analysis (PCA) is a
VI. LIMITATIONS OF PRINCIPAL COMPONENT widely used technique for dimensionality reduction in
ANALYSIS (PCA): machine learning and data analysis. It offers several benefits,
including dimensionality reduction, feature extraction, noise
 Linearity Assumption: PCA assumes that the data is reduction, visualization, and multicollinearity detection. PCA
linearly related to the principal components. If the can help in simplifying complex datasets, improving
underlying data has complex nonlinear relationships, PCA computational efficiency, and enhancing the interpretability of
may not capture all the relevant information, and other data.
nonlinear dimensionality reduction methods may be more
appropriate. However, PCA also has certain limitations that need to
be considered. It assumes linearity, which may not hold in all
 Loss of Interpretability: While PCA reduces the cases, and its interpretability can be challenging due to the
dimensionality of the data, the resulting principal combination of original variables in the transformed
components are usually combinations of the original components. PCA is sensitive to outliers and may lead to
variables, making their interpretation less straightforward. information loss during the dimensionality reduction process.
The interpretability of the transformed features may be Additionally, determining the optimal number of components
challenging, especially when dealing with a large number to retain requires subjective decision-making.
of components.
Overall, PCA is a valuable tool for dimensionality
 Sensitivity to Outliers: PCA is sensitive to outliers in the reduction, particularly in cases where linearity is a reasonable
data, as outliers can disproportionately influence the assumption and interpretability is not the primary concern. It
estimation of principal components. Outliers can distort the is important to understand the benefits and limitations of PCA
resulting variance-covariance structure and affect the and consider alternative methods when necessary to address
quality of dimensionality reduction. specific challenges or requirements in data analysis.

 Information Loss: PCA aims to capture the most REFERENCES

important information in the data, but there is inevitably
some loss of information during the dimensionality [1]. Aoying Zhou, Zhiyuan Cai, Li Wei, Weining Qian.
reduction process. The lower-dimensional representation Mkernel merging: towards density estimation over data
may not fully retain all the details and nuances present in streams. In: Eighth International Conference on Database
the original data. Systems for Advanced Applications, 2003. (DASFAA
2003). Proceedings.; 2003. p. 285–292.
 Selecting the Number of Components: Determining the [2]. Artac M, Jogan M, Leonardis A, “Incremental PCA for
optimal number of principal components to retain is a on-line visual learning and recognition,” Object
subjective decision. Choosing too few components may

IJISRT23MAY2426 www.ijisrt.com 2031

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
recognition supported by user interaction for service [16]. Oja E. Simplified neuron model as a principal
robots. 2002;3:781-784. component analyzer. Journal of Mathematical Biology.
[3]. Bannour S, Azimi-Sadjadi MR. Principal component 1982;15(3):267–273. pmid:7153672
extraction using recursive least squares learning. IEEE [17]. W.K. Lee, et al., "Mining audit data to build intrusion
Transactions on Neural Networks. 1995;6(2):457–469. detection models", In Proc. Int. Conf. Knowledge
pmid:18263327 Discovery and Data Mining (KDD'98), pp.66-72, 1998.
[4]. Bharadiya , J. P., Tzenios, N. T., & Reddy , M. (2023). [18]. Zhang T, Yang B. Big Data Dimension Reduction Using
Forecasting of Crop Yield using Remote Sensing Data, PCA. In: 2016 IEEE International Conference on Smart
Agrarian Factors and Machine Learning Approaches. Cloud (SmartCloud); 2016. p. 152–157.
Journal of Engineering Research and Reports, 24(12), [19]. Zhangxue-qin, Gu chun-hua and Linjia-jun, “Intrusion
29–44. https://doi.org/10.9734/jerr/2023/v24i12858 detection system based on feature selection and support
[5]. Bharadiya, J. (2023). Artificial Intelligence in vector machine”, East China University of Science and
Transportation Systems A Critical Review. American Technology, Proceedings of IEEE, 2006.
Journal of Computing and Engineering, 6(1), 34 - 45.
https://doi.org/10.47672/ajce.1487
[6]. Bharadiya, J. . (2023). A Comprehensive Survey of Deep
Learning Techniques Natural Language Processing.
European Journal of Technology, 7(1), 58 - 66.
https://doi.org/10.47672/ejt.1473
[7]. Bharadiya, J. . (2023). Convolutional Neural Networks
for Image Classification. International Journal of
Innovative Science and Research Technology, 8(5), 673 -
677. https://doi.org/10.5281/zenodo.7952031
[8]. Bharadiya, J. . (2023). Machine Learning in
Cybersecurity: Techniques and Challenges. European
Journal of Technology, 7(2), 1 - 14.
[9]. Bharadiya, J. . (2023). The Impact of Artificial
Intelligence on Business Processes. European Journal of
Technology, 7(2), 15 - 25.
https://doi.org/10.47672/ejt.1488
[10]. Bharadiya, J. P. (2023, May). A Review of Bayesian
Machine Learning Principles, Methods, and
Applications. International Journal of Innovative Science
and Research Technology, 8(5), 2033-2038. DOI:
https://doi.org/10.5281/zenodo.8002438
[11]. Bharadiya, J. P. (2023, May). Exploring the Use of
Recurrent Neural Networks for Time Series Forecasting.
International Journal of Innovative Science and Research
Technology, 8(5), 2023-2027. DOI:
https://doi.org/10.5281/zenodo.8002429
[12]. CHEN Bo, Ma Wu, “Research of Intrusion Detection
based on Principal Components Analysis”, Information
Engineering Institute, Dalian University, China, Second
International Conference on Information and Computing
Science, 2009. https://doi.org/10.47672/ejt.1486
[13]. Lee JA, Verleysen M. Nonlinear Dimensionality
Reduction. Springer-Verlag GmbH; 2007.
[14]. M. M. Sebring, E. Shellhouse, M. E. Hanna, and R. Alan
Whitehurst, “Expert systems in intrusion detection: A
case study”, In Proceedings of the 11th National Comput
Security Conference, Baltimore, Maryland.
[15]. Nallamothu, P. T., & Bharadiya, J. P. (2023). Artificial
Intelligence in Orthopedics: A Concise Review. Asian
Journal of Orthopaedic Research, 6(1), 17–27. Retrieved
from
https://journalajorr.com/index.php/AJORR/article/view/1
64