A Tutorial On Principal Component Analysis For Dimensionality Reduction in Machine Learning
A Tutorial On Principal Component Analysis For Dimensionality Reduction in Machine Learning
ISSN No:-2456-2165
Abstract:- Anomaly detection has become a crucial many ways, big data deposits have been constructed by
technology in several application fields, mostly for network corporations with special needs data processing like operating
security. The classification challenge of anomaly detection systems has gradually expanded in sizes and number of
using machine learning techniques on network data has available data sets, will need massively parallel software
been described here. Using the KDD99 dataset for network running on tens, hundreds, or even thousands of servers to run
IDS, dimensionality reduction and classification parallel software.
techniques are investigated and assessed. For the
application on network data, Principal Component Through the provision of personalised healthcare and
Analysis for dimensionality reduction and Support Vector prescriptive analysis, clinical risk response and predictive
Machine for classification have been taken into analysis, reduction of duplication and care variability,
consideration, and the results are examined.. The result automatic external and internal patient reporting, structured
shows the decrease in execution time for the classification medical terminology, and patient registers, big data
as we reduce the dimension of the input data and also the technology has continued to improve medical treatment.
precision and recall parameter values of the classification According to particular specified requirements, dimension
algorithm shows that the SVM with PCA method is more reduction algorithms convert data from higher dimensions to
accurate as the number of misclassification decreases. lower dimensions. Principal Component Analysis (PCA) is a
Enormous data in health research is extremely interesting dimensionality-reduction (DR) approach that is primarily used
since data-based studies may move more quickly than to condense a huge set of variables into a manageable number
hypothesis-based research, despite the fact that enormous while retaining the majority of their information.
databases are becoming common and hence challenging to
interpret. Using Principal Component Analysis (PCA), one As computer networks become more prevalent in today's
may make some datasets less dimensional. enhances society, so are the dangers that they face. Intrusion detection
interpretability while retaining most of the information. It systems are required to identify numerous threats. Both host-
does this by introducing fresh variables that are unrelated based and network-based intrusion detection systems are
to one another. available. Host-based technology looks at things like which
files were accessed and which programmes were run. With the
Keywords:- Machine Learning, Principal Component use of network-based technologies, events are examined as
Analysis, Dimensionality Reduction, Intrusion Detection, computer-to-computer information packet transfers. Building
Anomaly Detection, Principal Component Analysis, Support useful behaviour models to discriminate between normal and
Vector Machine. pathological activities by watching data is one of the primary
challenges for NIDSs.
I. INTRODUCTION
There are two types of intrusion detection approaches,
Machine Learning (ML) is the automatic training of misuse detection where we model attack behaviour or features
a computer for specific tasks through algorithms. using intrusion audit data and anomaly detection, which is to
Automatically learning to know user preferences, application model normal usage behaviours. Usually in the commercial
sets of algorithms are used to mine data that discover and filter NIDS, the signature or misuse based approach is followed but
general rules in large data sets. Big data is an anomaly based approach is efficient using the machine
environment in which approaches are extract, information learning methods. There are many data mining and machine
is regularly collected or otherwise stored in data sets that learning methods used for network intrusion detection.
are too much or complicated to be managed by standard Unsupervised methods such as clustering and supervised
data processing application. Current use of the term big data methods such as Naïve Bayes, Support Vector Machine are
uses predictive analytical systems, user behavioral analytics or used. But comparisons of the results of using an unsupervised
some other advanced methods of data analysis that extract dimensionality reduction method along with the supervised
value from big data and rarely a particular data set size. In
Noise Reduction: PCA can effectively filter out noise and It is important to consider these limitations and assess the
irrelevant variations in the data by focusing on the suitability of PCA for specific datasets and analysis goals. In
components with the highest eigenvalues. It helps in some cases, alternative dimensionality reduction techniques or
improving the signal-to-noise ratio and enhances the customized approaches may be more appropriate to address
performance of subsequent analysis or modeling tasks. specific challenges or requirements.
Visualization: PCA enables the visualization of high- Considerations and Future Directions:
dimensional data in a lower-dimensional space. By The article emphasizes important considerations when
projecting the data onto a reduced set of principal applying PCA, such as selecting the appropriate number of
components, it allows for the visualization of clusters, principal components, assessing the quality of dimensionality
patterns, and relationships in the data, aiding in exploratory reduction, and evaluating the impact on downstream tasks. It
data analysis. also identifies open research challenges, such as incorporating
domain knowledge into PCA, handling high-dimensional and
Multicollinearity Detection: PCA can identify and streaming data, and addressing the interpretability of the
address multicollinearity issues in datasets, where reduced feature space.
variables are highly correlated. It helps in identifying
linear dependencies among variables and provides a more VII. CONCLUSION
independent set of components.
In conclusion, Principal Component Analysis (PCA) is a
VI. LIMITATIONS OF PRINCIPAL COMPONENT widely used technique for dimensionality reduction in
ANALYSIS (PCA): machine learning and data analysis. It offers several benefits,
including dimensionality reduction, feature extraction, noise
Linearity Assumption: PCA assumes that the data is reduction, visualization, and multicollinearity detection. PCA
linearly related to the principal components. If the can help in simplifying complex datasets, improving
underlying data has complex nonlinear relationships, PCA computational efficiency, and enhancing the interpretability of
may not capture all the relevant information, and other data.
nonlinear dimensionality reduction methods may be more
appropriate. However, PCA also has certain limitations that need to
be considered. It assumes linearity, which may not hold in all
Loss of Interpretability: While PCA reduces the cases, and its interpretability can be challenging due to the
dimensionality of the data, the resulting principal combination of original variables in the transformed
components are usually combinations of the original components. PCA is sensitive to outliers and may lead to
variables, making their interpretation less straightforward. information loss during the dimensionality reduction process.
The interpretability of the transformed features may be Additionally, determining the optimal number of components
challenging, especially when dealing with a large number to retain requires subjective decision-making.
of components.
Overall, PCA is a valuable tool for dimensionality
Sensitivity to Outliers: PCA is sensitive to outliers in the reduction, particularly in cases where linearity is a reasonable
data, as outliers can disproportionately influence the assumption and interpretability is not the primary concern. It
estimation of principal components. Outliers can distort the is important to understand the benefits and limitations of PCA
resulting variance-covariance structure and affect the and consider alternative methods when necessary to address
quality of dimensionality reduction. specific challenges or requirements in data analysis.