Multivariate Analysis Techniques For Exploring Data
Multivariate Analysis Techniques For Exploring Data
Exploring Data
Most problems we deal with have multiple variables. To analyze these variables before they
can be used in training a machine learning framework, we need to analytically explore the
data. A fast and easy way to do this is bivariate analysis, wherein we simply compare two
variables against each other. This can be in the form of simple two-dimensional plots and t-
tests.
However, comparing only two variables at a time does not give deep insights into the nature
of variables and how they interact with each other. Consider the Curiosity rover recently
launched by NASA. It is using laser-induced breakdown spectroscopy (LIBS) to analyze the
chemical composition of the rocks in the Gale Crater region of Mars. Now, this data is highly
multivariate in nature with over 6000 variables per sample. Imagine plotting two-dimensional
graphs to understand the patterns in the data! This is where the need to understand and
implement multivariate analysis techniques comes in.
Pairwise plots
Pairwise plots are a great way to look at multi-dimensional data, and at the same time
maintain the simplicity of a two-dimensional plot. As shown in the figure below, it allows the
analysts to view all combinations of the variables, each in a two-dimensional plot. In this
way, they can visualize all the relations and interactions among the variables on one single
screen.
Spider Plots
While there are various ways of visualizing multi-dimensional data, spider plots are one of
the easiest ways to decipher the meaning of data. From the figure below, we can see how
easily we can compare three mobile phones based on attributes such as their speed, screen,
camera, memory and apps.
Correlation Analysis
Often, data sets contain variables that are either related to each other or derived from each
other. It is important to understand these relations that exist in the data. In statistical terms,
correlation can be defined as the degree to which a pair of variables are linearly related. In
some cases, it is easy for the analyst to understand that the variables are related, but in most
cases, it isn’t. Thus, performing a correlation analysis is very critical while examining any
data. Furthermore, feeding data which has variables correlated to one another is not a good
statistical practice, since we are providing multiple weightage to the same type of data. To
prevent such issues, correlation analysis is a must.
Cluster Analysis
In many business scenarios, the data belongs to different types of entities; and fitting all of
them into a single model might not be the best thing to do. For example, in a bank dataset, the
customers might belong to multiple income groups which leads to different spending
behaviors. If we use the data having all these customers into a single model, we would be
comparing apples to oranges. In that regard, clustering provides analysts a good way to
segment their data and therefore avoid this problem. Clustering also allows us to visually
understand and therefore compare the different attributes of the segments formed.
K-means clustering is a well-renowned approach used by a lot of data analysts and scientists.
This separates the data points into clusters such that the inter-cluster distances are maximized.
What this means is that each point in a particular cluster is similar to every other point in that
cluster; and, points in a particular cluster are very different from every point in any other
cluster. Other popular approaches for clustering include the hierarchical clustering algorithm,
the DBSCAN algorithm, Partitioning Around Medoids (PAM) algorithm, etc.
This technique is best suited for use when we have multiple categorical independent
variables; and two or more metric dependent variables. While the simple ANOVA (Analysis
of Variance) examines the difference between groups by using t-tests for two means and F-
test otherwise, MANOVA assesses the relationship between the set of dependent features
across a set of groups. For example, this technique is suitable when we want to compare two
or more dishes in a restaurant against each other, in terms of the level of spiciness, the time
taken to cook and value for money, etc.