0% found this document useful (0 votes)
100 views24 pages

Assignment

The document discusses two main topics: 1) Applications of PCA in machine learning with examples including dimensionality reduction, image processing, and recommendation systems. PCA works by finding principal components that explain maximum variance in data. 2) Visualization tools in Python like Matplotlib and Seaborn for creating line charts, bar charts, histograms and more using example datasets. Code examples show how to plot and customize different graph types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views24 pages

Assignment

The document discusses two main topics: 1) Applications of PCA in machine learning with examples including dimensionality reduction, image processing, and recommendation systems. PCA works by finding principal components that explain maximum variance in data. 2) Visualization tools in Python like Matplotlib and Seaborn for creating line charts, bar charts, histograms and more using example datasets. Code examples show how to plot and customize different graph types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

M.

Sathya Sundaram
PA2213003013027
Assignment
MA3067 - Linear Algebra & Statistics for ML

Topics:
1. Explain the applications of PCA in Machine Learning with suitable examples
2. What is visualization tools, Explain with python code to draw suitable graphs, charts, bars
with suitable examples
1. Explain the applications of PCA in Machine Learning with suitable examples
Principal Component Analysis
 Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning.
 It is a statistical process that converts the observations of correlated features into a set
of linearly uncorrelated features with the help of orthogonal transformation. These
new transformed features are called the Principal Components. It is one of the
popular tools that is used for exploratory data analysis and predictive modeling. It is a
technique to draw strong patterns from the given dataset by reducing the variances.
 PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
 Correlation is a highly applied technique in machine learning during data analysis and
data mining. It can extract key problems from a given set of features, which can later
cause significant damage during the fitting model. Data having non-
correlated features have many benefits. Such as:
1. Learning of Algorithm will be faster
2. Interpretability will be high
3. Bias will be less
 PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
 Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels.
 The PCA algorithm is based on some mathematical concepts such as:
o Variance and Covariance
o Eigenvalues and Eigen factors
HOW DO YOU DO A PRINCIPAL COMPONENT ANALYSIS?
1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes
In the above figure, we have several points plotted on a 2-D plane. There are two principal
components. PC1 is the primary principal component that explains the maximum variance in
the data. PC2 is another principal component that is orthogonal to PC1.
The mathematical representation of dimensionality reduction in the context of PCA is as
follows:
Given a dataset with n observations and p variables represented by the n x p data matrix X,
the goal of PCA is to transform the original variables into a new set of k variables called
principal components that capture the most significant variation in the data. The principal
components are defined as linear combinations of the original variables given by:
PC_1 = a_11 * x_1 + a_12 * x_2 + ... + a_1p * x_p
PC_2 = a_21 * x_1 + a_22 * x_2 + ... + a_2p * x_p
...
PC_k = a_k1 * x_1 + a_k2 * x_2 + ... + a_kp * x_p

How Does Principal Component Analysis Work?


1. Normalize the Data
Standardize the data before performing PCA. This will ensure that each feature has a mean =
0 and variance = 1.

2. Build the Covariance Matrix


Construct a square matrix to express the correlation between two or more features in a
multidimensional dataset.

3. Find the Eigenvectors and Eigenvalues


Calculate the eigenvectors/unit vectors and eigenvalues. Eigenvalues are scalars by which we
multiply the eigenvector of the covariance matrix.
4. Sort the Eigenvectors in Highest to Lowest Order and Select the Number of Principal
Components.
Applications of PCA in Machine Learning

 PCA is used to visualize multidimensional data.


 It is used to reduce the number of dimensions in healthcare data.
 PCA can help resize an image.
 It can be used in finance to analyze stock data and forecast returns.
 PCA helps to find patterns in the high-dimensional datasets.

1. Neuroscience:
 A technique known as spike-triggered covariance analysis uses a variant of
Principal Components Analysis in Neuroscience to identify the specific properties of
a stimulus that increase a neuron's probability of generating an action potential.
 PCA is also used to find the identity of a neuron from the shape of its action
potential.

 PCA as a dimension reduction technique is used detect coordinated activities of large


neuronal ensembles. It has been used in determining collective variables, that is, order
parameters, during phase transitions in the brain.
2. Quantitative Finance
PCA is a methodology to reduce the dimensionality of a complex problem. Say, a fund
manager has 200 stocks in his portfolio. To analyze these stocks quantiatively a stock
manager will require a co-relational matrix of the size 200 * 200, which makes the problem
very complex.

However if he was to extract, 10 Principal Components which best represent the variance in
the stocks best, this would reduce the complexity of problem while still explaining the
movement of all 200 stocks. Some other applications of PCA include:

 Analyzing the shape of the yield curve

 Hedging fixed income portfolios

 Implementation of interest rate models

 Forecasting portfolio returns

 Developing asset allocation algorithms

 Developing long short equity trading algorithms


3. Image Compression
PCA is also used for image compression. 

2. What is visualization tools, Explain with python code to draw suitable graphs, charts, bars
with suitable examples
Data visualization is a crucial aspect of machine learning that enables analysts to understand
and make sense of data patterns, relationships, and trends. Through data visualization,
insights and patterns in data can be easily interpreted and communicated to a wider audience,
making it a critical component of machine learning.
1.Line Chart
2.Scatter Plot
3.Bar Chart
4.Pie Chart
5.Box Plot
6. Histogram
 Matplotlib and Seaborn are python libraries that are used for data visualization. 
Line Charts
A Line chart is a graph that represents information as a series of data points connected by a
straight line. In line charts, each data point or marker is plotted and connected with a line or
curve. 
Let's consider the apple yield (tons per hectare) in Kanto. Let's plot a line graph using this
data and see how the yield of apples changes over time. We start by importing Matplotlib and
Seaborn.

Using Matplotlib
We are using random data points to represent the yield of apples. 

To better understand the graph and its purpose, we can add the x-axis values too.
Let's add labels to the axes so that we can show what each axis represents.  

  
To plot multiple datasets on the same graph, just use the plt.plot function once for each
dataset. Let's use this to compare the yields of apples vs. oranges on the same graph.
We can add a legend which tells us what each line in our graph means. To understand what
we are plotting, we can add a title to our graph.

  
To show each data point on our graph, we can highlight them with markers using the marker
argument. Many different marker shapes like a circle, cross, square, diamond, etc. are
provided by Matplotlib.

You can use the plt.figure function to change the size of the figure.
Using Seaborn
An easy way to make your charts look beautiful is to use some default styles from the
Seaborn library. These can be applied globally using the sns.set_style function.
We can also use the darkgrid option to change the background color to a darker shade.
Bar Graphs
When you have categorical data, you can represent it with a bar graph. A bar graph plots data
with the help of bars, which represent value on the y-axis and category on the x-axis. Bar
graphs use bars with varying heights to show the data which belongs to a specific category.
We can also stack bars on top of each other. Let's plot the data for apples and oranges.

Let’s use the tips dataset in Seaborn next. The dataset consists of :
 Information about the sex (gender)
 Time of day
 Total bill
 Tips given by customers visiting the restaurant for a week

We can draw a bar chart to visualize how the average bill amount varies across different days
of the week. We can do this by computing the day-wise averages and then using plt.bar. The
Seaborn library also provides a barplot function that can automatically compute averages.
Histograms
A Histogram is a bar representation of data  that varies over a range. It plots the height of the
data belonging to a range along the y-axis and the range along the x-axis. Histograms are
used to plot data over a range of values. They use a bar representation to show the data
belonging to each range. Let's again use the ‘Iris’ data which contains information about
flowers to plot histograms.

Now, let’s plot a histogram using the hist() function.


We can control the number or size of bins too.

We can change the number and size of bins using numpy too.
We can create bins of unequal size too.

Similar to line charts, we can draw multiple histograms in a single chart. We can reduce each
histogram's opacity so that one histogram's bars don't hide the others'. Let's draw separate
histograms for each species of flowers.
Multiple histograms can be stacked on top of one another by setting the stacked parameter to
True.

Scatter Plots
Scatter plots are used when we have to plot two or more variables present at different
coordinates. The data is scattered all over the graph and is not confined to a range. Two or
more variables are plotted in a Scatter Plot, with each variable being represented by a
different color. Let's use the ‘Iris’ dataset to plot a Scatter Plot.

First, let’s see how many different species of flowers we have.

Let’s try plotting the data with the help of a line chart.
This is not very informative. We cannot figure out the relationship between different data
points.

This is much better. But we still cannot differentiate different data points belonging to
different categories. We can color the dots using the flower species as a hue.
Since Seaborn uses Matplotlib's plotting functions internally, we can use functions like
plt.figure and plt.title to modify the figure.

Heat Maps
Heatmaps are used to see changes in behavior or gradual changes in data. It uses different
colors to represent different values. Based on how these colors range in hues, intensity, etc.,
tells us how the phenomenon varies. Let's use heatmaps to visualize monthly passenger
footfall at an airport over 12 years from the flights dataset in Seaborn.
The above dataset, flights_df shows us the monthly footfall in an airport for each year, from
1949 to 1960. The values represent the number of passengers (in thousands) that passed
through the airport. Let’s use a heatmap to visualize the above data.

 
The brighter the color, the higher the footfall at the airport. By looking at the graph, we can
infer that : 
1. The annual footfall for any given year is highest around July and August.
2. The footfall grows annually. Any month in a year will have a higher footfall when
compared to the previous years.
Let's display the actual values in our heatmap and change the hue to blue.           

You might also like