0% found this document useful (0 votes)
4 views18 pages

Mathophilia

The document provides an overview of data science, highlighting its multidisciplinary nature and the importance of mathematics, particularly linear algebra, in analyzing large datasets. It discusses various mathematical applications such as loss functions, regularization, covariance matrices, and singular value decomposition, as well as their relevance in fields like natural language processing and image representation. Additionally, it touches on the role of statistics in data science and its applications across various domains, including robotics and genomics.

Uploaded by

sujitha Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

Mathophilia

The document provides an overview of data science, highlighting its multidisciplinary nature and the importance of mathematics, particularly linear algebra, in analyzing large datasets. It discusses various mathematical applications such as loss functions, regularization, covariance matrices, and singular value decomposition, as well as their relevance in fields like natural language processing and image representation. Additionally, it touches on the role of statistics in data science and its applications across various domains, including robotics and genomics.

Uploaded by

sujitha Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

MATHEMATICS IN DATA

SCIENCE
INTRODUCTION – DATA SCIENCE
 Data science is a multidisciplinary field that combines various techniques, tools, and algorithms to extract
meaningful insights and knowledge from structured and unstructured data.
 It involves using statistical analysis, machine learning, data visualization, and other methods to uncover patterns,
trends, and correlations that can inform decision-making and drive business outcomes.
 Data scientists employ a combination of skills from mathematics, statistics, programming, and domain
knowledge to solve complex problems. They work with large datasets, often referred to as "big data," and
leverage advanced computational techniques to process and analyze the information contained within the data.
 Data science has applications in various fields, such as business, healthcare, finance, marketing, social sciences,
and many others. It has become increasingly important in today's data-driven world, as organizations strive to
extract valuable insights from their data to gain a competitive edge and make informed decisions.
Linear algebra
• Linear algebra is a branch of Mathematics that studies the properties of matrices and vector
spaces.
• Linear Algebra is the “mathematics” of Data Science helping to provide structure and powerful
theory to work with big data sets.
• Linear algebra in data science is used as follows:
Application
of
Mathematic MATH
S
s in data
science
LOSS FUNCTION

A loss function is an application of the vector norm in linear algebra. The norm of a
vector can simply be its magnitude. There are many types of vector norms.
L1 Norm: Also known as the Manhattan Distance or Taxicab Norm. The L1 Norm is the distance
you would travel if you went from the origin to the vector if the only permitted directions are
parallel to the axes of the space

In this 2D space, you could reach the vector (3, 4) by traveling 3


units along the x-axis and then 4 units parallel to the y-axis (as
shown). Or you could travel 4 units along the y-axis first and then
3 units parallel to the x-axis. In either case, you will travel a total
of 7 units.
• L2 Norm: Also
known as the
Euclidean Distance.
L2 Norm is the
shortest distance of the
vector from the origin.
• This distance is
calculated using the
(Pythagoras Theorem).
• It is the square root of
(3^2 + 4^2), which is
equal to 5.
Regularization
• Regularization is a very important concept in data science. It’s a technique we use to prevent models from
overfitting. Regularization is actually another application of the Norm.
• A model is said to overfit when it fits the training data too well. Such a model does not perform well with new
data because it has learned even the noise in the training data. It will not be able to generalize on data that it
has not seen before.
• Regularization penalizes overly complex models by adding the norm of the weight vector
to the cost function. Since we want to minimize the cost function, we will need to minimize this norm.
This causes unrequired components of the weight vector to reduce to zero and prevents the prediction
function from being overly complex.
Covariance Matrix
• We want to study the relationship between pairs of variables. Covariance or Correlation are measures used to study relationships
between two continuous variables.
• Covariance indicates the direction of the linear relationship between the variables. A positive covariance
indicates that an increase or decrease in one variable is accompanied by the same in another. A negative covariance indicates that
an increase or decrease in one is accompanied by the opposite in the other.

correlation is the standardized value of


Covariance. A correlation value tells us
both the strength and direction of the
linear relationship and has the range
from -1 to 1.
SINGULAR VALUE
DECOMPOSITION
• SVD in Dimensionality Reduction. Specifically, this is known
as Truncated SVD.
• We start with the large m x n numerical data matrix A, where m is the
number of rows and n is the number of features
• Decompose it into 3 matrices
NATURAL LANGUAGE PROCESSING (NLP)
WORD EMBEDDINGS
• Machine learning algorithms cannot work with raw textual data. We need to convert the text into some numerical and statistical
features to create model inputs. There are many ways for engineering features from text data, such as:
• Meta attributes of a text, like word count, special character count, etc.
• NLP attributes of text using Parts-of-Speech tags and Grammar Relations like the number of proper nouns
• Word Vector Notations or Word Embeddings
• Word Embeddings is a way of representing words as low dimensional vectors of numbers while preserving their context in
the document. These representations are obtained by training different neural networks on a large amount of text which is called
a corpus. They also help in analyzing syntactic similarity among words:
Image Representation as Tensors
• How do you account for the ‘vision’ in Computer Vision? Obviously, a computer does not process images as
humans do. Machine learning algorithms need numerical features to work with.
• A digital image is made up of small indivisible units called pixels

This grayscale image of the digit


zero is made of 8 x 8 = 64 pixels. Each
pixel has a value in the range 0 to 255. A
value of 0 represents a black pixel and
255 represents a white pixel.
Conveniently, an m x n grayscale image
can be represented as a 2D
matrix with m rows and n columns
with the cells containing the respective
pixel values:
But colored image? A colored image is generally
stored in the RGB system. Each image can be
thought of as being represented by three 2D
matrices, one for each R, G and B channel. A pixel
value of 0 in the R channel represents zero
intensity of the Red color and of 255 represents the
full intensity of the Red color.
Each pixel value is then a combination of the
corresponding values in the three channels:
In reality, instead of using 3 matrices to represent
an image, a tensor is used. A tensor is a generalized
n-dimensional matrix. For an RGB image, a 3rd
ordered tensor is used. Imagine it as three 2D
matrices stacked one behind another:
REAL LIFE APPLICATIONS

STATISTICS
• Statistics is an inherently necessary
component of data science
• Statistics is used to predict the
weather, restock retail shelves,
estimate the condition of the
economy, and much more.
• Data scientists use statistics to
gather, review, analyze, and draw
conclusions from data, as well as
apply quantified mathematical
models to appropriate variables.

PITCH DECK 13
ROBOTICS
• Reprogramming a robot for a new function or preparing for a new real-time
trend involving vision-oriented tasks was time-consuming.
• Data Scientists who rely on AI and Machine Learning learned to work with
robots that would evolve, acquire newer behavior through labeled data, evolve
after learning to identify errors in existing data, and so on. As a result, the
scientist’s task becomes easier, and robots can evolve with little human
intervention.
Chemical sciences and
engineering have also used
data science tools to, for
example, monitor and control
chemical processes, predict
activity depending on
chemical structures or
properties, and inform
business and research
decisions.
Data-driven science as an iterative
process: (1) identify a database
(2) eliminate redundancies, reduce
large uncertainties, and describe or
annotate the data Chemical Physics
(3) use data science methods to
develop and validate a data-driven
model that can examine correlations,
Geonomics
Genomic data science is a field of study that enables
researchers to use powerful computational and statistical
methods to decode the functional information hidden in
DNA sequences.

Genomic data science emerged as a field in the 1990s to bring


together two laboratory activities:
• Experimentation: Generating genomic information from
studying the genomes of living organisms.

• Data analysis: Using statistical and computational tools to


analyze and visualize genomic data, which includes
processing and storing data and using algorithms and
software to make predictions based on available genomic
data.

16
OTHER FIELDS
• IMAGE PROCESSING
• QUANTUM PHYSICS
• NEURAL NETWORK
• PRINCIPAL COMPONENT ANALYSICS (PCA)​
• SUPPORT VECTOR MACHINE
CLASSIFICATION​
THANK YOU

You might also like