0% found this document useful (0 votes)
60 views

Lecture2 Math ML Review

- The document provides an overview of probability, statistics, and linear algebra concepts that are important foundations for machine learning. - It discusses basic probability concepts like probability spaces, random variables, distributions, and expectations. It also covers common distributions like Bernoulli, Poisson, and Gaussian. - Linear algebra concepts covered include matrix operations, properties of matrices like identity and diagonal matrices, and matrix norms. - The document is intended as a review of math and foundational concepts for a machine learning course.

Uploaded by

LishanZhu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Lecture2 Math ML Review

- The document provides an overview of probability, statistics, and linear algebra concepts that are important foundations for machine learning. - It discusses basic probability concepts like probability spaces, random variables, distributions, and expectations. It also covers common distributions like Bernoulli, Poisson, and Gaussian. - Linear algebra concepts covered include matrix operations, properties of matrices like identity and diagonal matrices, and matrix norms. - The document is intended as a review of math and foundational concepts for a machine learning course.

Uploaded by

LishanZhu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 87

Deep Learning

Math and BL Review


Shih Yu Chang
Today Lecture
• Probability & Statistics

• Linear Algebra

• ML Basic Review
Why Probability in ML
• What is probability ?
• mathematical study of uncertainty.
• Probability plays a key role in machine learning, as the design of
learning algorithms often relies on probabilistic assumption of the
data.
• This review tries to cover basic probability theory that serves as a
basic requirement for this course.
Basic Concepts: Probability Space
• When we speak about probability, we often refer to the probability of an
event of uncertain nature taking place, e.g., measure each outcome chance
after tossing a coin.
Basic Concepts: Probability Measure
Probability Space, Example
• Let’s throw a fair dice:
• What are possible outcomes?
• If we are interested in odd or even events, what is event space?
• What is the probability to get odd numbers?
• What is the probability to get even numbers?
Basic Concepts: Random Variables
• Random variable (RV) is not variable.
• It is a function that map outcoms (in the outcome space) to real
values.
• Why?! Abstract various events into numbers then we can apply math
framework to deal with random events.
Basic Concepts: Random Variables
• Random variable (RV) is not variable.
• It is a function that map outcoms (in the outcome space) to real
values.
• Why?! Abstract various events into numbers then we can apply math
framework to deal with random events.
Random Variables : Examples
• For a fair dice, define random variable X as X(i) = 0, where “i” is the
number dice outcome face number. What is the probability of
random variable X = 0, denoted by P(X = 0) or PX(0)?
• For a fair dice, define random variable Y as Y(i) = 0, if “i” is even and
Y(i) = 1, if “i” is odd. What is the probability of random variable Y = 1?
• For two fair dices, define random variable Z as Z(i, j) = i + j, what is the
probability of random variable z = 3?
Basic Concepts: Joint and Marginal Distributions
• How about more than one RVs at your problem?
Basic Concepts: Conditional Distributions
• They specify the distribution of a random variable when the value of
another random variable is known.
• In other words, when some events are known to be true.

• Suppose we know that a dice throw was odd, and want to know the
probability of an “one” has been thrown. Do you know?
Basic Concepts: Independence
• independence means that the distribution of a random variable does
not change on learning the value of another random variable.
• In machine learning, we often make such assumptions about our
data, for example, training samples are assumed to be drawn
independently from some underlying space.

• (Conditional Independence): X and Y are conditionally independent


given Z, used in Naïve Bayes assumption.
Basic Concepts: Chain Rule and Bayes Rule
• How to know probability joint distribution from one rv? By
concatenative.

• Computing conditional probability, inversely by Bayes Rule


Why Bayes Rule, a Diagnosis story

• We would like to determine the


probability of p(measles|red spots).
• Too many and different reasons for a
person with red spots syndrome.
Why Bayes Rule, cont
• p(E|H) is the probability that one has
red spots given that one has
measles. An expert in infectious
diseases may well know.
• p(H) is simply the probability that
someone has measles, without
considering any evidence; that’s just
the prevalence of measles in the
population.
• p(E) is the probability of the
evidence: what’s the probability that
someone has red spots—again,
simply the prevalence of red spots in
the population.
Probability Distribution
• A probability distribution is a mathematical function that describes
the probabilities of occurrence of different possible outcomes in an
experiment.
• In more technical terms, the probability distribution is a description of
a random phenomenon in terms of the probabilities of events.
• Basically, discrete and continuous two categories.
Probability Distribution : Discrete
• By a discrete distribution, we mean that the random variable of the
underlying distribution can take on only finitely many different values
• Outcome space is finite.
• Using probability mass function to define a discrete distribution.
Probability Distribution : Continuous
• By a continuous distribution, we mean that the random variable of
the underlying distribution can take on infinitely many different
values.
• Outcome space is infinite.
Expectations and Variance : Expectations
• One of the most common operations we deal with a random variable
is to compute its expectation, also known as its mean, expected
value, or first moment.

• Let X be the outcome of rolling a fir dice. What is its expectation?


Expectations, cont
• Linearity of Expectations

• Let X and Y be independent rvs, then


Expectations and Variance : Variance
• The variance of a distribution is a measure of the “spread” of a
distribution. It is also called as the second moment.

• covariance of two random variables. This is a measure of how “closely


related” two random variables are.
Popular Distributions : Bernoulli
• Only two possible outcomes, {0, 1} with P(X = 1) = 1.
• It is often used to indicate whether a trail is successful or not.
• In binary classification, labels are distributed as Bernoulli.
Popular Distributions : Poisson
• It measures probability of the number of events happening over a
fixed period of time, given a fixed average rate of occurrence, and
that the events take place independently of the time since the last
event.
Popular Distributions : Gaussian
• The Gaussian distribution, also known as the normal distribution,
appears in a wide variety of contexts.
• It is also related to the Law of Large Numbers.
• For many problems, we will also often assume that when noise in the
system is Gaussian distributed.
Popular Distributions : Gaussian
Log Trick
• In machine learning, we often assume the independence of different samples.
• Therefore, we often have to deal with the product of a large number of
distributions. It is often easier if we first work with the logarithm of such
functions.
• As the logarithmic function is a strictly increasing function, it will not affect where
the maximum is located.

Now, can work on ONE sample each time


Jenson’s Inequality
• We often need to transform a rv by other function.
• It is not easy or even possible to get the expectation of such transformed
rv.
• However, if the transform is convex or concave, Jenson’s inequality help us
to derive a bound by evaluating the value of the function at the
expectation of the random variable itself.

If f is convex (U) shape

If f is concave (∩) shape


Statistical Hypothesis: quick go through, I
A principal at a certain school claims that the students in his school are
above average intelligence. A random sample of thirty students IQ
scores have a mean score of 112. Is there sufficient evidence to support
the principal’s claim? The mean population IQ is 100 with a standard
deviation of 15.
• Step 1: State the Null hypothesis. The accepted fact is that the
population mean is 100, so: H0: μ=100.
• Step 2: State the Alternate Hypothesis. The claim is that the students
have above average IQ scores, so: H1: μ > 100. “greater than” means
that this is a one-tailed test.

http://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/
Statistical Hypothesis: quick go through, II

• Step 3: Draw a picture to help you visualize the


problem.
• Step 4: State the alpha level. If you aren’t given an
alpha level, use 5% (0.05).
• Step 5: Find the rejection region area (given by your
alpha level above) from the z-table. An area of .05
is equal to a z-score of 1.645.
• Step 6: Find the test statistic using this formula:
For this set of data: z= (112.5-100) / (15/√30)=4.56.
• Step 7: If Step 6 is greater than Step 5, reject the
null hypothesis. If it’s less than Step 5, you cannot
reject the null hypothesis. In this case, it is greater
(4.56 > 1.645), so you can reject the null.

http://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/
Today Lecture
• Probability & Statistics

• Linear Algebra
Basic Concepts and Notations: What is LA
• Linear algebra is the branch of mathematics concerning linear
equations and linear functions (mappings).

• Many linear based ML models, SVM, LM, Logistic Regressions, are


based on linear algebra.
Basic Concepts and Notations : Notations, I
Basic Concepts and Notations : Notations, II
Matrix Product
Matrix Product : Properties

• Matrix multiplication is associative: (AB)C = A(BC).

• Matrix multiplication is distributive: A(B + C) = AB + AC.

• Matrix multiplication is, in general, not commutative; that is, it can be


the case that AB ≠ BA.
Matrix Properties : Identity and Diagonal Matrices

• Identity matrix is a square matrix


with only ones on diagonal and
zeros everywhere else.

• A diagonal matrix is a matrix where


all non-diagonal elements are 0.
Matrix Properties : Transpose

• The transpose of a matrix results from “flipping” the rows and columns.
Matrix Properties : Symmetric Matrices
Matrix Properties : Trace
Matrix Properties : Norms
• The norm of a matrix is a length-like measure of a matrix.
Matrix Properties : Linear Independence and Rank
• A set of vectors {x1, x2, . . . xn} is said to be linearly independent if no
vector can be represented as a linear combination of the remaining
vectors.

• The column rank of a matrix A is the largest number of columns of A that


constitute linearly independent set.
• In the same way, the row rank is the largest number of rows of A that
constitute a linearly independent set.
• Rank is defined as column rank = row rank.
Matrix Properties : Inverse
Matrix Properties : Orthogonal Matrices

Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will
not change its Euclidean norm
Matrix Properties : Determinant, I
Absolute value of determinant of a matrix is the n-dim volume spanned by n rows. 1-dim
case: |[a]| = a

Relation with matrix inversion:


Matrix Properties : Determinant, II

More Properties about Determinant and Their Proofs

http://linear.ups.edu/html/section-PDM.html
Matrix Properties : Quadratic Forms and PD
Matrix Properties : Eigenvalues and Eigenvectors
Matrix Calculations : Gradient
Gradient and Hessian for Quadratic Functions, I
Gradient and Hessian for Quadratic Functions, II

Recap again, will be used soon at Linear Model…


Gradient of Determinant

Remember log trick, why not A-T


Eigenvalues as Optimization
Take-Home Message
• Probability & Statistics :
Basic Probability Concepts, Probability Distribution, Expectations and
Variance, Popular Distributions, Statistical Hypothesis.
(https://www.quora.com/What-are-some-good-websites-where-I-can-learn-
probability-and-statistics-at-advanced-levels)

• Linear Algebra :
Basic LA Concepts, Matrix Products, Matrix Properties, Matrix Calculations.
(https://www.quora.com/Books-What-is-the-best-book-for-learning-Linear-
Algebra)
Today Lecture
• Probability & Statistics

• Linear Algebra

• ML Basic Review
Simple Questionnaire
How many people have heard about Machine Learning

How many people know about Machine Learning

How many people are using Machine Learning


What is ML
• name is derived from the concept that it deals with “construction and
study of systems that can learn from data”
• can be considered as building blocks to make computers learn to behave
more intelligently
• It is a theoretical concept. There are various techniques with different
implementations.
• The process of learning begins with observations or data, such as examples,
direct experience, or instruction.
• The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.
In Summary….
Machine learning is to design artificial system that can learn
from data and improve itself through experiences automatically.
Terminology
• Features
– The number of features or distinct traits that can be used to describe each item in a
quantitative or indicator manners.
• Samples
– Samples are items to process (e.g. classify). It can be a document, a picture, a sound,
a video, a row in database or CSV file, or whatever you can describe with a fixed set of
quantitative or indicator traits.
• Feature vector
– is an n-dimensional vector of quantitative or indicator traits that represent some
object.
• Feature extraction
– Preparation of feature vector
– transforms the data in the high-dimensional space to a space of fewer dimensions.
• Training/Evolution set
– Set of data to discover potentially relationships.
Intuition Example
What do you mean

Apple
Learning (Training)
What is Model in ML
How Machine Learn
ML Types
• Supervised Learning

• Unsupervised Learning

• Semi-Supervised Learning

• Reinforcement Learning
Supervised Learning
• the correct classes of the training data are known

http://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-reinforcement-learning/
Unsupervised Learning
• the correct classes of the training data are not known

http://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-reinforcement-learning/
Semi-Supervised Learning
• A Mix of Supervised and Unsupervised learning
Reinforcement Learning
• Learn online based on feedback

http://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-reinforcement-learning/
Supervised V.S. Unsupervised
• Supervised:
• Try to predict a specific quantity
• Have training samples with labels
• Can measure accuracy directly
• Unsupervised:
• Try to understand data
• Looking for patterns not for something specified (supervised)
• Evaluation usually indirect or qualitative
• Semi-supervised:
• Using unsupervised methods to improve supervised algs.
• Few labeled samples with lots of unlabeled.
ML Techniques
• classification: assign class from observations

• clustering: group observations into groups by


pre-defined metrics

• regression (prediction): predict value from


observations
Classification
• Assign a set of information into a predefined category, where
information can be text, images, music, any meaningful signs, etc.
• Steps:
– Step1 : Train the program (Building a Model) using a set of
training samples with labeled category.
– Step2 : Classifier will compute probability for each sample, the
probability that it makes a sample belong to each of
considered categories.
– Step3 : Test with a test data set against this model
• Popular one is Naive Bayes Classifier.
• For text classification, samples will be words, categories can be
text’s categories, e.g., sports news, politics news, etc.
Clustering
• clustering is the task of grouping a set of samples in such a way that
samples in the same group (called a cluster) are more similar to each
other
• Samples are not predefined
• For e.g. these keywords
– “man’s watch”
– “women’s watch”
– “women’s t-shirt”
– “man’s t-shirt”
– can be cluster into 2 categories “watch” and “t-shirt” or
“man” and “women”
• Popular ones are K-means clustering and Hierarchical
clustering
K-means Clustering
• partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype
of the cluster.
Hierarchical Clustering
• method of cluster analysis which seeks to build a
hierarchy of clusters.
• There can be two strategies
– Agglomerative: "bottom up" approach, each
observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
– Divisive: "top down" approach, all observations start
in one cluster, and splits are performed recursively as
one moves down the hierarchy.
Regression
• It is a measure of the relation
between the mean value of one
variable (e.g. output) and
associated values of other
variables (e.g. time and cost).
• regression analysis is a statistical
process to estimate the
relationships among variables.
•Popular one is Linear regression,
Logistic regression
(binary regression)
Classification vs Regression
• Classification means to • Regression means to
group the output into a predict the output value
class. using training data.
• classification to predict • regression to predict the
house price from training
the type of tumor i.e. data
harmful or not harmful
using training data
• if it is a real
• if it is discrete/categorical number/continuous, then
variable, then it is it is regression problem
classification problem
ML History
ML Applications, I
• Spam Email Detection
• Machine Translation (Language Translation)
• Image Search (Similarity)
• Clustering (KMeans) : Amazon Recommendations
• Classification : Google News
ML Applications, II
• Text Summarization - Google News
• Rating a Review/Comment: Yelp
• Fraud detection : Credit card Providers
• Decision Making : e.g. Bank/Insurance sector
• Sentiment Analysis
• Speech Understanding – iPhone with Siri
• Face Detection – Facebook’s Photo tagging
Spam Detection

Not a Spam
Auto Driving

https://www.slideshare.net/JunliGu/autonomous-driving-revolution-trends-challenges-and-machine-learning
Content Generation (Your Individual Project)

https://www.boredpanda.com/computer-deep-learning-algorithm-painting-masters/
Why ML are Challenging

• Does your problem can be


framed into ML
• How to get enough and
qualified data
• How to extract useful
features
• How to select proper
model
• How to tune and evaluate
proposed model
• How to adapt your model
The depiction of Choices in Designing a Checker-Playing Learning System.
for new data
Taken from “Machine Learning”, 1997.
Quiz

You might also like