TensorFlow Regression
TensorFlow Regression
Neural networks are the most common type of deep learning algorithm
Estimators
Understanding the Foundations of TensorFlow
Mammals Fish
Members of the infraorder Cetacea Look like fish, swim like fish, move with fish
Rule-based Binary Classifier
Whale Mammal
Rule-based Classifier
Human Experts
“Traditional” ML-based Binary Classifier
Corpus
“Traditional” ML-based Binary Classifier
Corpus
“Traditional” ML-based Binary Classifier
Corpus
“Traditional” ML-based Binary Classifier
Output: Label
Corpus
“Traditional” ML-based Binary Classifier
Corpus
“Traditional” ML-based Binary Classifier
Corpus
“Traditional” ML-based Binary Classifier
Output: Label
Corpus
The attributes that the ML algorithm focuses on are called
features
Feature Vectors Each data point is a list - or vector - of such features
Corpus
“Representation” ML-based Binary Classifier
ML-based Classifier
Corpus
“Traditional” ML-based Binary Classifier
Algorithms that learn what The most common class of deep Simple building blocks that actually
features matter learning algorithms “learn”
Deep Learning Book - Chapter 1 (intro), page 6
“Deep Learning”-based Binary Classifier
Object Parts
Corners
Edges
Pixels
Object Parts
Corners
Edges
Pixels
Object Parts
Corners
Edges
Pixels
Layer 2
Layer 1
Layer N
…
Corpus of Layers in a neural network ML-based Classifier
Images
Neural Net works Introduced
Accuracy in classification,
residual variance in regression
Training using a corpus of
labelled instances
Learning Algorithms
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E
Learning Algorithms
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E
Deep Learning and Neural Net works
Algorithms that learn what The most common class of deep Simple building blocks that actually
features matter learning algorithms “learn”
“Deep Learning”-based Binary Classifier
Object Parts
Corners
Edges
Pixels
Object Parts
Corners
Edges
Pixels
Deep Learning The more complex the graph, the more the relationships it can
“learn”
“Learning” Regression
Regression can be reverse-engineered by a single neuron
Regression: The Simplest Neural Net work
Set of Points
Single Neuron Regression Line
def XOR(x1,x2):
if (x1 == x2):
return 0
return 1
“Learning” XOR
The XOR function can be reverse-engineered using 3 neurons arranged in 2 layers
XOR: 3 Neurons, 2 Layers
X1 X2 Y
X1
0 0 0
0 1 1
1 0 1
X2 1 1 0
Corpus of The nodes in the computation graph are neurons ML-based Classifier
Images (simple building blocks)
The Computational Graph
Corpus of The edges in the computation graph are data items ML-based Classifier
Images called tensors
The nodes in the computation graph are simple entities
called neurons
- Convolutional
- Recurrent
Groups of neurons that perform similar functions are
Neural Net works aggregated into layers
Layers in the Computation Graph
Object Parts
Corners
Edges
Pixels
Object Parts
Corners
Edges
Pixels
Object Parts
Corners
Edges
Pixels
Layer 2
Layer 1
Layer N
…
Corpus of Layers in a neural network ML-based Classifier
Images
Neurons Each layer consists of units called neurons
Neurons
…But each neuron only applies two simple functions to its inputs
Neural Net works
- A linear (affine) transformation
- An activation function
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)
…
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
b
W Affine Different types of neural networks wire up neurons in different
ways
Wx + b
Transformation
X
These interconnections can get very sophisticated…
b
During training, the output of deeper layers may be “fed back” to find the
W Affine
best W, b
Wx + b
Transformation
X
This is called back propagation
b
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)
…
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
ReLU(x) = max(0,x)
Regression: The Simplest Neural Net work
def doSomethingReallyComplicated(x1,x2…):
…
…
…
return complicatedResult
“Learning” Regression
Regression can be reverse-engineered by a single neuron
Regression: The Simplest Neural Net work
Set of Points
Single Neuron Regression Line
Regression: The Simplest Neural Net work
Set of Points
Single Neuron Regression Line
Operation of a Single Neuron
W1
X1
W2 y = Wx + b
X2
Wi Wx + b
…
Xi
…
Wn b
Xn
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation
…
Xi
…
Wn
Xn b
Here the neuron is an entity that finds the “best fit” line through a set of
points
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation
…
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
The affine transformation is just a weighted sum of the inputs with a bias
added
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation
…
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Output from the neuron
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation
…
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
y1 = A + Bx1
y2 = A + Bx2
y3 = A + Bx3
… …
yn A + Bxn
Simple Regression
Regression Equation:
y = A + Bx
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… … …
yn A + Bxn + en
Simple Regression
Regression Equation:
y = A + Bx
y1 1 x1 e1
[] [][][]
y2
y3
…
yn
= A
…
1
1
1
+B
x2
x3
…
xn
+
e2
e3
…
en
Minimising Least Square Error
Y
(xi, yi)
ei = yi - y’i
(xi, y’i)
Regression Line:
y = A + Bx
X
Residuals of a regression are the difference between actual and
fitted values of the dependent variable
The “Best” Regression Line
Y
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
A1 Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
x increases by 1
y decreases by B1
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
A2 X
Line 1: y = A1 + B1x
y decreases by B2
Line 2: y = A2 + B2x
x increases by 1
X
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
[] [][][]
y2
y3
…
yn
= A
…
1
1
1
+B
x2
x3
…
xn
+
e2
e3
…
en
Minimising Least Square Error
Y
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
The “best fit” line is the one where the sum of the squares of
the lengths of these dotted lines is minimum
Minimising Least Square Error
Y
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
The “best fit” line is the one where the sum of the squares of
the lengths of these dotted lines is minimum
Minimising Least Square Error
Y
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
The “best fit” line is the one where the sum of the
squares of the lengths of the errors is minimum
Minimising Least Square Error
Y
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
The “best fit” line is the one where the sum of the
squares of the lengths of the errors is minimum
Simple Regression
Regression Equation:
y = A + Bx
y1 1 x1 e1
[] [][][]
y2
y3
…
yn
= A
…
1
1
1
+B
x2
x3
…
xn
+
e2
e3
…
en
The “best fit” line is the one where the sum of the squares of
the lengths of the errors is minimum
Minimising Least Square Error
Y
Regression Line: y = A
+ Bx
Maximum likelihood
Method of moments Method of least squares
estimation
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation
…
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
Xi
…
Wn
Xn b
“Learning” Regression
Regression can be learnt by a single neuron using an affine transformation alone
Regression: The Simplest Neural Net work
Set of Points
Single Neuron Regression Line
def XOR(x1,x2):
if (x1 == x2):
return 0
return 1
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
XOR: Not Linearly Separable
X2
X1 X2 Y
0 0 0
0 1 1
Y=1 Y=0 1 0 1
1 1 0
Y=1
Y=0
X1
XOR: Not Linearly Separable
X2
X1 X2 Y
0 0 0
0 1 1
Y=1 Y=0 1 0 1
1 1 0
Y=1
Y=0
X1
No one straight line neatly divides the points into disjoint regions
where Y = 0 and Y = 1
XOR: Not Linearly Separable
X2
X1 X2 Y
0 0 0
0 1 1
Y=1 Y=0 1 0 1
1 1 0
Y=1
Y=0
X1
No one straight line neatly divides the points into disjoint regions
where Y = 0 and Y = 1
XOR: 3 Neurons, 2 Layers
X1 X1 X2 Y
0 0 0
0 1 1
1 0 1
X2
1 1 0
0 0 0
0 1 1
1 0 1
1 1 0
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
1-Neuron Regression
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation
…
Xi
…
Wn
Xn b
ReLU(x) = max(0,x)
XOR: 3 Neurons, 2 Layers
X1 X1 X2 Y
0 0 0
0 1 1
1 0 1
X2
1 1 0
W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
3-Neuron XOR
Neuron #1
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6
W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
3-Neuron XOR
Neuron #1
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6
W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
Neuron #2
3-Neuron XOR
Neuron #3
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6
W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6
W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
Neuron #2
The most common form of the activation function is the
ReLU
Activation
Wx + b Function max(Wx+b,0) ReLU : Rectified Linear Unit
ReLU(x) = max(0,x)
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3 Activation
Affine
RELU Function
Transformation
X2
W4
b2
Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
max(x,0)
X1 Affine W 5 Affine
RELU
Transformation x Transformation
W2
b1 W6
W3 Activation
Affine x max(x,0)
RELU Function
Transformation
X2
W4
b2
Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3 Activation
Affine
RELU Function
Transformation
X2
W4
b2
Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6 x
W3
Affine Identity
RELU
Transformation
X2
W4 x
b2
Neuron #2
3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2
W4
b2
3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
Output
3-Neuron XOR
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
Neuron #2 Layer 1
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
max(x,0)
X1 Affine W 5 Affine
RELU
Transformation x Transformation
W2
b1 W6
W3 Activation
Affine x max(x,0)
RELU Function
Transformation
X2
W4
b2
Neuron #2 Layer 2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
Information
b1 only “feeds forward" W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
Neuron #2
“2-Layer Feed-for ward Neural Net work”
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
XOR: 3 Neurons, 2 Layers
X1 X1 X2 Y
0 0 0
0 1 1
1 0 1
X2
1 1 0
Xi
…
Wn
Xn
b
Each neuron has weights and a bias that must be calculated by the training algorithm
(done for us by TensorFlow)
Weights and Bias of Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
Weights and Bias of Neuron #2
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
Weights and Bias of Neuron #3
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4
b2
The weights and biases of individual neurons are
determined during the training process
Weights and Bias of Neuron #1
b3
W1 =1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6
W3 =1
Affine Identity
RELU
Transformation
X2
W4
b2
Weights and Bias of Neuron #2
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6
W3
Affine Identity
RELU
Transformation
X2
W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU
Transformation Transformation
W2
b1 W6 =-2
W3
Affine Identity
RELU
Transformation
X2
W4
b2
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #1
X1 =0 b3
W1 =1
Affine 0 0 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6
W3 =1
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #2
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6
W3
Affine 0 Identity
RELU
Transformation -1
X2 =0 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 0
Transformation Transformation
W2
b1 W6 =-2 0
W3
Affine Identity
RELU
Transformation 0
X2
W4
b2
0
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #1
X1 =0 b3
W1 =1
Affine 1 1 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6
W3 =1
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #2
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6
W3
Affine 0 Identity
RELU
Transformation 0
X2 =1 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 1
Transformation Transformation
W2
b1 W6 =-2 1
W3
Affine Identity
RELU
Transformation 0
X2
W4
b2
1
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #1
X1 =1 b3
W1 =1
Affine 1 1 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6
W3 =1
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #2
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6
W3
Affine 0 Identity
RELU
Transformation 0
X2 =0 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 1
Transformation Transformation
W2
b1 W6 =-2 1
W3
Affine Identity
RELU
Transformation 0
X2
W4
b2
1
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #1
X1 =1 b3
W1 =1
Affine 2 2 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6
W3 =1
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #2
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6
W3
Affine 1 Identity
RELU
Transformation 1
X2 =1 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 2
Transformation Transformation
W2
b1 W6 =-2 0
W3
Affine Identity
RELU
Transformation 1
X2
W4
b2
0
Choice of Activation Function
Activation SoftMax
“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
XOR: 3 Neurons, 2 Layers
X1 X1 X2 Y
0 0 0
0 1 1
1 0 1
X2
1 1 0
W3
Affine Identity
RELU
Transformation
X2
W4
b2
def doSomethingReallyComplicated(x1,x2…):
…
…
…
return complicatedResult
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Cause Effect
Baseline
Non-TensorFlow implementation
Index -n Index 0
Backward Forward
Indices Indices
Index -1 Index n-1
Prices to Returns
0 0
0
1
= [:-1] / -1
[1:]
n-1 n-1
n
(x1, y1)
(x2, y2)
(A, B)
…
(xn, yn)
[x1] y1
[x2] y2 (A, B)
… …
[xn] yn
reshape(-1,1)
n
Reshaping in NumPy
x1
[x1]
x2
…
reshape(-1,1) [x2]
…
xn
[xn]
Implementing Regression in TensorFlow
Baseline
Non-TensorFlow implementation
Computation Graph
Neural network of 1 neuron
Set of Points
Single Neuron Regression Line
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function Wx+b
…
Xi
…
Wn
Xn b
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Identity
Wi Transformation Wx + b Function Wx+b
…
Xi
…
Wn
Xn b
Implementing Regression in TensorFlow
Computation Graph
Neural network of 1 neuron
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
[] [][][]
y2
y3
…
yn
= A
…
1
1
1
+B
x2
x3
…
xn
+
e2
e3
…
en
Minimising Least Square Error
Y
Line 1: y = A1 + B1x
Line 2: y = A2 + B2x
The “best fit” line is the one where the sum of the squares of
the lengths of these dotted lines is minimum
Minimising Least Square Error
Y
Regression Line: y = A
+ Bx
What we would like to achieve What slows us down What we really control
y = Wx + b
Minimizing MSE
MSE
b
Minimizing MSE
MSE
b
Minimizing MSE
MSE
b
Minimizing MSE
As small as possible!
Best value of W
Best value of b
Smallest value of MSE
Start Somewhere Initial values - have to start
somewhere
MSE
Initial value of W
Initial value of b
Initial value of MSE
“Gradient Descent”
Converging on the “best” value
MSE
using an optimization algorithm
Initial value of W
Initial value of b
Training data
Single Neuron Regression Line
Minimizing MSE
MSE
Training data
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Cause Effect
Government Bond
Yields
Causes Effect
NASDAQ Share
Index
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Linear regression and logistic regression are similar, yet quite different
100%
Probability of getting other important work done
0%
Starting Five Minutes in Advance
Probability of meeting the deadline
0%
Probability of getting other important work done
100%
The Goldilocks Solution
Start very late and hope for the Start as late as possible to be Start very early and do little
best sure to make it else
95%
Probability of getting other important work done
95%
Working Hard, Fast, Smart
Probability of (1 year,100%)
meeting
deadline Start 1 year before deadline
Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (?,95%)
Work hard
deadline
Work smart
Work fast
(5 mins,0%)
Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (?,95%)
Work hard
deadline
Work smart
Work fast
(5 mins,0%)
Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (?,95%)
deadline Work hard
95% Work smart
Work fast
(5 mins,0%) 11 days
Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (11 days,95%)
deadline Work hard
95%
Work smart
Work fast
(5 mins,0%)
Time to deadline
Working Hard, Fast, Smart
Probability of
meeting
deadline Work hard
Work smart
Work fast
Time to deadline
Logistic Regression helps find how probabilities
are changed by actions
Working Smart with Logistic Regression
Y
100%
Probability
0%
X
Time to deadline
Working Smart with Logistic Regression
Y
100%
Probability
0%
0%
X
Time to deadline
Start too late, and you’ll definitely miss
Working Smart with Logistic Regression
Y
100%
Probability
100%
0%
X
Time to deadline
Start too early, and you’ll definitely make it
Working Smart with Logistic Regression
Y
100%
Probability
<50% >50%
0%
X
Time to deadline
Working smart is knowing when to start
Y-axis: probability of meeting deadline
- floor of 0
- ceiling of 1
y: hit or miss? (0 or 1?)
p(y) : probability of y = 1
Logistic regression involves finding the “best fit” such
curve
1 - A is the intercept
p(yi) =
1+ e-(A+Bx )
i - B is the regression coefficient
1
y=
1+e -(A+Bx)
1
p(yi) = x = -∞
1+ e-(A+Bx )
i p(y) = 0
1
p(yi) = x = +∞
1+ e-(A+Bx )
i p(y) = 0
Work fast
(5 mins,0%)
Time to deadline
Working Hard, Fast, Smart
p(y1)
X1
X2
0 1
Probability of outcome is very sensitive
to changes in cause
1
p(yi) =
1+ e-(A+Bx )
i
Categorical and Continuous Variables
Continuous Categorical
Can take an infinite set of values Can take a finite set of values
(height, weight, income…) (male/female, day of week…)
Categorical variables that can take just two values are called
binary variables
Logistic Regression helps estimate how
probabilities of categorical variables are
influenced by causes
Logistic Regression in Classification
Whales: Fish or Mammals
Mammal Fish
Member of the infraorder Cetacea Looks like a fish, swims like a fish, moves
like a fish
Rule-based Binary Classifier
Whale Mammal
Rule-based Classifier
Human Experts
ML-based Binary Classifier
Corpus
ML-based Binary Classifier
Corpus
ML-based Predictor
Corpus
Applying Logistic Regression
Probability of
animal being
(95%)
fish Lives in water, breathes with gills,
lays eggs
(60%)
(60%)
Rule of 50%
Applying Logistic Regression
(50%)
Probability of
animal being (95%)
fish (80%)
(60%)
(60%)
Mammal Fish
Mammal Fish
Cause Effect
Cause Effect
(x2, y2)
(x3, y3)
Regression Line: y =
(xn, yn) A + Bx
y=1
y=0
Regression Curve
1
p(y) =
1 + e-(A+Bx)
(x1, y1)
(x2, y2)
X
y
p(y)
x x
Similar, yet Different
y p(y)
x x
Similar, yet Different
y p(y)
x x
Similar, yet Different
y p(y)
x x
Similar, yet Different
y p(y)
x x
Similar, yet Different
Objective of regression is to find A, B that “best Objective of regression is to find A, B that “best
fit” the data fit” the data
Similar, yet Different
Relationship is already linear (by assumption) Relationship can be made linear (by log
transformation)
Similar, yet Different
p
logit(p) = ln( )
1-p
Solve regression problem using cookie-cutter Solve regression problem using cookie-cutter
solvers solvers
Logistic Regression
p(y)
y=1
y=0
Regression Curve
1
p(y) =
1+e -(A+Bx)
(x1, y1)
(x2, y2)
X
y1 = A + Bx1
y2 = A + Bx2
y3 = A + Bx3
… …
yn = A + Bxn
Logistic Regression
1
p(y) =
1+e -(A+Bx)
1
p(yi) =
1+ e-(A+Bx )
i
1
p(y1) =
1+ e-(A+Bx )
1
…
1
p(yn) =
1 + e-(A+Bxn)
Logistic Regression
p(y)
y=1
y=0
Regression Curve
1
p(y) =
1+e -(A+Bx)
(x1, y1)
(x2, y2)
X
Regression Equation:
1
p(yi) =
1+ e-(A+Bx )
i
p
Odds(p) =
1-p
Odds of an Event
1
p=
1+e -(A+Bx)
A + Bx
e
p= A + Bx
1+e
A + Bx
e
1-p= 1-
A + Bx
1+e
Odds of an Event
A + Bx
e
1-p= 1- A + Bx
1+e
A + Bx
1+e - e
A + Bx
1-p= A + Bx
1+e
1
1-p=
A + Bx
1+e
Odds of an Event
A + Bx
e
p= A + Bx
1+e
1
1-p=
A + Bx
1+e
p
Odds(p) = = A + Bx
1-p e
Logit Is Linear
p
Odds(p) = = A + Bx
e
1-p
logit(p) = A + Bx
1
p=
1+e -(A+Bx)
logit(p) = ln Odds(p) = A + Bx
Cause Effect
y = Returns on x = Returns on
Google stock S&P 500
(GOOG) (S&P500)
Logistic Regression
1
p(yi) =
1+ e-(A+Bx )
i
Predicted
Predicted
labels Labels
< 0.5 Down False
Set up the Problem
> 0% Up 1
GOOG Labels
Returns
<= 0% Down 0
2017-01-01 1 1
2017-02-01 1 1
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Regular python code Similarity of distribution Batch size for each epoch
Regular python code Similarity of distribution Batch size for each epoch
Set of Points
Single Neuron Regression Line
Linear Regression with One Neuron
W1
X1
W2
X2 Affine Activation
Wi
Transformation Wx + b Function Wx+b
…
Xi
Wn
…
Xn b
Linear Regression with One Neuron
W1
X1
W2
X2 Affine Identity
Wi
Transformation Wx + b Function Wx+b
…
Xi
Wn
…
Xn b
Linear Regression with One Neuron
W1
X1
X2 W2 Affine Identity
Wi Transformation Wx + b Function
…
Xi
Wn
…
Xn
b
Logistic Regression with One Neuron
W1
X1
X2 W2 Softmax P(Y = True)
Affine Transformation
Wi Wx + b Function
…
Xi P(Y = False)
Wn
…
Xn
b
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine Softmax P(Y = True)
Wi Transformation Wx + b Function
…
P(Y = False)
Xi
Wn
…
Xn b
Logistic Regression with One Neuron
W1
X1 W2
X2 Affine W2 Softmax P(Y = True)
Wi
Transformation W1x + b1 Function
…
Xi P(Y = False)
Wn
…
Xn b1 b2
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine W2 Softmax P(Y = True)
Wi Transformation Wx + b Function
…
P(Y = False)
Xi
Wn
…
Xn b b2
Logistic Regression with One Neuron
W1 1
X1
W2 Affine 1+e -(W2x’+b2)
X2
Wi Transformation W1x + b1
…
Softmax
Xi p(Y = True)
x’ Function
Wn W2,b2
…
1
Xn b1 W2x’ + b2
1+e
1 - p(Y = True)
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine W2 Softmax P(Y = True)
Wi Transformation x’ Function
…
P(Y = False)
Xi
Wn
…
Xn b b2
Logistic Regression with One Neuron
W1 1
X1
X2
W2 1+e -(W2x’+b2)
Affine Transformation
Wi W1x + b1 Softmax
…
W2,b2
1
Xn
b1 W2x’ + b2
1+e
1 - p(Y = True)
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine P(Y = True)
Wi Transformation W1x + b1
Softmax
…
Xi Function
Wn
…
P(Y = False)
Xn
b1
SoftMax for True/False Classification
1
1+e -(Wx+B)
p(Y = True)
Softmax
x
Function 1
Wx + B
1+e
p(Y = False)
Linear Regression with One Neuron
1-dimensional feature
vector
Shape (W) = [1,1] Regression Line
1-dimensional feature
Shape (W) = [1,2] S-Curve
vector
Shape (b) = [2]
SoftMax for Digit Classification
P(Y = 0)
P(Y = 1)
Softmax
Function …
P(Y = 9)
SoftMax for Digit Classification
1-dimensional feature
S-Curve
vector Shape (W) = [1,10]
P(Y = Y1)
P(Y = Y2)
Softmax
Function …
P(Y = YN)
SoftMax N-category Classification
1-dimensional feature
vector
Shape (W) = [1,N] S-Curve
1-dimensional feature
vector
Shape (W) = [1,N] S-Curve
1-dimensional feature
Shape (W) = [1,N] S-Curve
vector
Shape (b) = [N]
SoftMax N-category Classification
M-dimensional feature
vector
Shape (W) = [M,N] S-Curve
Regular python code Similarity of distribution Batch size for each epoch
Regular python code Similarity of distribution Batch size for each epoch
Regression Line: y =
A + Bx
Predicted
Predicted labels
Labels
< 0.5 Down False
Set up the Problem
> 0% Up 1
GOOG
Labels
Returns
<= 0% Down 0
2005-01-01 NA NA
2005-02-01 0 1
2005-03-01 0 0
2017-01-01 1 1
2017-02-01 1 1
Yactual
Ypredicted
Intuition: Low Cross Entropy
Yactual
Ypredicted
Yactual
Ypredicted
Cross Entropy
Intuition: Low Cross Entropy
Yactual
Ypredicted
Intuition: High Cross Entropy
Yactual
Ypredicted
Yactual
Ypredicted
Cross Entropy
Logistic Regression in TensorFlow
Regular python code Similarity of distribution Batch size for each epoch
Regular python code Similarity of distribution Batch size for each epoch
tensorflow.argmax(y,1)
tensorflow.argmax(y,1)
5
Index = 0
15
1
12
2
3 100
4 74
5 33
tf.argmax(y,1)
tf.argmax
5
Index = 0
15
1
12
2
3 100
74
Return value 4
5 33
tf.argmax(y,1)
tf.argmax
5
Index = 0
15
1
12
2
3 100
74
Return value 4
5 33
tf.argmax(y,1)
tf.argmax
1 15
12
2
3 100
4 74
5 33
tf.argmax(y,1)
tf.argmax
1 15
12
2
3 100
Return value 4 74
Largest value
5 33
tf.argmax(y,1)
tf.argmax
Index = 0
Index = M
tf.argmax(y,1)
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))
TRUE FALSE
TRUE 1 0
FALSE 0 1
FALSE 0 1
…
TRUE 1 0
TRUE FALSE
TRUE 1 0
FALSE 0 1
FALSE 0 1
…
TRUE 1 0
0 1
1 0 0
0 1 1
0 1 1
…
1 0 0
P(TRUE) P(FALSE)
P(TRUE) P(FALSE)
P(TRUE) = 0.70 0.70 0.30
Mammal Fish
P(TRUE) P(FALSE)
0
0.70 0.30
1
0.44 0.56
1
0.34 0.66
…
0
0.84 0.16
0 1 0 0 0
1 0 1 0 0
…
…
9 0 0 0 1
0 1 … 9
0 1 0 0 0
1 0 1 0 0
…
…
9 0 0 0 1
0 1 … 9
1 0 0 0 0
0 1 0 0 1
…
…
0 0 0 1 9
0.44 0.56 1
1
…
0.66 9
0.70 0.30 0
0.44 0.56 1
0.66 9
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))
…
Feature Dimensions
Name Type
Input Function
How Estimators Work
…
Feature Dimensions
Name Type
Input Function
Feature Number of
y Variable Batch Size
Names Epochs
How Estimators Work
…
Feature Dimensions
Name Type
Input Function
Feature Number of
y Variable Batch Size
Names Epochs
How Estimators Work
Run optimisation
Input Function
Return trained model
How Estimators Work
…
Feature Dimensions
Name Type
Input Function
How Estimators Work
…
Feature Dimensions
Name Type
Input Function
Complex Neural Net works
…
Feature Dimensions
Name Type
Input Function
All those interconnections represent intermediate
feature vector data!
How Estimators Work
…
Feature Dimensions
Name Type
Input Function
Linear Regression in TensorFlow
Baseline Cost Function Training
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Regular python code Similarity of distribution Batch size for each epoch
Regular python code Set up X, Y, batch_size, num_epochs Predict new points (test data)
Abstracts cost and optimizer choices Can re-specify number of training steps
Learning using Neurons
Estimators
“Representation” ML-based systems figure out by
themselves what features to pay attention to
y = Wx + b
“Learning” Regression
Regression can be reverse-engineered by a single neuron
Regression: The Simplest Neural Net work
Set of Points
Single Neuron Regression Line
def XOR(x1,x2):
if (x1 == x2):
return 0
return 1
“Learning” XOR
The XOR function can be reverse-engineered using 3 neurons arranged in 2 layers
XOR: 3 Neurons, 2 Layers
X1
X1 X2 Y
0 0 0
0 1 1
X2 1 0 1
1 1 0
Regular python code Quantifying goodness-of-fit Batch size for each epoch
Regular python code Similarity of distribution Batch size for each epoch