0% found this document useful (0 votes)
3 views

TensorFlow Regression

The document provides an overview of building regression models using TensorFlow, emphasizing the role of neural networks and neurons in deep learning. It outlines prerequisites for learning, including Python programming and TensorFlow installation, and discusses various machine learning concepts such as linear and logistic regression. Additionally, it explains the operation of neurons, the training process, and the significance of weights and biases in neural networks.

Uploaded by

Surya Bhoi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

TensorFlow Regression

The document provides an overview of building regression models using TensorFlow, emphasizing the role of neural networks and neurons in deep learning. It outlines prerequisites for learning, including Python programming and TensorFlow installation, and discusses various machine learning concepts such as linear and logistic regression. Additionally, it explains the operation of neurons, the training process, and the significance of weights and biases in neural networks.

Uploaded by

Surya Bhoi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 445

Building Regression Models using TensorFlow

LEARNING USING NEURONS


Overview

The most common use of TensorFlow is in deep learning

Neural networks are the most common type of deep learning algorithm

The basic building block of a neural network is a neuron

Linear regression can be “learnt” using a single neuron

Deep learning extends this idea to more complex, non-linear functions


Prerequisites and Course Outline
Learning using Neurons

Linear Regression in TensorFlow


Course Outline
Logistic Regression in TensorFlow

Estimators
Understanding the Foundations of TensorFlow

Related Courses Understanding and Applying Linear Regression

Understanding and Applying Logistic Regression


Soft ware Have TensorFlow installed

and Skills Know some Python programming


Understanding Deep Learning
Whales: Fish or Mammals?

Mammals Fish

Members of the infraorder Cetacea Look like fish, swim like fish, move with fish
Rule-based Binary Classifier

Whale Mammal

Rule-based Classifier

Human Experts
“Traditional” ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Corpus
“Traditional” ML-based Binary Classifier

Corpus Classification Algorithm ML-based Classifier


“Traditional” ML-based Binary Classifier
“Traditional” ML-based Rule-based
Dynamic Static

Experts optional Experts required

Corpus required Corpus optional

Training step No training step


“Traditional” ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Corpus
“Traditional” ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Input: Feature Vector

Corpus
“Traditional” ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Output: Label

Corpus
“Traditional” ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier

Corpus
“Traditional” ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier

Input: Feature Vector

Corpus
“Traditional” ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier

Output: Label

Corpus
The attributes that the ML algorithm focuses on are called
features
Feature Vectors Each data point is a list - or vector - of such features

Thus, the input into an ML algorithm is a feature vector


“Traditional” ML-based systems still rely on
experts to decide what features to pay
attention to
“Representation” ML-based systems figure
out by themselves what features to pay
attention to
“Traditional” ML-based Binary Classifier

Corpus Classification Algorithm ML-based Classifier


“Traditional” ML-based Binary Classifier

Corpus Feature Selection by Classification Algorithm ML-based Classifier


Experts
“Traditional” ML-based Binary Classifier

Corpus Feature Selection by Classification Algorithm ML-based Classifier


Experts
“Representation” ML-based Binary Classifier

Feature Selection Classification Algorithm


Corpus ML-based Classifier
Algorithm
“Representation” ML-based Binary Classifier

Feature Selection Classification Algorithm


Corpus ML-based Classifier
Algorithm
“Traditional” ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Corpus
“Representation” ML-based Binary Classifier

Picture or video of a whale Mammal

ML-based Classifier

Corpus
“Traditional” ML-based Binary Classifier

Corpus Feature Selection by Classification Algorithm ML-based Classifier


Experts
“Representation” ML-based Binary Classifier

Feature Selection Classification Algorithm


Corpus ML-based Classifier
Algorithm
“Representation” ML-based Binary Classifier

Feature Selection Classification Algorithm


Corpus ML-based Classifier
Algorithm
“Deep Learning” systems are one type of
representation systems
Deep Learning and Neural Net works

Deep Learning Neural Networks Neurons

Algorithms that learn what The most common class of deep Simple building blocks that actually
features matter learning algorithms “learn”
Deep Learning Book - Chapter 1 (intro), page 6
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of Feature Selection & Classification ML-based Classifier


Images Algorithm
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of “Visible layer” ML-based Classifier


Images
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of “Hidden Layers” ML-based Classifier


Images
Neural Net works Introduced

Layer 2
Layer 1

Layer N

Corpus of Layers in a neural network ML-based Classifier
Images
Neural Net works Introduced

Pixels Processed groups of pixels

Corpus of Each layer consists of individual interconnected ML-based Classifier


Images neurons
Neurons as Learning Units
A machine learning algorithm is an algorithm that
is able to learn from data
Learning Algorithms

A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E
Learning Algorithms
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E

Most common tasks in ML:


Classification, regression
Learning Algorithms
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E

Accuracy in classification,
residual variance in regression
Training using a corpus of
labelled instances
Learning Algorithms
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E
Learning Algorithms
A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E
Deep Learning and Neural Net works

Deep Learning Neural Networks Neurons

Algorithms that learn what The most common class of deep Simple building blocks that actually
features matter learning algorithms “learn”
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of Feature Selection & Classification ML-based Classifier


Images Algorithm
Layers in the Computation Graph

Object Parts
Corners
Edges
Pixels

Corpus of Groups of neurons that perform similar ML-based Classifier


Images functions are aggregated into layers
Neural Net works: Net works of Neurons

Pixels Processed groups of pixels

Corpus of Each layer consists of individual interconnected ML-based Classifier


Images neurons
Directed computation graphs “learn” relationships between
data

Deep Learning The more complex the graph, the more the relationships it can
“learn”

“Deep” Learning: Depth of the computation graph


y = Wx + b

“Learning” Regression
Regression can be reverse-engineered by a single neuron
Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
def XOR(x1,x2):
if (x1 == x2):
return 0
return 1

“Learning” XOR
The XOR function can be reverse-engineered using 3 neurons arranged in 2 layers
XOR: 3 Neurons, 2 Layers

X1 X2 Y
X1
0 0 0
0 1 1

1 0 1
X2 1 1 0

Output Truth Table


2 Inputs Layer 1 Layer 2
def doSomethingReallyComplicated(x1,x2…):



return complicatedResult

“Learning” Arbitrarily Complex Functions


Adding layers to a neural network can “learn” (reverse-engineer) pretty much anything
Arbitrarily Complex Function

Pixels Processed groups of pixels

Corpus of Operations (nodes) on data (edges) ML-based Classifier


Images
The Computational Graph

Pixels Processed groups of pixels

Corpus of Operations (nodes) on data (edges) ML-based Classifier


Images
The Computational Graph

Pixels Processed groups of pixels

Corpus of The nodes in the computation graph are neurons ML-based Classifier
Images (simple building blocks)
The Computational Graph

Pixels Processed groups of pixels

Corpus of The edges in the computation graph are data items ML-based Classifier
Images called tensors
The nodes in the computation graph are simple entities
called neurons

Neurons Each neuron performs very simple operations on data

The neurons are connected in very complex,


sophisticated ways
The complex interconnections between simple neurons

Different network configurations => different types of

Neural Net works neural networks

- Convolutional

- Recurrent
Groups of neurons that perform similar functions are
Neural Net works aggregated into layers
Layers in the Computation Graph

Object Parts
Corners
Edges
Pixels

Corpus of Groups of neurons that perform similar ML-based Classifier


Images functions are aggregated into layers
Layers in the Computation Graph

Object Parts
Corners
Edges
Pixels

Corpus of “Visible layer” ML-based Classifier


Images
Layers in the Computation Graph

Object Parts
Corners
Edges
Pixels

Corpus of “Hidden Layers” ML-based Classifier


Images
Layers in the Computation Graph

Layer 2
Layer 1

Layer N

Corpus of Layers in a neural network ML-based Classifier
Images
Neurons Each layer consists of units called neurons
Neurons

Pixels Processed groups of pixels

Corpus of Neural Network ML-based Classifier


Images
Neural Neurons in a neural network can be
Net works connected in very complex ways…
Neural Net works Introduced

Pixels Processed groups of pixels

Corpus of Neurons in a neural network can be connected in very ML-based Classifier


Images complex ways…
Neurons in a neural network can be connected in very
complex ways…
Neural Net works
…But each neuron only applies two simple functions
to its inputs
Neurons in a neural network can be connected in very complex
ways…

…But each neuron only applies two simple functions to its inputs
Neural Net works
- A linear (affine) transformation

- An activation function
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

Each neuron only applies two simple functions to its inputs


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

Inputs into the neuron


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

Output from the neuron


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

Each neuron only applies two simple functions to its inputs


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

The affine transformation is just a weighted sum with a bias added


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

The values W1, W2…Wn are called the weights


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

The value b is called the bias


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

Where do the values of W and b come from?


The weights and biases of individual neurons are
determined during the training process
The actual training of a neural network is
managed by TensorFlow
Finding the “best” values of W and b for each neuron is crucial
W Affine
The “best” values are found using the cost function, optimiser and
Wx + b
Transformation
X corpus…

…and the process of finding them is called the training process

b
W Affine Different types of neural networks wire up neurons in different
ways
Wx + b
Transformation
X
These interconnections can get very sophisticated…

b
During training, the output of deeper layers may be “fed back” to find the
W Affine
best W, b
Wx + b
Transformation
X
This is called back propagation

Back propagation is the standard algorithm for training neural networks

b
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

The training algorithm will use the weights to tell the


neuron which inputs matter, and which do not…
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

…and apply a corrective bias if needed


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

The linear output can only be used to learn linear


functions, but we can easily generalize this…
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn b

The Activation function is a non-linear function, very


often simply the max(0,…) function
Activation The output of the affine transformation is chained
Wx + b Function max(Wx+b,0) into an activation function
Activation The activation function is needed for the neural
Wx + b Function max(Wx+b,0) network to predict non-linear functions
The most common form of the activation function is the
ReLU
Activation
Wx + b Function max(Wx+b,0) ReLU : Rectified Linear Unit

ReLU(x) = max(0,x)
Regression: The Simplest Neural Net work
def doSomethingReallyComplicated(x1,x2…):



return complicatedResult

“Learning” Arbitrarily Complex Functions


Adding layers to a neural network can “learn” (reverse-engineer) pretty much anything
y = Wx + b

“Learning” Regression
Regression can be reverse-engineered by a single neuron
Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
Operation of a Single Neuron

W1
X1
W2 y = Wx + b
X2
Wi Wx + b

Xi

Wn b
Xn
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

Here the neuron is an entity that finds the “best fit” line through a set of
points
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

We are instructing the neuron to learn a linear function - so no activation


function is required at all
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

The affine transformation is just a weighted sum of the inputs with a bias
added
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

Inputs into the neuron


Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b
Output from the neuron
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

The values W1, W2…Wn are called the weights


Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

The value b is called the bias


Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

Where do the weights W and the bias b come from?


The weights and biases of individual neurons are
determined during the training process
The actual training of a neural network is
managed by TensorFlow
Simple Regression
Regression Equation:
y = A + Bx

y1 = A + Bx1
y2 = A + Bx2
y3 = A + Bx3
… …
yn A + Bxn
Simple Regression
Regression Equation:
y = A + Bx

y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… … …
yn A + Bxn + en
Simple Regression
Regression Equation:
y = A + Bx
y1 1 x1 e1

[] [][][]
y2
y3

yn
= A


1
1

1
+B
x2
x3

xn
+
e2
e3

en
Minimising Least Square Error
Y

(xi, yi)

ei = yi - y’i

(xi, y’i)
Regression Line:
y = A + Bx

X
Residuals of a regression are the difference between actual and
fitted values of the dependent variable
The “Best” Regression Line
Y

Linear Regression involves finding the “best fit” line


The “Best” Regression Line
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

Let’s compare two lines, Line 1 and Line 2


The “Best” Regression Line
Y

A1 Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

The first line has y-intercept A1


The “Best” Regression Line
Y

x increases by 1

y decreases by B1

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

In the first line, if x increases by 1 unit, y decreases by B1 units


The “Best” Regression Line
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x
A2 X

The second line has y-intercept A2


The “Best” Regression Line
Y

Line 1: y = A1 + B1x
y decreases by B2
Line 2: y = A2 + B2x
x increases by 1
X

In the second line, if x increases by 1 unit, y decreases by B2 units


Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

Drop vertical lines from each point to the lines A and B


Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

Drop vertical lines from each point to the lines A and B


Simple Regression
Regression Equation:
y = A + Bx
y1 1 x1 e1

[] [][][]
y2
y3

yn
= A


1
1

1
+B
x2
x3

xn
+
e2
e3

en
Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

The “best fit” line is the one where the sum of the squares of
the lengths of these dotted lines is minimum
Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

The “best fit” line is the one where the sum of the squares of
the lengths of these dotted lines is minimum
Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

The “best fit” line is the one where the sum of the
squares of the lengths of the errors is minimum
Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

The “best fit” line is the one where the sum of the
squares of the lengths of the errors is minimum
Simple Regression
Regression Equation:
y = A + Bx
y1 1 x1 e1

[] [][][]
y2
y3

yn
= A


1
1

1
+B
x2
x3

xn
+
e2
e3

en
The “best fit” line is the one where the sum of the squares of
the lengths of the errors is minimum
Minimising Least Square Error
Y

Regression Line: y = A
+ Bx

The “best fit” line is called the regression line


Optimizers for the “Best-fit”

Maximum likelihood
Method of moments Method of least squares
estimation
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

Where do the weights W and the bias b come from?


Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

Where do the weights W and the bias b come from?


They are determined during the training process
Operation of a Single Neuron
W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

This optimization is not carried out by the individual neuron, rather


a training algorithm will take care of this
A Slightly More Complex Neural Net work
def doSomethingReallyComplicated(x1,x2…):



return complicatedResult

“Learning” Arbitrarily Complex Functions


Adding layers to a neural network can “learn” (reverse-engineer) pretty much anything
An Arbitrarily Complex Function

Pixels Processed groups of pixels

Corpus of Each layer consists of individual interconnected ML-based Classifier


Images neurons
y = Wx + b

“Learning” Regression
Regression can be learnt by a single neuron using an affine transformation alone
Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
def XOR(x1,x2):
if (x1 == x2):
return 0
return 1

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
XOR: Not Linearly Separable
X2
X1 X2 Y

0 0 0
0 1 1
Y=1 Y=0 1 0 1
1 1 0

Y=1
Y=0
X1
XOR: Not Linearly Separable
X2
X1 X2 Y

0 0 0
0 1 1
Y=1 Y=0 1 0 1
1 1 0

Y=1
Y=0
X1
No one straight line neatly divides the points into disjoint regions
where Y = 0 and Y = 1
XOR: Not Linearly Separable
X2
X1 X2 Y

0 0 0
0 1 1
Y=1 Y=0 1 0 1
1 1 0

Y=1
Y=0
X1
No one straight line neatly divides the points into disjoint regions
where Y = 0 and Y = 1
XOR: 3 Neurons, 2 Layers

X1 X1 X2 Y

0 0 0
0 1 1

1 0 1
X2
1 1 0

2 Inputs Layer 1 Layer 2 Output Truth Table


X1 X2 Y

0 0 0
0 1 1

1 0 1
1 1 0

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
1-Neuron Regression

W1
X1
X2
W2 Affine
Wx + b y = Wx + b
Wi Transformation

Xi

Wn
Xn b

Regression could be learnt by a single neuron using a single, linear


operation
Adding an Activation Function
W1
X1
Affine Activation
W2
Transformation Wx + b Function max(Wx+b,0)
X2

XOR, a simple non-linear function, can be learnt by 3 neurons if we add an


appropriate activation function
Activation The activation function is needed for the neural
Wx + b Function max(Wx+b,0) network to predict non-linear functions
The most common form of the activation function is the
ReLU
Activation
Wx + b Function max(Wx+b,0) ReLU : Rectified Linear Unit

ReLU(x) = max(0,x)
XOR: 3 Neurons, 2 Layers

X1 X1 X2 Y

0 0 0
0 1 1

1 0 1
X2
1 1 0

2 Inputs Layer 1 Layer 2 Output Truth Table


3-Neuron XOR
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6

W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
3-Neuron XOR
Neuron #1
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6

W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2
3-Neuron XOR
Neuron #1
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6

W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2

Neuron #2
3-Neuron XOR
Neuron #3

b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6

W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2

Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine Activation W5 Affine
Transformation Function Transformation
W2
b1 W6

W3 Activation
Affine Activation
Function
Transformation Function
X2
W4
b2

Neuron #2
The most common form of the activation function is the
ReLU
Activation
Wx + b Function max(Wx+b,0) ReLU : Rectified Linear Unit

ReLU(x) = max(0,x)
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3 Activation
Affine
RELU Function
Transformation
X2
W4
b2

Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
max(x,0)
X1 Affine W 5 Affine
RELU
Transformation x Transformation
W2
b1 W6

W3 Activation
Affine x max(x,0)
RELU Function
Transformation
X2
W4
b2

Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3 Activation
Affine
RELU Function
Transformation
X2
W4
b2

Neuron #2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6 x

W3
Affine Identity
RELU
Transformation
X2
W4 x
b2

Neuron #2
3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2
3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2
W4
b2
3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2

Output
3-Neuron XOR
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2

Neuron #2 Layer 1
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
max(x,0)
X1 Affine W 5 Affine
RELU
Transformation x Transformation
W2
b1 W6

W3 Activation
Affine x max(x,0)
RELU Function
Transformation
X2
W4
b2

Neuron #2 Layer 2
3-Neuron XOR
Neuron #3
Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
Information
b1 only “feeds forward" W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2

Neuron #2
“2-Layer Feed-for ward Neural Net work”
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2
XOR: 3 Neurons, 2 Layers

X1 X1 X2 Y

0 0 0
0 1 1

1 0 1
X2
1 1 0

2 Inputs Layer 1 Layer 2 Output Truth Table


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function max(Wx+b,0)

Xi

Wn
Xn
b

Each neuron has weights and a bias that must be calculated by the training algorithm
(done for us by TensorFlow)
Weights and Bias of Neuron #1
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2
Weights and Bias of Neuron #2
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2
Weights and Bias of Neuron #3
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2
The weights and biases of individual neurons are
determined during the training process
Weights and Bias of Neuron #1
b3
W1 =1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6

W3 =1
Affine Identity
RELU
Transformation
X2
W4
b2
Weights and Bias of Neuron #2
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU
Transformation Transformation
W2
b1 W6 =-2

W3
Affine Identity
RELU
Transformation
X2
W4
b2
X1 X2 Y

0 0 0
0 1 1

1 0 1
1 1 0

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #1
X1 =0 b3
W1 =1
Affine 0 0 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6

W3 =1
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #2
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6

W3
Affine 0 Identity
RELU
Transformation -1

X2 =0 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 0
Transformation Transformation
W2
b1 W6 =-2 0

W3
Affine Identity
RELU
Transformation 0
X2
W4
b2

0
X1 X2 Y

0 0 0
0 1 1

1 0 1
1 1 0

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #1
X1 =0 b3
W1 =1
Affine 1 1 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6

W3 =1
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #2
X1 =0 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6

W3
Affine 0 Identity
RELU
Transformation 0

X2 =1 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 1
Transformation Transformation
W2
b1 W6 =-2 1

W3
Affine Identity
RELU
Transformation 0
X2
W4
b2

1
X1 X2 Y

0 0 0
0 1 1

1 0 1
1 1 0

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #1
X1 =1 b3
W1 =1
Affine 1 1 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6

W3 =1
Affine Identity
RELU
Transformation
X2 =0 W4
b2
Weights and Bias of Neuron #2
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6

W3
Affine 0 Identity
RELU
Transformation 0

X2 =0 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 1
Transformation Transformation
W2
b1 W6 =-2 1

W3
Affine Identity
RELU
Transformation 0
X2
W4
b2

1
X1 X2 Y

0 0 0
0 1 1

1 0 1
1 1 0

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
3-Neuron XOR
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6
Inputs
W3
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #1
X1 =1 b3
W1 =1
Affine 2 2 W5 Affine
RELU
Transformation Transformation
W2
b1 =0 W6

W3 =1
Affine Identity
RELU
Transformation
X2 =1 W4
b2
Weights and Bias of Neuron #2
X1 =1 b3
W1
Affine W5 Affine
RELU
Transformation Transformation
W2 =1 b1 W6

W3
Affine 1 Identity
RELU
Transformation 1

X2 =1 W4 =1
b2 =-1
Weights and Bias of Neuron #3
b3 =0
W1
X1 Affine W5 =1 Affine
RELU 2
Transformation Transformation
W2
b1 W6 =-2 0

W3
Affine Identity
RELU
Transformation 1
X2
W4
b2

0
Choice of Activation Function

Pixels Processed groups of pixels

Corpus of Neural Network ML-based Classifier


Images
Choice of Activation Function

Pixels Processed groups of pixels

Corpus of Input layers use identity function as ML-based Classifier


Images
activation: f(x) = x
Choice of Activation Function

Pixels Processed groups of pixels

Corpus of Inner hidden layers typically use ReLU as ML-based Classifier


Images
activation function
Choice of Activation Function

Pixels Processed groups of pixels

Corpus of Output layer in our XOR example used ML-based Classifier


Images
the identity function
Choice of Activation Function

Pixels Processed groups of pixels

Corpus of Output layer in classification will often use ML-based Classifier


Images
SoftMax
Another very common form of the activation function is the

Activation SoftMax

Wx + b Function max(Wx+b,0) SoftMax(x) outputs a number between 0 and 1

This output can be interpreted as a probability


def XOR(x1,x2):
if (x1 == x2):
return 0
return 1

“Learning” XOR
Reverse-engineering XOR requires 3 neurons (arranged in 2 layers) as well as a non-linear activation function
XOR: 3 Neurons, 2 Layers

X1 X1 X2 Y

0 0 0
0 1 1

1 0 1
X2
1 1 0

2 Inputs Layer 1 Layer 2 Output Truth Table


3-Neuron XOR
b3
W1
X1 Affine W5 Affine
RELU
Transformation Transformation
W2
b1 W6

W3
Affine Identity
RELU
Transformation
X2
W4
b2
def doSomethingReallyComplicated(x1,x2…):



return complicatedResult

“Learning” Arbitrarily Complex Functions


Adding layers to a neural network can “learn” (reverse-engineer) pretty much anything
A neuron is the smallest entity in a neural network

Linear regression can be learnt by a single neuron

A more complex function such as XOR requires more neurons


Summary Combinations of interconnected neurons can “learn” virtually
anything

Training such networks to use the “best” parameter values is


vital
Building Linear Regression Models Using TensorFlow
Implementing Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Affine transformation suffices Improving goodness-of-fit Compare to baseline


Simple Regression

Cause Effect

Independent variable Dependent variable


Implementing Regression in TensorFlow

Baseline

Non-TensorFlow implementation

Regular python code


Regression in Python

Pandas for dataframes NumPy for arrays

Statsmodels for regression Matplotlib for plots


Negative Indices In Python

Index -n Index 0

Backward Forward
Indices Indices
Index -1 Index n-1
Prices to Returns

0 0
0
1

= [:-1] / -1
[1:]

n-1 n-1
n

Returns = Prices[:-1] / Prices[1:] -1


ML-based Regression Model

(x1, y1)

(x2, y2)
(A, B)

(xn, yn)

Corpus Regression Algorithm Regression Line: y = A + Bx


ML-based Regression Model
x y

[x1] y1

[x2] y2 (A, B)

… …
[xn] yn

Corpus NumPy Linear Regression Regression Line: y = A + Bx


Reshaping in NumPy

reshape(-1,1)

n
Reshaping in NumPy

x1
[x1]
x2


reshape(-1,1) [x2]

xn
[xn]
Implementing Regression in TensorFlow
Baseline

Non-TensorFlow implementation

Regular python code

Computation Graph
Neural network of 1 neuron

Affine transformation suffices


Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function Wx+b

Xi

Wn
Xn b
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Identity
Wi Transformation Wx + b Function Wx+b

Xi

Wn
Xn b
Implementing Regression in TensorFlow

Baseline Cost Function

Non-TensorFlow implementation Mean Square Error (MSE)

Regular python code Quantifying goodness-of-fit

Computation Graph
Neural network of 1 neuron

Affine transformation suffices


The “Best” Regression Line
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

Let’s compare two lines, Line 1 and Line 2


Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

Drop vertical lines from each point to the lines A and B


Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

Drop vertical lines from each point to the lines A and B


Simple Regression
Regression Equation:
y = A + Bx
y1 1 x1 e1

[] [][][]
y2
y3

yn
= A


1
1

1
+B
x2
x3

xn
+
e2
e3

en
Minimising Least Square Error
Y

Line 1: y = A1 + B1x

Line 2: y = A2 + B2x

The “best fit” line is the one where the sum of the squares of
the lengths of these dotted lines is minimum
Minimising Least Square Error
Y

Regression Line: y = A
+ Bx

The “best fit” line is called the regression line


Implementing Regression in TensorFlow

Baseline Cost Function

Non-TensorFlow implementation Mean Square Error (MSE)

Regular python code Quantifying goodness-of-fit

Computation Graph Optimizer


Neural network of 1 neuron Gradient Descent optimizers

Affine transformation suffices Improving goodness-of-fit


Why Choosing Is Complicated

What do we really want to What is slowing us down? What do we really control?


achieve?

Choosing involves answering complicated questions


Why Optimization Helps

What do we really want to What is slowing us down? What do we really control?


achieve?

Optimization forces us to mathematically pin down


answers to these questions
Framing the Optimization Problem

Objective Function Constraints Decision Variables

What we would like to achieve What slows us down What we really control

Collectively, these answers constitute the optimization problem


Linear Regression as an Optimization Problem

Objective Function Constraints Decision Variables

Minimize variance of the Express relationship as a Values of W and b


residuals (MSE) straight line

y = Wx + b
Minimizing MSE
MSE

b
Minimizing MSE
MSE

b
Minimizing MSE
MSE

b
Minimizing MSE

Smallest value of MSE


Minimizing MSE
MSE

As small as possible!

Smallest value of MSE


Minimizing MSE
MSE

Smallest value of MSE


Minimizing MSE
MSE

Smallest value of MSE


“Gradient Descent”
Converging on the “best” value
MSE
using an optimization algorithm

Initial value of MSE

Smallest value of MSE


Minimizing MSE
MSE

Smallest value of MSE


“Training” the Algorithm
“Training Process” = Finding these best
MSE
values

Best value of W

Best value of b
Smallest value of MSE
Start Somewhere Initial values - have to start
somewhere
MSE

Initial value of W
Initial value of b
Initial value of MSE
“Gradient Descent”
Converging on the “best” value
MSE
using an optimization algorithm

Initial value of MSE

Smallest value of MSE


“Learning Rate”
MSE

Change in value of MSE in each


epoch = Learning Rate

Initial value of MSE

Smallest value of MSE


tf.train.GradientDescentOptimizer
Gradient
Descent tf.train.AdamOptimizer
Optimizers tf.train.FtrlOptimizer
Start Somewhere
Initial values - have to start
MSE
somewhere

Initial value of W
Initial value of b

Initial value of MSE


Minimizing MSE
MSE

Initial value of MSE

Smallest value of MSE


Regression: The Simplest Neural Net work

Training data
Single Neuron Regression Line
Minimizing MSE
MSE

Training data

Initial value of MSE

Smallest value of MSE


Minimizing MSE
MSE

Batch of training data

Initial value of MSE

Smallest value of MSE


“Epoch”
MSE

1 step towards optimal = 1 epoch

Batch of training data

New value of MSE

Smallest value of MSE


“Batch Size”
MSE

Number of training data points


considered = batch size
Batch of training data

New value of MSE

Smallest value of MSE


“Batch Size”

Stochastic Gradient Mini-batch Gradient


Descent Descent Batch Gradient Descent
All training data in each batch
1 point at a time Some subset in each batch
Implementing Regression in TensorFlow

Baseline Cost Function Training


Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer


Neural network of 1 neuron Gradient Descent optimizers

Affine transformation suffices Improving goodness-of-fit


Minimizing MSE
MSE

Batch of training data

New value of MSE

Smallest value of MSE


Decisions in Training

Initial values Type of optimizer

Number of epochs Batch size


Simple Regression

Cause Effect

Independent variable Dependent variable


Simple Regression
Oil Prices

Government Bond
Yields

One cause, one effect


Multiple Regression

Causes Effect

Independent variables Dependent variable


Multiple Regression
XOM Returns

Oil Price Changes

NASDAQ Share
Index

Many causes, one effect


Implementing Regression in TensorFlow
Baseline Cost Function Training

Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Affine transformation suffices Improving goodness-of-fit Compare to baseline


Implement linear regression without using TensorFlow

Define computation graph of just one neuron

Summary Set up the cost function

Use various gradient descent optimizers

Understand gradient descent process

Train the model to get a converged regression model


Building Logistic Regression Models Using TensorFlow
Over view

Given causes, predict probability of effects - that’s logistic regression

Linear regression and logistic regression are similar, yet quite different

Logistic regression can be used for categorical y-variables

Logistic regression in TensorFlow differs from linear regression in two ways

- Softmax as the activation function

- cross-entropy as the cost function


Two Approaches to Deadlines

Start 5 minutes before deadline Start 1 year before deadline

Good luck with that Maybe overkill

Neither approach is optimal


Starting a Year in Advance
Probability of meeting the deadline

100%
Probability of getting other important work done

0%
Starting Five Minutes in Advance
Probability of meeting the deadline

0%
Probability of getting other important work done

100%
The Goldilocks Solution

Work fast Work smart Work hard

Start very late and hope for the Start as late as possible to be Start very early and do little
best sure to make it else

As usual, the middle path is best


Working Smart
Probability of meeting the deadline

95%
Probability of getting other important work done

95%
Working Hard, Fast, Smart
Probability of (1 year,100%)

meeting
deadline Start 1 year before deadline

100% probability of meeting deadline

Start 5 minutes before deadline

0% probability of meeting deadline


(5 mins,0%)

Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (?,95%)
Work hard
deadline
Work smart

Work fast
(5 mins,0%)

Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (?,95%)
Work hard
deadline
Work smart

Work fast
(5 mins,0%)

Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (?,95%)
deadline Work hard
95% Work smart

Work fast
(5 mins,0%) 11 days

Time to deadline
Working Hard, Fast, Smart
(1 year,100%)
Probability of
meeting (11 days,95%)
deadline Work hard
95%
Work smart

Work fast
(5 mins,0%)

Time to deadline
Working Hard, Fast, Smart
Probability of
meeting
deadline Work hard

Work smart

Work fast

Time to deadline
Logistic Regression helps find how probabilities
are changed by actions
Working Smart with Logistic Regression
Y
100%

Probability

0%
X
Time to deadline
Working Smart with Logistic Regression
Y
100%

Probability
0%

0%
X
Time to deadline
Start too late, and you’ll definitely miss
Working Smart with Logistic Regression
Y
100%

Probability
100%

0%
X
Time to deadline
Start too early, and you’ll definitely make it
Working Smart with Logistic Regression
Y
100%

Probability
<50% >50%

0%
X

Time to deadline
Working smart is knowing when to start
Y-axis: probability of meeting deadline

X-axis: time to deadline

Meeting or missing deadline is binary

Probability curve flattens at ends

- floor of 0

- ceiling of 1
y: hit or miss? (0 or 1?)

x: start time before deadline

p(y) : probability of y = 1
Logistic regression involves finding the “best fit” such
curve
1 - A is the intercept
p(yi) =
1+ e-(A+Bx )
i - B is the regression coefficient

(e is the constant 2.71828)


S-curves are widely studied, well understood

1
y=
1+e -(A+Bx)

Logistic regression uses S-curve to estimate


probabilities
1
p(y) =
1+e -(A+Bx)
x=0
x = +∞
p(y) = 1

1
p(yi) = x = -∞
1+ e-(A+Bx )
i p(y) = 0

If A and B are positive


x=0
x = -∞
p(y) = 1

1
p(yi) = x = +∞
1+ e-(A+Bx )
i p(y) = 0

If A and B are negative


Working Hard, Fast, Smart
Probability of (1 year,100%)
meeting (11 days,95%)
Work hard
deadline
Work smart

Work fast
(5 mins,0%)

Time to deadline
Working Hard, Fast, Smart

Minimum value of p(yi)


Working Hard, Fast, Smart

Maximum value of p(yi)


Working Hard, Fast, Smart
Probability of outcome changes with every change in value
of the independent variable
p(y2)

p(y1)

X1
X2

Between maximum and minimum values of p(yi)


Logistic Regression

0 1
Probability of outcome is very sensitive
to changes in cause

1
p(yi) =
1+ e-(A+Bx )
i
Categorical and Continuous Variables

Continuous Categorical
Can take an infinite set of values Can take a finite set of values
(height, weight, income…) (male/female, day of week…)

Categorical variables that can take just two values are called
binary variables
Logistic Regression helps estimate how
probabilities of categorical variables are
influenced by causes
Logistic Regression in Classification
Whales: Fish or Mammals

Mammal Fish

Member of the infraorder Cetacea Looks like a fish, swims like a fish, moves
like a fish
Rule-based Binary Classifier

Whale Mammal

Rule-based Classifier

Human Experts
ML-based Binary Classifier

Corpus Classification Algorithm ML-based Classifier


ML-based Binary Classifier

Moves like a fish,


ML-based Predictor Fish
Looks like a fish

Corpus
ML-based Binary Classifier

Breathes like a mammal


ML-based Classifier Mammal
Gives birth like a mammal

Corpus
ML-based Predictor

Corpus Logistic Regression ML-based Predictor


1
p(yi) =
1+ e-(A+Bx )
i
ML-based Predictor

Lives in water, breathes


P(mammal) = 0.55
with lungs,does not lay
eggs

Corpus
Applying Logistic Regression
Probability of
animal being
(95%)
fish Lives in water, breathes with gills,
lays eggs
(60%)

Lives in water, breathes with lungs,does not


lay eggs
Lives on land, breathes with lungs,does not lay
eggs
(5%) (40%)

Whales: Fish or Mammals?


Applying Logistic Regression
(50%)
Probability of
animal being
(95%)
fish (80%)

(60%)

(5%) (20%) (40%)

Rule of 50%
Applying Logistic Regression
(50%)
Probability of
animal being (95%)
fish (80%)

(60%)

(5%) (20%) (40%)

If probability < 50%, it’s a mammal


Applying Logistic Regression
(50%)
Probability of
animal being (95%)
fish (80%)

(60%)

(5%) (20%) (40%)

If probability < 50%, it’s a mammal


Applying Logistic Regression

Mammal Fish

Probability of whales being fish < 50%


Applying Logistic Regression

Mammal Fish

Probability of whales being fish > 50%


Logistic Regression and Linear Regression
X Causes Y

Cause Effect

Independent variable Dependent variable


X Causes Y

Cause Effect

Explanatory variable Dependent variable


Linear Regression
Y

Represent all n points as (xi,yi), where i = 1 to n


Linear Regression
Y (x1, y1)

(x2, y2)

(x3, y3)
Regression Line: y =
(xn, yn) A + Bx

Represent all n points as (xi,yi), where i = 1 to n


Logistic Regression
p(y)

y=1

y=0

Represent all n points as (xi,yi), where i = 1 to n


Logistic Regression
p(y)

(x3, y3) (xn, yn)

Regression Curve
1
p(y) =
1 + e-(A+Bx)
(x1, y1)
(x2, y2)
X

Represent all n points as (xi,yi), where i = 1 to n


Similar, yet Different

Linear Regression Logistic Regression


Given causes, predict effect Given causes, predict probability of effect

y
p(y)

x x
Similar, yet Different

Linear Regression Logistic Regression


Effect variable (y) must be continuous Effect variable (y) must be categorical

y p(y)

x x
Similar, yet Different

Linear Regression Logistic Regression


Cause variables (x) can be continuous or Cause variables (x) can be continuous or
categorical categorical

y p(y)

x x
Similar, yet Different

Linear Regression Logistic Regression


Connect the dots with a straight line Connect the dots with an S-curve

y p(y)

x x
Similar, yet Different

Linear Regression Logistic Regression


1
yi = A + Bxi p(yi) =
1+ e-(A+Bx )
i

y p(y)

x x
Similar, yet Different

Linear Regression Logistic Regression


1
yi = A + Bxi p(yi) =
1+e -(A+Bx )
i

Objective of regression is to find A, B that “best Objective of regression is to find A, B that “best
fit” the data fit” the data
Similar, yet Different

Linear Regression Logistic Regression


p(yi)
yi = A + Bxi ln( 1 - p(yi) ) = A + Bxi

Relationship is already linear (by assumption) Relationship can be made linear (by log
transformation)
Similar, yet Different

Linear Regression Logistic Regression

yi = A + Bxi logit(p) = A + Bxi

p
logit(p) = ln( )
1-p

Solve regression problem using cookie-cutter Solve regression problem using cookie-cutter
solvers solvers
Logistic Regression
p(y)

y=1

y=0

Represent all n points as (xi,yi), where i = 1 to n


Logistic Regression
p(y)
(x3, y3) (xn, yn)

Regression Curve
1
p(y) =
1+e -(A+Bx)

(x1, y1)

(x2, y2)
X

Represent all n points as (xi,yi), where i = 1 to n


Linear Regression
y = A + Bx

y1 = A + Bx1
y2 = A + Bx2
y3 = A + Bx3
… …
yn = A + Bxn
Logistic Regression
1
p(y) =
1+e -(A+Bx)

1
p(yi) =
1+ e-(A+Bx )
i

1
p(y1) =
1+ e-(A+Bx )
1

1
p(yn) =
1 + e-(A+Bxn)
Logistic Regression
p(y)

y=1

y=0

Represent all n points as (xi,yi), where i = 1 to n


Logistic Regression
p(y)

(x3, y3) (xn, yn)

Regression Curve
1
p(y) =
1+e -(A+Bx)

(x1, y1)
(x2, y2)
X

Represent all n points as (xi,yi), where i = 1 to n


Logistic Regression

Regression Equation:

1
p(yi) =
1+ e-(A+Bx )
i

Solve for A and B that “best fit” the data


Odds from Probabilities

p
Odds(p) =
1-p
Odds of an Event
1
p=
1+e -(A+Bx)

A + Bx
e
p= A + Bx
1+e
A + Bx
e
1-p= 1-
A + Bx
1+e
Odds of an Event
A + Bx
e
1-p= 1- A + Bx
1+e

A + Bx
1+e - e
A + Bx
1-p= A + Bx
1+e

1
1-p=
A + Bx
1+e
Odds of an Event
A + Bx
e
p= A + Bx
1+e
1
1-p=
A + Bx
1+e
p
Odds(p) = = A + Bx
1-p e
Logit Is Linear

p
Odds(p) = = A + Bx
e
1-p

logit(p) = A + Bx

ln(Odds(p)) is called the logit function


Logit Is Linear
ln Odds(p) = ln(p) - ln(1-p)

1
p=
1+e -(A+Bx)

logit(p) = ln Odds(p) = A + Bx

This is a linear function!


Logistic Regression can be solved via linear
regression on the logit function (log of the odds
function)
Logistic Regression in TensorFlow
Logistic Regression

Cause Effect

Changes in S&P 500 Changes in price of Google Stock


Logistic Regression

y = Returns on x = Returns on
Google stock S&P 500
(GOOG) (S&P500)
Logistic Regression

1
p(yi) =
1+ e-(A+Bx )
i

P(y) = Probability of Google x = Returns on S&P 500


going up in the current month i for current month
Logistic Regression

>,= 0.5 Up True

Predicted
Predicted
labels Labels
< 0.5 Down False
Set up the Problem

> 0% Up 1

GOOG Labels
Returns
<= 0% Down 0

Label GOOG returns as binary (1,0)


Prediction Accuracy
DATE ACTUAL PREDICTED
2005-01-01 NA NA
2005-02-01 0 1
2005-03-01 0 0

2017-01-01 1 1
2017-02-01 1 1

Compare GOOG’s actual labels vs. predicted labels


Linear Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Affine transformation suffices Improving goodness-of-fit Compare to baseline


Linear Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Affine transformation suffices Improving goodness-of-fit Compare to baseline


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Linear Regression with One Neuron

Set of Points
Single Neuron Regression Line
Linear Regression with One Neuron

W1
X1
W2
X2 Affine Activation
Wi
Transformation Wx + b Function Wx+b

Xi
Wn

Xn b
Linear Regression with One Neuron

W1
X1
W2
X2 Affine Identity
Wi
Transformation Wx + b Function Wx+b

Xi
Wn

Xn b
Linear Regression with One Neuron

W1
X1
X2 W2 Affine Identity
Wi Transformation Wx + b Function

Xi

Wn

Xn
b
Logistic Regression with One Neuron
W1
X1
X2 W2 Softmax P(Y = True)
Affine Transformation
Wi Wx + b Function

Xi P(Y = False)

Wn

Xn
b
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine Softmax P(Y = True)
Wi Transformation Wx + b Function

P(Y = False)
Xi

Wn

Xn b
Logistic Regression with One Neuron
W1
X1 W2
X2 Affine W2 Softmax P(Y = True)
Wi
Transformation W1x + b1 Function

Xi P(Y = False)

Wn

Xn b1 b2
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine W2 Softmax P(Y = True)
Wi Transformation Wx + b Function

P(Y = False)
Xi

Wn

Xn b b2
Logistic Regression with One Neuron

W1 1
X1
W2 Affine 1+e -(W2x’+b2)
X2
Wi Transformation W1x + b1

Softmax
Xi p(Y = True)
x’ Function
Wn W2,b2

1
Xn b1 W2x’ + b2
1+e

1 - p(Y = True)
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine W2 Softmax P(Y = True)
Wi Transformation x’ Function

P(Y = False)
Xi

Wn

Xn b b2
Logistic Regression with One Neuron

W1 1
X1
X2
W2 1+e -(W2x’+b2)
Affine Transformation
Wi W1x + b1 Softmax

Xi x’ Function p(Y = True)


Wn

W2,b2
1
Xn
b1 W2x’ + b2
1+e

1 - p(Y = True)
Logistic Regression with One Neuron

W1
X1
X2
W2 Affine P(Y = True)
Wi Transformation W1x + b1
Softmax

Xi Function
Wn

P(Y = False)
Xn
b1
SoftMax for True/False Classification

1
1+e -(Wx+B)
p(Y = True)
Softmax
x
Function 1
Wx + B
1+e
p(Y = False)
Linear Regression with One Neuron

1-dimensional feature
vector
Shape (W) = [1,1] Regression Line

Shape (b) = [1]


Logistic Regression with One Neuron

1-dimensional feature
Shape (W) = [1,2] S-Curve
vector
Shape (b) = [2]
SoftMax for Digit Classification

P(Y = 0)

P(Y = 1)

Softmax
Function …

P(Y = 9)
SoftMax for Digit Classification

1-dimensional feature
S-Curve
vector Shape (W) = [1,10]

Shape (b) = [10]


SoftMax N-category Classification

P(Y = Y1)

P(Y = Y2)
Softmax
Function …

P(Y = YN)
SoftMax N-category Classification

1-dimensional feature
vector
Shape (W) = [1,N] S-Curve

Shape (b) = [N]


SoftMax N-category Classification

1-dimensional feature
vector
Shape (W) = [1,N] S-Curve

Shape (b) = [N]


SoftMax N-category Classification

1-dimensional feature
Shape (W) = [1,N] S-Curve
vector
Shape (b) = [N]
SoftMax N-category Classification

M-dimensional feature Shape (W) = [M,N] S-Curve


vector
Shape (b) = [N]
SoftMax N-category Classification

M-dimensional feature
vector
Shape (W) = [M,N] S-Curve

Shape (b) = [N]


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Linear Regression and MSE
Y

Regression Line: y =
A + Bx

The “best fit” line is called the regression line


Logistic Regression

>,= 0.5 Up True

Predicted
Predicted labels
Labels
< 0.5 Down False
Set up the Problem

> 0% Up 1

GOOG
Labels
Returns

<= 0% Down 0

Label GOOG returns as binary (1,0)


Prediction Accuracy
DATE ACTUAL PREDICTED

2005-01-01 NA NA

2005-02-01 0 1
2005-03-01 0 0

2017-01-01 1 1
2017-02-01 1 1

Compare GOOG’s actual labels vs. predicted labels


Intuition: Low Cross Entropy

Yactual

Ypredicted
Intuition: Low Cross Entropy

Yactual

Ypredicted

The labels of the two series are in-synch


Intuition: Low Cross Entropy

Yactual

Ypredicted

-Sum( Yactual * log [ Ypredicted] ) will be small

Cross Entropy
Intuition: Low Cross Entropy

Yactual

Ypredicted
Intuition: High Cross Entropy

Yactual

Ypredicted

The labels of the two series are out-of-synch


Intuition: High Cross Entropy

Yactual

Ypredicted

-Sum( Yactual * log [ Ypredicted] ) will be large

Cross Entropy
Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


tensorflow.argmax(y,1)

Finding the index of the largest element


Return the index of the largest element of tensor y along dimension k
Tensor

tensorflow.argmax(y,1)

Finding the index of the largest element


Return the index of the largest element of tensor y along dimension k
Dimension

tensorflow.argmax(y,1)

Finding the index of the largest element


Return the index of the largest element of tensor y along dimension k
tf.argmax

Tensor y Dimension 0 Dimension 1

5
Index = 0
15
1
12
2
3 100

4 74

5 33

tf.argmax(y,1)
tf.argmax

Tensor y Dimension 0 Dimension 1

5
Index = 0
15
1
12
2
3 100

74
Return value 4
5 33

tf.argmax(y,1)
tf.argmax

Tensor y Dimension 0 Dimension 1

5
Index = 0
15
1
12
2
3 100

74
Return value 4
5 33

tf.argmax(y,1)
tf.argmax

Tensor y Dimension 0 Dimension 1


Index = 0 5

1 15

12
2
3 100

4 74

5 33

tf.argmax(y,1)
tf.argmax

Tensor y Dimension 0 Dimension 1


Index = 0 5

1 15

12
2
3 100

Return value 4 74
Largest value
5 33

tf.argmax(y,1)
tf.argmax

Tensor y Dimension 1 … Dimension N

Index = 0

Index = M

tf.argmax(y,1)
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
Actual labels

tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
Predicted labels
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
One-hot

tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
One-hot Representation

TRUE FALSE

TRUE 1 0

FALSE 0 1
FALSE 0 1

TRUE 1 0

Label Vector One-hot Label Vector


One-hot y_

TRUE FALSE
TRUE 1 0

FALSE 0 1
FALSE 0 1

TRUE 1 0

Label Vector One-hot Label Vector


argmax(y_,1)

0 1

1 0 0

0 1 1

0 1 1

1 0 0

One-hot Label Vector Index of one-hot element


Predicted labels
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
Predicted Probabilities y
P(TRUE) P(FALSE)

P(TRUE) = 0.70 0.70 0.30


P(TRUE) = 0.44 0.44 0.56
P(TRUE) = 0.34 0.34 0.66

P(TRUE) = 0.84 0.84 0.16

Probabilities Softmax Output


Predicted Probabilities y

P(TRUE) P(FALSE)

P(TRUE) = 0.70 0.70 0.30

P(TRUE) = 0.44 0.44 0.56

P(TRUE) = 0.34 0.34 0.66 Each row



sums to 1

P(TRUE) = 0.84 0.84 0.16

Probabilities Softmax Output


Predicted Probabilities y

P(TRUE) P(FALSE)
P(TRUE) = 0.70 0.70 0.30

P(TRUE) = 0.44 0.44 0.56

P(TRUE) = 0.34 0.34 0.66

P(TRUE) = 0.84 0.84 0.16

Probabilities Softmax Output


Rule of 50% in Binary Classification

Mammal Fish

Probability of whales being Fish < 50%


argmax(y,1)

P(TRUE) P(FALSE)
0
0.70 0.30
1
0.44 0.56
1
0.34 0.66

0
0.84 0.16

Softmax Output argmax(y,1)


One-hot Vectors with Digit Classes
0 1 … 9

0 1 0 0 0

1 0 1 0 0


9 0 0 0 1

Actual Digits One-hot Label Vectors


y_:One-hot Vectors with Digit Classes

0 1 … 9

0 1 0 0 0

1 0 1 0 0


9 0 0 0 1

Actual Digits One-hot Label Vectors


argmax(y_,1)

0 1 … 9

1 0 0 0 0

0 1 0 0 1

0 0 0 1 9

One-hot Label Vectors argmax(y_,1)


Digit Classification

P(X=0) P(X=1) … P(X=9)


0.70 0.30 0

0.44 0.56 1
1

0.66 9

Softmax Output argmax(y,1)


y: Predicted Probabilities

P(X=0) P(X=1) … P(X=9)

0.70 0.30 0
0.44 0.56 1

0.66 9

Softmax Output argmax(y,1)


tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
Actual labels

tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
Predicted labels
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Two invocations of tf.argmax


Once on actual labels y_, once on predicted values y
Two invocations of tf.argmax

Tensor of actual labels Tensor of predicted labels

tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Once on actual labels y_, once on predicted values y


Two invocations of tf.argmax

List of True, False values

tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Once on actual labels y_, once on predicted values y


Two invocations of tf.argmax

True: Correct prediction


False: Incorrect prediction
tf.equal(tf.argmax(y_,1), tf.argmax(y,1))

Once on actual labels y_, once on predicted values y


Building Generalized Linear Models Using Estimators
Over view

Estimators are cookie-cutter TensorFlow APIs for many standard problems

These high-level APIs reside in tf.learn and tf.contrib.learn

Estimators can be extended by plugging custom models into a base class

That extension relies on composition rather than inheritance


How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type (Base class)


Feature Dimensions
Name Type
Input Function
How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type
(Base class)


Feature Dimensions
Name Type
Input Function
Feature Number of
y Variable Batch Size
Names Epochs
How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type
(Base class)


Feature Dimensions
Name Type
Input Function
Feature Number of
y Variable Batch Size
Names Epochs
How Estimators Work

Feature Vector Estimator


Object
Instantiate Optimizer

Fetch training data

Define cost function

Run optimisation
Input Function
Return trained model
How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type (Base class)


Feature Dimensions
Name Type
Input Function
How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type (Base class)


Feature Dimensions
Name Type
Input Function
Complex Neural Net works

Pixels Processed groups of pixels

Operations (nodes) on data


Corpus of ML-based Classifier
Images (edges)
Complex Neural Net works

Pixels Processed groups of pixels

Corpus of Neurons in a neural network can be connected in ML-based Classifier


Images very complex ways…
Complex Neural Net works

Pixels Processed groups of pixels

Corpus of All those interconnections represent intermediate ML-based Classifier


Images feature vector data!
How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type (Base class)


Feature Dimensions
Name Type
Input Function
All those interconnections represent intermediate
feature vector data!
How Estimators Work

Feature Vector Estimator Object


Feature Dimensions tf.contrib.learn.Estimator
Name Type (Base class)


Feature Dimensions
Name Type
Input Function
Linear Regression in TensorFlow
Baseline Cost Function Training

Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer Converged Model

Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Affine transformation suffices Improving goodness-of-fit Compare to baseline


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Linear Regression with an Estimator

Baseline Input Function Evaluate


Non-TensorFlow implementation tf.contrib.learn.io.numpy_input_fn Use trained model

Regular python code Set up X, Y, batch_size, num_epochs Predict new points (test data)

Instantiate Estimator Fit

tf.contrib.learn.LinearRegressor Returns trained model

Abstracts cost and optimizer choices Can re-specify number of training steps
Learning using Neurons

Linear Regression in TensorFlow


Course Outline
Logistic Regression in TensorFlow

Estimators
“Representation” ML-based systems figure out by
themselves what features to pay attention to
y = Wx + b

“Learning” Regression
Regression can be reverse-engineered by a single neuron
Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
def XOR(x1,x2):
if (x1 == x2):
return 0
return 1

“Learning” XOR
The XOR function can be reverse-engineered using 3 neurons arranged in 2 layers
XOR: 3 Neurons, 2 Layers

X1
X1 X2 Y

0 0 0

0 1 1

X2 1 0 1
1 1 0

2 Inputs Layer 1 Layer 2 Output Truth Table


def doSomethingReallyComplicated(x1,x2…):



return complicatedResult

“Learning” Arbitrarily Complex Functions


Adding layers to a neural network can “learn” (reverse-engineer) pretty much anything
Arbitrarily Complex Function

Pixels Processed groups of pixels

Corpus of Operations (nodes) on data ML-based Classifier


Images (edges)
Linear Regression in TensorFlow
Baseline Cost Function Training

Non-TensorFlow implementation Mean Square Error (MSE) Invoke optimizer in epochs

Regular python code Quantifying goodness-of-fit Batch size for each epoch

Computation Graph Optimizer Converged Model

Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Affine transformation suffices Improving goodness-of-fit Compare to baseline


Logistic Regression in TensorFlow

Baseline Cost Function Training

Non-TensorFlow implementation Cross Entropy Invoke optimizer in epochs

Regular python code Similarity of distribution Batch size for each epoch

Computation Graph Optimizer Converged Model


Neural network of 1 neuron Gradient Descent optimizers Values of W and b

Softmax activation required Improving goodness-of-fit Compare to baseline


Logistic Regression Using Estimators

You might also like