0% found this document useful (0 votes)
11 views

W2 Ann

Uploaded by

Amir Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

W2 Ann

Uploaded by

Amir Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Notes on Neural Network Programming: Logistic Regression and Notation

Setup

1. Introduction to Neural Network Programming Basics

Neural network programming involves specific techniques that simplify computations and
optimize performance.
Training Set Processing: Avoid explicit for-loops over all examples; instead, process the
training set as a whole (using matrix operations).

2. Forward and Backward Propagation

Forward Propagation: Computes predictions based on the input features.


Backward Propagation: Updates the model’s parameters (e.g., weights) to reduce
prediction error.

3. Logistic Regression for Binary Classification

Goal: Classify an image as a cat (1) or not a cat (0).


Image Representation: Store RGB pixel intensities in a feature vector ( x ).

Example: A 64x64 image with three color channels has (


n x = 64 × 64 × 3 = 12, 288 ) features.

4. Key Notations and Definitions (from attached notation guide)

(m): Number of examples in the dataset.


(nx ): Input feature dimension.
(ny ): Output dimension (1 for binary classification).
Superscript Notations:

(x(i) ) : T he(i
th
) training example feature vector.
(y (i) ): Label for the (ith ) training example (0 or 1).

5. Notation for Dataset and Matrices

Input Matrix ( X ): Defined by stacking all feature vectors (x(i) ) as columns, forming (
X ∈ R
nx ×m
).

In Python, X.shape would output ((nx , m) ).


Label Matrix ( Y ): Stack the labels (y (i) ) into a single row vector, (Y ).
1×m
∈ R

In Python, Y.shape outputs ((1, m) ).


Notation for Training and Test Sets:

(mtrain ): Number of training examples.


(mtest ): Number of test examples.

6. Forward Propagation Equation (from notation guide)


General Activation Formula: a[l] = g
[l]
(W
[l] (i)
x
[l]
+ b ) = g
[l]
(z
[l]
)

m (i)
Cost Function for Binary Classification: JCE (y, y^) = −∑
i=1
(y
(i)
^
log y )
Notes on Logistic Regression for Binary Classification

Overview of Logistic Regression

Logistic Regression is used for binary classification (outputs are either 0 or 1).
Example: Given an image as input (represented as a feature vector (x), we want to classify
it as either a "cat" (1) or "not a cat" (0).

Objective

Given an input (x), we aim to predict the probability (y^ ) that (y = 1 ) (i.e., the probability
that the image is a "cat").
(y^ ) should be in the range [0, 1].

Model Parameters

Input Feature Vector ( x ): An (nx )-dimensional vector.


Parameters:

Weights (W ): (nx )-dimensional vector.


Bias (b): A single real number.

Prediction Function for Logistic Regression

Linear regression-like prediction, (W T x + b ), does not constrain (y^ ) between 0 and 1,


which is necessary for a probability.
Sigmoid Activation: Logistic regression uses the sigmoid function to convert the linear
output into a probability: [y^ = ] where (z ).
1 T
σ(z) = −z
= W x + b
1+e

Sigmoid Function Characteristics:

For large positive ( z ): (σ(z) ≈ 1 ).


For large negative ( z ): (σ(z) ≈ 0 ).
For (z = 0) : (σ(z) = 0.5 ).

Alternative Notational Convention

Some approaches use an augmented feature vector (X ) with an additional dimension (


), making (X ).
nx +1
X0 = 1 ∈ R

This way, parameters (θ ) represent both (W )and(b ), where:

(θ0 ) represents the bias term (b ).


(θ1 , θ2 , … , θn ) represent the weights (W ).
x

Current Convention: We separate (W ) and (b ) to simplify implementation, especially in


neural networks.

Notes on Logistic Regression: Loss and Cost Functions


Recap on Logistic Regression Model

Logistic regression predicts a binary outcome (y ) (1 for cat, 0 for not-cat) based on input
features (x ).
Prediction: [y^ = ] where (σ(z) ) is the sigmoid function.
T 1
σ(z) = σ(W x + b) = −z
1+e

Goal

Train parameters (W ) and (b) to make predictions (y^ ) as close as possible to the actual
(i)

label (y (i) ) for each training example.

Notation for Training Examples

(y^ ): Prediction for the (ith ) example.


(i)

(z (i) = W
T (i)
x + b ): Linear combination of inputs and parameters for the (ith )
example.
Superscript ((i) ) denotes the (ith ) training example.

Loss Function for Logistic Regression


The loss function (L(y^, y) ) measures the error for a single example.
We avoid using squared error, as it makes optimization difficult in logistic regression due to
non-convexity, which leads to multiple local optima.
Binary Cross-Entropy Loss: [L(y^, y) ^) + (1 − y) log(1 − y
= − (y log(y ^)) ]
Intuition:

When (y = 1 ):

Loss reduces to (− log(y^) ), encouraging (y^ ) to be close to 1.


When (y = 0 ):

Loss reduces to (− log(1 − y^) ), encouraging (y^ ) to be close to 0.

Cost Function for Logistic Regression


Cost Function (J (W , b) ): Measures the model’s average error across the entire training
m (i) (i)
set. [J (W , b) ]
1 (i) (i)
= − ∑ (y ^
log(y ) + (1 − y ^
) log(1 − y ))
m i=1

This cost function averages the individual losses from all training examples.

Summary of Terms and Usage


Loss Function (L(y^, y) ): Error on a single training example.
Cost Function (J (W , b) ): Overall error across the entire training set.

Training Objective

Find parameters (W ) and (b ) that minimize the cost function (J (W , b) ), thus improving
prediction accuracy for binary classification.
Notes on Gradient Descent for Training Logistic Regression

Overview

To train the logistic regression model, we aim to learn the parameters (W ) and (b ) that
minimize the cost function (J (W , b) ).
Cost Function (J (W , b) ): Measures how well the model's predictions align with actual
labels across the training set.

Gradient Descent for Logistic Regression

1. Goal: Minimize (J (W , b) ) to find the optimal parameters (W ) and (b ).

2. Gradient Descent Visualization:

The horizontal axes represent (W ) and (b ).


The surface above these axes represents the cost function (J (W , b) ).
Convexity: The cost function (J (W , b) ) is convex (a single "bowl" shape), so
gradient descent is guaranteed to converge to a global minimum.

3. Algorithm Initialization:

Start with initial values for (W ) and (b ) (e.g., zero).


Take steps in the direction of steepest descent (downhill).

4. Gradient Descent Iteration:

Each iteration moves the parameters towards the minimum.


The learning rate (α ) controls the step size in each iteration.

Gradient Descent Update Equations


dJ (w)
For a single parameter (w ), gradient descent updates are: [w := w − α
dw
]

(α ): Learning rate.
dJ (w)
( dw
): Derivative of (J ) with respect to (w ), indicating the slope.

5. Applying Gradient Descent to Multiple Parameters

Since (J (W , b) ) is a function of both (W ) and (b ), the updates are: [


∂J (W ,b) ∂J (W ,b)
W := W − α
∂W
] [b := b − α
∂b
]
Here, the symbol (∂ ) (partial derivative) is used because (J (W , b) ) depends on
multiple variables.

Coding Convention

In code, we use:
∂J (W ,b)
(dW): Represents ( ).
∂W
∂J (W ,b)
(db): Represents ( ).
∂b
Updates are implemented as:

W = W - alpha * dW
b = b - alpha * db

Intuition Behind Gradient Descent Steps


∂J (W ,b)
The derivative ( ) gives the slope of (J ) at the current (W ) and (b) values:
∂W

Positive Slope: Move left (decrease (W )).


Negative Slope: Move right (increase (W )).

Gradient descent iteratively adjusts (W ) and (b) to reduce (J ), ideally converging to the
global minimum.

Key Takeaways
Gradient Descent minimizes the cost function by adjusting parameters in the steepest
descent direction.
Convexity of (J (W , b)) guarantees convergence to a global minimum, making logistic
regression a stable training algorithm.
Learning Rate (α ): A crucial hyperparameter that controls step size; tuning (α ) affects
convergence speed and stability.

Notes on Intuitive Understanding of Derivatives in Calculus

Purpose of Derivatives in Deep Learning

Derivatives are crucial in understanding gradient descent, which is used to optimize


parameters in neural networks.
A deep knowledge of calculus isn’t necessary for deep learning; a basic, intuitive
understanding of derivatives (slopes) suffices.

Derivative as Slope

Definition: A derivative represents the slope of a function at a specific point.


Slope (or derivative) indicates how much the function (f (a)) changes when you make a
small change in the input (a).

Example with a Straight Line

Consider the function (f (a) = 3a ).

At (a = 2 ):

(f (a) = 6 ).
If (a) is nudged to (2.001), then (f (a)) becomes (6.003).
The increase in (f (a)) (0.003) is three times the increase in (a) (0.001) ,
indicating a slope of 3.
At (a = 5) :

(f (a) = 15) .
If (a) is nudged to (5.001) , (f (a)) becomes (15.003).
The slope is still 3, showing a constant rate of change for this linear function.

Calculating the Derivative (Slope)


df (a)
For (f (a) = 3a ), the derivative at any point is: [ da
= 3 ]

This notation means that for any tiny increase in (a), (f (a)) will increase by 3 times
that amount.

Formal Definition of Derivative

The formal definition considers an "infinitesimal" nudge in (a) (even smaller than 0.001)
and calculates the resulting change in (f (a)) as the ratio approaches an infinitely small
limit.

Constant Slope in Linear Functions

For the function (f (a) = 3a ), the slope remains 3 everywhere on the line.
Ratio Consistency: Wherever you draw a triangle (change in (a) vs. change in (f (a))), the
ratio is always 3 to 1.

Notes on Understanding Derivatives in Non-Linear Functions

Derivatives in Non-Linear Functions

1. Function Example: (f (a) = a


2
)

At (a = 2 ):

(f (a) )
2
= 2 = 4

Nudging (a) to (2.001) yields (f (a) ≈ 4.004 ).


The change in (f (a))(0.004) is 4 times the change in (a)(0.001).
df (a)
Slope at (a = 2 ): ( da
= 4) .

At (a = 5 ):

(f (a) )
2
= 5 = 25

Nudging (a) to (5.001) yields (f (a) ≈ 25.010 ).


Here, (f (a)) changes 10 times more than (a).
df (a)
Slope at (a = 5 ): ( da
= 10 ).

General Formula:

The derivative of (a2 ) is (2a).


This formula means that at any point (a), the slope of (f (a) = a
2
) is (2a).

2. Function Example: (f (a) = a


3
)

The derivative for (f (a) = a


3
) is (3a2 ).
At (a = 2 ):

(f (a) = 8 ) (since (23 = 8 ))


Nudging (a) to (2.001) results in (f (a) ≈ 8.012 ).
This shows (f (a)) increases 12 times as much as (a).
df (a)
Slope at (a = 2 ): ( = 12 ).
da

3. Function Example: (f (a) = log(a) )

The derivative for (f (a) ) ( logarithm) is ( a ).


1
= log(a)

At (a = 2 ):

(f (a) = log(2) ≈ 0.69315 )


Increasing (a) to (2.001) makes (f (a) ≈ 0.69365 ).
df (a)
Slope at (a ): ( )
1
= 2 =
da 2

This shows that the function rises by half the change in (a).

Notes on Computation Graphs in Neural Networks

Purpose of Computation Graphs

In neural networks, computations are organized into:

Forward Pass: Calculates the output by moving from input to output.


Backward Pass: Calculates gradients (derivatives) for optimization by moving from
output to input.

Example Function

Consider a simple function (J = 3(a + bc) ) of three variables (a), (b), and (c).
This function can be broken down into three distinct steps:

1. Compute (u = bc )
2. Compute (v = a + u )
3. Compute (J = 3v )

Computation Graph Structure

We represent the function as a graph where each step is a node:

Inputs: Nodes (a), (b), and (c).


Intermediate Calculations:

(u = bc ): Node takes inputs (b) and (c).


(v = a + u ): Node takes inputs (a) and (u).
(J = 3v ): Node takes input (v).

Concrete Example

Suppose (a = 5 ), (b = 3 ), and (c = 2 ):

Step 1: (u = bc = 3 × 2 = 6 )
Step 2: (v = a + u = 5 + 6 = 11 )
Step 3: (J = 3v = 3 × 11 = 33 )
This calculation confirms that (J = 3(a + bc) = 33).

Why Use a Computation Graph?

Forward Pass (left-to-right): Computes the output value by moving through each
calculation in sequence.
Backward Pass (right-to-left): Computes derivatives with respect to each input for
optimization. This step is essential in training neural networks as it provides the necessary
gradients for parameter updates.

Notes on Using a Computation Graph for Derivative Calculations

Purpose of Computation Graphs in Neural Networks

Computation graphs help organize both forward and backward passes:

Forward Pass (left-to-right): Calculates the output, e.g., cost function (J ).


Backward Pass (right-to-left): Computes gradients (derivatives) efficiently, crucial for
backpropagation.

Example Computation Graph


Function: (J = 3(a + bc) )

Steps:

1. (u = bc )
2. (v = a + u )
3. (J = 3v )

Graph Nodes:

Inputs: (a), (b), and (c).


Intermediate Nodes: (u), (v).
Output: (J ).

Computing Derivatives Using Backpropagation


1. Derivative of (J ) with Respect to (v): ( dJ
dv
)
(J ), so ( dv ).
dJ
= 3v = 3

This means that if (v) changes by a small amount, (J ) will change three times that
amount.

2. Derivative of (J ) with Respect to (a): ( dJ )


da

Since (v = a + u ), a small increase in (a) directly affects (v).


We apply the chain rule: [ da ]
dJ dJ dv
= ⋅
dv da

( dv ).
dJ dv dJ
= 3\)and\( = 1\), so\( = 3
da da

3. Derivative of ( J ) with Respect to (u): ( dJ )


du

Using the chain rule again: [ du ]


dJ dJ dv
= ⋅
dv du

Since (dv/du ), we find ( du ).


dJ
= 1 = 3

4. Derivative of (J ) with Respect to (b): ( dJ


db
)

Using the chain rule: [ db ]


dJ dJ du
= ⋅
du db

(du/db = c ) (from (u = bc )), and since (c = 2 ), ( du


db
= 2 ).
So ( dJ = 3 × 2 = 6 ).
db

5. Derivative of (J ) with Respect to (c): ( dJ )


dc

Similarly, using the chain rule: [ dJ


=
dJ

du
]
dc du dc

(du/dc = b ), and since (b = 3 ), ( du = 3 ).


dc

Thus, ( dJ = 3 × 3 = 9 ).
dc

Notes on Implementing Gradient Descent for Logistic Regression Using a


Computation Graph

Objective

Compute derivatives for logistic regression to implement gradient descent.

Recap on Logistic Regression Model


1. Predictions: [y^ = σ(z) =
1
−z
] where (z = W1 X1 + W2 X2 + B ).
1+e

2. Loss Function (for a single example): [L(y^, y) = − (y log(y


^) + (1 − y) log(1 − y
^)) ]

(y^): Predicted probability.


(y ): Ground truth label.

Goal:

Update parameters (W ) and (B ) to minimize the loss (L).


Computation Graph for Logistic Regression
1. Forward Pass:

Step 1: Compute (z = W1 X1 + W2 X2 + B ).
Step 2: Compute (y^ = σ(z) ).
Step 3: Compute the loss (L(y^, y) ).

2. Backward Pass:

Compute gradients by going backward through each step in the graph.

Derivatives for Backpropagation


1. Derivative of Loss (L) with Respect to (y^) (or (A )):
y 1−y
( dA )
dL
= − +
A 1−A

This gradient tells us how much the loss (L) changes with respect to (y^).

2. Derivative of Loss (L) with Respect to (z):

Using the chain rule: [ dL =


dL
×
dA
]
dz dA dz

Result: ( dL = A − y )
dz

Explanation: Since (A = σ(z) ), we use the chain rule and find that (
), which simplifies the calculation to (A − y ).
dA
= A(1 − A)
dz

3. Derivatives with Respect to Parameters (W1 ), (W2 ), and (B ):

For (W1 ): ( dW )
dL dL
= X1 ⋅ = X1 ⋅ (A − y)
1 dz

For (W2 ): ( dW )
dL dL
= X2 ⋅ = X2 ⋅ (A − y)
2 dz

For (B ): ( dB )
dL dL
= = A − y
dz

Gradient Descent Update Rule for a Single Example


For a single example, update parameters as follows: [W1 := W1 − α ⋅ dW1 ][
W2 := W2 − α ⋅ dW2 ] [B := B − α ⋅ dB ] where (α ) is the learning rate.

Notes on Computing Gradient Descent for Logistic Regression on (m)


Training Examples

Objective

To compute gradients for multiple training examples and update parameters in logistic
regression using batch gradient descent.

Cost Function Definition


m
For (m) training examples, the cost function (J ) is: [J (w, b) ]
1 (i) (i)
= ∑ L(a ,y )
m i=1

where:

(a(i) = σ(z
(i)
) )
(z (i) = w
T
x
(i)
+ b )
(L(a (i)
,y
(i)
) ) is the loss for each training example.

Gradient Computation for Multiple Examples


1. Individual Example Derivatives:

For each training example (i ), compute: [dz (i) = a


(i)
− y
(i)
]
(i) (i) (i) (i)
Then compute: [dw1 = x
1
⋅ dz
(i)
, dw
2
= x
2
⋅ dz
(i)
, db
(i)
= dz
(i)

]
These derivatives calculate the impact of a single example on the parameters.

2. Averaging Over All Examples:

Compute the average gradient over all ( m ) examples: [


m (i) m (i) m
]
1 1 1 (i)
dw 1 = ∑ dw , dw 2 = ∑ dw , db = ∑ db
m i=1 1 m i=1 2 m i=1

Gradient Descent Update Rule


Initialize accumulators for gradients: (dw1 = 0 ), (dw2 = 0 ), (db = 0 ).

You might also like