W2 Ann
W2 Ann
Setup
Neural network programming involves specific techniques that simplify computations and
optimize performance.
Training Set Processing: Avoid explicit for-loops over all examples; instead, process the
training set as a whole (using matrix operations).
(x(i) ) : T he(i
th
) training example feature vector.
(y (i) ): Label for the (ith ) training example (0 or 1).
Input Matrix ( X ): Defined by stacking all feature vectors (x(i) ) as columns, forming (
X ∈ R
nx ×m
).
m (i)
Cost Function for Binary Classification: JCE (y, y^) = −∑
i=1
(y
(i)
^
log y )
Notes on Logistic Regression for Binary Classification
Logistic Regression is used for binary classification (outputs are either 0 or 1).
Example: Given an image as input (represented as a feature vector (x), we want to classify
it as either a "cat" (1) or "not a cat" (0).
Objective
Given an input (x), we aim to predict the probability (y^ ) that (y = 1 ) (i.e., the probability
that the image is a "cat").
(y^ ) should be in the range [0, 1].
Model Parameters
Logistic regression predicts a binary outcome (y ) (1 for cat, 0 for not-cat) based on input
features (x ).
Prediction: [y^ = ] where (σ(z) ) is the sigmoid function.
T 1
σ(z) = σ(W x + b) = −z
1+e
Goal
Train parameters (W ) and (b) to make predictions (y^ ) as close as possible to the actual
(i)
(z (i) = W
T (i)
x + b ): Linear combination of inputs and parameters for the (ith )
example.
Superscript ((i) ) denotes the (ith ) training example.
When (y = 1 ):
This cost function averages the individual losses from all training examples.
Training Objective
Find parameters (W ) and (b ) that minimize the cost function (J (W , b) ), thus improving
prediction accuracy for binary classification.
Notes on Gradient Descent for Training Logistic Regression
Overview
To train the logistic regression model, we aim to learn the parameters (W ) and (b ) that
minimize the cost function (J (W , b) ).
Cost Function (J (W , b) ): Measures how well the model's predictions align with actual
labels across the training set.
3. Algorithm Initialization:
(α ): Learning rate.
dJ (w)
( dw
): Derivative of (J ) with respect to (w ), indicating the slope.
Coding Convention
In code, we use:
∂J (W ,b)
(dW): Represents ( ).
∂W
∂J (W ,b)
(db): Represents ( ).
∂b
Updates are implemented as:
W = W - alpha * dW
b = b - alpha * db
Gradient descent iteratively adjusts (W ) and (b) to reduce (J ), ideally converging to the
global minimum.
Key Takeaways
Gradient Descent minimizes the cost function by adjusting parameters in the steepest
descent direction.
Convexity of (J (W , b)) guarantees convergence to a global minimum, making logistic
regression a stable training algorithm.
Learning Rate (α ): A crucial hyperparameter that controls step size; tuning (α ) affects
convergence speed and stability.
Derivative as Slope
At (a = 2 ):
(f (a) = 6 ).
If (a) is nudged to (2.001), then (f (a)) becomes (6.003).
The increase in (f (a)) (0.003) is three times the increase in (a) (0.001) ,
indicating a slope of 3.
At (a = 5) :
(f (a) = 15) .
If (a) is nudged to (5.001) , (f (a)) becomes (15.003).
The slope is still 3, showing a constant rate of change for this linear function.
This notation means that for any tiny increase in (a), (f (a)) will increase by 3 times
that amount.
The formal definition considers an "infinitesimal" nudge in (a) (even smaller than 0.001)
and calculates the resulting change in (f (a)) as the ratio approaches an infinitely small
limit.
For the function (f (a) = 3a ), the slope remains 3 everywhere on the line.
Ratio Consistency: Wherever you draw a triangle (change in (a) vs. change in (f (a))), the
ratio is always 3 to 1.
At (a = 2 ):
(f (a) )
2
= 2 = 4
At (a = 5 ):
(f (a) )
2
= 5 = 25
General Formula:
At (a = 2 ):
This shows that the function rises by half the change in (a).
Example Function
Consider a simple function (J = 3(a + bc) ) of three variables (a), (b), and (c).
This function can be broken down into three distinct steps:
1. Compute (u = bc )
2. Compute (v = a + u )
3. Compute (J = 3v )
Concrete Example
Suppose (a = 5 ), (b = 3 ), and (c = 2 ):
Step 1: (u = bc = 3 × 2 = 6 )
Step 2: (v = a + u = 5 + 6 = 11 )
Step 3: (J = 3v = 3 × 11 = 33 )
This calculation confirms that (J = 3(a + bc) = 33).
Forward Pass (left-to-right): Computes the output value by moving through each
calculation in sequence.
Backward Pass (right-to-left): Computes derivatives with respect to each input for
optimization. This step is essential in training neural networks as it provides the necessary
gradients for parameter updates.
Steps:
1. (u = bc )
2. (v = a + u )
3. (J = 3v )
Graph Nodes:
This means that if (v) changes by a small amount, (J ) will change three times that
amount.
( dv ).
dJ dv dJ
= 3\)and\( = 1\), so\( = 3
da da
Thus, ( dJ = 3 × 3 = 9 ).
dc
Objective
Goal:
Step 1: Compute (z = W1 X1 + W2 X2 + B ).
Step 2: Compute (y^ = σ(z) ).
Step 3: Compute the loss (L(y^, y) ).
2. Backward Pass:
This gradient tells us how much the loss (L) changes with respect to (y^).
Result: ( dL = A − y )
dz
Explanation: Since (A = σ(z) ), we use the chain rule and find that (
), which simplifies the calculation to (A − y ).
dA
= A(1 − A)
dz
For (W1 ): ( dW )
dL dL
= X1 ⋅ = X1 ⋅ (A − y)
1 dz
For (W2 ): ( dW )
dL dL
= X2 ⋅ = X2 ⋅ (A − y)
2 dz
For (B ): ( dB )
dL dL
= = A − y
dz
Objective
To compute gradients for multiple training examples and update parameters in logistic
regression using batch gradient descent.
where:
(a(i) = σ(z
(i)
) )
(z (i) = w
T
x
(i)
+ b )
(L(a (i)
,y
(i)
) ) is the loss for each training example.
]
These derivatives calculate the impact of a single example on the parameters.