0% found this document useful (0 votes)
3 views

MLP, Gradient Descent, Activation Functions

Uploaded by

HOW TO DO THIS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

MLP, Gradient Descent, Activation Functions

Uploaded by

HOW TO DO THIS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The multilayer perceptron is a commonly used neural network.

MLP is
composed of multiple layers, including an input layer, hidden layers,
and an output layer, where each layer contains a set of perception
elements known as neurons. Fig. 1 illustrates an MLP with two hidden
layers, an input and output layer.

Multilayer Perceptron (MLP):

● Structure: An MLP consists of multiple layers of


neurons—input, hidden, and output layers.
● Operation: Information flows forward from the input layer to
the output layer, passing through the hidden layers.
● Training: The network adjusts its weights using the
backpropagation algorithm to minimize errors in predictions by
iterating over training data.

Example Use Cases:

● Fault Detection in Wind Turbines: MLPs can analyze data


from sensors to identify when a turbine might fail.
● Prediction of NAICS Codes: MLPs can classify business types
by predicting six-digit North American Industry Classification
System (NAICS) codes based on input data.
Types of Non Linear Activation Functions:

Nonlinear Activation Functions:

1. ReLU (Rectified Linear Unit)


○ Definition:

○ Purpose: Introduces non-linearity, enabling the model to


learn complex patterns and relationships.
○ Benefits: Simple and computationally efficient, helps avoid
the vanishing gradient problem.
○ Limitations: Can cause "dead neurons" where some neurons
may never activate during training (negative inputs always
give zero).
2. Sigmoid Function
○ Definition:

○ Purpose: Maps the input into a probability-like output,


making it suitable for binary classification tasks.
○ Benefits: Useful for interpreting outputs as probabilities.
○ Limitations: Can cause the vanishing gradient problem,
making it difficult for the network to learn in deeper layers.
3. Tanh (Hyperbolic Tangent)
○ Definition:

○ Purpose: Provides stronger gradients than the sigmoid


function, helping the network learn more effectively.
○ Benefits: Zero-centered output helps in convergence during
training.
○ Limitations: Still susceptible to the vanishing gradient
problem but less severe than sigmoid.

4. Leaky ReLU
○ Definition:

○ Purpose: Addresses the "dying ReLU" problem by allowing a


small, non-zero gradient for negative inputs.
○ Benefits: Prevents dead neurons while maintaining simplicity
and computational efficiency.
○ Limitations: The choice of α\alpha α is crucial and can
affect performance.
Gradient Descent Summary:

What is Gradient Descent?

● An optimization algorithm used to train machine learning models


and neural networks.
● It minimizes the error between predicted and actual results,
helping the model learn over time.
● The cost function measures this error and guides the model to
adjust its parameters to reduce the error.

How Does Gradient Descent Work?

● Starts with an arbitrary point and calculates the slope (derivative)


to find the direction to move to minimize the error.
● Uses a learning rate (step size) to determine how big each step
should be.
● Continues adjusting parameters (weights and biases) until it
reaches the lowest point (minimum error).
● The goal is to find the optimal parameters that minimize the cost
function.

Types of Gradient Descent:

1. Batch Gradient Descent:


○ Updates parameters after evaluating all training examples.
○ Efficient but slow for large datasets.
2. Stochastic Gradient Descent (SGD):
○ Updates parameters after each training example.
○ Faster and can escape local minima, but less stable.
3. Mini-Batch Gradient Descent:
○ Updates parameters in small batches.
○ Balances efficiency and speed.

Challenges with Gradient Descent:

● Local Minima and Saddle Points: Can get stuck in local minima or
saddle points where learning stops prematurely.
● Vanishing and Exploding Gradients: In deep networks, gradients
can become too small (vanishing) or too large (exploding),
hindering learning.
Key Concepts:

● Learning Rate: Controls the step size during parameter updates;


too high can overshoot, too low can slow learning.
● Cost Function: Measures the error; different from the loss function,
which is for a single training example.

You might also like