0% found this document useful (0 votes)
19 views

Deep Learning

Deep neural networks are difficult to train compared to shallow networks due to having many layers, which can lead to problems like vanishing or exploding gradients that hinder effective training. Common issues in training deep neural networks include overfitting, computational complexity, and getting stuck in local optima. Greedy layer-wise training aims to simplify training by building the network up layer-by-layer rather than end-to-end, potentially helping address issues like vanishing gradients.

Uploaded by

rabinbhaumik7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Deep Learning

Deep neural networks are difficult to train compared to shallow networks due to having many layers, which can lead to problems like vanishing or exploding gradients that hinder effective training. Common issues in training deep neural networks include overfitting, computational complexity, and getting stuck in local optima. Greedy layer-wise training aims to simplify training by building the network up layer-by-layer rather than end-to-end, potentially helping address issues like vanishing gradients.

Uploaded by

rabinbhaumik7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1. Why are deep neural networks difficult to train compared to shallow networks?

Answer- Deep neural networks are harder to train than shallow ones because they have
many layers, making it tough to adjust all the connections effectively. This complexity
leads to problems like vanishing or exploding gradients and increases the risk of over
fitting, where the network learns the training data too well.

2. What are some common issues encountered during training deep neural networks?

Answer- Common issues encountered during training deep neural networks include:

1. Vanishing or exploding gradients: When gradients become too small or too


large, making it difficult to update the weights effectively.
2. Over fitting: The network learns the training data too well but struggles to
generalize to new, unseen data.
3. Computational complexity: Training deep networks requires significant
computational resources, which can be time-consuming and expensive.
4. Hyper parameter tuning: Selecting the right architecture and parameters for the
network can be challenging and requires experimentation.
5. Data pre-processing: Ensuring the data is properly cleaned, normalized, and
prepared for training is crucial for effective learning.
6. Local optima: Networks can get stuck in suboptimal solutions, hindering
convergence to the best solution.
7. Limited interpretability: Understanding how the network makes decisions can
be difficult due to its complex structure and high dimensionality.

3. What is greedy layer wise training in the context of deep learning?

Answer- Greedy layer wise training in deep learning involves training one layer at a time,
gradually building up the network. Each layer is trained independently, focusing on
learning simpler representations before combining them in deeper layers. This approach
simplifies the training process by breaking down the problem into smaller, more
manageable parts, potentially aiding convergence and overcoming issues such as
vanishing or exploding gradients.

4. What are the advantages and disadvantages of greedy layer wise training?

Answer- Advantages of greedy layer wise training:

1. Simplifies Training: By training one layer at a time, it simplifies the learning


process, making it easier to handle complex networks.
2. Address Vanishing/Exploding Gradients: Helps mitigate issues like
vanishing or exploding gradients by allowing each layer to learn simpler
representations independently.
3. Efficient Use of Resources: Can be computationally efficient, especially for
large datasets, as it breaks down the problem into smaller parts.

Disadvantages of greedy layer wise training:


1. May Not Find Global Optimum: Since it optimizes each layer independently,
there's no guarantee that the final combination of layers will lead to the best
overall solution.
2. Potentially Suboptimal Solutions: It might get stuck in suboptimal solutions,
particularly if the layers don't interact well when combined.
3. Additional Complexity: Requires careful tuning of hyper parameters and might
introduce additional complexity to the training process compared to end-to-end
training.

5. What is optimization in the context of training deep neural networks?

Answer- Optimization in the context of training deep neural networks refers to the process of
finding the best set of parameters (weights and biases) for the network to minimize the
difference between the actual outputs and the predicted outputs. This involves adjusting the
parameters iteratively using optimization algorithms to improve the network's performance in
tasks such as classification, regression, or generation. The goal is to find the optimal
configuration that reduces errors and enhances the network's ability to generalize to unseen
data.

6. What are AdaGrad, RMSProp, and Adam optimization algorithms?


Answer- AdaGrad, RMSProp, and Adam are optimization algorithms commonly used in training
deep neural networks:

1. AdaGrad (Adaptive Gradient Algorithm): AdaGrad adapts the learning rates of


individual parameters by scaling them inversely proportional to the square root of
the sum of historical squared gradients. It effectively boosts the learning rate for
infrequent parameters and decreases it for frequent ones, aiming to converge
faster.
2. RMSProp (Root Mean Square Propagation): RMSProp addresses the
diminishing learning rates problem of AdaGrad by using an exponentially
decaying average of squared gradients. It scales the learning rates differently for
each parameter based on the magnitude of their gradients, allowing for faster
convergence and better generalization.
3. Adam (Adaptive Moment Estimation): Adam combines the benefits of both
AdaGrad and RMSProp by using adaptive learning rates and momentum. It
maintains exponentially decaying averages of past gradients and squared
gradients, adjusting the learning rates accordingly. Adam also includes bias
correction terms to account for the initialization bias in the estimates, resulting in
improved performance and robustness across different types of deep learning
tasks.
7. What are some examples of second order optimization algorithms?

Answer- Examples of second-order optimization algorithms include:

1. Newton's Method: Newton's method uses second-order derivatives (Hessian matrix)


to update parameters. It provides more information about the curvature of the loss
function, allowing for faster convergence towards the minimum. However, computing
and inverting the Hessian matrix can be computationally expensive, especially for
large networks.
2. Quasi-Newton Methods (e.g., BFGS, L-BFGS): Quasi-Newton methods approximate
the inverse of the Hessian matrix without explicitly computing it. BFGS (Broyden–
Fletcher–Goldfarb–Shanno) and L-BFGS (Limited-memory BFGS) are popular
variants that iteratively update an approximation of the inverse Hessian matrix based
on gradients and parameter updates. They offer faster convergence than first-order
methods like gradient descent, with reduced computational cost compared to Newton's
method.

8. What are the computational challenges associated with using second order methods?

Answer- Computational challenges associated with using second-order optimization methods


include:

1. High Memory Requirements: Second-order methods often require storing and


manipulating large matrices (such as the Hessian matrix or its approximation), which
can consume significant memory resources, particularly for large-scale deep neural
networks with millions of parameters.
2. Computational Complexity: Computing and updating the Hessian matrix or its
approximation involve substantial computational overhead, making second-order
methods computationally expensive, especially when dealing with large datasets and
high-dimensional parameter spaces.
3. Numerical Stability: Inverting or approximating the Hessian matrix can lead to
numerical instability, especially when dealing with ill-conditioned or highly non-linear
optimization problems. This instability can affect the convergence behaviour and the
overall performance of second-order methods.
4. Limited Scalability: Second-order methods may not scale well to very large neural
networks or datasets due to their high computational and memory requirements. As a
result, these methods may not be practical for training deep neural networks in certain
scenarios where computational resources are limited.
9. What is regularization and why is it important in training deep neural networks?
Answer- Regularization is a technique used in training deep neural networks to prevent over
fitting, which occurs when the model learns to memorize the training data too well and
performs poorly on unseen data. The goal of regularization is to impose constraints on the
network's parameters during training, encouraging simpler and more generalizable
representations.
Regularization is important in training deep neural networks for several reasons:

1. Preventing Over fitting: Deep neural networks have a large number of parameters,
making them prone to over fitting, especially when trained on limited data.
Regularization helps mitigate over fitting by penalizing complex models that fit the
training data too closely, thereby improving the model's ability to generalize to new,
unseen data.
2. Improving Generalization: By encouraging simpler model configurations,
regularization helps prevent the network from memorizing noise or irrelevant patterns
in the training data, leading to better generalization performance on unseen data.
3. Controlling Model Complexity: Regularization techniques introduce constraints on
the network's parameters, such as weight decay or dropout, which control the
complexity of the model. This helps prevent the network from becoming too complex
and ensures that it learns meaningful features that are relevant to the task at hand.

10. Explain the concept of batch normalization and its role in training deep neural
networks.

Answer- Batch normalization is a technique used in training deep neural networks to stabilize
and speed up the learning process. It works by normalizing the activations of each layer
within a mini-batch, transforming the inputs to have a mean of zero and a standard deviation
of one. This normalization is applied independently to each feature dimension.

The main steps involved in batch normalization are:


1. Compute Batch Statistics: For each feature dimension, calculate the mean and standard
deviation of the activations within the mini-batch.
2. Normalize Activations: Subtract the mean and divide by the standard deviation for each
activation in the mini-batch.
3. Scale and Shift: Introduce learnable parameters (gamma and beta) to scale and shift the
normalized activations, allowing the network to learn the optimal representation for each
layer.

Batch normalization helps in training deep neural networks by addressing several key
challenges:
1. Internal Covariate Shift: By normalizing the activations within each mini-batch, batch
normalization reduces the internal covariate shift, making the training process more stable
and allowing for higher learning rates.
2. Addressing Vanishing/Exploding Gradients: Batch normalization helps mitigate the
vanishing and exploding gradient problems by stabilizing the activations throughout the
network, enabling smoother and more consistent gradient flow during back propagation.
3. Regularization: Batch normalization acts as a form of regularization by adding noise to
the activations, similar to dropout, which helps prevent over fitting and improves the
generalization performance of the network.

You might also like