Unit-4 (NLP)
Unit-4 (NLP)
4.1.1Dependency Parsing
4.1.2Gradient checks, Overfitting
4.1.3Regularization
4.1.4Activation functions
4.1.5Multitask and Semi-supervised Learning;
4.2Text Embedding
4.2.1 Word Vector representations: word2vec, GloVe
4.2.2Advanced word vector representations
4.2.3Sequence-to-sequence model
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity
extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned
in unstructured text into pre-defined categories such as person names, organisations, locations, medical
It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of
The named entity recognition (NER) is one of the most popular data preprocessing task. It involves the
identification of key information in the text and classification into a set of predefined categories. An entity
is basically the thing that is consistently talked about or refer to in the text.
NER is the form of NLP.
At its core, NLP is just a two-step process, below are the two steps that are involved:
Detecting the entities from the text
Classifying them into different categories
UNIT-4
Applications :
Few applications of NER include: extracting important named entities from legal, financial, and medical
documents, classifying content for news providers, improving the search algorithms, and etc.
Ambiguity in NE
For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity
in classification. Let’s look at some ambiguous example:
England (Organisation) won the 2019 world cup vs The 2019 world cup happened
in England(Location).
Washington(Location) is the capital of the US vs The first president of the US
was Washington(Person).
1. Dictionary-based systems
This is the simplest NER approach. Here we will be having a dictionary that contains a collection of
vocabulary. In this approach, basic string matching algorithms are used to check whether the entity is occurring
in the given text to the items in vocabulary. The method has limitations as it is required to update and maintain
2. Rule-based systems
Here, the model uses a pre-defined set of rules for information extraction. Mainly two types of rules are used,
Pattern-based rules, which depend upon the morphological pattern of the words used, and context-based rules,
which depend upon the context of the word used in the given text document. A simple example for a context-
based rule is “If a person’s title is followed by a proper noun, then that proper noun is the name of a person”.
The ML-based systems use statistical-based models for detecting the entity names. These models try to make
a feature-based representation of the observed data. By this approach, a lot of limitations of dictionary and
rule-based approaches are solved by recognizing an existing entity name even with small spelling variations.
There are mainly two phases while we use an ML-based solution for NER. The first phase involves training
the ML model on the annotated documents. The time taken for the model to train will vary depending upon
the complexity of the model that we are building. In the next phase, the trained model can be used to annotate
In recent years, deep learning-based models are being used for building state-of-the-art systems for NER.
There are many advantages of using DL techniques over the previously discussed approaches. Using the DL
approach, the input data is mapped to a non-linear representation. This approach helps to learn complex
relations that are present in the input data. Another advantage is that we can avoid a lot of time and resources
spent on feature engineering which is required for the other traditional approaches.
It is one of the standard tools that is used for Named Entity Recognition. Mainly there are three types of
1. Three class model which recognizes the organizations, persons, and locations.
2. Four class model which recognizes persons, organizations, locations, and miscellaneous entities.
3. Seven class model which recognizes persons, organizations, locations, money, time, percents, and dates.
3. While testing the model, I had taken a news extract from the Indian Express newspaper
example_document = '''Deepak Jasani, Head of retail research, HDFC Securities, said: “Investors will look
to the European Central Bank later Thursday for reassurance that surging prices are just transitory, and not
about to spiral out of control. In addition to the ECB policy meeting, investors are awaiting a report later
Thursday on US economic growth, which is likely to show a cooling recovery, as well as weekly jobs
data.”.'''
st_4class.tag(example_document.split())
UNIT-4
Spacy Pipelines for NER
Spacy has mainly three English pipelines that are optimized for CPU for Named Entity Recognition. They are
a) en_core_web_sm
b) en_core_web_md
c) en_core_web_lg
The above models are listed in ascending order according to their size where SM, MD, and LG denote small,
medium, and large models respectively. Let us try out NER using the small model.
import spacy
import spacy.cli
spacy.cli.download("en_core_web_sm")
sp_sm = spacy.load('en_core_web_sm')
def spacy_large_ner(document):
return {(ent.text.strip(), ent.label_) for ent in sp_lg(document).ents}
spacy_large_ner(example_document)
4.1.1Dependency Parsing
The term Dependency Parsing (DP) refers to the process of examining the dependencies between the phrases
of a sentence in order to determine its grammatical structure. A sentence is divided into many sections based
mostly on this. The process is based on the assumption that there is a direct relationship between each linguistic
Consider the following statement: “I prefer the morning flight through Denver.”
In a written dependency structure, the relationships between each linguistic unit, or phrase, in the sentence are
expressed by directed arcs. The root of the tree “prefer” varies the pinnacle of the preceding sentence, as
A dependence tag indicates the relationship between two phrases. For example, the word “flight” changes the
meaning of the noun “Denver.” As a result, you may identify a dependence from
flight -> Denver, where flight is the pinnacle and Denver is the kid or dependent. It’s represented by nmod,
This distinguishes the scenario for dependency between the two phrases, where one serves as the pinnacle and
the other as the dependent. Currently, the Common Dependency V2 taxonomy consists of 37 common
The Pure Language Toolkit (NLTK) package deal will be used for Dependency Parsing, which is a set of
libraries and codes used during statistical Pure Language Processing (NLP) of human language.
1. Probabilistic, projective dependency parser: These parsers predict new sentences by using human
language data acquired from hand-parsed sentences. They’re known to make mistakes and work with a limited
2. Stanford parser: It is a Java-based pure language parser. You would want the Stanford CoreNLP parser to
perform dependency parsing. The parser supports a number of languages, including English, Chinese,
Constituency Parsing
Constituency Parsing is based on context-free grammars. Constituency Context-free grammars are used to
parse text. Right here the parse tree includes sentences that have been broken down into sub-phrases, each of
which belongs to a different grammar class. A terminal node is a linguistic unit or phrase that has a mother or
Fоr exаmрle, “A cat” and “a box beneath the bed”, are noun phrases, while “write a letter” and “drive a car”
Consider the following example sentence: “I shot an elephant in my pajamas.” The constituency parse tree is
IMAGE – 2
The parse tree on the left represents catching an elephant carrying pyjamas, while the parse tree on the right
The entire sentence is broken down into sub-phases till we’ve got terminal phrases remaining. VP stands for
The Stanford parser will also be used to do constituency parsing. It begins by parsing a phrase using the
constituency parser and then transforms the constituency parse tree into a dependency tree.
In case your main objective is to interrupt a sentence into sub-phrases, it is ideal to implement constituency
parsing. However, dependency parsing is the best method for discovering the dependencies between phrases
in a sentence.
A constituency parse tree denotes the subdivision of a text into sub-phrases. The tree’s non-terminals are
different sorts of phrases, the terminals are the sentence’s words, and the edges are unlabeled. A constituency
parse for the simple statement “John sees Bill” would be:
UNIT-4
A dependency parse links words together based on their connections. Each vertex in the tree corresponds to a
word, child nodes to words that are reliant on the parent, and edges to relationships. The dependency parse for
You should choose the parser type that is most closely related to your objective. If you’re looking for sub-
phrases inside a sentence, you’re definitely interested in the constituency parse. If you’re interested in the
connection between words, you’re probably interested in the dependency parse.
When implementing a neural network from scratch, backpropagation is arguably where it is more prone to
mistakes. Therefore, a method to debug this step could potentially save a lot of time and headaches when
Here, the method of gradient checking will be introduced. Briefly, this methods consists in approximating
the gradient using a numerical approach. If it is close to the calculated gradients, then backpropagation was
implemented correctly!
A bit of calculus
Assuming that you have some knowledge of calculus, gradient checking will be very easy to understand.
We know that backpropagation calculates the derivatives (or gradient). From your calculus course, you might
The definition above can be used as a numerical approximation of the derivative. Taking an epsilon small
enough, the calculated approximation will have an error in the range of epsilon squared.
Therefore, we can use this to approximate the gradient, and in turn make sure that backpropagation is
Vectorized implementation
Now, we need to define a vectorized form of gradient checking before implementing it i Python.
Let’s take the weights and bias matrices and reshape them into a big vector theta. Similarly, all their
respective derivatives will be placed into a vector d_theta. Therefore, the approximate gradient can be
expressed as:
Notice how the equation above is almost identical to the definition of the limit!
Gradient check
The equation above is basically the Euclidean distance normalized by the sum of the norm of the vectors. We
As a value for epsilon, we usually opt for 1e-7. Therefore, if gradient check return a value less than 1e-7, then
it means that backpropagation was implemented correctly. Otherwise, there is potentially a mistake in your
implementation. If the value exceeds 1e-3, then you are sure that the code is not correct.
What is overfitting?
It is a common pitfall in deep learning algorithms in which a model tries to fit the training data entirely and ends up
memorizing the data patterns and the noise and random fluctuations.
These models fail to generalize and perform well in the case of unseen data scenarios, defeating the model's
purpose.
The training time of the model or its architectural complexity may cause the model to overfit. If the model trains for
too long on the training data or is too complex, it learns the noise or irrelevant information within the dataset.
Signs of overfitting
Here are some of the key definitions that’ll help you navigate through this guide.
UNIT-4
Bias: Bias measures the difference between the model’s prediction and the target value. If the model is
oversimplified, then the predicted value would be far from the ground truth resulting in more bias.
Variance: Variance is the measure of the inconsistency of different predictions over varied datasets. If the
model’s performance is tested on different datasets, the closer the prediction, the lesser the variance. Higher
variance is an indication of overfitting in which the model loses the ability to generalize.
Bias-variance tradeoff: A simple linear model is expected to have a high bias and low variance due to less
complexity of the model and fewer trainable parameters. On the other hand, complex non-linear models
tend to observe an opposite behavior. In an ideal scenario, the model would have an optimal balance of bias
and variance.
Model generalization: Model generalization means how well the model is trained to extract useful data
patterns and classify unseen data samples.
Feature selection: It involves selecting a subset of features from all the extracted features that contribute
most towards the model performance. Including all the features unnecessarily increases the model
complexity and redundant features can significantly increase the training time.
Underfitting occurs when we have a high bias in our data, i.e., we are oversimplifying the problem, and as a result,
the model does not work correctly in the training data.
Overfitting occurs when the model has a high variance, i.e., the model performs well on the training data but does
not perform accurately in the evaluation set. The model memorizes the data patterns in the training dataset but fails
to generalize to unseen examples.
UNIT-4
Overfitting vs. Underfitting vs. Good Model
1. The data used for training is not cleaned and contains garbage values. The model captures the noise in the
training data and fails to generalize the model's learning.
3. The training data size is not enough, and the model trains on the limited training data for several epochs.
4. The architecture of the model has several neural layers stacked together. Deep neural networks are complex
and require a significant amount of time to train, and often lead to overfitting the training set.
1. Unclean training data containing noise or outliers can be a reason for the model not being able to derive
patterns from the dataset.
2. The model has a high bias due to the inability to capture the relationship between the input examples and
the target values.
3. The model is assumed to be too simple. For example, training a linear model in complex scenarios.
The goal is to find a good fit such that the model picks up the patterns from the training data and does not end up
memorizing the finer details.
This, in turn, would ensure that the model generalizes and accurately predicts other data samples.
Have a look at this visual comparison to get a better understanding of the differences.
UNIT-4
Underfitted vs. Fit vs. Overfitted model
One of the leading indicators of an overfit model is its inability to generalize datasets. The most obvious way to
start the process of detecting overfitting machine learning models is to segment the dataset. It’s done so that we can
examine the model's performance on each set of data to spot overfitting when it occurs and see how the training
process works.
K-fold cross-validation is one of the most popular techniques commonly used to detect overfitting.
We split the data points into k equally sized subsets in K-folds cross-validation, called "folds." One split subsets act
as the testing set, and the remaining folds will train the model.
The model is trained on a limited sample to estimate how the model is expected to perform in general when used to
make predictions on data not used during the training of the model. One fold acts as a validation set in each turn.
After all the iterations, we average the scores to assess the performance of the overall model.
UNIT-4
K-fold cross-validation
10 techniques to avoid overfitting
Here we will discuss possible options to prevent overfitting, which helps improve the model performance.
With the increase in the training data, the crucial features to be extracted become prominent. The model can
recognize the relationship between the input attributes and the output variable. The only assumption in this method
is that the data to be fed into the model should be clean; otherwise, it would worsen the problem of overfitting.
Data augmentation
An alternative method to training with more data is data augmentation, which is less expensive and safer than the
previous method. Data augmentation makes a sample data look slightly different every time the model processes it.
Another similar option as data augmentation is adding noise to the input and output data. Adding noise to the input
makes the model stable without affecting data quality and privacy while adding noise to the output makes the data
more diverse. Noise addition should be done in limit so that it does not make the data incorrect or too different.
Feature selection
Every model has several parameters or features depending upon the number of layers, number of neurons, etc. The
model can detect many redundant features or features determinable from other features leading to unnecessary
complexity. We very well know that the more complex the model, the higher the chances of the model to overfit.
Cross-validation
Cross-validation is a robust measure to prevent overfitting. The complete dataset is split into parts. In standard K-
fold cross-validation, we need to partition the data into k folds. Then, we iteratively train the algorithm on k-1 folds
while using the remaining holdout fold as the test set. This method allows us to tune the hyperparameters of the
neural network or machine learning model and test it using completely unseen data.
Simplify data
Till now, we have come across model complexity to be one of the top reasons for overfitting. The data
simplification method is used to reduce overfitting by decreasing the complexity of the model to make it simple
UNIT-4
enough that it does not overfit. Some of the procedures include pruning a decision tree, reducing the number of
parameters in a neural network, and using dropout on a neutral network.
Regularization
If overfitting occurs when a model is too complex, reducing the number of features makes sense. Regularization
methods like Lasso, L1 can be beneficial if we do not know which features to remove from our model.
Regularization applies a "penalty" to the input parameters with the larger coefficients, which subsequently limits the
model's variance.
Ensembling
It is a machine learning technique that combines several base models to produce one optimal predictive model. In
Ensemble learning, the predictions are aggregated to identify the most popular result. Well-known ensemble
methods include bagging and boosting, which prevents overfitting as an ensemble model is made from the
aggregation of multiple models.
Early stopping
This method aims to pause the model's training before memorizing noise and random fluctuations from the data.
There can be a risk that the model stops training too soon, leading to underfitting. One has to come to an optimum
time/iterations the model should train.
Large weights in a neural network signify a more complex network. Probabilistically dropping out nodes in the
network is a simple and effective method to prevent overfitting. In regularization, some number of layer outputs are
randomly ignored or “dropped out” to reduce the complexity of the model.
Our tip: If one has two models with almost equal performance, the only difference being that one model is more
complex than the other, one should always go with the less complex model. In data science, it's a thumb rule that
one should always start with a less complex model and add complexity over time.
A model is trained by hyperparameters tuning using a training dataset and then tested on a separate dataset
called the testing set. If a model performs well on training data, it should work well for the testing set.
UNIT-4
The scenario in which the model performs well in the training phase but gives a poor accuracy in the test
dataset is called overfitting.
The machine learning algorithm performs poorly on the training dataset if it cannot derive features from the
training set. This condition is called underfitting.
Feature selection by choosing the best features and remove the useless/unnecessary features
Early stopping the training of deep learning models where the number of epochs is set high
Dropout techniques by randomly selecting nodes and removing them from training
What is Regularization?
Regularization is one of the most important concepts of machine learning. It is a technique to prevent the
model from overfitting by adding extra information to it.
Sometimes the machine learning model performs well with the training data but does not perform well with
the test data. It means the model is not able to predict the output when deals with unseen data by introducing
noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a
regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features in the model
by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the
model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization
technique, we reduce the magnitude of the features by keeping the same number of features."
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple
linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of
the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost
function for the linear model is given below:
UNIT-4
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate value
of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of bias is introduced
so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It
is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added
to the model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda
to the squared weight of each individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge
regression reduces the amplitudes of the coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation becomes the
cost function of the linear regression model. Hence, for the minimum value of λ, the model will
resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the absolute weights
instead of a square of weights.
UNIT-4
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only
shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
o Some of the features in this technique are completely neglected for model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
4.1.4Activation functions
Definition
In artificial neural networks, an activation function is one that outputs a smaller value for tiny inputs and a
higher value if its inputs are greater than a threshold. An activation function "fires" if the inputs are big enough;
otherwise, nothing happens. An activation function, then, is a gate that verifies how an incoming value is
higher than a threshold value.
Because they introduce non-linearities in neural networks and enable the neural networks can learn powerful
operations, activation functions are helpful. A feedforward neural network might be refactored into a
straightforward linear function or matrix transformation on to its input if indeed the activation functions were
taken out.
By generating a weighted total and then including bias with it, the activation function determines whether a
neuron should be turned on. The activation function seeks to boost a neuron's output's nonlinearity.
Explanation: As we are aware, neurons in neural networks operate in accordance with weight, bias, and their
corresponding activation functions. Based on the mistake, the values of the neurons inside a neural network
would be modified. This process is known as back-propagation. Back-propagation is made possible by
activation functions since they provide the gradients and error required to change the biases and weights.
An interconnected regression model without an activation function is all that a neural network is. Input is
transformed nonlinearly by the activation function, allowing the system to learn and perform more challenging
tasks.
It is merely a thing procedure that is used to obtain a node's output. It also goes by the name Transfer Function.
UNIT-4
The mixture of two linear functions yields a linear function, so no matter how several hidden layers we add to
a neural network, they all will behave in the same way. The neuron cannot learn if all it has is a linear model.
It will be able to learn based on the difference with respect to error with a non-linear activation function.
The mixture of two linear functions yields a linear function in itself, so no matter how several hidden layers
we add to a neural network, they all will behave in the same way. The neuron cannot learn if all it has is a
linear model.
Multi-Task Learning (MTL) is a type of machine learning technique where a model is trained to perform
multiple tasks simultaneously. In deep learning, MTL refers to training a neural network to perform
multiple tasks by sharing some of the network’s layers and parameters across tasks.
In MTL, the goal is to improve the generalization performance of the model by leveraging the information
shared across tasks. By sharing some of the network’s parameters, the model can learn a more efficient
and compact representation of the data, which can be beneficial when the tasks are related or have some
commonalities.
UNIT-4
There are different ways to implement MTL in deep learning, but the most common approach is to use a
shared feature extractor and multiple task-specific heads. The shared feature extractor is a part of the
network that is shared across tasks and is used to extract features from the input data. The task-specific
heads are used to make predictions for each task and are typically connected to the shared feature
extractor.
Another approach is to use a shared decision-making layer, where the decision-making layer is shared
across tasks, and the task-specific layers are connected to the shared decision-making layer.
MTL can be useful in many applications such as natural language processing, computer vision, and
healthcare, where multiple tasks are related or have some commonalities. It is also useful when the data is
limited, MTL can help to improve the generalization performance of the model by leveraging the
information shared across tasks.
However, MTL also has its own limitations, such as when the tasks are very different
Multi-Task Learning is a sub-field of Deep Learning. It is recommended that you familiarize yourself with
the concepts of neural networks to understand what multi-task learning means. What is Multi-Task
Learning? Multi-Task learning is a sub-field of Machine Learning that aims to solve multiple different
tasks at the same time, by taking advantage of the similarities between different tasks. This can improve
the learning efficiency and also act as a regularizer which we will discuss in a while. Formally, if there
are n tasks (conventional deep learning approaches aim to solve just 1 task using 1 particular model),
where these n tasks or a subset of them are related to each other but not exactly identical, Multi-Task
Learning (MTL) will help in improving the learning of a particular model by using the knowledge
contained in all the n tasks. Intuition behind Multi-Task Learning (MTL): By using Deep learning
models, we usually aim to learn a good representation of the features or attributes of the input data to
predict a specific value. Formally, we aim to optimize for a particular function by training a model and
fine-tuning the hyperparameters till the performance can’t be increased further. By using MTL, it might be
possible to increase performance even further by forcing the model to learn a more generalized
representation as it learns (updates its weights) not just for one specific task but a bunch of tasks.
Biologically, humans learn in the same way. We learn better if we learn multiple related tasks instead of
focusing on one specific task for a long time. MTL as a regularizer: In the lingo of Machine Learning,
MTL can also be looked at as a way of inducing bias. It is a form of inductive transfer, using multiple
tasks induces a bias that prefers hypotheses that can explain all the n tasks. MTL acts as a regularizer by
introducing inductive bias as stated above. It significantly reduces the risk of overfitting and also reduces
the model’s ability to accommodate random noise during training. Now, let’s discuss the major and
prevalent techniques to use MTL. Hard Parameter Sharing – A common hidden layer is used for all
tasks but several task specific layers are kept intact towards the end of the model. This technique is very
useful as by learning a representation for various tasks by a common hidden layer, we reduce the risk of
overfitting.
Assumptions and Considerations – Using MTL to share knowledge among tasks are very useful only
when the tasks are very similar, but when this assumption is violated, the performance will significantly
decline. Applications: MTL techniques have found various uses, some of the major applications are-
Object detection and Facial recognition
Self Driving Cars: Pedestrians, stop signs and other obstacles can be detected together
Multi-domain collaborative filtering for web applications
Stock Prediction
Language Modelling and other NLP applications
Important points:
Here are some important points to consider when implementing Multi-Task Learning (MTL) for deep
learning:
1. Task relatedness: MTL is most effective when the tasks are related or have some
commonalities, such as natural language processing, computer vision, and healthcare.
2. Data limitation: MTL can be useful when the data is limited, as it allows the model to leverage
the information shared across tasks to improve the generalization performance.
3. Shared feature extractor: A common approach in MTL is to use a shared feature extractor,
which is a part of the network that is shared across tasks and is used to extract features from the
input data.
4. Task-specific heads: Task-specific heads are used to make predictions for each task and are
typically connected to the shared feature extractor.
5. Shared decision-making layer: another approach is to use a shared decision-making layer,
where the decision-making layer is shared across tasks, and the task-specific layers are
connected to the shared decision-making layer.
6. Careful architecture design: The architecture of MTL should be carefully designed to
accommodate the different tasks and to make sure that the shared features are useful for all
tasks.
7. Overfitting: MTL models can be prone to overfitting if the model is not regularized properly.
8. Avoiding negative transfer: when the tasks are very different or independent, MTL can lead to
suboptimal performance compared to training a single-task model. Therefore, it is important to
make sure that the shared features are useful for all tasks to avoid negative transfer.
UNIT-4
Semi-Supervised learning is a type of Machine Learning algorithm that represents the intermediate ground
between Supervised and Unsupervised learning algorithms. It uses the combination of labeled and
unlabeled datasets during the training period.
Before understanding the Semi-Supervised learning, you should know the main categories of Machine
Learning algorithms. Machine Learning consists of three main categories: Supervised
Learning, Unsupervised Learning, and Reinforcement Learning. Further, the basic difference between
Supervised and unsupervised learning is that supervised learning datasets consist of an output label training
data associated with each tuple, and unsupervised datasets do not consist the same. Semi-supervised learning
is an important category that lies between the Supervised and Unsupervised machine learning. Although
Semi-supervised learning is the middle ground between supervised and unsupervised learning and operates
on the data that consists of a few labels, it mostly consists of unlabeled data. As labels are costly, but for the
corporate purpose, it may have few labels.
The basic disadvantage of supervised learning is that it requires hand-labeling by ML specialists or data
scientists, and it also requires a high cost to process. Further unsupervised learning also has a limited spectrum
for its applications. To overcome these drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced. In this algorithm, training data is a
combination of both labeled and unlabeled data. However, labeled data exists with a very small amount while
it consists of a huge amount of unlabeled data. Initially, similar data is clustered along with an unsupervised
learning algorithm, and further, it helps to label the unlabeled data into labeled data. It is why label data is a
comparatively, more expensive acquisition than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a student is under the
supervision of an instructor at home and college. Further, if that student is self-analyzing the same concept
without any help from the instructor, it comes under unsupervised learning. Under semi-supervised learning,
the student has to revise itself after analyzing the same concept under the guidance of an instructor at college.
Play Video
To work with the unlabeled dataset, there must be a relationship between the objects. To understand this, semi-
supervised learning uses any of the following assumptions:
o Continuity Assumption:
As per the continuity assumption, the objects near each other tend to share the same group or label.
This assumption is also used in supervised learning, and the datasets are separated by the decision
boundaries. But in semi-supervised, the decision boundaries are added with the smoothness
assumption in low-density boundaries.
o Cluster assumptions- In this assumption, data are divided into different discrete clusters. Further, the
points in the same cluster share the output label.
UNIT-4
o Manifold assumptions- This assumption helps to use distances and densities, and this data lie on a
manifold of fewer dimensions than input space.
o The dimensional data are created by a process that has less degree of freedom and may be hard to
model directly. (This assumption becomes practical if high).
Semi-supervised learning uses pseudo labeling to train the model with less labeled training data than
supervised learning. The process can combine various neural network models and training ways. The whole
working of semi-supervised learning is explained in the below points:
o Firstly, it trains the model with less amount of training data similar to the supervised learning models.
The training continues until the model gives accurate results.
o The algorithms use the unlabeled dataset with pseudo labels in the next step, and now the result may
not be accurate.
o Now, the labels from labeled training data and pseudo labels data are linked together.
o The input data in labeled training data and unlabeled training data are also linked.
o In the end, again train the model with the new combined input as did in the first step. It will reduce
errors and improve the accuracy of the model.
Reinforcement learning is different from semi-supervised learning, as it works with rewards and
feedback. Reinforcement learning aims to maximize the rewards by their hit and trial actions, whereas in
semi-supervised learning, we train the model with a less labeled dataset.
Semi-supervised learning models are becoming more popular in the industries. Some of the main applications
are as follows.
o Speech Analysis- It is the most classic example of semi-supervised learning applications. Since,
labeling the audio data is the most impassable task that requires many human resources, this problem
can be naturally overcome with the help of applying SSL in a Semi-supervised learning model.
o Web content classification- However, this is very critical and impossible to label each page on the
internet because it needs mode human intervention. Still, this problem can be reduced through Semi-
Supervised learning algorithms.
Further, Google also uses semi-supervised learning algorithms to rank a webpage for a given query.
o Protein sequence classification- DNA strands are larger, they require active human intervention. So,
the rise of the Semi-supervised model has been proximate in this field.
o Text document classifier- As we know, it would be very unfeasible to find a large amount of labeled
text data, so semi-supervised learning is an ideal model to overcome this.
4.2Text Embedding
UNIT-4
1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1.If there are 500 words in the corpus
then the vector length will be 500. After assigning vectors to each word we take a window size and iterate
through the entire corpus. While we do this there are two neural embedding methods which are used:
In this model what we do is we try to fit the neighboring words in the window to the central word.
In this model, we try to make the central word closer to the neighboring words. It is the complete opposite
of the CBOW model. It is shown that this method produces more meaningful embeddings.
UNIT-4
After applying the above neural embedding methods we get trained vectors of each word after many
iterations through the corpus. These trained vectors preserve syntactical or semantic information and are
converted to lower dimensions. The vectors with similar meaning or semantic information are placed close
to each other in space.
2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate
through it and get the co-occurrence of each word with other words in the corpus. We get a co-occurrence
matrix through this. The words which occur next to each other get a value of 1, if they are one word apart
then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus:
It is a nice evening.
Good Evening!
Is it a nice evening?
it 0
is 1+1 0
a 1/2+1 1+1/2 0
good 0 0 0 0 1 0
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as
well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather
information about the context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see how
close they are to each other in space. If they occur together more often or have a higher value in the co-
occurrence matrix and are far apart in space then they are brought close to each other. If they are close to
each other but are rarely or not frequently used together then they are moved further apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates the
information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms
of both semantic and syntactic capturing.
Pre-trained Word Embedding Models:
UNIT-4
People generally use pre-trained models for word embeddings. Few of them are:
SpaCy
fastText
Flair etc.
Common Errors made:
You need to use the exact same pipeline during deploying your model as were used to create
the training data for the word embedding. If you use a different tokenizer or different method
of handling white space, punctuation etc. you might end up with incompatible inputs.
Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of
Vocabulary Word(oov). What you can do is replace those words with “UNK” which means
unknown and then handle them separately.
Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of
length say 400 and then try to apply vectors of length 1000 at inference time, you will run into
errors. So make sure to use the same dimensions throughout.
Benefits of using Word Embeddings:
It is much faster to train than hand build models like WordNet(which uses graph embeddings)
Almost all modern NLP applications start with an embedding layer
It Stores an approximation of meaning
Drawbacks of Word Embeddings:
It can be memory intensive
It is corpus dependent. Any underlying bias will have an effect on your model
It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.
Expressing power of notations used to represent a vocabulary of a language has been a great deal of interest
Sam wished John’s wife or his own? These ambiguities must be handled in order to represent information in
true form.
So, how do we develop a machine level understanding for language modelling task. Classical count based or
similarity parameter based methods have existed for quite a long time in Computer Science.
But, now with the advent of deep learning models like RNNs this expressing power of language modelling is
of great interest for creating efficient system to store information and finding relations between vocabulary
terms. This will act as fundamental block in encoder-decoder models like seq-to-seq model.
Machines are better at understanding numbers that actual text passed on as tokens. This process of converting
text to numbers is called vectorization. Vectors then combine to form vector space which is continuous in
nature, an algebraic model where rules of vector addition and similarity measures apply. Different approaches
of vectorization exists let’s move from the most primitive ones to most advanced ones.
1. One-hot representations
2. Distributed Representations
5. Skip-Gram model
6. Glove Representations
For defining representation power of a system we first will look into workings of different representation
systems. These systems represent each and every word of a vocabulary in the form a vector and create a finite
vector space.
Let’s see an example of one-hot representation of words. Each word is represented with a large vector of size
The representation of the i-th word will have a 1 in the i-th position and a 0 in the remaining |V | − 1
positions.
It is a very simple form of representation with very easy implementation. But, many of the faults would have
become clear even from such small example. Like, huge memory required for storing and processing such
vectors. Along, with sparse nature of these vectors.For example, the size of |V| is very large like 3M for
Google 1T corpus. This notation will fail in terms of computation overhead caused by representation power
of this system.
Also, no notion of similarity is captured. Cosine similarity b/w unique words is zero and Euclidean distance is
always sqrt(2). Meaning, no semantic information is getting expressed with this representation system.
Now, clearly we should move to a representation which saves space and hold some semantic power to
— Firth, J.R.
The idea is to quantify co-occurrence of terms in a corpus. This co-occurrence is measured with window size
of ‘k’ around the terms which signifies the context being distributed in that window size. With this method in
mind, a co-occurrence matrix of terms × terms which captures the number of times a term appears in the
context of another term is created. Also, it is a good practice to remove stopwords as these are high frequency
words providing least amount of meaningful insight. Also, we can set an upper threshold for t for words.
UNIT-4
Human machine interface for computer applications.User opinion of computer system response time.User
Co-occurrence Matrix with window of size k=2. Each row[column] of the co-occurrence matrix gives a
vectorial representation of the corresponding word’s context.
Here, in above case if stopwords are not handled properly they will create problem with their relative high
frequency. So, a new quantity known as Positive Pointwise Mutual Information will be used that takes into
With this method we were able to get some idea about context but still this won’t be an ideal approach.
Consider example ‘cat’ and ‘dog’ in corpus but they don’t lie within the window gap. Clearly there is relation
UNIT-4
b/w cat and dog like both being pets, mammal etc. this method won’t be able to give any insight about their
relation.
In above case co-occurrence between {system, machine} and {human, user} is not visible. But, they are
Also, high dimensionality, sparse nature of matrix & redundancy( as symmetric matrix ) still persists along
High dimensional problem are solved by PCA or by its generalized version SVD. SVD is a generalization of
the eigen-decomposition of a positive semi-definite normal matrix to any matrix via an extension of the polar
decomposition. It gives the best rank-k approximation of the original data. Let original data X be of
dimension m x n. The singular value decomposition will break this into best rank approximation capturing
SVD theorem tells us that u1 ,v 1 and σ1 store the most important information in X. Subsequent terms stores
less and less important information.
An analogy to this can be seen with with case of colors being represented with 8-bits. These 8-bits will
provide more resolution to us. But, now we want to compress this into 4-bits only. Which bits should we
When the bits are reduced, most important information is to identify the color not the shades of different
colors. Similar resolution will exist for different colors like Red, Blue, Yellow etc.
As by capturing information about different shades only there would be no meaning for colors because all
information related to colors was lost. Hence, lower bits are the most important ones. Now, in this case SVD
With SVD, the latent co-occurrence between {system, machine} and {human, user} will become visible. We
take the matrix product of X and Xt, whose ij-th entry is the dot product between the representation of word i
(X[i :]) and word j (X[j :]). The ij-th entry of X, Xt roughly captures the cosine similarity between word i ,
word j.
From product of [XXt] matrix a low rank matrix capturing important features is constructed. the latent co-
occurrence between {system, machine} and {human, user} has become visible. See, the red and blue parts in
given figure.
UNIT-4
We would want representations of words(i, j) to be of smaller dimensions but still have the same
similarity(dot product) as the corresponding rows of Xhat. Hence, our search for finding more powerful
Also, notice that the dot product between the rows of the the matrix Wword=UΣ is the same as the dot
Wword = UΣ ∈ R m×k is taken as the representation of the m words in the vocabulary and Wcontext = V is
taken as the representation of the context words.
The methods we have seen are count based models like SVD as it uses co-occurrence count which uses the
classical statistic based NLP principles. Now, we will move onto prediction based model which directly learn
word representations. Consider a task, predict the nth word given previous (n-1) words. For training data all
n-word window in training corpus can be used and corpus can be obtained from scraping any webpage.
Now, How to model this task of predicting nth word? & What’s the connection between this task and
learning word representations? For modelling this problem we will use feed-forward neural network as shown
below.
UNIT-4
Our aim is to predict a probability distribution over these |V| classes as a multi-class classification problem.
But this looks very complex and neural networks have very high number parameters. How is this a simpler
approach?
For this, we need to look little bit behind the mathematics behind the vector multiplication happening in
behind the scenes with this neural network. The product Wcontext*x given that x is a one hot vector.
Is simply, the i-th column of Wcontext . A ont-to-one mapping exists b/w words & Wcontext’s columns.
We can treat the i-th column of Wcontext as the representation of context i. For P(on|sat) is proportional to
the dot product between jth column of Wcontext and ith column of Wword. P(word = i|sat) thus depends on
the ith
column of Wword. We thus treat the ith column of Wword as the representation of word i. This clearly shows
the weight parameters are being represented as word vector representation in a neural network architecture.
UNIT-4
Having understood the simplicity of interpretation behind the parameters, our aim now is to learn these
parameters. For multi-class classification problem use softmax as output activation function and cross-
Let’s see what we can interpret from update rule from above loss-function. Put value of yhat in loss-function
When yhat is 1, already corrected word predicted. Hence, no update. & when it is 0, vw gets updated by
fraction of uc added.
This increases the cosine similarity between vw and uc. As, training objective ensures that the cosine
similarity between word (vw) and context word (uc) is maximized. Hence, similarity measure is also captured
by this representation. Also, neural network helps in learning much simpler and abstract vector
representations of words.
In practice, more than one words are used in the window size, it common to use ‘d’ window size depending
on the use-case. That would simply mean we have to stack copies of Wcontext in the bottom layers as two
Here, still computation bottleneck of softmax is present with denominator term involving summation over the
entire vocabulary. We must explore some other model mitigating this bottleneck step.
Skip-Gram Model
This model predicts context words with respect to given input words. The role of context and word has
changed to an almost opposite sense. Now, with given input words as one hot representations our aim is to
predict context word related to it. This opposite relation b/w CBOW model & Skip-Gram model will become
Given a corpus the model loops on the words of each sentence and either tries to use the current word of to
predict its neighbors(its context), in which case the model is called “Skip-Gram”, or it uses each of these
contexts to predict the current word, in which case the model is called “Continuous Bag Of Words” (CBOW).
‘on’ as input word, probabilities of context words related to it are predicted by this network.
Train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually
going to use that neural network for the task we trained it on every single time for new task of modeling!
Instead, the goal is actually just to learn the weights of the hidden layer which will be word vectors as stated
In the simple case when there is only one context word, we will arrive at
the same update rule for uc as we did for vw earlier. If we have multiple context words the loss function
Again same issue as with CBOW, the problem of softmax being computationally expensive exists for this
model also.
Three strategies namely negative sampling, contrastive estimation and hierarchical softmax can be used to
Negative Sampling: We sample k negative (w, r) pairs with no context for every positive (w, c)
context pairs. The size of Ddash is thus k times the size of D. In our neural network we will
define loss functions and train on both these sets. These corrupted pairs are drawn from a
specially designed distribution, which favours less frequent words to be drawn more often.
Summation over entire vocabulary get reduces as two different sets get created.
p(w|vc ) can now be computed using |π(w)| computations instead of |V| required by softmax. Also, random
arrangement of the words on leaf nodes does well in practice
GloVe Representations
Count based methods (SVD) rely on global co-occurrence counts from the
corpus for computing word representations. Predict based methods learn word representations using co-
occurrence information. Why not combine the two count and learn mechanisms ?
Let’s formulate this idea mathematically and then develop an intuition for it. Let Xij encodes important global
Our aim will be to learn word vectors which complies with computed probability on entire corpus. Essentially
vectors vi and vj such that vi^T vj is faithful to the globally computed P (j|i).
Add the two equations for Xi and Xj respectively. Also, log(Xi) and log(Xj) depend only on the
words i & j and we can think of them as word specific biases which will be learned. Formulate this problem
in following way.
Now, the problem is weights of all the co-occurrences are equal. Weight should be defined in such a manner
Words are represented by dense vectors where a vector represents the projection of the word into a
continuous vector space. It is an improvement over more the traditional bag-of-word model encoding
schemes where large sparse vectors were used to represent each word. Those representations were sparse
because the vocabularies were vast and a given word or document would be represented by a large vector
Considering the recent popular research papers. Boroni et.al [2014] showed that predict models consistently
outperform count models in all tasks. Levy et.al [2015] do a much more through analysis (IMO) and show
that good
old SVD does better than prediction based models on similarity tasks but not
on analogy tasks. Levy showed that word2vec also implicitly does a matrix factorization. It turns out that we
word2vec factorizes a matrix M which is related to the PMI based co-occurrence matrix. Very similar to
what SVD does.
Good old SVD will just do fine. But, currently working with pre-trained Glove embeddings on huge corpuses
results in much better results and once trained, can be reused again. Libraries like Keras, Gensim does
provide embedding layers which can be used with ease for modelling tasks.
4.2.3Sequence-to-sequence model
What is seq2seq Model in Machine Learning?
Seq2seq was first introduced for machine translation, by Google. Before that, the translation worked in a
very naïve way. Each word that you used to type was converted to its target language giving no regard to
its grammar and sentence structure. Seq2seq revolutionized the process of translation by making use of
deep learning. It not only takes the current word/input into account while translating but also its
neighborhood.
Seq2Seq (Sequence-to-Sequence) is a type of model in machine learning that is used for tasks such
as machine translation, text summarization, and image captioning. The model consists of two main
components:
Encoder
Decoder
Seq2Seq models are trained using a dataset of input-output pairs, where the input is a sequence of tokens
and the output is also a sequence of tokens. The model is trained to maximize the likelihood of the correct
output sequence given the input sequence.
Seq2Seq models have been widely used in NLP tasks such as machine translation, text summarization,
and image captioning, due to their ability to handle variable-length input and output sequences.
Additionally, the Attention mechanism is often used in Seq2Seq models to improve performance and it
allows the decoder to focus on specific parts of the input sequence when generating the output.
UNIT-4
Nowadays, it is used for a variety of different applications such as image captioning, conversational
models, text summarization, etc.
Encoder-Decoder Stack
As the name suggests, seq2seq takes as input a sequence of words(sentence or sentences) and generates an
output sequence of words. It does so by use of the recurrent neural network (RNN). Although the vanilla
version of RNN is rarely used, its more advanced version i.e. LSTM or GRU is used. This is because RNN
suffers from the problem of vanishing gradient. LSTM is used in the version proposed by Google. It
develops the context of the word by taking 2 inputs at each point in time. One from the user and the other
from its previous output, hence the name recurrent (output goes as input).
The encoder and decoder are typically implemented as Recurrent Neural Networks (RNNs) or
Transformers.
Encoder Stack
It uses deep neural network layers and converts the input words to corresponding hidden vectors. Each
vector represents the current word and the context of the word. The encoder takes the input sequence, one
token at a time, and uses an RNN or transformer to update its hidden state, which summarizes the
information in the input sequence. The final hidden state of the encoder is then passed as the context
vector to the decoder.
Decoder Stack
It is similar to the encoder. It takes as input the hidden vector generated by the encoder, its own hidden
states, and the current word to produce the next hidden vector and finally predict the next word. The
decoder uses the context vector and an initial hidden state to generate the output sequence, one token at a
time. At each time step, the decoder uses the current hidden state, the context vector, and the previous
output token to generate a probability distribution over the possible next tokens. The token with the
highest probability is then chosen as the output, and the process continues until the end of the output
sequence is reached.