Word Embedding Learning Process
Word Embedding Learning Process
please explain the process of learning word embedding using the picture
This image illustrates the process of learning word embeddings in a neural language model. Here’s a
step-by-step breakdown of what’s happening:
Example:
"I" → 4343
"want" → 9665
"a" → 1
"glass" → 3852
"of" → 6163
"orange" → 6257
2. Embedding Lookup
Example:
3. Contextual Processing
The words "I" and "want" are connected with a red box, indicating they are part of the context
input.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/6
Words like "a", "glass", "of", and "orange" are mapped to embeddings as well.
The embeddings of the context words are passed through a neural network.
The final layer uses a Softmax function to predict the most probable next word.
The output vocabulary contains around 10,000 words (as indicated in the yellow box).
The model learns embeddings by adjusting the weights W and biases b through backpropagation.
These weights and biases improve over time to generate meaningful word embeddings.
The handwritten notes show that the model learns relationships like:
"a glass of orange" → "juice"
Conclusion
The image demonstrates how a neural language model learns word embeddings by mapping words to
dense vectors, processing them through layers, and predicting the next word in context. Over time, the
learned embeddings capture meaningful relationships between words, improving language
understanding.
what does the softmax do in the situation above lease explain with an example
In the situation above, the Softmax function is used to predict the next word based on the given
context. It converts raw scores (logits) from the neural network into probabilities that sum to 1, helping
the model make a prediction.
Example Walkthrough
The network generates raw scores (logits) for all possible words in the vocabulary. Suppose we have a
vocabulary of 10,000 words, and the model produces scores for a few relevant words:
juice 7.2
apple 4.5
table 1.8
book 0.5
e oi
P (yi ) =
∑ j e oj
where:
Applying Softmax:
e7.2
P ("juice") = 7.2
e + e4.5 + e1.8 + e0.5
e7.2
= 7.2
e + e4.5 + e1.8 + e0.5
Approximating:
1348
P ("juice") = ≈ 0.92
1348 + 90 + 6 + 1.6
e4.5
P ("apple") = 7.2 ≈ 0.06
e + e4.5 + e1.8 + e0.5
Step 5: Prediction
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/6
Conclusion
The Softmax function converts raw scores into probabilities that sum to 1, allowing the model to select
the most probable next word. Over training, the model learns to assign higher probabilities to correct
words, improving its language understanding.
The Skip-Gram model is a popular method for learning word embeddings using a neural network. It
is part of the Word2Vec family and is designed to predict surrounding words (context) given a target
word. This approach helps the model learn meaningful word relationships.
The model is trained using large text corpora and updates embeddings based on co-occurrence
patterns.
We choose a window size (context size), which determines how many words around the target
word we consider.
Example: Consider the sentence:
If we use a context window of 2, then for the target word "sat", the context words are:
For each target word, we create training pairs consisting of the target word and each context word.
Example for context window = 2:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/6
Target Word Context Words
("sat", "cat")
("sat", "on")
("sat", "The")
("sat", "the")
Hidden Layer: A weight matrix where each row corresponds to a word embedding.
Output Layer: Softmax function that predicts the probability of each word in the vocabulary being a
context word.
3. Example Walkthrough
Let's assume we have a small vocabulary:
["the", "cat", "sat", "on", "mat"]
("sat", "cat")
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/6
("sat", "on")
("sat", "The")
("sat", "the")
Training Process
csharp
`[0, 0, 1, 0, 0]
`
4. The model is trained to maximize the probability of correct context words appearing near the
target word.
4. Advantages of Skip-Gram
✅ Works well with small datasets
✅ Learns meaningful word relationships
✅ Handles infrequent words better than CBOW
✅ Embeddings capture word analogies (e.g., "king" - "man" + "woman" = "queen")
5. Comparison with CBOW (Continuous Bag of Words)
Feature Skip-Gram CBOW
Predicts Context words from a target word Target word from context words
6. Conclusion
The Skip-Gram model is an effective technique for learning word embeddings by predicting
surrounding words based on a given target word. It captures rich semantic relationships between words
and is widely used in NLP applications like text classification, sentiment analysis, and machine
translation.
Would you like me to show how to implement Skip-Gram in Python using Word2Vec? 🚀
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/6