0% found this document useful (0 votes)
55 views

BERT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

BERT

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

BERT

INSTRUCTOR NAME: SHUKDEV DATTA


ML DEVELOPER AT INNOVATIVE SKILLS
What is BERT?
• BERT language model is an open source machine learning framework for natural language
processing (NLP). BERT is designed to help computers understand the meaning of ambiguous
language in text by using surrounding text to establish context. The BERT framework was
pretrained using text from Wikipedia and can be fine-tuned with question-and-answer data
sets.
• BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on
transformers, a deep learning model in which every output element is connected to every
input element, and the weightings between them are dynamically calculated based upon
their connection.
Background and history of BERT
• Google first introduced the transformer model in 2017. At that time, language models
primarily used recurrent neural networks (RNN) and convolutional neural networks (CNN) to
handle NLP tasks.
• CNNs and RNNs are competent models, however, they require sequences of data to be
processed in a fixed order. Transformer models are considered a significant improvement
because they don't require data sequences to be processed in any fixed order.
• Because transformers can process data in any order, they enable training on larger amounts
of data than was possible before their existence. This facilitated the creation of pretrained
models like BERT, which was trained on massive amounts of language data prior to its
release.
How BERT works
• BERT was pretrained using only a collection of unlabeled, plain text, namely the entirety of
English Wikipedia and the Brown Corpus. It continues to learn through unsupervised
learning from unlabeled text and improves even as it's being used in practical applications
such as Google search.
• BERT's pretraining serves as a base layer of knowledge from which it can build its responses.
From there, BERT can adapt to the ever-growing body of searchable content and queries,
and it can be fine-tuned to a user's specifications. This process is known as transfer learning.
Components leading to BERT
Creation
• Transformers

• Masked language modeling

• Self-attention mechanisms

• Next sentence prediction


Transformers
• Google's work on transformers made BERT possible. The
transformer is the part of the model that gives BERT its
increased capacity for understanding context and
ambiguity in language. The transformer processes any
given word in relation to all other words in a sentence,
rather than processing them one at a time. By looking at
all surrounding words, the transformer enables BERT to
understand the full context of the word and therefore
better understand searcher intent.
• This is contrasted against the traditional method of
language processing, known as word embedding. This
approach was used in models such as GloVe and
word2vec. It would map every single word to a vector,
which represented only one dimension of that word's Fig: Word Embeddings
meaning.
Masked language modeling
• Word embedding models require large data sets of structured data. While they are adept at many
general NLP tasks, they fail at the context-heavy, predictive nature of question answering because
all words are in some sense fixed to a vector or meaning.
• BERT uses an MLM method to keep the word in focus from seeing itself, or having a fixed meaning
independent of its context. BERT is forced to identify the masked word based on context alone. In
BERT, words are defined by their surroundings, not by a prefixed identity. How?
Masked language modeling
• Imagine you're playing a guessing game where you have to figure out a missing word in a sentence.
BERT, which is a type of language model, plays a similar game. But instead of just guessing, it learns
to predict the missing word by looking at the words around it in a sentence.

• The trick here is that BERT doesn't know the exact word that's missing. It's like trying to solve a
puzzle without knowing all the pieces. So, BERT has to pay close attention to the context or the
other words in the sentence to make an educated guess about what the missing word could be.

• Because of this, BERT doesn't have a fixed idea of what each word means on its own. Instead, it
learns the meaning of words based on how they're used in different sentences. This way, each word
gets its meaning from the words around it, not from some pre-set definition. This helps BERT
understand language in a more flexible and context-dependent way.
Self-attention mechanisms
• BERT also relies on a self-attention mechanism that captures and understands relationships among
words in a sentence. The bidirectional transformers at the center of BERT's design make this
possible. This is significant because often, a word may change meaning as a sentence develops.
Each word added augments the overall meaning of the word the NLP algorithm is focusing on. The
more words that are present in each sentence or phrase, the more ambiguous the word in focus
becomes. BERT accounts for the augmented meaning by reading bidirectionally, accounting for the
effect of all other words in a sentence on the focus word and eliminating the left-to-right
momentum that biases words towards a certain meaning as a sentence progresses.
Self-attention mechanisms
• Think of BERT like a detective trying to understand a story. It uses a special tool called self-attention
to figure out how all the words in a sentence relate to each other. This helps BERT understand how
the meaning of a word might change as the sentence goes on.

• The cool thing about BERT is that it doesn't just look at words one after another. It looks at all the
words in the sentence at the same time, kind of like how you might scan a whole page of a book.
This helps it understand the connections between words better.

• For example, if you have a sentence like "She went to the bank to deposit money," the word "bank"
could mean a riverbank or a place where you put money. BERT looks at all the words around "bank"
to figure out which meaning makes sense.

• By reading both forwards and backwards in the sentence, BERT can catch these changes in meaning
as the sentence unfolds. This helps it avoid getting stuck on just one meaning of a word and makes
it better at understanding the whole story.
Next sentence prediction
• NSP is a training technique that teaches BERT to predict whether a certain sentence follows a
previous sentence to test its knowledge of relationships between sentences.
• Specifically, BERT is given both sentence pairs that are correctly paired and pairs that are wrongly
paired so it gets better at understanding the difference.
• Over time, BERT gets better at predicting next sentences accurately.
Next sentence prediction
• NSP involves giving BERT two sentences, sentence 1 and sentence 2. Then, BERT is asked the
question: “HEY BERT, DOES SENTENCE 1 COME AFTER SENTENCE 2?” --- and BERT replies with
isNextSentence or NotNextSentence.

Consider the following three sentences below:


1. Tony drove home after playing football in front of his friend’s house for three hours.
2. In the milky way galaxy, there are eight planets, and Earth is neither the smallest nor the largest.
3. Once home, Tony ate the remaining food he left in the Fridge and fell asleep on the floor.

• Which of the sentences would you say followed the other logically? 2 after 1? Probably not. These
are the questions that BERT is supposed to answer.
• Sentence 3 comes after 1 because of the contextual follow-up in both sentences. Secondly, an easy
takeaway is that both sentences contain “Tony”.
What is BERT used for?
Sequence-to-sequence language generation tasks such as:

• Question answering.
• Abstract summarization.
• Sentence prediction.
• Conversational response generation.

NLU tasks such as:

• Polysemy and coreference resolution.


• Word sense disambiguation.
• Sentiment classification
Polysemy and coreference
resolution
Word sense disambiguation
Word sense disambiguation
Word sense disambiguation
Clearly the word bank in sentence S1 refers to a
sloping land near a water body and bank in S2
refers to a financial institution. This is an example
of lexical ambiguity that arises in linguistics due to
different interpretations of meanings of a word.
While this task of disambiguation of a polysemous
word seems pretty obvious for humans, it turns out
that it is not so for machines and algorithms. In
NLP, we formally call this a problem of Word Sense
Disambiguation (WSD) and BERT addresses this
issues well.
BERT vs. generative pre-trained
transformers (GPT)
• While BERT and GPT models are among the best language models, they exist for different reasons.
The initial GPT-3 model, along with OpenAI's subsequent more advanced GPT models, are also
language models trained on massive data sets. While they share this in common with BERT, BERT
differs in multiple ways.
BERT
• Google developed BERT to serve as a bidirectional transformer model that examines words within
text by considering both left-to-right and right-to-left contexts. It helps computer systems
understand text as opposed to creating text, which GPT models are made to do. BERT excels at NLU
tasks as well as performing sentiment analysis. It's ideal for Google searches and customer
feedback.
GPT
• GPT models differ from BERT in both their objectives and their use cases. GPT models are forms of
generative AI that generate original text and other forms of content. They're also well-suited for
summarizing long pieces of text and text that's hard to interpret.
Thank You!!!

You might also like