BERT (Bidirectional Encoder Representations from Transformers) stands as an open-source machine learning framework designed for the natural language processing (NLP). Originating in 2018, this framework was crafted by researchers from Google AI Language.
The article aims to explore the architecture, working and applications of BERT.
What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) leverages a transformer-based neural network to understand and generate human-like language. BERT employs an encoder-only architecture. In the original Transformer architecture, there are both encoder and decoder modules. The decision to use an encoder-only architecture in BERT suggests a primary emphasis on understanding input sequences rather than generating output sequences.
Bidirectional Approach of BERT
Traditional language models process text sequentially, either from left to right or right to left. This method limits the model's awareness to the immediate context preceding the target word. BERT uses a bi-directional approach considering both the left and right context of words in a sentence, instead of analyzing the text sequentially, BERT looks at all the words in a sentence simultaneously.
Example: "The bank is situated on the _______ of the river."
In a unidirectional model, the understanding of the blank would heavily depend on the preceding words, and the model might struggle to discern whether "bank" refers to a financial institution or the side of the river.
BERT, being bidirectional, simultaneously considers both the left ("The bank is situated on the") and right context ("of the river"), enabling a more nuanced understanding. It comprehends that the missing word is likely related to the geographical location of the bank, demonstrating the contextual richness that the bidirectional approach brings.
Pre-training and Fine-tuning BERT Model
The BERT model undergoes a two-step process:
- Pre-training on Large amounts of unlabeled text to learn contextual embeddings.
- Fine-tuning on labeled data for specific NLP tasks.
Pre-Training on Large Data
- BERT is pre-trained on large amount of unlabeled text data. The model learns contextual embeddings, which are the representations of words that take into account their surrounding context in a sentence.
- BERT engages in various unsupervised pre-training tasks. For instance, it might learn to predict missing words in a sentence (Masked Language Model or MLM task), understand the relationship between two sentences, or predict the next sentence in a pair.
Fine-Tuning on Labeled Data
- After the pre-training phase, the BERT model, armed with its contextual embeddings, is then fine-tuned for specific natural language processing (NLP) tasks. This step tailors the model to more targeted applications by adapting its general language understanding to the nuances of the particular task.
- BERT is fine-tuned using labeled data specific to the downstream tasks of interest. These tasks could include sentiment analysis, question-answering, named entity recognition, or any other NLP application. The model's parameters are adjusted to optimize its performance for the particular requirements of the task at hand.
BERT's unified architecture allows it to adapt to various downstream tasks with minimal modifications, making it a versatile and highly effective tool in natural language understanding and processing.
How BERT work?
BERT is designed to generate a language model so, only the encoder mechanism is used. Sequence of tokens are fed to the Transformer encoder. These tokens are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors, each corresponding to an input token, providing contextualized representations.
When training language models, defining a prediction goal is a challenge. Many models predict the next word in a sequence, which is a directional approach and may limit context learning.
BERT addresses this challenge with two innovative training strategies:
- Masked Language Model (MLM)
- Next Sentence Prediction (NSP)
1. Masked Language Model (MLM)
In BERT's pre-training process, a portion of words in each input sequence is masked and the model is trained to predict the original values of these masked words based on the context provided by the surrounding words.
In simple terms,
- Masking words: Before BERT learns from sentences, it hides some words (about 15%) and replaces them with a special symbol, like [MASK].
- Guessing Hidden Words: BERT's job is to figure out what these hidden words are by looking at the words around them. It's like a game of guessing where some words are missing, and BERT tries to fill in the blanks.
- How BERT learns:
- BERT adds a special layer on top of its learning system to make these guesses. It then checks how close its guesses are to the actual hidden words.
- It does this by converting its guesses into probabilities, saying, "I think this word is X, and I'm this much sure about it."
- Special Attention to Hidden Words
- BERT's main focus during training is on getting these hidden words right. It cares less about predicting the words that are not hidden.
- This is because the real challenge is figuring out the missing parts, and this strategy helps BERT become really good at understanding the meaning and context of words.
In technical terms,
- BERT adds a classification layer on top of the output from the encoder. This layer is crucial for predicting the masked words.
- The output vectors from the classification layer are multiplied by the embedding matrix, transforming them into the vocabulary dimension. This step helps align the predicted representations with the vocabulary space.
- The probability of each word in the vocabulary is calculated using the SoftMax activation function. This step generates a probability distribution over the entire vocabulary for each masked position.
- The loss function used during training considers only the prediction of the masked values. The model is penalized for the deviation between its predictions and the actual values of the masked words.
- The model converges slower than directional models. This is because, during training, BERT is only concerned with predicting the masked values, ignoring the prediction of the non-masked words. The increased context awareness achieved through this strategy compensates for the slower convergence.
2. Next Sentence Prediction (NSP)
BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.
- In the training process, BERT learns to understand the relationship between pairs of sentences, predicting if the second sentence follows the first in the original document.
- 50% of the input pairs have the second sentence as the subsequent sentence in the original document, and the other 50% have a randomly chosen sentence.
- To help the model distinguish between connected and disconnected sentence pairs. The input is processed before entering the model:
- A [CLS] token is inserted at the beginning of the first sentence, and a [SEP] token is added at the end of each sentence.
- A sentence embedding indicating Sentence A or Sentence B is added to each token.
- A positional embedding indicates the position of each token in the sequence.
- BERT predicts if the second sentence is connected to the first. This is done by transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating the probability of whether the second sentence follows the first using SoftMax.
During the training of BERT model, the Masked LM and Next Sentence Prediction are trained together. The model aims to minimize the combined loss function of the Masked LM and Next Sentence Prediction, leading to a robust language model with enhanced capabilities in understanding context within sentences and relationships between sentences.
Why to train Masked LM and Next Sentence Prediction together?
Masked LM helps BERT to understand the context within a sentence and Next Sentence Prediction helps BERT grasp the connection or relationship between pairs of sentences. Hence, training both the strategies together ensures that BERT learns a broad and comprehensive understanding of language, capturing both details within sentences and the flow between sentences.
BERT Architecture
The architecture of BERT is a multilayer bidirectional transformer encoder which is quite similar to the transformer model. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side.
- BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack. These are more than the Transformer architecture described in the original paper (6 encoder layers).
- BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the Transformer architecture suggested in the original paper. It contains 512 hidden units and 8 attention heads.
- BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.
BERT BASE and BERT LARGE architecture.
This model takes the CLS token as input first, then it is followed by a sequence of words as input. Here CLS is a classification token. It then passes the input to the above layers. Each layer applies self-attention and passes the result through a feedforward network after then it hands off to the next encoder. The model outputs a vector of hidden size (768 for BERT BASE). If we want to output a classifier from this model we can take the output corresponding to the CLS token.
BERT output as EmbeddingsNow, this trained vector can be used to perform a number of tasks such as classification, translation, etc. For Example, the paper achieves great results just by using a single layer Neural Network on the BERT model in the classification task.
How to use BERT model in NLP?
BERT can be used for various natural language processing (NLP) tasks such as:
1. Classification Task
- BERT can be used for classification task like sentiment analysis, the goal is to classify the text into different categories (positive/ negative/ neutral), BERT can be employed by adding a classification layer on the top of the Transformer output for the [CLS] token.
- The [CLS] token represents the aggregated information from the entire input sequence. This pooled representation can then be used as input for a classification layer to make predictions for the specific task.
2. Question Answering
- In question answering tasks, where the model is required to locate and mark the answer within a given text sequence, BERT can be trained for this purpose.
- BERT is trained for question answering by learning two additional vectors that mark the beginning and end of the answer. During training, the model is provided with questions and corresponding passages, and it learns to predict the start and end positions of the answer within the passage.
3. Named Entity Recognition (NER)
- BERT can be utilized for NER, where the goal is to identify and classify entities (e.g., Person, Organization, Date) in a text sequence.
- A BERT-based NER model is trained by taking the output vector of each token form the Transformer and feeding it into a classification layer. The layer predicts the named entity label for each token, indicating the type of entity it represents.
How to Tokenize and Encode Text using BERT?
To tokenize and encode text using BERT, we will be using the 'transformer' library in Python.
Command to install transformers:
!pip install transformers
- We will load the pretrained BERT tokenize with a cased vocabulary using BertTokenizer.from_pretrained("bert-base-cased").
- tokenizer.encode(text) tokenizes the input text and converts it into a sequence of token IDs.
- print("Token IDs:", encoding) prints the token IDs obtained after encoding.
- tokenizer.convert_ids_to_tokens(encoding) converts the token IDs back to their corresponding tokens.
- print("Tokens:", tokens) prints the tokens obtained after converting the token IDs
Python
from transformers import BertTokenizer
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
# Input text
text = 'ChatGPT is a language model developed by OpenAI, based on the GPT (Generative Pre-trained Transformer) architecture. '
# Tokenize and encode the text
encoding = tokenizer.encode(text)
# Print the token IDs
print("Token IDs:", encoding)
# Convert token IDs back to tokens
tokens = tokenizer.convert_ids_to_tokens(encoding)
# Print the corresponding tokens
print("Tokens:", tokens)
Output:
Token IDs: [101, 24705, 1204, 17095, 1942, 1110, 170, 1846, 2235, 1872, 1118, 3353, 1592, 2240, 117, 1359, 1113, 1103, 15175, 1942, 113, 9066, 15306, 11689, 118, 3972, 13809, 23763, 114, 4220, 119, 102]
Tokens: ['[CLS]', 'Cha', '##t', '##GP', '##T', 'is', 'a', 'language', 'model', 'developed', 'by', 'Open', '##A', '##I', ',', 'based', 'on', 'the', 'GP', '##T', '(', 'Gene', '##rative', 'Pre', '-', 'trained', 'Trans', '##former', ')', 'architecture', '.', '[SEP]']
The tokenizer.encode method adds the special [CLS] - classification and [SEP] - separator tokens at the beginning and end of the encoded sequence. In the token IDs section, token id: 101 refers to the start of the sentence and token id: 102 represents the end of the sentence.
Application of BERT
BERT is used for:
- Text Representation: BERT is used to generate word embeddings or representation for words in a sentence.
- Named Entity Recognition (NER): BERT can be fine-tuned for named entity recognition tasks, where the goal is to identify entities such as names of people, organizations, locations, etc., in a given text.
- Text Classification: BERT is widely used for text classification tasks, including sentiment analysis, spam detection, and topic categorization. It has demonstrated excellent performance in understanding and classifying the context of textual data.
- Question-Answering Systems: BERT has been applied to question-answering systems, where the model is trained to understand the context of a question and provide relevant answers. This is particularly useful for tasks like reading comprehension.
- Machine Translation: BERT's contextual embeddings can be leveraged for improving machine translation systems. The model captures the nuances of language that are crucial for accurate translation.
- Text Summarization: BERT can be used for abstractive text summarization, where the model generates concise and meaningful summaries of longer texts by understanding the context and semantics.
- Conversational AI: BERT is employed in building conversational AI systems, such as chatbots, virtual assistants, and dialogue systems. Its ability to grasp context makes it effective for understanding and generating natural language responses.
- Semantic Similarity: BERT embeddings can be used to measure semantic similarity between sentences or documents. This is valuable in tasks like duplicate detection, paraphrase identification, and information retrieval.
BERT vs GPT
The difference between BERTand GPT are as follows:
| BERT | GPT |
---|
Architecture | BERT is designed for bidirectional representation learning. It uses a masked language model objective, where it predicts missing words in a sentence based on both left and right context. | GPT, on the other hand, is designed for generative language modeling. It predicts the next word in a sentence given the preceding context, utilizing a unidirectional autoregressive approach. |
---|
Pre-training Objectives | BERT is pre-trained using a masked language model objective and next sentence prediction. It focuses on capturing bidirectional context and understanding relationships between words in a sentence. | GPT is pre-trained to predict the next word in a sentence, which encourages the model to learn a coherent representation of language and generate contextually relevant sequences. |
---|
Context Understanding | BERT is effective for tasks that require a deep understanding of context and relationships within a sentence, such as text classification, named entity recognition, and question-answering. | GPT is strong in generating coherent and contextually relevant text. It is often used in creative tasks, dialogue systems, and tasks requiring the generation of natural language sequences. |
---|
Task types and Use Cases | Commonly used in tasks like text classification, named entity recognition, sentiment analysis, and question-answering. | Applied to tasks such as text generation, dialogue systems, summarization, and creative writing. |
---|
Fine-tuning vs Few-Shot Learning | BERT is often fine-tuned on specific downstream tasks with labeled data to adapt its pre-trained representations to the task at hand. | GPT is designed to perform few-shot learning, where it can generalize to new tasks with minimal task-specific training data. |
---|
Also Check:
Similar Reads
Artificial Intelligence Tutorial | AI Tutorial Artificial Intelligence (AI) refers to the simulation of human intelligence in machines which helps in allowing them to think and act like humans. It involves creating algorithms and systems that can perform tasks which requiring human abilities such as visual perception, speech recognition, decisio
5 min read
Introduction to AI
What is Artificial Intelligence(AI)?Artificial Intelligence (AI) refers to the technology that allows machines and computers to replicate human intelligence. It enables systems to perform tasks that require human-like decision-making, such as learning from data, identifying patterns, making informed choices and solving complex problem
13 min read
Types of Artificial Intelligence (AI)Artificial Intelligence refers to something which is made by humans or non-natural things and Intelligence means the ability to understand or think. AI is not a system but it is implemented in the system. There are many different types of AI, each with its own strengths and weaknesses.This article w
6 min read
Types of AI Based on FunctionalitiesArtificial Intelligence (AI) has become central to applications in healthcare, finance, education and many more. However, AI operates differently at various levels based on how it processes data, learns and responds. Classifying AI by its functionalities helps us better understand its current capabi
4 min read
Agents in AIAn AI agent is a software program that can interact with its surroundings, gather information, and use that information to complete tasks on its own to achieve goals set by humans.For instance, an AI agent on an online shopping platform can recommend products, answer customer questions, and process
9 min read
Artificial intelligence vs Machine Learning vs Deep LearningNowadays many misconceptions are there related to the words machine learning, deep learning, and artificial intelligence (AI), most people think all these things are the same whenever they hear the word AI, they directly relate that word to machine learning or vice versa, well yes, these things are
4 min read
Problem Solving in Artificial IntelligenceProblem solving is a core aspect of artificial intelligence (AI) that mimics human cognitive processes. It involves identifying challenges, analyzing situations, and applying strategies to find effective solutions. This article explores the various dimensions of problem solving in AI, the types of p
6 min read
Top 20 Applications of Artificial Intelligence (AI) in 2025Artificial Intelligence is the practice of transforming digital computers into working robots. They are designed in such a way that they can perform any dedicated tasks and also take decisions based on the provided inputs. The reason behind its hype around the world today is its act of working and t
13 min read
AI Concepts
Search Algorithms in AIArtificial Intelligence is the study of building agents that act rationally. Most of the time, these agents perform some kind of search algorithm in the background in order to achieve their tasks. A search problem consists of: A State Space. Set of all possible states where you can be.A Start State.
10 min read
Local Search Algorithm in Artificial IntelligenceLocal search algorithms are essential tools in artificial intelligence and optimization, employed to find high-quality solutions in large and complex problem spaces. Key algorithms include Hill-Climbing Search, Simulated Annealing, Local Beam Search, Genetic Algorithms, and Tabu Search. Each of thes
4 min read
Adversarial Search Algorithms in Artificial Intelligence (AI)Adversarial search algorithms are the backbone of strategic decision-making in artificial intelligence, it enables the agents to navigate competitive scenarios effectively. This article offers concise yet comprehensive advantages of these algorithms from their foundational principles to practical ap
15+ min read
Constraint Satisfaction Problems (CSP) in Artificial IntelligenceA Constraint Satisfaction Problem is a mathematical problem where the solution must meet a number of constraints. In CSP the objective is to assign values to variables such that all the constraints are satisfied. Many AI applications use CSPs to solve decision-making problems that involve managing o
10 min read
Knowledge Representation in AIknowledge representation (KR) in AI refers to encoding information about the world into formats that AI systems can utilize to solve complex tasks. This process enables machines to reason, learn, and make decisions by structuring data in a way that mirrors human understanding.Knowledge Representatio
9 min read
First-Order Logic in Artificial IntelligenceFirst-order logic (FOL) is also known as predicate logic. It is a foundational framework used in mathematics, philosophy, linguistics, and computer science. In artificial intelligence (AI), FOL is important for knowledge representation, automated reasoning, and NLP.FOL extends propositional logic by
3 min read
Reasoning Mechanisms in AIArtificial Intelligence (AI) systems are designed to mimic human intelligence and decision-making processes, and reasoning is a critical component of these capabilities. Reasoning Mechanism in AI involves the processes by which AI systems generate new knowledge from existing information, make decisi
9 min read
Machine Learning in AI
Robotics and AI
Artificial Intelligence in RoboticsArtificial Intelligence (AI) in robotics is one of the most groundbreaking technological advancements, revolutionizing how robots perform tasks. What was once a futuristic concept from space operas, the idea of "artificial intelligence robots" is now a reality, shaping industries globally. Unlike ea
10 min read
What is Robotics Process AutomationImagine having a digital assistant that works tirelessly 24/7, never takes a break, and never makes a mistake. Sounds like a dream, right? This is the magic of Robotic Process Automation (RPA). Instead of humans handling repetitive, time-consuming tasks, RPA lets software robots step in to take over
8 min read
Automated Planning in AIAutomated planning is an essential segment of AI. Automated planning is used to create a set of strategies that will bring about certain results from a certain starting point. This area of AI is critical in issues to do with robotics, logistics and manufacturing, game playing as well as self-control
8 min read
AI in Transportation - Benifits, Use Cases and ExamplesAI positively impacts transportation by improving business processes, safety and passenger satisfaction. Applied on autopilot, real-time data analysis, and profit prediction, AI contributes to innovative and adaptive Autonomous car driving, efficient car maintenance, and route planning. This ranges
15+ min read
AI in Manufacturing : Revolutionizing the IndustryArtificial Intelligence (AI) is at the forefront of technological advancements transforming various industries including manufacturing. By integrating AI into the manufacturing processes companies can enhance efficiency, improve quality, reduce costs and innovate faster. AI in ManufacturinThis artic
6 min read
Generative AI
What is Generative AI?Generative artificial intelligence, often called generative AI or gen AI, is a type of AI that can create new content like conversations, stories, images, videos, and music. It can learn about different topics such as languages, programming, art, science, and more, and use this knowledge to solve ne
9 min read
Generative Adversarial Network (GAN)Generative Adversarial Networks (GANs) help machines to create new, realistic data by learning from existing examples. It is introduced by Ian Goodfellow and his team in 2014 and they have transformed how computers generate images, videos, music and more. Unlike traditional models that only recogniz
12 min read
Cycle Generative Adversarial Network (CycleGAN)Generative Adversarial Networks (GANs) use two neural networks i.e a generator that creates images and a discriminator that decides if those images look real or fake. Traditional GANs need paired data means each input image must have a matching output image. But finding such paired images is difficu
7 min read
StyleGAN - Style Generative Adversarial NetworksStyleGAN is a generative model that produces highly realistic images by controlling image features at multiple levels from overall structure to fine details like texture and lighting. It is developed by NVIDIA and builds on traditional GANs with a unique architecture that separates style from conten
5 min read
Introduction to Generative Pre-trained Transformer (GPT)The Generative Pre-trained Transformer (GPT) is a model, developed by Open AI to understand and generate human-like text. GPT has revolutionized how machines interact with human language making more meaningful communication possible between humans and computers. In this article, we are going to expl
7 min read
BERT Model - NLPBERT (Bidirectional Encoder Representations from Transformers) stands as an open-source machine learning framework designed for the natural language processing (NLP). Originating in 2018, this framework was crafted by researchers from Google AI Language. The article aims to explore the architecture,
14 min read
Generative AI Applications Generative AI generally refers to algorithms capable of generating new content: images, music, text, or what have you. Some examples of these models that originate from deep learning architectures-including Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs)-are revolutionizin
7 min read
AI Practice
Top Artificial Intelligence(AI) Interview Questions and Answers As Artificial Intelligence (AI) continues to expand and evolve, the demand for professionals skilled in AI concepts, techniques, and tools has surged. Whether preparing for an interview or refreshing your knowledge, mastering key AI concepts is crucial. This guide on the Top 50 AI Interview Question
15+ min read
Top Generative AI Interview Question with AnswerWelcome to the Generative AI Specialist interview. In this role, you'll lead innovation in AI by developing and optimising models to generate data, text, images, and other content, leveraging cutting-edge technologies to solve complex problems and advance our AI capabilities.In this interview, we wi
15+ min read
30+ Best Artificial Intelligence Project Ideas with Source Code [2025 Updated]Artificial intelligence (AI) is the branch of computer science that aims to create intelligent agents, which are systems that can reason, learn and act autonomously. This involves developing algorithms and techniques that enable machines to perform tasks that typically require human intelligence suc
15+ min read