0% found this document useful (0 votes)
13 views

Understanding BERT

BERT is a pre-trained language representation model developed by Google AI that utilizes a deep bidirectional understanding of language context, significantly improving performance on various NLP tasks. It is built on the Transformer architecture and is pre-trained using Masked Language Modeling and Next Sentence Prediction, allowing for effective fine-tuning on specific tasks. Despite its strengths, BERT has limitations such as high computational costs and input size restrictions.

Uploaded by

Mahrukh Malik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Understanding BERT

BERT is a pre-trained language representation model developed by Google AI that utilizes a deep bidirectional understanding of language context, significantly improving performance on various NLP tasks. It is built on the Transformer architecture and is pre-trained using Masked Language Modeling and Next Sentence Prediction, allowing for effective fine-tuning on specific tasks. Despite its strengths, BERT has limitations such as high computational costs and input size restrictions.

Uploaded by

Mahrukh Malik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Understanding BERT: Bidirectional Encoder Representations from

Transformers

What is BERT?

BERT is a pre-trained language representation model developed by


Google AI in 2018. It set a new state-of-the-art across a wide range of NLP
tasks like question answering, sentiment analysis, and language inference.

The key innovation?


Deep bidirectional understanding of language context.
Unlike previous models that read text left-to-right or right-to-left, BERT
reads in both directions simultaneously—giving it a richer
understanding of how words relate to one another.

Architecture of BERT

BERT is built on the Transformer architecture—specifically, the encoder


part of the original Transformer model.

Two Common BERT Variants:

 BERT-Base: 12 layers, 768 hidden units, 12 attention heads (~110M


parameters)

 BERT-Large: 24 layers, 1024 hidden units, 16 attention heads (~340M


parameters)

The architecture is a stack of transformer encoders, each containing:

 Multi-head self-attention mechanism

 Feed-forward neural network

 Layer normalization & residual connections

How is BERT Pre-trained?

BERT is trained on large corpora (like Wikipedia and BookCorpus) using two
tasks:

1. Masked Language Modeling (MLM)

 Randomly masks 15% of the input tokens.


 The model learns to predict the masked words based on the
context.

Example:

Input:
The man went to the [MASK] to buy milk.
Output:
store

This forces BERT to learn bidirectional context since it needs both left and
right words to fill in the blank.

2. Next Sentence Prediction (NSP)

 BERT is trained to understand sentence relationships.

 Given two sentences A and B, the model predicts whether B follows A


in the original text.

Example:

 A: "She opened the door."

 B: "She saw a package on the ground."


Label: IsNext

This enables BERT to perform well on tasks like question answering and
natural language inference.

Fine-Tuning BERT for Downstream Tasks

After pre-training, BERT can be fine-tuned for specific tasks by adding a


small layer on top of the base model and training the entire system end-to-
end.

Common Tasks:

 Sentiment Analysis: Add a classification layer

 Named Entity Recognition (NER): Add a token-level classifier

 Question Answering: Predict start and end positions of answer span

 Text Classification: Add a dense layer on the [CLS] token output


Fine-tuning typically requires less data and compute since the base model
already understands general language.

Input Format in BERT

Input text is tokenized using WordPiece embeddings, and then converted


into a specific format:

[CLS] Sentence A [SEP] Sentence B [SEP]

Where:

 [CLS] = Classification token (used for final output)

 [SEP] = Separator token between two sentences

 Segment Embeddings = Indicate which tokens belong to which


sentence

 Position Embeddings = Track the order of words

Why is BERT So Powerful?

 Bidirectional attention → Captures deep semantic context

 Pre-training + Fine-tuning paradigm → Highly transferable

 Scalable → Works across a wide range of NLP tasks with minimal


changes

 Open-source & pre-trained models available → Accessible to


developers and researchers

Limitations of BERT

While BERT is revolutionary, it's not perfect:

 Computationally expensive → Large models require significant


resources

 Input size limited to 512 tokens

 Not well-suited for real-time or edge applications without


pruning or distillation (e.g., DistilBERT)
 Struggles with long-range dependencies in very long texts

Popular BERT Variants

 RoBERTa: Robustly optimized BERT (no NSP, more data)

 DistilBERT: Smaller, faster, distilled version of BERT

 ALBERT: A Lite BERT with parameter sharing

 BioBERT / SciBERT: Domain-specific BERT for biomedical and


scientific texts

 BERTweet: BERT fine-tuned on tweets

You might also like