Unit 2
Unit 2
processing (NLP) tasks. It involves leveraging pre-trained models that have been trained on large datasets to serve as
the starting point for a wide range of downstream tasks. The process of fine-tuning these pre-trained models on task-
specific data allows users to benefit from the knowledge encoded in the original model while adapting it to new, more
specialized objectives.
The Transformers library, developed by Hugging Face, provides numerous pre-trained models like BERT, GPT, and T5,
which are widely used for tasks such as text classification, translation, summarization, and question answering. These
models are typically trained on massive corpora, such as Wikipedia or the Common Crawl, allowing them to capture deep
contextual understanding of language. Transfer learning in this context typically works in two stages:
Pre-training: A model is trained on large-scale, general-purpose language data with objectives like masked language
modeling (in BERT) or next-word prediction (in GPT). This allows the model to learn universal language patterns,
relationships, and syntactic/semantic structures.
Fine-tuning: After pre-training, the model is further trained (fine-tuned) on a task-specific, often much smaller dataset. This
adapts the general language understanding to the particular task (e.g., sentiment analysis, entity recognition) without
requiring vast amounts of task-specific data.
Benefits of Transfer Learning in Transformers
Efficiency in Data Usage: Fine-tuning a pre-trained model often requires only a small dataset. This is because the pre-
trained model already captures a deep understanding of the language, reducing the need for extensive labeled task-
specific data.
Reduced Computational Costs: Pre-training large language models is computationally expensive and time-
consuming. However, with transfer learning, users can leverage these pre-trained models and focus only on fine-tuning,
which is more computationally feasible.
Improved Performance: Models pre-trained on large corpora often yield better performance than models trained from
scratch, especially on tasks where the labeled dataset is small. They provide a strong baseline due to their extensive
language understanding.
Generalization Across Tasks: Pre-trained models can generalize across a wide range of tasks. Once fine-tuned for a
specific task, they can adapt quickly to others with minimal effort due to the learned language representations.
Access to State-of-the-Art Models: The Transformers library provides access to top-performing models (BERT, GPT,
RoBERTa, etc.), giving practitioners the ability to achieve cutting-edge results without building models from scratch.
import spacy
# Tokenization
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_)
# Dependency Parsing
for token in doc:
print(f'{token.text} --> {token.head.text} ({token.dep_})')
Application: Extracting diseases, medications, symptoms, and medical procedures from unstructured text such as research papers or clinical notes.
Customization: Train a specialized Named Entity Recognition (NER) model to recognize medical entities like drug names, anatomical terms, or
clinical abbreviations.
Legal Industry: Application: Parsing legal documents, contracts, and regulations to extract parties, clauses, dates, and obligations.
Customization: Custom NER for legal entities such as court names, statute references, case citations, and legal terms.
Financial Services: Application: Processing financial reports, news, and regulatory filings to extract information about companies, stock prices,
transactions, and risk indicators.
Customization: Add custom components to identify financial terms, company names, stock tickers, and transaction types.
E-commerce: Application: Analyzing customer reviews, product descriptions, and advertisements to extract product features, customer sentiment,
and competitor information.
Customization: Fine-tune NER models to detect product names, features, and brands.
Scientific Research: Application: Extracting chemical compounds, biological entities, or gene names from research articles or patent filings.
Customization: Create custom pipelines for recognizing scientific terms, formulas, or references to experiments and results.
if (step + 1) % accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Use Early Stopping:
Implement early stopping to prevent overfitting, especially if the model starts to overfit after a few epochs. Track the
validation loss and stop training if it doesn’t improve for a few epochs.
For each token (or word) in the input sequence, the self-attention mechanism performs the following steps:
Input Representation:
Each input token is represented as an embedding vector (typically a learned representation), which captures the
token's semantic meaning. These embeddings are combined with positional encodings to retain positional
information.
For each token, the Transformer computes three vectors: Query (Q), Key (K), and Value (V). These vectors are
derived by multiplying the input embedding with learned weight matrices: Q=XWQ,K=XWK,V=XWVWhere:
WQ, WK, and WVare learned weight matrices for queries, keys, and values, respectively.
The Query represents the token we are focusing on, the Key represents the tokens we are comparing it against, and
the Value contains the token's actual information.
During training, the student model is optimized using a distillation loss function, which
typically combines two terms:
Cross-entropy loss with soft labels (from the teacher model’s soft predictions).
Where α is a hyperparameter that controls the weight given to the hard and soft labels.
1. Architecture Comparison
BERT:
Layers: BERT has two main versions: BERT-Base (12 layers) and BERT-Large (24 layers), where each layer
is a Transformer encoder.
Hidden Units: In BERT-Base, each layer has 768 hidden units, while BERT-Large has 1024 hidden units.
Attention Heads: BERT-Base has 12 attention heads, and BERT-Large has 16 attention heads.
Parameters:
BERT-Base has around 110 million parameters.
BERT-Large has around 340 million parameters.
DistilBERT:
Layers: DistilBERT reduces the number of layers to 6, exactly half the number of BERT-Base.
Hidden Units: DistilBERT retains the same number of hidden units as BERT-Base (768 units).
Attention Heads: DistilBERT also retains the 12 attention heads from BERT-Base.
Parameters:
DistilBERT has approximately 66 million parameters, about 60% of BERT-Base's size.