Natural Language Processing (NLP) is evolving rapidly, with interviews focusing not just on basics but also on advanced architectures, contextual understanding and real-world applications. Let's prepare for interviews with a few practice questions.
Q1. What is tokenization and what are its types?
Tokenization is the process of splitting text into smaller units called tokens, which can be words, subwords or characters. It is a fundamental step in NLP because most downstream tasks—like embeddings, parsing and classification—require input in a structured, tokenized form.
Types of Tokenization:
- Word-level Tokenization: Splits text into individual words and it is used for most classic NLP tasks like text classification or POS tagging.
Example: "NLP is fun" → ["NLP", "is", "fun"] - Subword Tokenization: Splits words into meaningful subword units using methods like WordPiece or Byte-Pair Encoding (BPE) and it can handle rare/unknown words in language models; used in BERT, GPT.
Example: "unhappiness" → ["un", "happiness"] - Character-level Tokenization: Splits text into characters and it is useful for languages with large vocabularies, misspellings or morphological analysis.
Example: "NLP" → ["N", "L", "P"] - Sentence-level Tokenization: Splits text into sentences and it is useful in tasks like summarization or translation.
Example: "NLP is fun. It is evolving." → ["NLP is fun.", "It is evolving."]
Q2. What is the difference between stemming and lemmatization?
Let's see the difference between stemming and lemmatization,
Feature | Stemming | Lemmatization |
---|
Definition | Reduces a word to its base or root form by removing suffixes/prefixes | Reduces a word to its dictionary or canonical form using linguistic rules and context |
---|
Output | Often produces non-words or truncated forms | Produces valid words found in a dictionary |
---|
Accuracy | Crude approximation; may remove too much | More precise; considers context and part-of-speech |
---|
Computation | Fast, computationally inexpensive | Slower due to use of dictionaries and POS tagging |
---|
Example | "studies" → "studi"; "running" → "run" | "better" → "good"; "running" → "run" |
---|
Use-case | Search engines, text indexing where speed matters | NLP tasks requiring semantic understanding, e.g., sentiment analysis, text summarization |
---|
Q3. What is the Out-of-Vocabulary (OOV) problem in NLP?
The OOV problem occurs when a model encounters a word not seen during training, leading to poor representation or prediction failure. This is a major challenge for traditional embedding methods like Word2Vec or GloVe, which assign vectors only to words present in the training corpus.
Example: Model trained on "I love NLP" may fail on "I enjoy NLU" because "NLU" is OOV.
Solutions:
- Subword embeddings: BPE, WordPiece break unknown words into known subwords.
- Character-level embeddings: Represent words via character sequences to handle rare/misspelled words.
- Contextual embeddings: Models like BERT or ELMo generate dynamic embeddings for any input, mitigating OOV issues.
Q4. What is the Bag of Words (BoW) model and what are its limitations?
Bag of Words (BoW) is a feature extraction technique that represents text as a vector of word counts or frequencies, ignoring grammar and word order. It’s simple and widely used in classic NLP pipelines.
Example:
- Sentence 1: "I love NLP" → [I:1, love:1, NLP:1]
- Sentence 2: "NLP love I" → [I:1, love:1, NLP:1]
- Both sentences have identical BoW representations because word order is ignored.
Limitations:
- Ignores word order: Cannot distinguish sentences with the same words but different meanings.
- No semantic understanding: Words like "good" and "excellent" are treated as unrelated.
- High dimensionality: For large vocabularies, the feature vectors can become sparse and memory-intensive.
- Does not handle OOV words: Words unseen during training are ignored.
Q5. What is TF-IDF and how is it used in NLP?
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in NLP to evaluate the importance of a word in a document relative to a collection of documents (corpus).
- Term Frequency (TF): Measures how often a word appears in a document, normalized by the total number of words in that document.
- Inverse Document Frequency (IDF): Measures how rare or informative a word is across all documents, reducing the weight of common words like "the" and increasing the weight of rare, meaningful words.
TF-IDF score is the product of TF and IDF, highlighting words that are important in a specific document but not common across the corpus.
Formula:
TF-IDF(t,d) = TF(t,d)\times IDF(t)
Applications:
- Feature extraction for classification tasks.
- Keyword extraction.
- Search engines and information retrieval.
Q6. What are word embeddings and why are they important?
Word embeddings are dense vector representations of words in a continuous space, capturing semantic and syntactic relationships. Unlike one-hot vectors, embeddings encode similarity and contextual relationships between words.
Example: "king" and "queen" have vectors close together, reflecting their semantic similarity, whereas "king" and "apple" are distant.
Word Embedding Techniques:
- Word2Vec: Predicts words from context (CBOW) or context from word (Skip-gram).
- GloVe: Factorizes co-occurrence matrices to learn global statistical relationships.
- FastText: Embeds subwords to handle rare or misspelled words.
Importance:
- Captures semantic similarity beyond simple counts.
- Improves performance in NLP tasks like sentiment analysis, translation, text classification and semantic search.
- Reduces dimensionality while preserving meaning.
Q7. What is the difference between word embeddings and contextual embeddings?
Let's see the difference between word embeddings and contextual embeddings,
Feature | Word Embeddings | Contextual Embeddings |
---|
Definition | Fixed vector representation for each word | Dynamic vectors that vary depending on context |
---|
Example | "bank" → same vector regardless of "river bank" or "financial bank" | "bank" → different vectors for "river bank" and "financial bank" |
---|
Techniques | Word2Vec, GloVe, FastText | BERT, ELMo, GPT |
---|
Context Awareness | None | Captures surrounding words and semantic meaning |
---|
Use-case | Basic NLP tasks, semantic similarity | Tasks requiring context understanding (QA, NER, disambiguation) |
---|
Q8. What are the different types of embeddings in NLP?
Let's see the various types of embeddings,
1. Word-level embeddings:
- Represent each word as a fixed vector.
- Examples: Word2Vec, GloVe.
- Use-case: Classic NLP tasks like similarity or classification.
2. Subword embeddings:
- Split words into meaningful sub-units to handle rare or unknown words.
- Examples: FastText, BPE, WordPiece.
- Use-case: Mitigates OOV problems, improves morphological understanding.
3. Character-level embeddings:
- Represent text at character granularity.
- Example: "running" → sequence of character vectors.
- Use-case: Morphologically rich languages, misspellings or rare words.
4. Contextual embeddings:
- Generate dynamic vectors for words based on surrounding text.
- Examples: BERT, ELMo, GPT.
- Use-case: Tasks like QA, NER and semantic disambiguation.
5. Sentence/document embeddings:
- Represent sentences or documents as single vectors.
- Examples: Universal Sentence Encoder, Sentence-BERT.
- Use-case: Semantic similarity, clustering, retrieval.
Different embeddings capture meaning at different granularities—word, subword, character, sentence—depending on the task.
Q9. Dense vs. Sparse Embeddings
Let's see the difference between sparse and dense embeddings,
Feature | Dense Embeddings | Sparse Embeddings |
---|
Vector Characteristics | Low-dimensional, mostly non-zero | High-dimensional, mostly zero |
---|
Representation | Learned via neural networks capturing semantic meaning | Based on explicit features like TF-IDF or one-hot encoding |
---|
Dimensionality | Typically 100–1000 dimensions | Thousands to millions of dimensions |
---|
Interpretability | Less interpretable; dimensions do not correspond directly to features | Highly interpretable; each dimension maps to a specific feature or term |
---|
Use-case | Semantic search, recommendations, NLP tasks needing contextual understanding | Keyword matching, traditional information retrieval, sparse data scenarios |
---|
Storage & Efficiency | Compact but computationally intensive | Larger storage, efficiently indexed for exact match retrieval |
---|
Strength | Captures subtle contextual and semantic relationships | Efficient for exact match retrieval and scalable |
---|
Limitation | Requires large datasets and training overhead | Cannot capture semantic similarity or context well |
---|
Q10. What is the difference between pretrained embeddings and fine-tuning?
Let's see the difference between pretrained embeddings and fine-tuning,
Feature | Pretrained Embeddings | Fine-tuning |
---|
Definition | Embeddings trained on large general corpora and used as-is | Pretrained embeddings further trained on task-specific data to improve performance |
---|
Flexibility | General-purpose | Task-specific adaptation |
---|
Computation | Low (no additional training required) | Higher (requires gradient updates on embeddings) |
---|
Example | Word2Vec trained on Wikipedia | BERT embeddings fine-tuned for sentiment analysis |
---|
Use-case | Quickly incorporate semantic knowledge into models | Improve accuracy for specific downstream tasks |
---|
Q11. What are Recurrent Neural Networks (RNNs)?
Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data, where the order of inputs is important. Unlike feedforward networks, RNNs maintain a hidden state that acts as memory, capturing information from previous time steps to influence current predictions.
Mathematical Representation:
h_t = f(W_{xh}x_t + W_{hh}h_{t-1} + b_h)
- x_t = input at time step t
- h_t = hidden state at time step t
- f = activation function (e.g., tanh, ReLU)
Applications:
- Language modeling and text generation
- Machine translation
- Speech recognition
- Time-series forecasting
Q13. What is the vanishing gradient problem in RNNs?
The vanishing gradient problem occurs when gradients shrink exponentially as they are backpropagated through time steps in an RNN. This prevents the network from learning long-range dependencies effectively.
Example: If an RNN is trying to predict a word based on a sequence of 50 previous words, gradients may become extremely small by the time they reach the first word, leading to ineffective weight updates.
Solutions:
- LSTM (Long Short-Term Memory) networks with memory cells
- GRU (Gated Recurrent Units) with simplified gating
- Gradient clipping to prevent extremely small or large gradients
The vanishing gradient problem is why RNNs struggle with long sequences, motivating LSTM and GRU architectures.
Q14. What is the difference between RNN, LSTM and GRU networks?
Let's see the difference between RNN, LSTM, GRU,
Feature | RNN | LSTM | GRU |
---|
Memory Mechanism | Simple hidden state | Separate memory cell and hidden state | Single hidden state combining memory |
Gates | None | Input, Forget, Output | Update, Reset |
Ability to handle long sequences | Poor due to vanishing gradients | Excellent due to gating mechanisms | Good, slightly less complex than LSTM |
Complexity | Low | High | Moderate |
Computation Cost | Lower | Higher | Lower than LSTM |
Use-case | Short sequences or simple tasks | Long sequences, language modeling, translation | Medium-length sequences, lightweight tasks |
Advantage | Simple and fast | Captures long-range dependencies | Efficient and less computationally heavy |
Limitation | Cannot handle long-term dependencies | Computationally intensive | Slightly less powerful for very long sequences |
Q15. Explain sequence-to-sequence (Seq2Seq) models and their components
Sequence-to-sequence (Seq2Seq) models are neural network architectures designed to transform an input sequence into an output sequence. They are widely used in NLP tasks where the lengths of input and output sequences can vary, such as machine translation, text summarization and speech recognition.
Components:
1. Encoder:
- Processes the input sequence and compresses it into a context vector, capturing the information of the entire sequence.
- An LSTM or GRU reads "I love NLP" and encodes it into a fixed-size vector representation.
2. Decoder:
- Generates the output sequence from the context vector.
- Predicts one token at a time, using the previously generated token as input.
3. Attention Mechanism :
- Allows the decoder to focus on specific parts of the input at each step, improving performance on longer sequences.
Example:
- Input: "I love NLP"
- Output (French): "J'aime le NLP"
Seq2Seq allows mapping variable-length input sequences to variable-length outputs, a crucial aspect in NLP tasks like translation.
Q16. What is an Encoder-Decoder model in NLP?
The Encoder-Decoder architecture is a foundational framework for sequence-to-sequence (Seq2Seq) tasks in NLP. It separates input comprehension and output generation, enabling flexible transformation of variable-length sequences.
Components:
Encoder:
- Processes the input sequence and compresses it into a context vector (dense representation).
- Often implemented using RNNs, LSTMs, GRUs or Transformer encoders.
Decoder:
- Generates the output sequence from the context vector, predicting one token at a time.
- Can use attention mechanisms to dynamically focus on relevant parts of the input.
Attention Layer:
- Enhances the decoder by weighting encoder outputs based on relevance to current decoding step.
Use-cases:
- Machine translation (English → French)
- Text summarization
- Chatbots and dialogue systems
Encoder-Decoder models provide a structured way to handle variable-length sequences, bridging input understanding with output generation.
Q17. Explain the Transformer architecture and its impact on NLP
Transformers process sequences in parallel using self-attention, instead of sequentially like RNNs. Self-attention weighs the importance of every word with respect to others, capturing long-range dependencies effectively.
Key Components:
- Self-Attention: Computes relationships between all words in a sequence simultaneously.
- Multi-Head Attention: Allows the model to focus on multiple aspects of context concurrently.
- Positional Encoding: Adds information about word order since attention itself is order-agnostic.
- Feed-Forward Layers: Non-linear transformations applied after attention for feature refinement.
- Layer Normalization & Residual Connections: Stabilizes training and improves gradient flow.
Impact on NLP:
- Enables parallelization, overcoming the sequential bottleneck of RNNs.
- Captures long-range dependencies efficiently.
- Forms the backbone of state-of-the-art models like BERT, GPT, T5, excelling in translation, summarization, text classification and QA.
Transformers outperform RNN-based architectures by modeling context more effectively and training faster on large corpora.
Q18. Give the difference between BERT and GPT architectures.
Let's see the difference between BERT and GPT architectures,
Feature | BERT | GPT |
---|
Architecture | Encoder-only | Decoder-only |
Training Objective | Masked Language Modeling (MLM) | Autoregressive Language Modeling |
Context | Bidirectional (considers left and right context) | Unidirectional (left-to-right context) |
Use-case | Understanding tasks: NER, QA, classification | Text generation, completion, dialogue systems |
Fine-tuning | Requires task-specific fine-tuning | Can generate text with minimal adaptation |
Strength | Captures full context for comprehension | Generates coherent, sequential text |
Limitation | Not naturally suited for text generation | Limited bidirectional understanding |
Q19. Autoregressive vs Autoencoder models.
Let's see the differences between Autoregressive and Autoencoders,
Feature | Autoregressive Models | Autoencoder Models |
---|
Purpose | Predict next token based on previous tokens in sequence | Reconstruct input from compressed latent representation |
Context Usage | Left (previous) context only (unidirectional) | Both left and right context (bidirectional) |
Training Objective | Maximize likelihood of next token | Minimize reconstruction loss between input and output |
Typical Architecture | Decoder-only Transformer (e.g., GPT series) | Encoder-decoder or encoder-only (e.g., BERT) |
Applications | Text generation, speech synthesis, time-series forecasting | Text classification, question answering, representation learning |
Inference | Sequential token generation; slower | Parallel processing possible; faster |
Strength | Excellent at coherent sequential generation | Strong at contextual understanding and embeddings |
Limitation | Limited bidirectional context | Not naturally suited for free-form text generation |
Q20. What are the differences between Masked Language Modeling (MLM) and Causal Language Modeling (CLM)?
Let's see the difference between MLM and CLM,
Feature | Masked Language Modeling (MLM) | Causal Language Modeling (CLM) |
---|
Objective | Predict masked tokens anywhere in the input | Predict the next token based on previous tokens only |
Context | Bidirectional (uses both left and right context) | Unidirectional (left-to-right context) |
Example | Input: "The [MASK] is bright" → predict "sun" | Input: "The sun is" → predict "bright" |
Model Examples | BERT, RoBERTa | GPT series |
Use-case | Language understanding tasks: NER, QA, classification | Language generation tasks: text completion, dialogue |
Strength | Captures full context for better comprehension | Generates coherent sequential text |
Limitation | Not naturally suited for free text generation | Limited to past context; cannot see future tokens |
Q21. How does dependency parsing differ from constituency parsing?
Feature | Dependency Parsing | Constituency Parsing |
---|
Focus | Grammatical relationships between words (head-dependent relations) | Hierarchical structure of phrases (sub-phrases like NP, VP) |
Output | Dependency tree: edges represent direct relationships | Constituency tree: nested tree structure of constituents |
Example | Sentence: "She enjoys reading books" → "enjoys" is root; "She" → subject; "reading books" → object | Same sentence → NP: "She"; VP: "enjoys reading books" |
Advantages | Highlights syntactic dependencies useful for relation extraction | Captures hierarchical grammatical structure |
Use-cases | Information extraction, syntax-based sentiment analysis, NER | Grammar analysis, parsing for machine translation, text generation |
Representation | Graph-based (nodes = words, edges = dependencies) | Tree-based (nested phrase structure) |
Q22. What are positional encodings in Transformers and why are they needed?
Transformers process sequences in parallel and do not inherently capture the order of tokens. Positional encodings add information about the position of each token in the sequence, enabling the model to recognize the relative or absolute position of words.
Types of Positional Encodings:
- Sinusoidal (fixed) encoding: Uses sine and cosine functions of different frequencies.
- Learned encoding: Position embeddings are learned during training.
Why Needed:
- Without positional encodings, a Transformer cannot distinguish "I love NLP" from "NLP love I".
- Enables modeling of sequential patterns and relationships between words.
Positional encodings provide order information, essential for accurate context modeling in Transformers.
Q23. Explain the concept of embeddings for subwords and character-level models
Subword and character-level embeddings are designed to address the Out-of-Vocabulary (OOV) problem and handle rare, morphologically complex or unseen words in NLP tasks. They allow models to generate meaningful representations even for words not seen during training.
1. Subword Embeddings:
Words are split into subword units such as prefixes, suffixes or frequent subword patterns. Common methods: Byte-Pair Encoding (BPE), WordPiece, FastText.
Benefits:
- Handles rare and unseen words by combining known subwords.
- Reduces overall vocabulary size.
- Captures morphological structure of words.
Example:
- Word: "unhappiness"
- Subwords: "un" + "happi" + "ness"
- Embedding: Combine embeddings of subwords to represent the word.
2. Character-Level Embeddings:
Each character in a word is represented as an embedding. A sequence of character embeddings is processed (e.g., via CNNs or RNNs) to form a word-level representation.
Benefits:
- Handles typos, misspellings and very rare words.
- Captures fine-grained morphological and orthographic patterns.
Example:
- Word: "running"
- Characters: r, u, n, n, i, n, g
- Processed via RNN/CNN → Combined embedding represents "running".
Q24. Explain Named Entity Recognition (NER) and its importance
Named Entity Recognition (NER) is a subtask of information extraction that identifies and classifies entities in text into predefined categories such as persons organizations, locations, dates, monetary values, percentages and more.
- Entity Recognition: Detecting the presence of an entity in the text.
- Entity Classification: Assigning the detected entity to a predefined category.
Example:
Sentence: "Apple Inc. was founded by Steve Jobs in Cupertino."
NER Output:
- "Apple Inc." → Organization
- "Steve Jobs" → Person
- "Cupertino" → Location
Importance of NER:
- Information Retrieval: Improves search accuracy by identifying key entities.
- Question Answering Systems: Helps extract relevant facts from documents.
- Content Recommendation: Enhances understanding of text content for personalization.
- Knowledge Graph Construction: Extracts structured information from unstructured text.
NER is foundational for structured understanding of unstructured text, enabling downstream NLP tasks to operate more effectively.
Q26. What is Word Sense Disambiguation (WSD)? Differentiate between WSD and NER.
Word Sense Disambiguation (WSD) is the process of determining the correct meaning of a word in context when the word has multiple possible senses. WSD is crucial for accurate understanding and downstream NLP tasks.
Techniques:
- Knowledge-based approaches: Use lexical databases like WordNet to match context with word senses.
- Supervised learning: Train classifiers on labeled datasets where words are annotated with their correct senses.
- Contextual embeddings: Modern models like BERT produce dynamic embeddings that inherently disambiguate word senses based on surrounding context.
Feature | WSD (Word Sense Disambiguation) | NER (Named Entity Recognition) |
---|
Definition | Identifies the correct meaning (sense) of a word based on context. | Identifies and classifies proper nouns or entities (like names, locations, organizations). |
Focus | Resolving lexical ambiguity for common words. | Detecting specific entities in text. |
Example | “Bank” → financial institution vs river bank depending on sentence. | “Apple” → company vs “Steve Jobs” → person. |
Context Use | Requires surrounding words or sentence-level context to choose correct sense. | Uses surrounding words and sometimes grammar to classify entity type. |
Applications | Machine translation, semantic search, word-level sense analysis. | Information extraction, question answering, knowledge graph construction. |
Q27. What is topic modeling and which algorithms are commonly used?
Topic modeling is an unsupervised learning technique that identifies hidden topics in large collections of text documents by analyzing word patterns and co-occurrences.
Common Algorithms:
- Latent Dirichlet Allocation (LDA): Probabilistic model that represents documents as mixtures of topics and topics as distributions of words.
- Non-negative Matrix Factorization (NMF): Factorizes document-term matrices into topic-word and document-topic matrices.
- BERTopic: Transformer-based approach that leverages contextual embeddings for richer topic representations.
Applications:
- Trend analysis and market research
- Document clustering and organization
- Summarization and content recommendation
Information Extraction (IE) is the process of automatically converting unstructured text into structured data that can be easily analyzed and used in downstream applications. IE allows systems to extract meaningful facts, entities, relationships and events from raw text.
Key Components:
- Named Entity Recognition (NER): Identify and classify entities (persons organizations, locations, etc.)
- Relation Extraction: Detect relationships between entities (e.g., “Steve Jobs → founder → Apple Inc.”)
- Event Extraction: Identify events and participants, along with temporal and spatial details
Applications:
- Knowledge Graph Construction: Populating structured graphs for reasoning and search.
- Question Answering Systems: Extracting precise answers from large text corpora.
- Content Summarization: Automatically summarizing key information from articles.
- Data Analytics: Structuring unstructured textual data for insights.
Q29. What are the challenges faced in sentiment analysis and how can they be addressed?
Sentiment analysis determines the emotional tone of text (positive, negative, neutral), but it faces several challenges:
1. Sarcasm and Irony:
- Sentences may convey the opposite sentiment of literal words.
- Example: "Great job!" could be sarcastic and actually negative.
- Solution: Use contextual embeddings like BERT or specific sarcasm detection models that can capture nuanced meaning.
2. Contextual Ambiguity:
- Words can have different sentiment depending on context.
- Example: "The movie was good, but the ending was disappointing."
- Solution: Fine-tune context-aware architectures like Transformers to understand subtle shifts in sentiment.
3. Domain-Specific Language:
- Words can carry different sentiment in specialized domains.
- Example: "Positive" in a medical report vs. a movie review.
- Solution: Use domain-specific datasets for training or fine-tuning.
4. Negation Handling:
- Negations can flip sentiment.
- Example: "I don’t like this movie."
- Solution: Incorporate syntactic parsing or negation-aware embeddings.
5. Imbalanced Data:
- Some sentiment classes may dominate the dataset, biasing predictions.
- Solution: Apply data augmentation, class weighting or resampling techniques
Q30. What are common challenges in text classification and how can they be solved?
Text classification assigns predefined categories to text documents, but it faces multiple challenges:
High Dimensionality:
- Text represented with large vocabularies leads to sparse feature spaces.
- Solution: Use dimensionality reduction methods like TF-IDF with PCA or dense embeddings such as Word2Vec or BERT.
Class Imbalance:
- Some categories have significantly fewer examples, causing biased models.
- Solution: Use oversampling, undersampling or weighted loss functions to balance training.
Noise and Irrelevant Information:
- Text may contain typos, stopwords or unrelated content.
- Solution: Perform preprocessing steps like tokenization, stopword removal and normalization.
Ambiguity:
- Words with multiple meanings can confuse the classifier.
- Solution: Employ contextual embeddings (e.g., BERT, ELMo) to capture word meaning in context.
Domain Adaptation:
- Models trained on one domain may not generalize to another.
- Solution: Apply transfer learning and fine-tune models on target domain data.
Q31. How do attention mechanisms work in NLP?
Attention mechanisms in NLP allow models to focus on relevant parts of the input sequence when processing or generating text. Each word is assigned a weight based on its importance to other words in the sequence, enabling the model to capture context and long-range dependencies effectively.
- Self-Attention: Computes relevance of each word with every other word in the sequence.
- Scaled Dot-Product Attention: Uses the dot product of queries and keys, scaled and applied to values to determine attention weights.
- Multi-Head Attention: Uses multiple attention heads to capture different aspects of relationships simultaneously.
Layer Normalization is a technique that normalizes the inputs of each layer to have zero mean and unit variance, stabilizing and accelerating training in deep neural networks, particularly Transformers.
- Prevents vanishing/exploding gradients.
- Applied to self-attention and feed-forward layers.
- Variants: Pre-LayerNorm (before operations) and Post-LayerNorm (after operations).
Layer Normalization ensures stable and efficient training, improving convergence and model performance.
Q33. What is the role of context windows in NLP?
A context window is the set of words surrounding a target word that a model considers when interpreting its meaning. It defines the scope of context used to capture semantic and syntactic relationships.
Types:
- Narrow Window: Focuses on immediate neighbors; captures local syntactic relationships.
- Wide Window: Includes distant words; captures long-range semantic dependencies.
- Dynamic/Adaptive Window: Context is learned dynamically, as in Transformers, via attention.
Q34. What is zero-shot and few-shot learning in NLP?
Zero-shot learning in NLP refers to the ability of a model to perform a task without having seen any labeled examples of that task during training, relying solely on its pre-trained knowledge. For Example: A sentiment analysis model trained on English being used to classify sentiments in Hindi without explicit Hindi training data.
Few-shot learning refers to the ability of a model to adapt to a task with only a small number of labeled examples, leveraging prior knowledge for generalization. For example: Fine-tuning a pre-trained model for intent classification with just a handful of labeled sentences.
Q35. Explain Cross-lingual Transfer Learning and its challenges.
Cross-lingual Transfer Learning is the process of using knowledge learned from a high-resource source language (e.g., English) to improve model performance in a low-resource target language (e.g., Swahili), enabling multilingual applications with limited labeled data.
- Utilizes multilingual embeddings and pre-trained models like mBERT or XLM-R.
- Enables tasks like machine translation, sentiment analysis and question answering across languages.
Challenges:
- Language diversity: Structural and syntactic differences hinder transfer.
- Data scarcity: Limited parallel corpora for many languages.
- Domain mismatch: Source and target may differ in usage contexts.
- Cultural nuances: Idioms and semantics vary across languages.
- High computation costs: Multilingual models are large and resource-intensive.
Q36. What is retrieval-augmented generation (RAG) in NLP?
Retrieval-Augmented Generation (RAG) is a hybrid NLP approach that combines retrieval-based methods with generative models to improve accuracy, factuality and knowledge coverage in text generation tasks.
- The retriever component fetches relevant documents or passages from an external knowledge source (e.g., Wikipedia, a vector database).
- The generator (usually a Transformer-based model like BART or GPT) uses both the retrieved context and its own learned knowledge to produce the final output.
- This helps reduce hallucinations and improves performance on tasks requiring factual grounding.
Applications:
- Open-domain question answering
- Chatbots with external knowledge integration
- Summarization with verified context
- Legal, financial and medical document assistance
RAG enhances generative models by anchoring responses in real-world data, making them more reliable and trustworthy.
Q37. How can knowledge graphs be integrated into NLP applications?
A knowledge graph (KG) is a structured representation of entities and their relationships. Integrating KGs into NLP allows models to use explicit symbolic knowledge alongside statistical learning for better reasoning and interpretability.
- Entity Linking: Mapping text mentions to KG entities (e.g., “Apple” → company vs fruit).
- Relation Extraction: Using KGs to validate or discover relationships between entities.
- KG-Enhanced Embeddings: Incorporating KG structure into word or sentence embeddings for semantic enrichment.
- Hybrid Models: Combining KGs with neural architectures (e.g., Graph Neural Networks + Transformers).
Applications:
- Question Answering: Providing factual, graph-based answers.
- Recommendation Systems: Leveraging entity relationships for personalized recommendations.
- Semantic Search: Improving retrieval accuracy with KG-based reasoning.
- Dialogue Systems: Maintaining consistency and factual grounding in conversations.
Q38. Describe how you would implement a chatbot using NLP techniques.
A chatbot is an AI system that simulates human conversation, often using Natural Language Processing (NLP) to understand user input and generate appropriate responses.
Implementation Steps:
1. Text Preprocessing
- Clean input (tokenization, stopword removal, stemming/lemmatization).
- Handle spelling correction and entity recognition for better interpretation.
2. Intent Recognition
- Use text classification models (e.g., logistic regression, SVM or deep learning models like BERT) to identify user intent (e.g., "book flight," "check weather").
3. Entity Extraction
- Apply Named Entity Recognition (NER) to capture required entities (e.g., dates, names, locations).
4. Dialogue Management
- Rule-based (dialogue flow charts, if-else rules).
- Machine learning-based (Reinforcement Learning or Transformer-based dialogue policies).
5. Response Generation
- Retrieval-based: Predefined responses based on matching.
- Generative-based: Neural models (e.g., Seq2Seq, Transformer, GPT) generate responses dynamically.
- Hybrid approach (retrieval + generative).
6. Knowledge Integration
- Incorporate knowledge graphs or retrieval-augmented generation (RAG) to improve factual accuracy.
Application:
- Banking chatbot (checking balances, transaction history).
- Customer support (handling FAQs).
- Healthcare chatbot (symptom checker with disclaimers).
Q39. What are machine translation approaches?
Machine Translation (MT) refers to the process of automatically converting text or speech from one natural language into another using computational methods. Over time, MT has evolved through three main paradigms: Rule-Based (RBMT), Statistical (SMT) and Neural Machine Translation (NMT). Each approach differs in how it models language, handles grammar and learns translation patterns.
1. Rule-Based Machine Translation (RBMT):
RBMT is the earliest approach to MT, which relies on explicit linguistic rules and bilingual dictionaries crafted by experts. It uses knowledge of grammar, syntax and semantics of both the source and target languages to perform translation.
- Pros: Grammatically precise for structured sentences.
- Cons: Requires extensive manual effort, struggles with ambiguity and idioms.
2. Statistical Machine Translation (SMT):
SMT relies on probability and statistics derived from large bilingual corpora to generate translations. Instead of rules, it learns how words and phrases in one language map to another based on frequency and alignment.
- Example: Phrase-Based SMT learns phrase mappings from aligned sentences.
- Pros: More flexible than RBMT, learns from data.
- Cons: Still limited in fluency and long-range dependencies.
3. Neural Machine Translation (NMT):
NMT uses deep learning models, particularly sequence-to-sequence architectures with attention (and later Transformers), to perform translation. It represents words and sentences in continuous vector spaces (embeddings), enabling context-aware and fluent translations.
- Pros: Produces fluent, context-aware translations.
- Cons: Requires large data and compute resources, may hallucinate translations.
Q40. How can NLP be applied in recommendation systems, search engines and QA systems?
1. NLP in Recommendation System
A recommendation system is an AI-based system that predicts and suggests items (such as products, movies or news articles) to users by analyzing their past interactions, preferences and available content. When NLP is applied, the system can also interpret textual content (e.g., item descriptions, user reviews) to make smarter and more personalized recommendations.
How NLP is applied:
- Analyzing product/movie descriptions using embeddings.
- Understanding user feedback via sentiment analysis.
- Extracting key themes through topic modeling.
- Capturing user intent from natural language queries (e.g., “affordable phones for photography”).
2. NLP in Search Engines
A search engine is a system that retrieves and ranks relevant documents, web pages or content based on a user’s query. NLP improves search engines by enabling them to understand the meaning behind queries instead of just matching keywords.
How NLP is applied:
- Query processing: Tokenization, stemming and lemmatization make queries more precise.
- Semantic search: Embedding-based retrieval (e.g., BERT, SBERT) allows searches by meaning, not exact words.
- Entity recognition: Identifying names, places or dates in queries.
- Re-ranking models: NLP-powered ranking ensures that the most relevant documents appear at the top.
- Conversational search: Handles follow-up and context-aware queries.
3. NLP in Question Answering (QA) Systems
A QA system is an NLP-powered application that provides direct answers to user queries expressed in natural language, instead of returning just a list of documents. Unlike search engines, QA systems aim to extract or generate exact responses from available knowledge.
How NLP is applied:
- Extractive QA: Models like BERT highlight the exact span of text containing the answer.
- Generative QA: GPT-like models generate natural language answers.
- Knowledge graph QA: Uses structured graphs to answer factual queries.
- Conversational QA: Context-aware systems that manage follow-up questions.
- Domain-specific QA: Trained on specialized datasets for medicine, law or finance.
Q41. What are the evaluations metrics in NLP?
Evaluation metrics in NLP vary by task and generally include:
Classification Tasks:
- Accuracy: Overall correctness of predictions
- Precision: Proportion of correctly predicted positive instances
- Recall: Proportion of actual positives correctly identified
- F1-Score: F1-Score is harmonic mean of precision and recall, balancing both
Machine Translation:
- BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between generated and reference text
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Emphasizes recall of overlapping units, used also for summarization
Summarization:
- ROUGE: Commonly used metric for overlap with reference summaries
- BERTScore: Uses contextual embeddings to evaluate semantic similarity beyond exact word matches
Semantic Similarity:
- Cosine Similarity: Measures angle-based similarity between vector embeddings of sentences or words
- Word Mover’s Distance (WMD): Quantifies semantic distance based on optimal transport between word embeddings
Q42. What is cosine similarity and Word Mover’s Distance (WMD) in semantic similarity?
Semantic similarity refers to measuring how closely two texts (words, sentences or documents) are related in meaning. Two popular approaches for this are cosine similarity and Word Mover’s Distance (WMD).
Cosine Similarity:
Cosine similarity is a vector-based metric that measures the cosine of the angle between two vectors in high-dimensional space. It captures how similar the direction of two vectors is, regardless of their magnitude.
- Represents words, sentences or documents as embeddings (numerical vectors).
- Computes similarity based on vector orientation.
Formula:
\text{Cosine Similarity}= \frac{A.B}{||A||\times||B||}
where A and B are embedding vectors.
Use cases:
- Semantic search (matching queries with documents).
- Document clustering.
- Recommender systems based on text similarity.
Word Mover’s Distance (WMD):
Word Mover’s Distance is a document-level distance metric that measures the minimum cumulative distance required to move words from one text to another, using their embeddings. It is based on the Earth Mover’s Distance (optimal transport theory).
- Represents each word in both documents as embeddings.
- Calculates how much "effort" is needed to transform one document into the other by optimally matching and moving words.
- Accounts for semantic similarity between words (e.g., “Obama” and “President” are close in embedding space).
Advantages over cosine similarity:
- Captures fine-grained word-to-word relationships instead of just comparing overall vectors.
- Better at handling paraphrases or semantically equivalent but differently worded sentences.
Example:
- Sentence 1: “Obama speaks to the press.”
- Sentence 2: “The President gives a speech.”
Cosine similarity may not capture the relation fully, but WMD shows high similarity because word embeddings align semantically.
Q43. What is pragmatic ambiguity?
Pragmatic ambiguity occurs when the meaning of an utterance depends on the context, situation or speaker intent, rather than the literal interpretation of the words themselves. It arises from how language is used in communication, not from grammatical or lexical ambiguity.
- Unlike lexical ambiguity (where a word has multiple meanings, e.g., “bank”), pragmatic ambiguity involves interpretation based on context.
- It can lead to multiple valid readings of the same sentence depending on the speaker, listener or situation.
Examples:
1. “Can you pass the salt?”
- Literal interpretation: Asking if the listener is able to pass the salt.
- Pragmatic interpretation: Polite request for the salt.
2. “I’ll meet you at the bank.”
- Without context, it could refer to a financial institution or the side of a river.
- The context (river vs city) resolves the ambiguity.
Q44. What are Hugging Face Transformers and how are they used in NLP?
Hugging Face Transformers is an open-source library that provides pretrained Transformer-based models for a wide range of NLP tasks. It enables easy access to models like BERT, GPT, RoBERTa, T5, and DistilBERT, along with tools for training, fine-tuning, and deploying them efficiently.
Key Features:
- Pretrained Models: Hundreds of models trained on large corpora, ready for downstream tasks.
- Task Support: Includes text classification, token classification, question answering, summarization, translation, and text generation.
- Easy Integration: Works with PyTorch, TensorFlow, and JAX.
- Tokenizers: Supports subword tokenization like WordPiece, BPE, and SentencePiece.
- Fine-Tuning: Allows adapting pretrained models to domain-specific tasks with minimal code.
Applications in NLP:
- Text Classification: Sentiment analysis, spam detection, topic classification.
- Named Entity Recognition (NER): Identifying entities like names, dates, locations.
- Question Answering (QA): Extractive and generative QA systems.
- Text Generation: Chatbots, story generation, summarization.
- Translation and Summarization: Multilingual and abstractive text processing.
Q.45. Apply a full text preprocessing pipeline.
Text preprocessing is the process of cleaning and transforming raw text into a structured format suitable for NLP tasks. It helps remove noise, standardize the input and prepare features for downstream models. Using NLTK (Natural Language Toolkit), we can implement a complete preprocessing pipeline.
1. Import Necessary Libraries
We will be importing nltk, regex, string and inflect.
Python
import nltk
import string
import re
import inflect
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
2. Convert to Lowercase
We convert the text lowercase to reduce the size of the vocabulary of our text data.
Python
def text_lowercase(text):
return text.lower()
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
print(text_lowercase(input_str))
Output:
hey, did you know that the summer break is coming? amazing right !! it's only 5 more days !!
3. Removing Numbers
We can either remove numbers or convert the numbers into their textual representations. To remove the numbers we can use regular expressions.
Python
def remove_numbers(text):
return re.sub(r'\d+', '', text)
input_str = "There are 3 balls in this bag, and 12 in the other one."
print(remove_numbers(input_str))
Output:
There are balls in this bag and in the other one.
4. Converting Numerical Values
We can also convert the numbers into words. This can be done by using the inflect library.
Python
p = inflect.engine()
def convert_number(text):
temp_str = text.split()
new_string = []
for word in temp_str:
if word.isdigit():
new_string.append(p.number_to_words(word))
else:
new_string.append(word)
return ' '.join(new_string)
input_str = "There are 3 balls in this bag, and 12 in the other one."
print(convert_number(input_str))
Output:
There are three balls in this bag and twelve in the other one.
5. Removing Punctuation
We remove punctuations so that we don't have different forms of the same word. For example if we don't remove the punctuation then been. been, been! will be treated separately.
Python
def remove_punctuation(text):
translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
print(remove_punctuation(input_str))
Output:
Hey did you know that the summer break is coming Amazing right Its only 5 more days
6. Removing Whitespace
We can use the join and split functions to remove all the white spaces in a string.
Python
def remove_whitespace(text):
return " ".join(text.split())
input_str = "we don't need the given questions"
print(remove_whitespace(input_str))
Output:
we don't need the given questions
7. Removing Stopwords
Stopwords are words that do not contribute much to the meaning of a sentence hence they can be removed.
Python
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word.lower()
not in stop_words]
return filtered_text
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
print(remove_stopwords(example_text))
Output:
['sample', 'sentence', 'going', 'remove', 'stopwords', '.']
8. Applying Stemming
Stemming is the process of getting the root form of a word. Stem or root is the part to which affixes like -ed, -ize, -de, -s, etc are added. The stem of a word is created by removing the prefix or suffix of a word.
Python
stemmer = PorterStemmer()
def stem_words(text):
word_tokens = word_tokenize(text)
stems = [stemmer.stem(word) for word in word_tokens]
return stems
text = "data science uses scientific methods algorithms and many types of processes"
print(stem_words(text))
Output:
['data', 'scienc', 'use', 'scientif', 'method', 'algorithm', 'and', 'mani', 'type', 'of', 'process']
9. Applying Lemmatization
Lemmatization is an NLP technique that reduces a word to its root form. This can be helpful for tasks such as text analysis and search as it allows us to compare words that are related but have different forms
Python
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def lemma_words(text):
word_tokens = word_tokenize(text)
lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
return lemmas
input_str = "data science uses scientific methods algorithms and many types of processes"
print(lemma_words(input_str))
Output:
['data', 'science', 'us', 'scientific', 'method', 'algorithm', 'and', 'many', 'type', 'of', 'process']
10. POS Tagging
POS tagging is the process of assigning each word in a sentence its grammatical category, such as noun, verb, adjective or adverb.
Python
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import os
import sys
nltk_data_dir = '/usr/local/share/nltk_data'
if nltk_data_dir not in nltk.data.path:
nltk.data.path.append(nltk_data_dir)
nltk.download('averaged_perceptron_tagger_eng')
def pos_tagging(text):
word_tokens = word_tokenize(text)
return pos_tag(word_tokens)
input_str = "Data science combines statistics, programming, and machine learning."
print(pos_tagging(input_str))
Output:
[('Data', 'NNP'), ('science', 'NN'), ('combines', 'NNS'), ('statistics', 'NNS'), (',', ','), ('programming', 'NN'), (',', ','), ('and', 'CC'), ('machine', 'NN'), ('learning', 'NN'), ('.', '.')]
Where,
- NNP: Proper noun
- NN: Noun (singular)
- VBZ: Verb (3rd person singular)
- CC: Conjunction