Natural Language Processing
Natural Language Processing
Grammar Based LM
Grammar-based language models are a type of statistical language model
that uses formal grammars to represent the underlying structure of language.
Unlike n-gram models, which focus on the probability of sequences of words,
grammar-based models explicitly model the grammatical relationships
between words in a sentence.
-> Training
-> Parsing
-> Probability calculation
-> Best Parse selection
Advantage
->Explicit Modelling of structure
-> Robustness
Disadvantage
->Complexity
->Limited Coverage
Application
-> NLP, Speech recognition, computational linguistics
1
Statistical based LM
Statistical language models (SLMs) are a cornerstone of natural language
processing (NLP), aiming to predict the likelihood of a sequence of words in a
given language. They do this by analyzing vast amounts of text data and
identifying statistical patterns in word usage.
Advantages
-> Simplicity, efficiency, widelyused
Disadvantages
-> Data sparsity(as many word combinations may not appear frequently in
the training data), Limited Context, Lack of Generalization.
Application
-> Speech recognition, Machine translation, text generation, Information
retrieval.
2
Regular Expression
A regular expression (regex) is a sequence of characters that define a search
pattern. Here’s how to write regular expressions:
3. Write your pattern using the special characters and literal
characters.
4. Use the appropriate function or method to search for the pattern in a
string.
3
● States of Automata: The conditions or configurations of the
machine.
made.
English Morphology
In Natural Language Processing (NLP), morphology plays a crucial role in
understanding the structure and meaning of words. It involves analyzing
words into their constituent parts (morphemes) and understanding how
these parts contribute to the overall meaning.
Example
Tokenization
Tokenization is a fundamental process in Natural Language Processing (NLP)
that involves breaking down a stream of text into smaller units called tokens.
These tokens can range from individual characters to full words or phrases,
depending on the level of granularity required. By converting text into these
manageable chunks, machines can more effectively analyze and understand
human language.
Types
4
This is the most common method where text is divided into individual words.
It works well for languages with clear word boundaries, like English. For
example, "Machine learning is fascinating" becomes:
["Machine", "learning", "is", "fascinating"]
2.Character Tokenization
3.Subword Tokenization
Example
5
Input: "Teh cat sat on teh mat."
________________________________________________________________
Example
Smoothing in NLP
6
Why Smoothing is Necessary
○ Underestimation of probabilities
○ Inability to handle unseen data
● Overfitting: Unsmoothed models tend to overfit the training
data, meaning they perform poorly on unseen text.
Example
7
● Count("the cat") = 10
● Count("the dog") = 5
● Count("the bird") = 0
Laplace Smoothing:
Impact of Smoothning
8
Example of POS Tagging
Consider the sentence: “The quick brown fox jumps over the lazy dog.”
1. Rule-Based Tagging
Rule-based part-of-speech (POS) tagging involves assigning words their
respective parts of speech using predetermined rules, contrasting with
9
machine learning-based POS tagging that requires training on annotated text
corpora. In a rule-based system, POS tags are assigned based on specific
word characteristics and contextual cues.
Initial Tags:
Change the tag of “chased” from Verb (V) to Noun (N) because it follows the
determiner “the.”
Updated tags:
10
● “The” – Determiner (DET)
easier to simplify.
11
● Ambiguity: The inherent ambiguity of language makes POS tagging
can be problematic for POS tagging systems since they don’t always
_____________________________________________________________________
Context-Free Grammar
Context Free Grammar is formal grammar, the syntax or structure of a formal
T - It is a set of terminals.
12
S - It is the starting symbol.
the form of :
● And the left-hand side of the G, here in the example, can only be a
The above equation states that every production which contains any
grammar.
Grammar in NLP
Grammar in NLP is a set of rules for constructing sentences in a language
This includes identifying parts of speech such as nouns, verbs, and adjectives,
13
Grammar is defined as the rules for forming well-structured sentences.
C programming language, the precise grammar rules state how functions are
Treebanks
In Natural Language Processing (NLP), treebanks are collections of text that have been
tree-like diagrams.
a normal form is a restricted form of the grammar that maintains the same language
generated by the original grammar. These restricted forms simplify the analysis and
Dependency Grammar
14
Dependency Grammar
Key Concepts
● Dependency: A directed link between two words, indicating that one word (the
sentence, where words are nodes and the directed links represent
dependencies.
How it Works
Example:
A possible dependency tree for this sentence would look like this:
15
● sat is the root of the sentence, as it is the main verb.
● Flexibility: Suitable for analyzing languages with free word order, where the
16
● Simplicity: Dependency trees can be more concise and easier to interpret than
Applications
—____________________________________________________________________
Application of NLP
1. Text Analysis: The input text is broken down into smaller units like words,
phrases, or sentences.
2. Language Identification: The source language is identified.
17
3. Translation: The system uses various techniques like statistical,
rule-based, or neural machine translation to find the most appropriate
equivalent in the target language.
4. Post-Editing: While modern MT systems are quite accurate, human
post-editing is often necessary to refine the translation, especially for
nuanced or complex texts.
Benefits of Machine Translation in Intelligent Work Processors:
● Efficiency: Translating large volumes of text becomes significantly faster,
saving time and resources.
● Accessibility: Content can be made accessible to a wider global
audience, breaking down language barriers.
● Cost-Effectiveness: Automating translation reduces the need for human
translators, lowering costs.
● 24/7 Availability: MT systems can translate text anytime, anywhere,
providing on-demand access to information.
18
● Google Translate: A widely used online translation service that supports
numerous languages.
● Microsoft Translator: Integrated into various Microsoft products, offering
real-time translation for text and speech.
● DeepL: A commercial translation service known for its high-quality output,
particularly for European languages.
19
● Graphical User Interface (GUI): Uses visual elements like icons and
windows for interaction.
● Touchscreen Interface: Allows users to interact directly with the screen
using touch gestures.
● Voice User Interface (VUI): Enables interaction through voice commands.
● Gesture-Based Interface: Relies on body movements and gestures for
control.
20
● Software: This encompasses the programs and applications that interpret
user input and control the machine's behavior.
● Visual Display: This presents information to the operator, often through
screens, gauges, or indicators.
Types of MMIs:
Applications of MMI:
21
● Industrial Automation: MMIs are widely used in factories and plants to
control machinery, monitor processes, and manage production lines.
● Transportation: Vehicle dashboards, flight control systems, and traffic
management systems rely on MMIs.
● Healthcare: Medical devices like MRI machines and patient monitoring
systems use MMIs for control and data visualization.
● Consumer Electronics: Remote controls, smartphone interfaces, and
gaming consoles are examples of MMIs in everyday life.
How it works:
1. User Input: The user poses a question in natural language, such as "What
were the sales figures for California in Q3 2023?"
2. Natural Language Understanding (NLU): The system analyzes the
user's question to:
○ Identify the intent (e.g., "find sales figures")
○ Extract key entities (e.g., "California," "Q3 2023")
○ Determine the relationships between entities (e.g., "sales figures for
California")
3. Query Translation: The system translates the natural language question
into a formal query language (like SQL) that the database can understand.
4. Data Retrieval: The database executes the generated query and retrieves
the relevant data.
22
5. Answer Generation: The system processes the retrieved data and
presents it to the user in a clear and concise format, such as a table, chart,
or summary.
Benefits of NLQ:
Applications of NLQ:
23
How it works:
● Speech Input: The user speaks into a microphone or other input device.
● Acoustic Analysis: The speech signal is converted into a digital
representation and analyzed to extract features like pitch, intensity, and
frequency.
● Feature Extraction: Key features of the speech signal are extracted and
represented in a numerical format.
● Pattern Recognition: The extracted features are compared to a database
of known speech patterns to identify the most likely words or phrases.
● Language Modeling: The system uses language models to predict the
most probable sequence of words based on the context and grammatical
rules.
● Output: The recognized text is displayed to the user.
Applications:
Challenges:
24
● Real-time Processing: Ensuring fast and accurate recognition for
real-time applications.
3. Finance:
25
● Fraud Detection: Identifying and preventing fraudulent activities by
analyzing patterns in financial documents (e.g., transactions, reports) and
detecting anomalies.
● Risk Assessment: Assessing credit risk by analyzing loan applications,
financial statements, and news articles.
● Investment Analysis: Analyzing news articles, financial reports, and
social media sentiment to inform investment decisions.
4. Healthcare:
5. E-commerce:
6. Legal:
26
● E-discovery: Analyzing large volumes of legal documents (emails,
contracts, legal briefs) to identify relevant information for legal proceedings.
● Contract Analysis: Automating the process of reviewing and analyzing
contracts to identify potential risks and inconsistencies.
27
28