0% found this document useful (0 votes)

2 views49 pages

Multimedia Application L3

The document discusses various aspects of Natural Language Processing (NLP), including regular expressions, tokenization, text normalization, and context-free grammar. It outlines applications of NLP such as text analysis, machine translation, and sentiment analysis, while also explaining techniques like lemmatization, stemming, and minimum edit distance. Additionally, it covers the significance of regular expressions in text processing and the structure of context-free grammar in programming languages.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views49 pages

Multimedia Application L3

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Multimedia

Application
By

Minhaz Uddin Ahmed, PhD

Department of Computer Engineering
Inha University Tashkent.
Email: [email protected]
Content
 Regular Expressions
 Words
 Corpora
 Simple Unix Tools for Word Tokenization
 Word Tokenization
 Word Normalization, Lemmatization and Stemming
 Sentence Segmentation
 Minimum Edit Distance
 Context Free Grammar
Natural Language processing

 NLP is field of computer science and AI that concerned with computer linguistics
and interaction between human and computer natural language.

 Application
 Text processing and analysis
 Text to speech , speech to text, speech to speech
 Machine translation
 Search engine.
 Sentiment analysis and opinion mining.
 Advanced text editor and IDE
 Question answering
 Spam and fraud detection
Regular expression
 Regular expression (often shortened to regex), a language for
specifying text search strings.

 Application
 Search in text, pattern in text, string matching, linux terminal
 E.g. Ctrl+F

Pattern Example Description

[0-9] 18524 Match only numeric digit
{A-Z} AZIZ Only capital word, If word = ‘Aziz’ then
not matching
[A-Za-z] Aziz Ahmed Capital and small letter but not digit such
as Aziz98
[Hmm*] Hm, Hmm, 0 or more matching
Hmmm,Hmm.
Regular expression
Regular expression: Kleene closure
 The Kleene closure, also known as the Kleene star, is a unary operator
denoted by an asterisk (*) symbol. It is used to indicate that the
preceding element or subexpression can occur zero or more times.

 For example, if we have a regular expression "a*", it means that the

character 'a' can occur zero or more times. This would match strings
such as "", "a", "aa", "aaa", and so on.

 Similarly, in the regular expression "(ab)*", the Kleene closure applies

to the entire subexpression "(ab)", indicating that the sequence "ab"
can occur zero or more times. This would match strings such as "",
"ab", "abab", "ababab", and so on.
Regular expression: Kleene closure

 Zero or more character recognition in a text.

 Application:
Linux terminal

classification :
“0” = zero or more
“1” = 1 or more

Example 1: New(s)* = New, News, Newss, Newss, Newssss

Example 2: New(s)+ = News, Newsss, Newssss
Regular expression: Kleene closure

 Sometimes it’s annoying to have to write the regular expression for

digits twice, so there is a shorter way to specify “at least one” of
some character.

 One very important special character is the period (/./), a wildcard

expression that matches any single character (except a carriage
return), as shown in Fig. 2.6.
Regular expression: Kleene closure

 Anchors are special characters that anchor regular expressions to

particular places
in a string. The most common anchors are the caret ^ and the dollar
sign $. The caret ^ matches the start of a line.
Disjunction, Grouping, and
Precedence
 This idea that one operator may take precedence over another,
requiring us to sometimes use parentheses to specify what we mean,
is formalized by the operator precedence hierarchy for regular
expressions.
Regular expression: pipe symbol (|)

works similar like logical OR

Choose a particular sub string from the input string
 Go (es|ing): Goes, Going
 S (a|u|i) ng: Sing, Sung, Sang
 Example: Two string
More Operators
Regular expression
Words

 A computer-readable corpora collection of text or speech. (plural

corpus)

 For example the Brown corpus is a million-word collection of samples

from 500 written English texts from different genres (newspaper,
fiction, non-fiction, academic, etc.), assembled at Brown University in
1963–64
Words
Text normalization in NLP

Normalization is the process of converting into standard

form.

Necessity of normalization
Word boundary detection
Separated word from each other
Example: Uzbekistan, New York, cats and dogs
Example 2: #nlp

The world has 7097 languages. -2018

Tokenization

 There are roughly two classes of tokenization algorithms. In top-down

tokenization, we define a standard and implement rules to implement
that kind of tokenization. In bottom-up tokenization, we use simple
statistics of letter sequences to break up words into subword tokens.

 While the Unix command sequence just removed all the numbers and
punctuation, for most NLP applications we’ll need to keep these in our
tokenization. We often want to break off punctuation as a separate
token; commas are a useful piece of information for parsers, periods
help indicate sentence boundaries.
Methods of Text Normalization:

 Tokenization
Input: Uzbekistan is beautiful
Output: [Uzbekistan, is, beautiful]

Lemmatization: Finding the same root.

Input: [Sings, sung, sang], [connection, connected, connecting]
Output: [Sing], [connect]

A lemma is a set of lexical forms having the same

stem, the same major part-of-speech, and the
same word sense
Methods of Text Normalization:

 Stemming: porter stemmer

Rules 1 : ational -> ate, ator -> ate
Example 1: Relational -> Relate, Operator -> Operate
Rule 2: Ing -> Null, s-> Null
Output: Going -> Go, Walking -> Walk, Cats -> Cat

Stemming Challenge : Need to handle some exceptional case

Input: King, Nothing, News
Output: K, noth, New
Methods of Text Normalization:

 Stop words: Remove the common word that are not important
 words: am is are, an the
 Example:
Input: Tashkent has a beautiful park.
Output: Tashkent beautiful park

 Parts Of Speech Tagging: Identify the grammar of each word

Input: IUT has a beautiful campus
Output: IUT = Noun, has = verb, a = Article, beautiful = adjective ,
campus = Noun.
Tokenization

 Tokenization includes word, sentence, whitespace, punctuation.

Sentence Tokenization
Tokenization is a text preprocessing step in sentiment analysis that
involves breaking down the text into individual words or tokens.

Preprocessing Text
Word tokenization
Text Normalization: Case folding

 All letter to lower case

 Application
-Information extraction, sentiment analysis, machine translation

- Example
Input
. [CAT, I have an Apple, General Motors, US]
Output
. [cat, i have an apple, general motors, us]
Minimum Edit distance

 Spell correction
Cornel Applications
Korenel
• Search
Corrnel
• Machine translation
Cononel • Information extraction
• Speech recognition

 Bioinformatics
sample 1 : [A,T,G,G,C]
sample 2 : [A,T,C,G,C]
Minimum Edit Distance

 The minimum number of editing operations need to be transform one

string to another.

I NTE*NTION
 INTENTION
 EXECUTION
*E XECUTION
 Editing Operation
o Insertion, I DSS IS
o Deletion, D
o Substitution, S
Minimum Edit Distance

 Operation cost
 I=1
 D=1
 S = (D+I)=2

 Cost = 1+2+2+1 +2 = 8
Minimum Edit Distance

 Ambiguity of minimum edit distance calculation

LEDA LEDA*

DEAL DE*A L

S SS S D I

Cost= 2+2+2 =6 cost = 2+1+1=4

Dynamic programming for Minimum
edit distance
 Dynamic programming: A tabular computation of D(n,m)
 Solving problems by combining solutions to subproblems.
 Bottom-up
 We compute D(i,j) for small i,j
 And compute larger D(i,j) based on previously computed smaller values
 i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Minimum edit distance
(Levenshtein)
 Initialization
D(i,0) = i
D(0,j) = j
 Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
 Termination:
D(N,M) is distance
The Edit distance table

Destination(j
)
0 1 2 3 4 5
0 # L E D A
1 # 0 1 2 3 4 Initialization
Source 2 D 1
D(i,0) = i
(i) 3 E 2 D (0,j) = j
4 A 3
5 L 4
The Edit distance table

Destination(j
)
0 1 2 3 4 5
0 # L E D A
1 # 0 1 2 3 4 Initialization
Source 2 D 1 2 3
i = 2 (index value: D)
(i) 3 E 2 j = 2 (index value: L)
4 A 3
5 L 4 D(i,j)=
min{D(1,2)+1 ,
D(2,1)+1,
D(1,1)+2}
= min{2,2,2} = 2
The Edit distance table

Destination(j)

0 1 2 3 4 5
D(i,j)= min{D(1,2)+1 , D(2,1)+1,
0 # L E D A D(1,1)+2}
= min{2,2,2} = 2
1 # 0 1 2 3 4
i= 2(index value: D)
Source (i) 2 D 1 2 3 j= 3(index value: E)

3 E 2
D(i,j)= min {D(1,3)+1 , D(2,2)+1,
4 A 3 D(1,2)+2}
= min {3,3,3} = 3
5 L 4
The Edit distance table

Destination(j)

0 1 2 3 4 5
0 # L E D A
1 # 0 1 2 3 4
i= 5(index value: L)
Source (i) 2 D 1 2 3 2 3 j= 5(index value: A)

3 E 2 3 2 3 4
D(i,j)= min {D(4,5)+1 , D(5,4)+1,
4 A 3 4 3 4 3 D(4,4)+2}
= min {4,6,6} = 4
5 L 4 3 4 5 4
DP for the minimum Edit Distance
Table
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance

N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Context Free Grammar (CFG)
 Context free grammar is a formal grammar which is used to generate
all possible strings in a given formal language.
Advantages of CFG

 There are the various usefulness of CFG:

• Context free grammar is useful to describe most of the programming
languages.
• If the grammar is properly designed then an efficient parser can be
constructed automatically.
• Using the features of associatively & precedence information, suitable
grammars for expressions can be constructed.
• Context free grammar is capable of describing nested structures like:
balanced parentheses, matching begin-end, corresponding if-then-
else's & so on.
Context Free Grammar (CFG)

 A CFG consists of a finite set of grammar rules is a quadruple (N,T,P,S)

 N is a set of non terminal symbol

 T is a set of terminal symbol
 P is a set of rules
 S is the start symbol
S-> Aa , A-> Ab|c
CFG
Example:
 N = {S,A}
 T = {a, b, c }
 P = {S-> Aa, A-> Ab, A-> c}
 S=S
Context Free Grammar (CFG)

 Production rules:

S → aSa
S → bSb
S→c

check that abbcbba string can be derived from the given CFG.

S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
Categories of CFG

 Non Recursive CFG

S -> Aa output:{ba, ca} S -> Aa
-> ba
A-> b|c

 Recursive CFG
S -> Aa output: { ca, cba, cbba, cbbba, …. } S->
Aa
->Aba
-> cba
A -> Ab|c
Context Free Grammar (CFG)

Example of a CFG
S -> Aa S-> Aa
A -> Ab| c -> Aba
-> Abba
-> cbba
Input: {c, b, b, a}
Parsing using
CFG
Parse tree
Parse tree

 A parse tree or parsing tree or derivation tree or concrete syntax tree

is an ordered, rooted tree that represents the syntactic structure of a
string according to some context-free grammar. The term parse tree
itself is used primarily in computational linguistics;
Parse tree

 The constituency-based parse trees of constituency grammars (phrase

structure grammars) distinguish between terminal and non-terminal
nodes. The interior nodes are labeled by non-terminal categories of the
grammar, while the leaf nodes are labeled by terminal categories.

S for sentence, the top-level structure in this example

NP for noun phrase. The first (leftmost) NP, a single noun "John", serves as the subject of the sentence.
The second one is the object of the sentence.
VP for verb phrase, which serves as the predicate
V for verb. In this case, it's a transitive verb hit.
D for determiner, in this instance the definite article "the“ N for noun
Reference

Chapter 2 Chapter 5
Question
Thank you

Woman-Hating Duke Feels Lust Only For One Aristocrat Lady Vol. 2
No ratings yet
Woman-Hating Duke Feels Lust Only For One Aristocrat Lady Vol. 2
367 pages
String matching
No ratings yet
String matching
66 pages
Effect of Drying and Milling On Head Rice of Paddy
No ratings yet
Effect of Drying and Milling On Head Rice of Paddy
457 pages
Pattern Searching
No ratings yet
Pattern Searching
488 pages
UNIT3
No ratings yet
UNIT3
52 pages
03 Text Processing- Minimum Edit Distance
No ratings yet
03 Text Processing- Minimum Edit Distance
41 pages
Multimedia Application L2
No ratings yet
Multimedia Application L2
47 pages
14. String Matching (1)
No ratings yet
14. String Matching (1)
116 pages
Module 1 Nlp
No ratings yet
Module 1 Nlp
26 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
4-Tolerant retrieval
No ratings yet
4-Tolerant retrieval
82 pages
BLOCK CIPHER
No ratings yet
BLOCK CIPHER
17 pages
mono_dandelion--taraxacum-officinale_english
No ratings yet
mono_dandelion--taraxacum-officinale_english
10 pages
2 EditDistance 2022
No ratings yet
2 EditDistance 2022
37 pages
Week-2
No ratings yet
Week-2
31 pages
Workshop 01 - Live Session HW0
No ratings yet
Workshop 01 - Live Session HW0
21 pages
03 Med
No ratings yet
03 Med
52 pages
12_strings.v3
No ratings yet
12_strings.v3
111 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
lec8
No ratings yet
lec8
17 pages
Lect_05_Preprocessing_text
No ratings yet
Lect_05_Preprocessing_text
25 pages
Lecture 4
No ratings yet
Lecture 4
57 pages
Edit Distance
No ratings yet
Edit Distance
19 pages
WUA
No ratings yet
WUA
32 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
03 Med
No ratings yet
03 Med
35 pages
ITD253-L2-TextPreprocessing
No ratings yet
ITD253-L2-TextPreprocessing
33 pages
2R Nissin Brake Catalogue (2023ver)
No ratings yet
2R Nissin Brake Catalogue (2023ver)
103 pages
Week 2
No ratings yet
Week 2
90 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Multimedia Application L4
No ratings yet
Multimedia Application L4
42 pages
EPC Contracts
100% (1)
EPC Contracts
67 pages
Minimum Edit Distance.
No ratings yet
Minimum Edit Distance.
12 pages
12 - Strings Matching
No ratings yet
12 - Strings Matching
111 pages
Bilty
No ratings yet
Bilty
1 page
Metallography Standards Complete List
No ratings yet
Metallography Standards Complete List
3 pages
Ca 12
No ratings yet
Ca 12
64 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
EditDistance
No ratings yet
EditDistance
28 pages
Etiopian Herald April 21 - Opt
No ratings yet
Etiopian Herald April 21 - Opt
13 pages
Module 1.2
No ratings yet
Module 1.2
28 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
Definition of Minimum Edit Distance
No ratings yet
Definition of Minimum Edit Distance
49 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
NLP_Midterm_Spring2025
No ratings yet
NLP_Midterm_Spring2025
7 pages
WINSEM2023-24_CSI2005_TH_VL2023240501823_2024-01-08_Reference-Material-I
No ratings yet
WINSEM2023-24_CSI2005_TH_VL2023240501823_2024-01-08_Reference-Material-I
23 pages
Chapter 1 + 2
No ratings yet
Chapter 1 + 2
9 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Defini'on of Minimum Edit Distance
No ratings yet
Defini'on of Minimum Edit Distance
52 pages
NLP Individual Assignment ch-2
No ratings yet
NLP Individual Assignment ch-2
4 pages
Ball Park Music - It’s Nice To Be Alive
No ratings yet
Ball Park Music - It’s Nice To Be Alive
1 page
01 Defining Minimum Edit Distance 7-04
No ratings yet
01 Defining Minimum Edit Distance 7-04
3 pages
Calculating Minimum Edit Distance
0% (1)
Calculating Minimum Edit Distance
52 pages
Lec 6
No ratings yet
Lec 6
19 pages
SNIST Sreenidhi Institute of Science and Technology
No ratings yet
SNIST Sreenidhi Institute of Science and Technology
4 pages
Spelling Correction: Edit Distance: Pawan Goyal
No ratings yet
Spelling Correction: Edit Distance: Pawan Goyal
67 pages
Type BX ring gasket groove
No ratings yet
Type BX ring gasket groove
1 page
Lecture 4
No ratings yet
Lecture 4
48 pages
Games
No ratings yet
Games
8 pages
SOLUTION Unit 4 Test
No ratings yet
SOLUTION Unit 4 Test
6 pages
Year 8 Mid Term Rocky Revision Sheet 2024 ANSWER KEY
No ratings yet
Year 8 Mid Term Rocky Revision Sheet 2024 ANSWER KEY
5 pages
Edlm1002 Unit 4
No ratings yet
Edlm1002 Unit 4
32 pages
Practical File: Be (Cse) 6 Semester
No ratings yet
Practical File: Be (Cse) 6 Semester
54 pages
NLP
No ratings yet
NLP
4 pages
Navigating To The File Using CMD
No ratings yet
Navigating To The File Using CMD
18 pages
Triage Obstétrico, Código Mater Y Equipo de Respuesta Inmediata
No ratings yet
Triage Obstétrico, Código Mater Y Equipo de Respuesta Inmediata
9 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
Edit Dist
No ratings yet
Edit Dist
24 pages
Demonstrating Conformity to the Standard
No ratings yet
Demonstrating Conformity to the Standard
2 pages
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
No ratings yet
Grading: Final Term: 40 % Term Paper: 30% Assignments and Quizzes: 30%
46 pages
Elephant Toothpaste
No ratings yet
Elephant Toothpaste
3 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
Electronics Plan Single Line Diagram - Trevor C. Sebio
No ratings yet
Electronics Plan Single Line Diagram - Trevor C. Sebio
1 page
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
No ratings yet
J.K. Institute of Applied Physics and Technology: Natural Language Processing Assignment
22 pages
Using This Tutorial Guide
No ratings yet
Using This Tutorial Guide
12 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Zambia Revenue Authority: Unofficial Consolidation of The
No ratings yet
Zambia Revenue Authority: Unofficial Consolidation of The
169 pages
Week-2
No ratings yet
Week-2
222 pages
Lexical Analysis: Textbook:Modern Compiler Design
No ratings yet
Lexical Analysis: Textbook:Modern Compiler Design
43 pages
FASTORQ-Torque-Estimating-Chart - Barra Roscada
No ratings yet
FASTORQ-Torque-Estimating-Chart - Barra Roscada
1 page
Student Exploration: Cell Structure
No ratings yet
Student Exploration: Cell Structure
8 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Service Manual: Xga Color Monitor Model: L700C/L700CM L701C/L701CM L700CAV
No ratings yet
Service Manual: Xga Color Monitor Model: L700C/L700CM L701C/L701CM L700CAV
75 pages
LUX SOAP PLC and Strategies by Azhar Ali
No ratings yet
LUX SOAP PLC and Strategies by Azhar Ali
11 pages
NLP SEM QUESTIONS AND ANSWERS
No ratings yet
NLP SEM QUESTIONS AND ANSWERS
72 pages
552502-computational-thinking-algorithms-and-programming_Swan
No ratings yet
552502-computational-thinking-algorithms-and-programming_Swan
42 pages
Tube Feeding Checklist
No ratings yet
Tube Feeding Checklist
1 page
Construction Management and Entrepreneurship: Model Question Paper - With Effect From 2020-21 (CBCS Scheme)
No ratings yet
Construction Management and Entrepreneurship: Model Question Paper - With Effect From 2020-21 (CBCS Scheme)
3 pages
Autonmous Maint TPM Club India
100% (3)
Autonmous Maint TPM Club India
36 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)

Multimedia Application L3

Uploaded by

Multimedia Application L3

Uploaded by

Multimedia

Minhaz Uddin Ahmed, PhD

Pattern Example Description

 For example, if we have a regular expression "a*", it means that the

 Similarly, in the regular expression "(ab)*", the Kleene closure applies

 Zero or more character recognition in a text.

Example 1: New(s)* = New, News, Newss, Newss, Newssss

 Sometimes it’s annoying to have to write the regular expression for

 One very important special character is the period (/./), a wildcard

 Anchors are special characters that anchor regular expressions to

works similar like logical OR

 A computer-readable corpora collection of text or speech. (plural

 For example the Brown corpus is a million-word collection of samples

Normalization is the process of converting into standard

The world has 7097 languages. -2018

 There are roughly two classes of tokenization algorithms. In top-down

Lemmatization: Finding the same root.

A lemma is a set of lexical forms having the same

 Stemming: porter stemmer

Stemming Challenge : Need to handle some exceptional case

 Parts Of Speech Tagging: Identify the grammar of each word

 Tokenization includes word, sentence, whitespace, punctuation.

 All letter to lower case

 The minimum number of editing operations need to be transform one

 Ambiguity of minimum edit distance calculation

Cost= 2+2+2 =6 cost = 2+1+1=4

 There are the various usefulness of CFG:

 A CFG consists of a finite set of grammar rules is a quadruple (N,T,P,S)

 N is a set of non terminal symbol

 Non Recursive CFG

 A parse tree or parsing tree or derivation tree or concrete syntax tree

 The constituency-based parse trees of constituency grammars (phrase

S for sentence, the top-level structure in this example

You might also like