Lecture 4 Regular Expression
Lecture 4 Regular Expression
EXPRESSION
AND FINITE
STATE
AUTOMATA
REGULAR EXPRESSION
A regular expression is a sequence of characters that specifies a match
pattern in text. When comparing this pattern against a string, it'll either be
true or false.
Example:-
(a b c) *
describes the language
2
WHY REGULAR
EXPRESSION?
Regular expressions allow matching and manipulation of textual data.
Regular expressions is a language for specifying text search strings. The
regular expression languages used for text searches in Word document &
online Web searches are very similar.
Requirements:-
Matching/Finding
Doing something with matched text
Validation of data
Case insensitive matching
Parsing data ( ex: html )
Converting data into different form etc.
3
STRINGS AND LANGUAGES
A string is a data type used in programming, such as an integer and floating
point unit, but is used to represent text rather than numbers.
It is comprised of a set of characters that can also contain spaces and
numbers. For example, the word “hamburger”. Even "12345" could be
considered a string, if specified correctly.
Two important examples of programming language alphabets are ASCII and
EBCDIC character sets.
4
STRINGS (CONT.)
A string is a finite sequence of symbols such as 001. The length of a string
x, usually denoted |x|, is the total number of symbols in x.
For e.g.: 110001 is a string of length 6. A special string is a empty string
which is denoted by ε. This string is of length 0(zero).
If x and y are strings, then the concatenation of x and y, written as x.y or
just xy, is the string formed by following the symbols of x by the symbols of
y.
For eg.:abd.ce = abdce i.e. if x= abd & y=ce , then xy = abdce.
5
THINGS TO REMEMBER
Languages
Generated by
Regular Expressions Regular
Languages
6
EXAMPLE
Regular expression
r a b * a bb
7
EXAMPLE
Regular expression
r (0 1) * 00 (0 1) *
8
REGULAR EXPRESSION &
FINITE STATE AUTOMATA
Finite State Automata (FSA)
Describing patterns with graphs
Programs that keep track of state
Regular Expressions (RE)
Describing patterns with regular expressions
Converting regular expressions to programs
9
REGULAR EXPRESSIONS
Describe (generate) Regular Languages
A pattern:
ε – the empty string
a – a literal character, stands for itself
Operations
Concatenation, RS
Alternation, R|S
Closure (Kleene Star) – R*, the set of all strings that can be made by
concatenating zero or more strings in R
10
REGULAR EXPRESSIONS
There are three operators used to build regular expressions:
Union
R|S – L(R|S) = L(R) L(S)
Concatenation
RS – L(RS) = {rs, r R and s S}
Closure
R* – L(R*) = {,R,RR,RRR,…}
11
RE – EXAMPLES
a|b* denotes
:- the set of strings starting with a, then zero or more b’s and finally
optionally a c.
12
REGULAR EXPRESSIONS
a|(ab)
(a|(ab))|(c|(bc))
a*
a*b*
(ab)*
a|bc*d
letter = a|b|c|…|z|A|B|C|…|Z|_
digit = 0|1|2|3|4|5|6|7|8|9
letter(letter|digit)*
13
FINITE STATE MACHINE
Accepts or rejects a string
A finite collection of states
Has a single start state
Transition from one state to another on a given input
Machine accepts if in an accepting state at end of input (whatever that
means)
The language accepted by a finite automata is the set of input strings that
end up in accepting states
14
A SIMPLE AUTOMATON
0 1
1 0
q1 q2 q3
start 0,1
accept
1 0
q1 q2 q3
0,1
DFAs
Regular
NFAs Expressions
17
NON-DETERMINISTIC FINITE AUTOMATA
A nondeterministic finite automaton (NFA) M is defined by a 5-
tuple M=(Q,,,q0,F), with
18
A NONDETERMINISTIC
AUTOMATON
0,1 0,1
1 0, 1
q1 q2 q3 q4
20
DETERMINISTIC FINITE AUTOMATA
DFA’s recognize strings.
If the input ends and the DFA is in an accept state, then the string is
recognized.
A language can be described as a set of strings.
A language is called a regular language if some finite automaton
recognizes it.
A deterministic finite automaton (DFA) M is defined by a 5-tuple
M=(Q,,,q0,F)
21
DETERMINISTIC FINITE
AUTOMATA
0 1
M = (Q,,,q,F)
1 0
states Q = {q1,q2,q3}
q1 q2 q3
alphabet = {0,1}
start state q1
accept states F={q2}
0,1
transition function :
22
DETERMINISTIC FINITE AUTOMATA EXAMPLE
23
DETERMINISTIC FINITE AUTOMATA EXAMPLE
24
WHY STUDY REGULAR EXPRESSION AND DFA
25
REGULAR EXPRESSIONS IN NATURAL LANGUAGE
PROCESSING
Regular expressions or RegEx is a sequence of characters mainly used to find or
replace patterns embedded in the text.
Let’s consider this example: Suppose we have a list of friends-
And if we want to select only those names on this list which match the certain
pattern such as something the like this-
26
REGULAR EXPRESSIONS IN NATURAL LANGUAGE
PROCESSING
The names having the first two letters- S and U, followed by only three
positions that can be taken up by any letter.
What do you think, which names fit this criterion? Let’s go one by one, the
name Sunil and Sumit fit this criterion as they have S and U in the
beginning and three more letters after that.
While rest of the three names are not following the given criteria as Ankit
is starting with the alphabet A whereas Surjeet and Surabhi have more
than three characters post S and U.
27
REGULAR EXPRESSIONS IN NATURAL LANGUAGE
PROCESSING
What we have done here is that we have a pattern(or criteria) and a list of
names and we’re trying to find the name that matches the given pattern.
That’s exactly how regular expressions work.
28
FINITE STATE MACHINES (DFA) IN NLP TOKEN RECOGNITION
29
SUMMARY
DFA’s, NFA’s all accept exactly the same set of languages: the regular
languages.
The NFA types are easier to design and may have exponentially fewer
states than a DFA.
But only a DFA can be implemented!
30