0% found this document useful (0 votes)
5 views

Lecture 4 Regular Expression

Regular Expression

Uploaded by

ravleen3310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 4 Regular Expression

Regular Expression

Uploaded by

ravleen3310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

REGULAR

EXPRESSION
AND FINITE
STATE
AUTOMATA
REGULAR EXPRESSION
 A regular expression is a sequence of characters that specifies a match
pattern in text. When comparing this pattern against a string, it'll either be
true or false.
 Example:-

(a  b c) *
describes the language

a, bc*  , a, bc, aa, abc, bca,...

2
WHY REGULAR
EXPRESSION?
 Regular expressions allow matching and manipulation of textual data.
 Regular expressions is a language for specifying text search strings. The
regular expression languages used for text searches in Word document &
online Web searches are very similar.
Requirements:-
 Matching/Finding
 Doing something with matched text
 Validation of data
 Case insensitive matching
 Parsing data ( ex: html )
 Converting data into different form etc.

3
STRINGS AND LANGUAGES
 A string is a data type used in programming, such as an integer and floating
point unit, but is used to represent text rather than numbers.
 It is comprised of a set of characters that can also contain spaces and
numbers. For example, the word “hamburger”. Even "12345" could be
considered a string, if specified correctly.
 Two important examples of programming language alphabets are ASCII and
EBCDIC character sets.

4
STRINGS (CONT.)
 A string is a finite sequence of symbols such as 001. The length of a string
x, usually denoted |x|, is the total number of symbols in x.
 For e.g.: 110001 is a string of length 6. A special string is a empty string
which is denoted by ε. This string is of length 0(zero).
 If x and y are strings, then the concatenation of x and y, written as x.y or
just xy, is the string formed by following the symbols of x by the symbols of
y.
 For eg.:abd.ce = abdce i.e. if x= abd & y=ce , then xy = abdce.

5
THINGS TO REMEMBER
Languages
Generated by
Regular Expressions  Regular
Languages

6
EXAMPLE
 Regular expression

r a  b * a  bb 

Lr  a, bb, aa, abb, ba, bbb,...

7
EXAMPLE
 Regular expression

r (0  1) * 00 (0  1) *

L(r )= { all strings containing substring 00 }

8
REGULAR EXPRESSION &
FINITE STATE AUTOMATA
 Finite State Automata (FSA)
 Describing patterns with graphs
 Programs that keep track of state
 Regular Expressions (RE)
 Describing patterns with regular expressions
 Converting regular expressions to programs

The languages (Regular Languages) recognized by FSA and


 Note:-
generated by RE are the same.

9
REGULAR EXPRESSIONS
 Describe (generate) Regular Languages
 A pattern:
 ε – the empty string
 a – a literal character, stands for itself

 Operations
 Concatenation, RS
 Alternation, R|S
 Closure (Kleene Star) – R*, the set of all strings that can be made by
concatenating zero or more strings in R

10
REGULAR EXPRESSIONS
 There are three operators used to build regular expressions:
 Union
 R|S – L(R|S) = L(R)  L(S)
 Concatenation
 RS – L(RS) = {rs, r  R and s  S}
 Closure
 R* – L(R*) = {,R,RR,RRR,…}

11
RE – EXAMPLES
 a|b* denotes

:- {ε, a, b, bb, bbb, ...}


 (a|b)* denotes

:- the set of all strings consisting of any number of a or b symbols, including


the empty string
 ab*(c|ε) denotes

:- the set of strings starting with a, then zero or more b’s and finally
optionally a c.

12
REGULAR EXPRESSIONS
 a|(ab)
 (a|(ab))|(c|(bc))
 a*
 a*b*
 (ab)*
 a|bc*d
 letter = a|b|c|…|z|A|B|C|…|Z|_
 digit = 0|1|2|3|4|5|6|7|8|9
 letter(letter|digit)*
13
FINITE STATE MACHINE
 Accepts or rejects a string
 A finite collection of states
 Has a single start state
 Transition from one state to another on a given input
 Machine accepts if in an accepting state at end of input (whatever that
means)
 The language accepted by a finite automata is the set of input strings that
end up in accepting states

14
A SIMPLE AUTOMATON
0 1

1 0

q1 q2 q3

start 0,1

accept

on input “0110”, the machine goes:


q1  q1  q2  q2  q3 = “reject”
15
A SIMPLE AUTOMATON
0 1

1 0

q1 q2 q3

0,1

on input “101”, the machine goes:


q1  q2  q3  q2 = “accept”
16
STANDARD REPRESENTATIONS
OF REGULAR LANGUAGES
Regular Languages

DFAs

Regular
NFAs Expressions

17
NON-DETERMINISTIC FINITE AUTOMATA
 A nondeterministic finite automaton (NFA) M is defined by a 5-
tuple M=(Q,,,q0,F), with

 Q: finite set of states


 : finite alphabet
 : transition function :QP (Q) P Power Set
 q0Q: start state
 FQ: set of accepting states

18
A NONDETERMINISTIC
AUTOMATON
0,1 0,1

1 0,  1

q1 q2 q3 q4

This automaton accepts “0110”, because


there is a possible path that leads to an
accepting state, namely:
q1  q1  q2  q3  q4  q4
19
DETERMINISTIC FINITE AUTOMATA
 Theoreticians have developed a number of theoretical models to describe
computing.
 Simplest model is known as DFA.
 Deterministic in the sense that machine will be in a state. Upon receipt of
a certain symbol, it will go to a known state.
 Finite as the machines only have a certain number of states.
 Automata refers to machine/robot.

20
DETERMINISTIC FINITE AUTOMATA
 DFA’s recognize strings.
 If the input ends and the DFA is in an accept state, then the string is
recognized.
 A language can be described as a set of strings.
 A language is called a regular language if some finite automaton
recognizes it.
 A deterministic finite automaton (DFA) M is defined by a 5-tuple
M=(Q,,,q0,F)

 Q: finite set of states


 : finite alphabet
 : transition function :QQ
 q Q: start state
0
 FQ: set of accepting states

21
DETERMINISTIC FINITE
AUTOMATA
0 1
M = (Q,,,q,F)

1 0

states Q = {q1,q2,q3}
q1 q2 q3
alphabet  = {0,1}
start state q1
accept states F={q2}
0,1
transition function :

22
DETERMINISTIC FINITE AUTOMATA EXAMPLE

 Accept all strings that ends with a 1

23
DETERMINISTIC FINITE AUTOMATA EXAMPLE

 Strings with an odd number of 1’s

24
WHY STUDY REGULAR EXPRESSION AND DFA

 Test if a string matches some pattern.


 Scan for virus signatures.
 Process natural language.
 Search for information using Google.
 Search for markers in human genome.
 Access information in digital libraries.
 Search and replace in a word processor.
 Filter text

25
REGULAR EXPRESSIONS IN NATURAL LANGUAGE
PROCESSING
Regular expressions or RegEx is a sequence of characters mainly used to find or
replace patterns embedded in the text.
Let’s consider this example: Suppose we have a list of friends-

And if we want to select only those names on this list which match the certain
pattern such as something the like this-

26
REGULAR EXPRESSIONS IN NATURAL LANGUAGE
PROCESSING
 The names having the first two letters- S and U, followed by only three
positions that can be taken up by any letter.
 What do you think, which names fit this criterion? Let’s go one by one, the
name Sunil and Sumit fit this criterion as they have S and U in the
beginning and three more letters after that.
 While rest of the three names are not following the given criteria as Ankit
is starting with the alphabet A whereas Surjeet and Surabhi have more
than three characters post S and U.

27
REGULAR EXPRESSIONS IN NATURAL LANGUAGE
PROCESSING
 What we have done here is that we have a pattern(or criteria) and a list of
names and we’re trying to find the name that matches the given pattern.
That’s exactly how regular expressions work.

 In RegEx, we’ve different types of patterns to recognize different strings of


characters

28
FINITE STATE MACHINES (DFA) IN NLP TOKEN RECOGNITION

 State machine that recognizes the strings “he”,”hers”,”his”,”she”.

29
SUMMARY
 DFA’s, NFA’s all accept exactly the same set of languages: the regular
languages.
 The NFA types are easier to design and may have exponentially fewer
states than a DFA.
 But only a DFA can be implemented!

30

You might also like