0% found this document useful (0 votes)
24 views

Regular Expression

The document provides information about regular expressions (regex). It describes various elements of regex like repeaters, wildcard characters, character classes, grouping, alternation, anchors etc. It also gives examples of regex to validate a mobile number, email address, string with first uppercase character and lowercase letters with an optional digit.

Uploaded by

SIDDHI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Regular Expression

The document provides information about regular expressions (regex). It describes various elements of regex like repeaters, wildcard characters, character classes, grouping, alternation, anchors etc. It also gives examples of regex to validate a mobile number, email address, string with first uppercase character and lowercase letters with an optional digit.

Uploaded by

SIDDHI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Module 1: NLP

Aarti Dharmani
Estimate bigram probabilities
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>

P(I|<s>) =
P(Sam|<s>) =
P(am|I) =
P(</s>|Sam) =
P(Sam|am) =
P(do|I) =
Given no. of bigrams and unigrams count of
a dataset
i want to eat chinese food lunch spend
i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

i want to eat chinese food lunch spend


2533 927 2417 746 158 1093 341 278
Calculate the probability of a sentence
• P(I want chinese food to eat) = ?

• P(I) x P(want|I) x P(chinese|want) x P(food|chinese) x P(to|food) x


P(eat|to) = ?
Regular Expressions
Regular expressions provide a powerful, flexible, and efficient method
for processing text.
The extensive pattern-matching notation of regular expressions
enables you to quickly parse large amounts of text to:
• Find specific character patterns.
• Validate text to ensure that it matches a predefined pattern (such as
an email address).

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Elements of Regular Expressions
1. Repeaters ( *, +, and { } )
These symbols act as repeaters and tell the computer that the preceding character
is to be used for more than just one time.

2. The asterisk symbol ( * )


It tells the computer to match the preceding character (or set of characters) for 0
or more times (upto infinite).

3. The Plus symbol ( + )


It tells the computer to repeat the preceding character (or set of characters) at
atleast one or more times(up to infinite).
4. The curly braces { … }
It tells the computer to repeat the preceding character (or set of characters) for as
many times as the value inside this bracket.

5. Wildcard ( . )
The dot symbol can take the place of any other symbol, that is why it is called the
wildcard character.
6. Optional character ( ? )
This symbol tells the computer that the preceding character may or may not be present in the string to be
matched.

7. The caret ( ^ ) symbol ( Setting position for the match )


The caret symbol tells the computer that the match must start at the beginning of the string or line.

8. The dollar ( $ ) symbol


It tells the computer that the match must occur at the end of the string or before \n at the end of the line or
string.
9. Character Classes
A character class matches any one of a set of characters. It is used to match the
most basic element of a language like a letter, a digit, a space, a symbol, etc.
10. [^set_of_characters] Negation:
Matches any single character that is not in set_of_characters. By default, the
match is case-sensitive.

11. [first-last] Character range:


• Matches any single character in the range from first to last.

12. The Escape Symbol ( \ )


If you want to match for the actual ‘+’, ‘.’ etc characters, add a backslash( \ ) before
that character. This will tell the computer to treat the following character as a
search character and consider it for a matching pattern.
13. Grouping Characters ( )
A set of different symbols of a regular expression can be grouped together to act
as a single unit and behave as a block, for this, you need to wrap the regular
expression in the parenthesis( ).

14. Vertical Bar ( | )


Matches any one element separated by the vertical bar (|) character.
Write Regular Expressions for the following
cases
1. Mobile number:should start with 8 or 9 and total number of digits:10

• f you're looking for a regular expression for a mobile number that should start with 8 or 9 and have a total of 10
digits, you can use the following:

• regexCopy code
• ^[89]\d{9}$

• Explanation:
• ^[89]: The caret (^) asserts the start of the string. [89] means the first digit should be 8 or 9.
• \d{9}: \d represents any digit, and {9} specifies that there should be exactly 9 digits following the first one.
• $: The dollar sign asserts the end of the string.
Email ID:
Should have the format "[email protected]"

• regexCopy code
• ^[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,}$
• Explanation:
• ^[a-zA-Z0-9]+: Starts with one or more alphanumeric characters.
• @: Contains the "@" symbol.
• [a-zA-Z0-9]+: Followed by one or more alphanumeric characters for the
domain name.
• \.: Contains a dot before the top-level domain.
• [a-zA-Z]{2,}$: Ends with at least two alphabetic characters for the top-level
domain.
First Character uppercase, contains lower case
alphabets, only one digit allowed in between
regex
• ^[A-Z][a-z]*\d?[a-z]*$
• Explanation:
• ^[A-Z]: The caret (^) asserts the start of the string. [A-Z] means the first
character should be an uppercase letter.
• [a-z]*: Matches zero or more lowercase letters.
• \d?: Optionally matches one digit.
• [a-z]*$: Matches zero or more lowercase letters until the end of the string.
• This regular expression ensures that the first character is uppercase, and
the string can contain lowercase letters with at most one digit in between
them.

You might also like