0% found this document useful (0 votes)
7 views

Natural Language Processing 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Natural Language Processing 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Module 1: NLP

Aarti Dharmani
Estimate bigram probabilities
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>

P(I|<s>) =
P(Sam|<s>) =
P(am|I) =
P(</s>|Sam) =
P(Sam|am) =
P(do|I) =
Given no. of bigrams and unigrams count of a
dataset
i want to eat chinese food lunch spend
i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0

i want to eat chinese food lunch spend


2533 927 2417 746 158 1093 341 278
Calculate the probability of a sentence
• P(I want chinese food to eat) = ?

• P(I) x P(want|I) x P(chinese|want) x P(food|chinese) x P(to|food) x


P(eat|to) = ?
Regular Expressions
Regular expressions provide a powerful, flexible, and efficient method
for processing text.
The extensive pattern-matching notation of regular expressions enables
you to quickly parse large amounts of text to:
• Find specific character patterns.
• Validate text to ensure that it matches a predefined pattern (such as
an email address).

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Elements of Regular Expressions
1. Repeaters ( *, +, and { } )
These symbols act as repeaters and tell the computer that the preceding character
is to be used for more than just one time.

2. The asterisk symbol ( * )


It tells the computer to match the preceding character (or set of characters) for 0 or
more times (upto infinite).

3. The Plus symbol ( + )


It tells the computer to repeat the preceding character (or set of characters) at
atleast one or more times(up to infinite).
4. The curly braces { … }
It tells the computer to repeat the preceding character (or set of characters) for as
many times as the value inside this bracket.

5. Wildcard ( . )
The dot symbol can take the place of any other symbol, that is why it is called the
wildcard character.
6. Optional character ( ? )
This symbol tells the computer that the preceding character may or may not be
present in the string to be matched.

7. The caret ( ^ ) symbol ( Setting position for the match )


The caret symbol tells the computer that the match must start at the beginning of
the string or line.

8. The dollar ( $ ) symbol


It tells the computer that the match must occur at the end of the string or before \n
at the end of the line or string.
9. Character Classes
A character class matches any one of a set of characters. It is used to match the
most basic element of a language like a letter, a digit, a space, a symbol, etc.
10. [^set_of_characters] Negation:
Matches any single character that is not in set_of_characters. By default, the match
is case-sensitive.

11. [first-last] Character range:


• Matches any single character in the range from first to last.

12. The Escape Symbol ( \ )


If you want to match for the actual ‘+’, ‘.’ etc characters, add a backslash( \ ) before
that character. This will tell the computer to treat the following character as a
search character and consider it for a matching pattern.
13. Grouping Characters ( )
A set of different symbols of a regular expression can be grouped together to act as
a single unit and behave as a block, for this, you need to wrap the regular
expression in the parenthesis( ).

14. Vertical Bar ( | )


Matches any one element separated by the vertical bar (|) character.
Write Regular Expressions for the following cases
1. Mobile number:should start with 8 or 9 and total number of
digits:10
2. First Character uppercase, contains lower case alphabets, only one
digit allowed in between
3. Email id, say [email protected]

You might also like