0% found this document useful (0 votes)
130 views

NLP Programming en 01 Unigramlm

This document provides an introduction to unigram language models for natural language processing. It explains that unigram models assign a probability to each word independently of context by calculating the frequency of that word in a training corpus. The document outlines how to estimate these probabilities using maximum likelihood estimation and adjust for unknown words. It also describes how to evaluate language models on test data by calculating the likelihood, perplexity, and coverage of the model for the test sentences. Exercises are provided to implement a program for training a unigram model from data and testing it on a corpus.

Uploaded by

dhashrath
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

NLP Programming en 01 Unigramlm

This document provides an introduction to unigram language models for natural language processing. It explains that unigram models assign a probability to each word independently of context by calculating the frequency of that word in a training corpus. The document outlines how to estimate these probabilities using maximum likelihood estimation and adjust for unknown words. It also describes how to evaluate language models on test data by calculating the likelihood, perplexity, and coverage of the model for the test sentences. Exercises are provided to implement a program for training a unigram model from data and testing it on a corpus.

Uploaded by

dhashrath
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

NLP Programming Tutorial 1 Unigram Language Model

NLP Programming Tutorial 1 Unigram Language Models

Graham Neubig Nara Institute of Science and Technology (N IST!

NLP Programming Tutorial 1 Unigram Language Model

Language Model "asics

NLP Programming Tutorial 1 Unigram Language Model

#hy Language Models$

#e ha%e an &nglish s'eech recognition system( )hich ans)er is better$


#1 * s'eech recognition system #+ * s'eech cognition system #- * s'ec. 'odcast histamine #, *

S'eech

NLP Programming Tutorial 1 Unigram Language Model

#hy Language Models$

#e ha%e an &nglish s'eech recognition system( )hich ans)er is better$


#1 * s'eech recognition system #+ * s'eech cognition system #- * s'ec. 'odcast histamine #, *

S'eech

Language models tell us the ans)er/


4

NLP Programming Tutorial 1 Unigram Language Model

Probabilistic Language Models

Language models assign a 'robability to each sentence P(#1! * ,01+1 2 11-P(#+! * 304-+ 2 11-, P(#-! * +0,-+ 2 11-5 P(#,! * 401+, 2 11-+-

#1 * s'eech recognition system #+ * s'eech cognition system #- * s'ec. 'odcast histamine #, *

#e )ant P(#1! 6 P(#+! 6 P(#-! 6 P(#,!

(or P(#,! 6 P(#1!( P(#+!( P(#-! for 7a'anese$!

NLP Programming Tutorial 1 Unigram Language Model

8alculating Sentence Probabilities

#e )ant the 'robability of


# * s'eech recognition system

9e'resent this mathematically as:


P(;#; * -( )1*<s'eech<( )+*<recognition<( )-*<system<!

NLP Programming Tutorial 1 Unigram Language Model

8alculating Sentence Probabilities

#e )ant the 'robability of


# * s'eech recognition system

9e'resent this mathematically as (using chain rule!:


P()1*=s'eech< ; )1 * =>s6<! 2 P()+*<recognition< ; )1 * =>s6<( )1*=s'eech<! 2 P()-*<system< ; )1 * =>s6<( )1*=s'eech<( )+*<recognition<!
2 P(),*<>?s6< ; )1 * =>s6<( )1*=s'eech<( )+*<recognition<( )-*<system<!

P(;#; * -( )1*<s'eech<( )+*<recognition<( )-*<system<! *

N@T&: sentence start >s6 and end >?s6 symbol

N@T&: P()1 * >s6! * 1

NLP Programming Tutorial 1 Unigram Language Model

Incremental 8om'utation

Pre%ious eAuation can be )ritten:

P ( W )=i =1 P ( wiw 0 wi1 )

W + 1

Bo) do )e decide 'robability$

P ( wiw 0 wi1 )

NLP Programming Tutorial 1 Unigram Language Model

MaCimum Li.elihood &stimation

8alculate )ord strings in cor'us( ta.e fraction

c ( w1 wi ) P ( wiw 1 w i1)= c ( w 1 w i 1)
i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 P(li%e ; >s6 i! * c(>s6 i li%e!?c(>s6 i! * 1 ? + * 10D P(am ; >s6 i! * c(>s6 i am!?c(>s6 i! * 1 ? + * 10D

NLP Programming Tutorial 1 Unigram Language Model

Problem #ith Eull &stimation

#ea. )hen counts are lo):

Training:

i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 >s6 i li%e in nara 0 >?s6

Test:

P(nara;>s6 i li%e in! * 1?1 * 1 P(#*>s6 i li%e in nara 0 >?s6! * 1


10

NLP Programming Tutorial 1 Unigram Language Model

Unigram Model

Fo not use history:

P ( wiw 1 w i1) P ( wi )=

c ( wi ) ) w c (w

P(nara! * 1?+1 * 101D i li%e in osa.a 0 >?s6 * +?+1 * 101 i am a graduate student 0 >?s6 P(i! my school is in nara 0 >?s6 P(>?s6! * -?+1 * 101D P(#*i li%e in nara 0 >?s6! * 101 2 101D 2 101 2 101D 2 101D 2 101D * D0G+D 2 11 -5
11

NLP Programming Tutorial 1 Unigram Language Model

"e 8areful of Integers/

Fi%ide t)o integers( you get an integer (rounded do)n!

$ ./my-program.py 0

8on%ert one integer to a float( and you )ill be @H

$ ./my-program.py 0.5

12

NLP Programming Tutorial 1 Unigram Language Model

#hat about Un.no)n #ords$/

Sim'le ML estimation doesnIt )or.


i li%e in osa.a 0 >?s6 i am a graduate student 0 >?s6 my school is in nara 0 >?s6 P(nara! * 1?+1 * 101D P(i! * +?+1 * 101 P(.yoto! * 1?+1 * 1

@ften( un.no)n )ords are ignored ( S9! "etter )ay to sol%e

Sa%e some 'robability for un.no)n )ords (Jun. * 1-J1! Guess total %ocabulary siKe (N!( including un.no)ns

1 P ( wi )=1 P ML ( wi )+( 1 1) N
13

NLP Programming Tutorial 1 Unigram Language Model

Un.no)n #ord &Cam'le

Total %ocabulary siKe: N*11G Un.no)n )ord 'robability: Jun.*101D (J1 * 104D!

1 P ( wi )=1 P ML ( wi )+( 1 1) N
P(nara! * 104D2101D L 101D2(1?11G! * 101,5D111D P(i! * 104D21011 L 101D2(1?11G! * 1014D1111D P(.yoto! * 104D21011 L 101D2(1?11G! * 101111111D
14

NLP Programming Tutorial 1 Unigram Language Model

&%aluating Language Models

15

NLP Programming Tutorial 1 Unigram Language Model

&C'erimental Setu'

Use training and test sets


Train Model

Training Fata
i li%e in osa.a i am a graduate student my school is in nara 000

Model

Testing Fata
i li%e in nara i am a student i ha%e lots of home)or. M

Test Model Model ccuracy Li.elihood Log Li.elihood &ntro'y Per'leCity 16

NLP Programming Tutorial 1 Unigram Language Model

Li.elihood

Li.elihood is the 'robability of some obser%ed data (the test set #test!( gi%en the model M

P ( W testM )= w W P ( wM )
test

i li%e in nara i am a student my classes are hard

P()*<i li%e in nara<;M! * P()*<i am a student<;M! * P()*<my classes are hard<;M! *

+0D+211-+1 -0,3211-14 +01D211--,

C C

* 1034211-517

NLP Programming Tutorial 1 Unigram Language Model

Log Li.elihood

Li.elihood uses %ery small numbers*underflo) Ta.ing the log resol%es this 'roblem

log P ( W testM )= w W log P ( wM )


test

i li%e in nara i am a student my classes are hard

log P()*<i li%e in nara<;M! * log P()*<i am a student<;M! *

-+10D3 -130,D

log P()*<my classes are hard<;M! * ---0G5

* -5+0G1
18

NLP Programming Tutorial 1 Unigram Language Model

8alculating Logs

PythonIs math 'ac.age has a function for logs

$ ./my-program.py 4.60517018599 2.0

19

NLP Programming Tutorial 1 Unigram Language Model

&ntro'y

&ntro'y B is a%erage negati%e log+ li.elihood 'er )ord

1 H ( W testM )= | W test |
i li%e in nara i am a student my classes are hard

w W

test

log 2 P ( wM )

log+ P()*<i li%e in nara<;M!* log+ P()*<i am a student<;M!*

log+ P()*<my classes are hard<;M!* 11103, N of )ords*

( )
G30,G10-+

L ?

1+ * +10120

2 note( )e can also count >?s6 in N of )ords (in )hich case it is 1D!

NLP Programming Tutorial 1 Unigram Language Model

&ntro'y and 8om'ression

&ntro'y B is also the a%erage number of bits needed to encode information (ShannonIs information theory!

a bird a cat a dog a >?s6


a bird cat dog >?s6 O1 O 111 O 111 O 111 O 111 P()* =a<! * 10D -log+ 10D * 1 P()* =bird<! * 101+D -log+ 101+D * P()* =cat<! * 101+D -log+ 101+D * P()* =dog<! * 101+D -log+ 101+D * P()* =>?s6<! * 101+D -log+ 101+D * 21

1 H= | W test | w W

wtest

log 2 P ( wM )

&ncoding

1111111111111111

NLP Programming Tutorial 1 Unigram Language Model

Per'leCity

&Aual to t)o to the 'o)er of 'er-)ord entro'y

PPL =2

(Mainly because it ma.es more im'ressi%e numbers! Eor uniform distributions( eAual to the siKe of %ocabulary

V =5

1 H =log 2 5

PPL =2 =2

log 2

1 5

=2

log2 5

=5
22

NLP Programming Tutorial 1 Unigram Language Model

8o%erage

The 'ercentage of .no)n )ords in the cor'us a bird a cat a dog a >?s6
=dog< is an un.no)n )ord 8o%erage: 5?3 2

2 often omit the sentence-final symbol O G?5

23

NLP Programming Tutorial 1 Unigram Language Model

&Cercise

24

NLP Programming Tutorial 1 Unigram Language Model

&Cercise

#rite t)o 'rograms


train-unigram: 8reates a unigram model test-unigram: 9eads a unigram model and calculates entro'y and co%erage for the test set

Test them test?11-train-in'ut0tCt test?11-test-in'ut0tCt Train the model on data?)i.i-en-train0)ord 8alculate entro'y and co%erage on data?)i.i-entest0)ord 9e'ort your scores neCt )ee.
25

NLP Programming Tutorial 1 Unigram Language Model

train-unigram Pseudo-8ode
create a map counts create a variable total_count * 1 for each line in the training_file split line into an array of words append =>?s6< to the end of words for each word in words add 1 to countsPwordQ add 1 to total_count open the model_file for )riting for each )ord( count in counts probability * countsPwordQ?total_count print word( probability to model_file
26

NLP Programming Tutorial 1 Unigram Language Model

test-unigram Pseudo-8ode
1 = 104D, Jun. * 1-J1( R * 1111111( # * 1( B * 1

Load Model
create a map probabilities for each line in model_file split line into w and P set probabilitiesPwQ * P

Test and Print


for each line in test_file split line into an array of words append =>?s6< to the end of words for each w in words add 1 to W set P * Jun. ? R if probabilitiesPwQ eCists set P L* J1 2 'robabilitiesPwQ else add 1 to unk add -log2 P to H print =entro'y * H!W print =co%erage * < L (#-un.!?#
27

NLP Programming Tutorial 1 Unigram Language Model

Than. Sou/

28

You might also like