02 Basic Text Processing
02 Basic Text Processing
Processing
Regular(Expressions(
Dan(Jurafsky(
Regular(expressions(
• A(formal(language(for(specifying(text(strings(
• How(can(we(search(for(any(of(these?(
• woodchuck(
• woodchucks(
• Woodchuck(
• Woodchucks(
(
Dan(Jurafsky(
Regular(Expressions:(Disjunc4ons(
• Le@ers(inside(square(brackets([](
Pa6ern( Matches(
[wW]oodchuck( Woodchuck,(woodchuck(
( [1234567890] !( Any(digit(
• Ranges([A-Z]!
Pa6ern( Matches(
[A-Z]( An(upper(case(le@er( Drenched Blossoms!
[a-z]( A(lower(case(le@er( my beans were impatient!
[0-9]( A(single(digit( Chapter 1: Down the Rabbit Hole!
Dan(Jurafsky(
Regular(Expressions:(Nega4on(in(Disjunc4on(
• NegaGons [^Ss]!
• Carat(means(negaGon(only(when(first(in([](
Pa6ern( Matches(
[^A-Z]( Not(an(upper(case(le@er( Oyfn pripetchik!
[^Ss] !( Neither(‘S’(nor(‘s’( I have no exquisite reason”!
[^e^]( Neither(e(nor(^( Look here!
a^b( The(pa@ern(a(carat(b( Look up a^b now!
Dan(Jurafsky(
Regular(Expressions:(More(Disjunc4on(
• Woodchucks(is(another(name(for(groundhog!(
• The(pipe(|(for(disjuncGon(
Pa6ern( Matches(
groundhog|woodchuck(
yours|mine( yours
mine!
a|b|c( =([abc](
[gG]roundhog|[Ww]oodchuck(
Photo(D.(Fletcher(
Dan(Jurafsky(
Regular(Expressions:(?((((* + .(
Pa6ern( Matches(
colou?r( OpGonal( color colour!
previous(char(
oo*h!( 0(or(more(of( oh! ooh! oooh! ooooh!!
previous(char(
o+h!( 1(or(more(of( oh! ooh! oooh! ooooh!!
previous(char(
Stephen(C(Kleene(
baa+( baa baaa baaaa baaaaa!
beg.n( begin begun begun beg3n! Kleene(*,(((Kleene(+((((
Dan(Jurafsky(
Regular(Expressions:(Anchors((^((($(
Pa6ern( Matches(
^[A-Z] ( Palo Alto!
^[^A-Za-z] ( 1 “Hello”!
\.$ ( The end.!
.$ ( The end? The end!!
!
Dan(Jurafsky(
Example(
• Find(me(all(instances(of(the(word(“the”(in(a(text.(
the!
((((((((((((((((((((((((((((((((((((((((((((((((Misses(capitalized(examples(
[tT]he!
((((((((((((((((((((((((((((((((((((((((((((((((Incorrectly(returns(other(or(theology!
[^a-zA-Z][tT]he[^a-zA-Z]!
((((((((((((((((((((((((((((((((((((((((((
Dan(Jurafsky(
Errors(
• The(process(we(just(went(through(was(based(on(fixing(
two(kinds(of(errors(
• Matching(strings(that(we(should(not(have(matched((there,(
then,(other)(
• False(posiGves((Type(I)(
• Not(matching(things(that(we(should(have(matched((The)(
• False(negaGves((Type(II)(
Dan(Jurafsky(
Errors(cont.(
• In(NLP(we(are(always(dealing(with(these(kinds(of(
errors.(
• Reducing(the(error(rate(for(an(applicaGon(oden(
involves(two(antagonisGc(efforts:((
• Increasing(accuracy(or(precision((minimizing(false(posiGves)(
• Increasing(coverage(or(recall((minimizing(false(negaGves).(
Dan(Jurafsky(
Summary(
• Regular(expressions(play(a(surprisingly(large(role(
• SophisGcated(sequences(of(regular(expressions(are(oden(the(first(model(
for(any(text(processing(text(
• For(many(hard(tasks,(we(use(machine(learning(classifiers(
• But(regular(expressions(are(used(as(features(in(the(classifiers(
• Can(be(very(useful(in(capturing(generalizaGons(
11(
Basic Text
Processing
Regular(Expressions(
Basic(Text(
Processing
Word(tokenizaGon(
Dan(Jurafsky(
Text(Normaliza4on(
• Every(NLP(task(needs(to(do(text(
normalizaGon:((
1. SegmenGng/tokenizing(words(in(running(text(
2. Normalizing(word(formats(
3. SegmenGng(sentences(in(running(text(
!
Dan(Jurafsky(
How(many(words?(
• I(do(uh(maink(mainly(business(data(processing(
• Fragments,(filled(pauses(
• Seuss’s(cat(in(the(hat(is(different(from(other(cats!((
• Lemma:(same(stem,(part(of(speech,(rough(word(sense(
• cat(and(cats(=(same(lemma(
• Wordform:(the(full(inflected(surface(form(
• cat(and(cats(=(different(wordforms(
Dan(Jurafsky(
How(many(words?(
they(lay(back(on(the(San(Francisco(grass(and(looked(at(the(stars(and(their(
• Type:(an(element(of(the(vocabulary.(
• Token:(an(instance(of(that(type(in(running(text.(
• How(many?(
• 15(tokens((or(14)(
• 13(types((or(12)((or(11?)(
Dan(Jurafsky(
How(many(words?(
N(=(number(of(tokens(
Church(and(Gale((1990):(|V|(>(O(N½)(
V(=(vocabulary(=(set(of(types( (
|V|"is(the(size(of(the(vocabulary(
(
( Tokens(=(N( Types(=(|V|(
( Switchboard(phone( 2.4(million( 20(thousand(
conversaGons(
(
Shakespeare( 884,000( 31(thousand(
(
Google(Nkgrams( 1(trillion( 13(million(
(
Dan(Jurafsky(
Simple(Tokeniza4on(in(UNIX(
• (Inspired(by(Ken(Church’s(UNIX(for(Poets.)(
• Given(a(text(file,(output(the(word(tokens(and(their(frequencies(
tr -sc ’A-Za-z’ ’\n’ < shakes.txt ! Change all non-alpha to newlines
| sort ! Sort in alphabetical order
| uniq –c ! Merge and count each type
!
1945 A! 25 Aaron!
72 AARON! 6 Abate!
1 Abates!
19 ABBESS!
5 Abbess!
5 ABBOT! 6 Abbey!
... ...! 3 Abbot(
( ....(((…!
Dan(Jurafsky(
The(first(step:(tokenizing(
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head!
!
THE!
SONNETS!
by!
William!
Shakespeare!
From!
fairest!
creatures!
We!
... (
Dan(Jurafsky(
The(second(step:(sor4ng(
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head!
!
A!
A!
A!
A!
A!
A!
A!
A!
A!
... (
Dan(Jurafsky(
More(coun4ng(
• Merging(upper(and(lower(case!
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c (
• SorGng(the(counts(
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r!
23243 the!
22225 i!
18618 and!
16339 to!
15687 of!
12780 a!
12163 you!
10839 my! What happened here?
10005 in!
8954 d!
(
Dan(Jurafsky(
Issues(in(Tokeniza4on(
• Finland’s capital → Finland Finlands Finland’s (?(
• what’re, I’m, isn’t → What are, I am, is not!
• Hewlett-Packard → Hewlett Packard ?!
• state-of-the-art → state of the art ?(
• Lowercase! !→ lower-case lowercase lower case ?(
• San Francisco !→ one(token(or(two?(
• m.p.h.,(PhD. ( (→ ??(
Dan(Jurafsky(
Tokeniza4on:(language(issues(
• French(
• L'ensemble(→(one(token(or(two?(
• L+?(L’+?(Le+?(
• Want(l’ensemble(to(match(with(un+ensemble+
• German(noun(compounds(are(not(segmented(
• Lebensversicherungsgesellscha5sangestellter+
• ‘life(insurance(company(employee’(
• German(informaGon(retrieval(needs(compound(spli6er(
Dan(Jurafsky(
Tokeniza4on:(language(issues(
• Chinese(and(Japanese(no(spaces(between(words:(
•
•
• Sharapova(now(((((lives(in(((((((US(((((((southeastern(((((Florida(
• Further(complicated(in(Japanese,(with(mulGple(alphabets(
intermingled(
• Dates/amounts(in(mulGple(formats(
Word(Tokeniza4on(in(Chinese(
• Also(called(Word(Segmenta4on(
• Chinese(words(are(composed(of(characters(
• Characters(are(generally(1(syllable(and(1(morpheme.(
• Average(word(is(2.4(characters(long.(
• Standard(baseline(segmentaGon(algorithm:((
• Maximum(Matching(((also(called(Greedy)(
Dan(Jurafsky(
Maximum(Matching(
Word(Segmenta4on(Algorithm(
• Given(a(wordlist(of(Chinese,(and(a(string.(
1) Start(a(pointer(at(the(beginning(of(the(string(
2) Find(the(longest(word(in(dicGonary(that(matches(the(string(
starGng(at(pointer(
3) Move(the(pointer(over(the(word(in(string(
4) Go(to(2(
Dan(Jurafsky(
MaxUmatch(segmenta4on(illustra4on(
• Doesn’t(generally(work(in(English!(
• But(works(astonishingly(well(in(Chinese(
• (
• (
• Modern(probabilisGc(segmentaGon(algorithms(even(be@er(
Basic(Text(
Processing
Word(tokenizaGon(
Basic(Text(
Processing(
(
Word(NormalizaGon(and(
Stemming(
(
Dan(Jurafsky(
Normaliza4on(
• Need(to(“normalize”(terms((
• InformaGon(Retrieval:(indexed(text(&(query(terms(must(have(same(form.(
• We(want(to(match(U.S.A.(and(USA(
• We(implicitly(define(equivalence(classes(of(terms(
• e.g.,(deleGng(periods(in(a(term(
• AlternaGve:(asymmetric(expansion:(
• Enter:(window (Search:(window,+windows+
• Enter:(windows (Search:(Windows,+windows,+window+
• Enter:(Windows (Search:(Windows+
• PotenGally(more(powerful,(but(less(efficient(
Dan(Jurafsky(
Case(folding(
• ApplicaGons(like(IR:(reduce(all(le@ers(to(lower(case(
• Since(users(tend(to(use(lower(case(
• Possible(excepGon:(upper(case(in(midksentence?(
• e.g.,(General+Motors+
• Fed(vs.(fed+
• SAIL(vs.(sail+
• For(senGment(analysis,(MT,(InformaGon(extracGon(
• Case(is(helpful((US(versus(us+is(important)(
Dan(Jurafsky(
Lemma4za4on(
• Reduce(inflecGons(or(variant(forms(to(base(form(
• am,"are,(is"→(be(
• car,"cars,"car's,(cars'(→(car"
• the"boy's"cars"are"different"colors(→(the"boy"car"be"different"color"
• LemmaGzaGon:(have(to(find(correct(dicGonary(headword(form(
• Machine(translaGon(
• Spanish(quiero((‘I(want’),(quieres((‘you(want’)(same(lemma(as(querer(‘want’(
Dan(Jurafsky(
Morphology(
• Morphemes:(
• The(small(meaningful(units(that(make(up(words(
• Stems:(The(core(meaningkbearing(units(
• Affixes:(Bits(and(pieces(that(adhere(to(stems(
• Oden(with(grammaGcal(funcGons(
Dan(Jurafsky(
Stemming(
• Reduce(terms(to(their(stems(in(informaGon(retrieval(
• Stemming(is(crude(chopping(of(affixes(
• language(dependent(
• e.g.,(automate(s),+automaGc,+automaGon(all(reduced(to(automat.(
for"example"compressed"" for(exampl(compress(and(
and"compression"are"both"" compress(ar(both(accept(
accepted"as"equivalent"to"" as(equival(to(compress(
compress.(
Dan(Jurafsky(
Porter’s(algorithm(
The(most(common(English(stemmer(
(((Step(1a( (((Step(2((for(long(stems)(
sses → ss ! caresses → caress! ational→ ate relational→ relate!
ies → i ! ponies → poni! izer→ ize ! digitizer → digitize!
ss → ss ! caress → caress! ator→ ate ! operator → operate!
s → ø(((((((((cats → cat! …!
((Step(1b( ((((Step(3((for(longer(stems)(
(*v*)ing → ø((((walking → walk! al → ø((((((revival → reviv!
sing → sing! able → ø((((((adjustable → adjust!
(*v*)ed → ø((((plastered → plaster! ate → ø activate → activ!
…! …!
Dan(Jurafsky(
Viewing(morphology(in(a(corpus(
Why(only(strip(–ing(if(there(is(a(vowel?(
36(
Dan(Jurafsky(
Viewing(morphology(in(a(corpus(
Why(only(strip(–ing(if(there(is(a(vowel?(
(*v*)ing → ø((((walking → walk!
sing → sing!
!
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr !
! 1312 King! 548 being!
! 548 being! 541 nothing!
541 nothing! 152 something!
! 388 king! 145 coming!
! 375 bring! 130 morning!
358 thing! 122 having!
! 307 ring! 120 living!
! 152 something! 117 loving!
! 145 coming! 116 Being!
130 morning ! 102 going!
!
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr!
37(
Dan(Jurafsky(
Dealing(with(complex(morphology(is(
some4mes(necessary(
• Some(languages(requires(complex(morpheme(segmentaGon(
• Turkish(
• UygarlasGramadiklarimizdanmissinizcasina(
• `(behaving)(as(if(you(are(among(those(whom(we(could(not(civilize’(
• Uygar(`civilized’(+(las(`become’((
+(Gr(`cause’(+(ama(`not(able’((
+(dik(`past’(+(lar(‘plural’(
+(imiz(‘p1pl’(+(dan(‘abl’((
+(mis(‘past’(+(siniz(‘2pl’(+(casina(‘as(if’((
(
Basic(Text(
Processing(
(
Word(NormalizaGon(and(
Stemming(
(
Basic(Text(
Processing(
(
Sentence(SegmentaGon(
and(Decision(Trees(
(
Dan(Jurafsky(
Sentence(Segmenta4on(
• !,(?(are(relaGvely(unambiguous(
• Period(“.”(is(quite(ambiguous(
• Sentence(boundary(
• AbbreviaGons(like(Inc.(or(Dr.(
• Numbers(like(.02%(or(4.3(
• Build(a(binary(classifier(
• Looks(at(a(“.”(
• Decides(EndOfSentence/NotEndOfSentence(
• Classifiers:(handkwri@en(rules,(regular(expressions,(or(machineklearning(
Dan(Jurafsky(
Determining(if(a(word(is(endUofUsentence:(
a(Decision(Tree(
Dan(Jurafsky(
More(sophis4cated(decision(tree(features(
• Case(of(word(with(“.”:(Upper,(Lower,(Cap,(Number(
• Case(of(word(ader(“.”:(Upper,(Lower,(Cap,(Number(
• Numeric(features(
• Length(of(word(with(“.”(
• Probability(word(with(“.”(occurs(at(endkofks)(
• Probability(word(ader(“.”(occurs(at(beginningkofks)(
Dan(Jurafsky(
Implemen4ng(Decision(Trees(
• A(decision(tree(is(just(an(ifkthenkelse(statement(
• The(interesGng(research(is(choosing(the(features(
• Seng(up(the(structure(is(oden(too(hard(to(do(by(hand(
• Handkbuilding(only(possible(for(very(simple(features,(domains(
• For(numeric(features,(it’s(too(hard(to(pick(each(threshold(
• Instead,(structure(usually(learned(by(machine(learning(from(a(training(
corpus(
Dan(Jurafsky(
Decision(Trees(and(other(classifiers(
• We(can(think(of(the(quesGons(in(a(decision(tree(
• As(features(that(could(be(exploited(by(any(kind(of(
classifier(
• LogisGc(regression(
• SVM(
• Neural(Nets(
• etc.(
(
Basic(Text(
Processing(
(
Sentence(SegmentaGon(
and(Decision(Trees(
(