0% found this document useful (0 votes)
33 views

02 Basic Text Processing

Regular expressions are a formal language for specifying text strings that can be used for text processing tasks like searching and matching strings. They allow specifying patterns using special characters like brackets, pipes, question marks and stars. Regular expressions help reduce errors in text processing by allowing generalization of patterns to match variations while avoiding false positives and negatives. They form the basis for more advanced natural language processing models by capturing linguistic patterns and being used as features in machine learning classifiers.

Uploaded by

raqibapp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

02 Basic Text Processing

Regular expressions are a formal language for specifying text strings that can be used for text processing tasks like searching and matching strings. They allow specifying patterns using special characters like brackets, pipes, question marks and stars. Regular expressions help reduce errors in text processing by allowing generalization of patterns to match variations while avoiding false positives and negatives. They form the basis for more advanced natural language processing models by capturing linguistic patterns and being used as features in machine learning classifiers.

Uploaded by

raqibapp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Basic Text

Processing

Regular(Expressions(

Dan(Jurafsky(

Regular(expressions(
•  A(formal(language(for(specifying(text(strings(
•  How(can(we(search(for(any(of(these?(
•  woodchuck(
•  woodchucks(
•  Woodchuck(
•  Woodchucks(
(
Dan(Jurafsky(

Regular(Expressions:(Disjunc4ons(
•  Le@ers(inside(square(brackets([](
Pa6ern( Matches(
[wW]oodchuck( Woodchuck,(woodchuck(

( [1234567890] !( Any(digit(

•  Ranges([A-Z]!
Pa6ern( Matches(
[A-Z]( An(upper(case(le@er( Drenched Blossoms!
[a-z]( A(lower(case(le@er( my beans were impatient!
[0-9]( A(single(digit( Chapter 1: Down the Rabbit Hole!

Dan(Jurafsky(

Regular(Expressions:(Nega4on(in(Disjunc4on(
•  NegaGons [^Ss]!
•  Carat(means(negaGon(only(when(first(in([](

Pa6ern( Matches(
[^A-Z]( Not(an(upper(case(le@er( Oyfn pripetchik!
[^Ss] !( Neither(‘S’(nor(‘s’( I have no exquisite reason”!
[^e^]( Neither(e(nor(^( Look here!
a^b( The(pa@ern(a(carat(b( Look up a^b now!
Dan(Jurafsky(

Regular(Expressions:(More(Disjunc4on(
•  Woodchucks(is(another(name(for(groundhog!(
•  The(pipe(|(for(disjuncGon(
Pa6ern( Matches(
groundhog|woodchuck(
yours|mine( yours
mine!
a|b|c( =([abc](
[gG]roundhog|[Ww]oodchuck(
Photo(D.(Fletcher(

Dan(Jurafsky(

Regular(Expressions:(?((((* + .(

Pa6ern( Matches(
colou?r( OpGonal( color colour!
previous(char(
oo*h!( 0(or(more(of( oh! ooh! oooh! ooooh!!
previous(char(
o+h!( 1(or(more(of( oh! ooh! oooh! ooooh!!
previous(char(
Stephen(C(Kleene(
baa+( baa baaa baaaa baaaaa!
beg.n( begin begun begun beg3n! Kleene(*,(((Kleene(+((((
Dan(Jurafsky(

Regular(Expressions:(Anchors((^((($(

Pa6ern( Matches(
^[A-Z] ( Palo Alto!
^[^A-Za-z] ( 1 “Hello”!
\.$ ( The end.!
.$ ( The end? The end!!
!

Dan(Jurafsky(

Example(
•  Find(me(all(instances(of(the(word(“the”(in(a(text.(
the!
((((((((((((((((((((((((((((((((((((((((((((((((Misses(capitalized(examples(
[tT]he!
((((((((((((((((((((((((((((((((((((((((((((((((Incorrectly(returns(other(or(theology!
[^a-zA-Z][tT]he[^a-zA-Z]!
((((((((((((((((((((((((((((((((((((((((((
Dan(Jurafsky(

Errors(
•  The(process(we(just(went(through(was(based(on(fixing(
two(kinds(of(errors(
•  Matching(strings(that(we(should(not(have(matched((there,(
then,(other)(
•  False(posiGves((Type(I)(
•  Not(matching(things(that(we(should(have(matched((The)(
•  False(negaGves((Type(II)(

Dan(Jurafsky(

Errors(cont.(
•  In(NLP(we(are(always(dealing(with(these(kinds(of(
errors.(
•  Reducing(the(error(rate(for(an(applicaGon(oden(
involves(two(antagonisGc(efforts:((
•  Increasing(accuracy(or(precision((minimizing(false(posiGves)(
•  Increasing(coverage(or(recall((minimizing(false(negaGves).(
Dan(Jurafsky(

Summary(
•  Regular(expressions(play(a(surprisingly(large(role(
•  SophisGcated(sequences(of(regular(expressions(are(oden(the(first(model(
for(any(text(processing(text(
•  For(many(hard(tasks,(we(use(machine(learning(classifiers(
•  But(regular(expressions(are(used(as(features(in(the(classifiers(
•  Can(be(very(useful(in(capturing(generalizaGons(

11(

Basic Text
Processing

Regular(Expressions(
Basic(Text(
Processing

Word(tokenizaGon(

Dan(Jurafsky(

Text(Normaliza4on(
•  Every(NLP(task(needs(to(do(text(
normalizaGon:((
1.  SegmenGng/tokenizing(words(in(running(text(
2.  Normalizing(word(formats(
3.  SegmenGng(sentences(in(running(text(
!
Dan(Jurafsky(

How(many(words?(
•  I(do(uh(maink(mainly(business(data(processing(
•  Fragments,(filled(pauses(
•  Seuss’s(cat(in(the(hat(is(different(from(other(cats!((
•  Lemma:(same(stem,(part(of(speech,(rough(word(sense(
•  cat(and(cats(=(same(lemma(
•  Wordform:(the(full(inflected(surface(form(
•  cat(and(cats(=(different(wordforms(

Dan(Jurafsky(

How(many(words?(
they(lay(back(on(the(San(Francisco(grass(and(looked(at(the(stars(and(their(

•  Type:(an(element(of(the(vocabulary.(
•  Token:(an(instance(of(that(type(in(running(text.(
•  How(many?(
•  15(tokens((or(14)(
•  13(types((or(12)((or(11?)(
Dan(Jurafsky(

How(many(words?(
N(=(number(of(tokens(
Church(and(Gale((1990):(|V|(>(O(N½)(
V(=(vocabulary(=(set(of(types( (
|V|"is(the(size(of(the(vocabulary(
(
( Tokens(=(N( Types(=(|V|(
( Switchboard(phone( 2.4(million( 20(thousand(
conversaGons(
(
Shakespeare( 884,000( 31(thousand(
(
Google(Nkgrams( 1(trillion( 13(million(
(

Dan(Jurafsky(

Simple(Tokeniza4on(in(UNIX(
•  (Inspired(by(Ken(Church’s(UNIX(for(Poets.)(
•  Given(a(text(file,(output(the(word(tokens(and(their(frequencies(
tr -sc ’A-Za-z’ ’\n’ < shakes.txt ! Change all non-alpha to newlines
| sort ! Sort in alphabetical order
| uniq –c ! Merge and count each type
!
1945 A! 25 Aaron!
72 AARON! 6 Abate!
1 Abates!
19 ABBESS!
5 Abbess!
5 ABBOT! 6 Abbey!
... ...! 3 Abbot(
( ....(((…!
Dan(Jurafsky(

The(first(step:(tokenizing(
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head!
!
THE!
SONNETS!
by!
William!
Shakespeare!
From!
fairest!
creatures!
We!
... (

Dan(Jurafsky(

The(second(step:(sor4ng(
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head!
!
A!
A!
A!
A!
A!
A!
A!
A!
A!
... (
Dan(Jurafsky(

More(coun4ng(
•  Merging(upper(and(lower(case!
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c (
•  SorGng(the(counts(
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r!
23243 the!
22225 i!
18618 and!
16339 to!
15687 of!
12780 a!
12163 you!
10839 my! What happened here?
10005 in!
8954 d!
(

Dan(Jurafsky(

Issues(in(Tokeniza4on(
•  Finland’s capital → Finland Finlands Finland’s (?(
•  what’re, I’m, isn’t → What are, I am, is not!
•  Hewlett-Packard → Hewlett Packard ?!
•  state-of-the-art → state of the art ?(
•  Lowercase! !→ lower-case lowercase lower case ?(
•  San Francisco !→ one(token(or(two?(
•  m.p.h.,(PhD. ( (→ ??(
Dan(Jurafsky(

Tokeniza4on:(language(issues(
•  French(
•  L'ensemble(→(one(token(or(two?(
•  L+?(L’+?(Le+?(
•  Want(l’ensemble(to(match(with(un+ensemble+

•  German(noun(compounds(are(not(segmented(
•  Lebensversicherungsgesellscha5sangestellter+
•  ‘life(insurance(company(employee’(
•  German(informaGon(retrieval(needs(compound(spli6er(

Dan(Jurafsky(

Tokeniza4on:(language(issues(
•  Chinese(and(Japanese(no(spaces(between(words:(
• 
• 
•  Sharapova(now(((((lives(in(((((((US(((((((southeastern(((((Florida(
•  Further(complicated(in(Japanese,(with(mulGple(alphabets(
intermingled(
•  Dates/amounts(in(mulGple(formats(

500 $500K( 6,000 )

Katakana( Hiragana( Kanji( Romaji(


Endkuser(can(express(query(enGrely(in(hiragana!(
Dan(Jurafsky(

Word(Tokeniza4on(in(Chinese(
•  Also(called(Word(Segmenta4on(
•  Chinese(words(are(composed(of(characters(
•  Characters(are(generally(1(syllable(and(1(morpheme.(
•  Average(word(is(2.4(characters(long.(
•  Standard(baseline(segmentaGon(algorithm:((
•  Maximum(Matching(((also(called(Greedy)(

Dan(Jurafsky(

Maximum(Matching(
Word(Segmenta4on(Algorithm(
•  Given(a(wordlist(of(Chinese,(and(a(string.(
1)  Start(a(pointer(at(the(beginning(of(the(string(
2)  Find(the(longest(word(in(dicGonary(that(matches(the(string(
starGng(at(pointer(
3)  Move(the(pointer(over(the(word(in(string(
4)  Go(to(2(
Dan(Jurafsky(

MaxUmatch(segmenta4on(illustra4on(

•  ThecaGnthehat( the cat in the hat

•  Thetabledownthere( the table down there


theta bled own there

•  Doesn’t(generally(work(in(English!(

•  But(works(astonishingly(well(in(Chinese(
•  (
•  (
•  Modern(probabilisGc(segmentaGon(algorithms(even(be@er(

Basic(Text(
Processing

Word(tokenizaGon(
Basic(Text(
Processing(
(
Word(NormalizaGon(and(
Stemming(
(

Dan(Jurafsky(

Normaliza4on(

•  Need(to(“normalize”(terms((
•  InformaGon(Retrieval:(indexed(text(&(query(terms(must(have(same(form.(
•  We(want(to(match(U.S.A.(and(USA(
•  We(implicitly(define(equivalence(classes(of(terms(
•  e.g.,(deleGng(periods(in(a(term(
•  AlternaGve:(asymmetric(expansion:(
•  Enter:(window (Search:(window,+windows+
•  Enter:(windows (Search:(Windows,+windows,+window+
•  Enter:(Windows (Search:(Windows+

•  PotenGally(more(powerful,(but(less(efficient(
Dan(Jurafsky(

Case(folding(
•  ApplicaGons(like(IR:(reduce(all(le@ers(to(lower(case(
•  Since(users(tend(to(use(lower(case(
•  Possible(excepGon:(upper(case(in(midksentence?(
•  e.g.,(General+Motors+
•  Fed(vs.(fed+
•  SAIL(vs.(sail+
•  For(senGment(analysis,(MT,(InformaGon(extracGon(
•  Case(is(helpful((US(versus(us+is(important)(

Dan(Jurafsky(

Lemma4za4on(
•  Reduce(inflecGons(or(variant(forms(to(base(form(
•  am,"are,(is"→(be(
•  car,"cars,"car's,(cars'(→(car"
•  the"boy's"cars"are"different"colors(→(the"boy"car"be"different"color"
•  LemmaGzaGon:(have(to(find(correct(dicGonary(headword(form(
•  Machine(translaGon(
•  Spanish(quiero((‘I(want’),(quieres((‘you(want’)(same(lemma(as(querer(‘want’(
Dan(Jurafsky(

Morphology(
•  Morphemes:(
•  The(small(meaningful(units(that(make(up(words(
•  Stems:(The(core(meaningkbearing(units(
•  Affixes:(Bits(and(pieces(that(adhere(to(stems(
•  Oden(with(grammaGcal(funcGons(

Dan(Jurafsky(

Stemming(
•  Reduce(terms(to(their(stems(in(informaGon(retrieval(
•  Stemming(is(crude(chopping(of(affixes(
•  language(dependent(
•  e.g.,(automate(s),+automaGc,+automaGon(all(reduced(to(automat.(

for"example"compressed"" for(exampl(compress(and(
and"compression"are"both"" compress(ar(both(accept(
accepted"as"equivalent"to"" as(equival(to(compress(
compress.(
Dan(Jurafsky(

Porter’s(algorithm(
The(most(common(English(stemmer(
(((Step(1a( (((Step(2((for(long(stems)(
sses → ss ! caresses → caress! ational→ ate relational→ relate!
ies → i ! ponies → poni! izer→ ize ! digitizer → digitize!
ss → ss ! caress → caress! ator→ ate ! operator → operate!
s → ø(((((((((cats → cat! …!
((Step(1b( ((((Step(3((for(longer(stems)(
(*v*)ing → ø((((walking → walk! al → ø((((((revival → reviv!
sing → sing! able → ø((((((adjustable → adjust!
(*v*)ed → ø((((plastered → plaster! ate → ø activate → activ!
…! …!

Dan(Jurafsky(

Viewing(morphology(in(a(corpus(
Why(only(strip(–ing(if(there(is(a(vowel?(

(*v*)ing → ø((((walking → walk!


sing → sing!
!

36(
Dan(Jurafsky(

Viewing(morphology(in(a(corpus(
Why(only(strip(–ing(if(there(is(a(vowel?(
(*v*)ing → ø((((walking → walk!
sing → sing!
!
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr !
! 1312 King! 548 being!
! 548 being! 541 nothing!
541 nothing! 152 something!
! 388 king! 145 coming!
! 375 bring! 130 morning!
358 thing! 122 having!
! 307 ring! 120 living!
! 152 something! 117 loving!
! 145 coming! 116 Being!
130 morning ! 102 going!
!
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr!

37(

Dan(Jurafsky(

Dealing(with(complex(morphology(is(
some4mes(necessary(
•  Some(languages(requires(complex(morpheme(segmentaGon(
•  Turkish(
•  UygarlasGramadiklarimizdanmissinizcasina(
•  `(behaving)(as(if(you(are(among(those(whom(we(could(not(civilize’(
•  Uygar(`civilized’(+(las(`become’((
+(Gr(`cause’(+(ama(`not(able’((
+(dik(`past’(+(lar(‘plural’(
+(imiz(‘p1pl’(+(dan(‘abl’((
+(mis(‘past’(+(siniz(‘2pl’(+(casina(‘as(if’((
(
Basic(Text(
Processing(
(
Word(NormalizaGon(and(
Stemming(
(

Basic(Text(
Processing(
(
Sentence(SegmentaGon(
and(Decision(Trees(
(
Dan(Jurafsky(

Sentence(Segmenta4on(
•  !,(?(are(relaGvely(unambiguous(
•  Period(“.”(is(quite(ambiguous(
•  Sentence(boundary(
•  AbbreviaGons(like(Inc.(or(Dr.(
•  Numbers(like(.02%(or(4.3(
•  Build(a(binary(classifier(
•  Looks(at(a(“.”(
•  Decides(EndOfSentence/NotEndOfSentence(
•  Classifiers:(handkwri@en(rules,(regular(expressions,(or(machineklearning(

Dan(Jurafsky(
Determining(if(a(word(is(endUofUsentence:(
a(Decision(Tree(
Dan(Jurafsky(

More(sophis4cated(decision(tree(features(
•  Case(of(word(with(“.”:(Upper,(Lower,(Cap,(Number(
•  Case(of(word(ader(“.”:(Upper,(Lower,(Cap,(Number(

•  Numeric(features(
•  Length(of(word(with(“.”(
•  Probability(word(with(“.”(occurs(at(endkofks)(
•  Probability(word(ader(“.”(occurs(at(beginningkofks)(

Dan(Jurafsky(

Implemen4ng(Decision(Trees(
•  A(decision(tree(is(just(an(ifkthenkelse(statement(
•  The(interesGng(research(is(choosing(the(features(
•  Seng(up(the(structure(is(oden(too(hard(to(do(by(hand(
•  Handkbuilding(only(possible(for(very(simple(features,(domains(
•  For(numeric(features,(it’s(too(hard(to(pick(each(threshold(
•  Instead,(structure(usually(learned(by(machine(learning(from(a(training(
corpus(
Dan(Jurafsky(

Decision(Trees(and(other(classifiers(
•  We(can(think(of(the(quesGons(in(a(decision(tree(
•  As(features(that(could(be(exploited(by(any(kind(of(
classifier(
•  LogisGc(regression(
•  SVM(
•  Neural(Nets(
•  etc.(
(

Basic(Text(
Processing(
(
Sentence(SegmentaGon(
and(Decision(Trees(
(

You might also like