0% found this document useful (0 votes)
17 views

NLP Module -1

The document provides an overview of Natural Language Processing (NLP), defining it as the intersection of linguistics, computer science, and artificial intelligence focused on enabling computers to understand and generate human language. It discusses the origins of NLP, its challenges, various levels of language processing, and applications such as machine translation, speech recognition, and information retrieval. Additionally, it covers language modeling techniques, particularly the statistical n-gram model, which estimates the probability of word sequences.

Uploaded by

Vaishnavi Y. U
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
17 views

NLP Module -1

The document provides an overview of Natural Language Processing (NLP), defining it as the intersection of linguistics, computer science, and artificial intelligence focused on enabling computers to understand and generate human language. It discusses the origins of NLP, its challenges, various levels of language processing, and applications such as machine translation, speech recognition, and information retrieval. Additionally, it covers language modeling techniques, particularly the statistical n-gram model, which estimates the probability of word sequences.

Uploaded by

Vaishnavi Y. U
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
Introduction | Introduction Introduction: 1.1 What is Natural Language Processing? 1.2 Origins of NLP, 1.3 Language and Knowledge, 1.4 The Challenges of NLP, 1.5 Language and Grammar, 1.6 Processing Indian Languages, 1.7 NLP Applications. 1.8 Language Modelling: Statistical Language Model - N-gram model (unigram, bigram), 1.9 Paninion Framework, 1.10 Karaka theory. 1.1 What is Natural Language Processing? > > Humans communicate through some form of language either by text or speech. To make interactions between computers and humans, computers need to understand natural Janguages used by humans. Natural language’ processing is all*about making computers lear, understand, analyze, ‘manipulate and interpret natural (human) languages. NLP stands for Natural Language Processing, which is a part of Computer Science, Human languages or Linguistics, and Artificial Intelligence. Processing of Natural Language is required when you want an intelligent system like robot to perform as per your instructions, when you want to hear decision from a dialogue based clinical expert system, ete. The ability of machines to interpret human language is now at the core of many applications that we use every day - chatbots, Email classification and spam filters, search engines, grammar} checkers, voice assistants, and social language translators. ‘The input and output of an NLP system can be Speech or Written Text. Natural language processing (NLP) is the ability of a computer program to understand human| language as it's spoken and written ~- referred to as natural language. It's a component of artificial intelligence (Al), VI Semester, CSE (D5) Dr. Murali G, Professor Introduction | | Natural Language Processing (NLP) is concemed with the development of computational models of aspects of human language processing. There are two main reasons for such development: To develop automated tools for language processing. To gain a better understanding of human communication. Building computational models with human language ~proces s requires knowledge of| how human acquire, store and process language. Natural Language Processing (NLP) is a fascinating field that sits at the crossroads of linguistics, computer science, and artificial intelligence (AI). At its core, NLP is concerned with enabling] computers to understand, interpret, and generate human language in a Way that is both smart and| useful. Rey i ] J MACHINE LEARNING 1.2 Origins of NLP Natural language processing sometimes termed natural language understanding originated from machine| translation research. While natural language understanding involves only the interpretation of language, natural language processing includes both understanding (interpretation) and generation (production). The NLP also include speech processing, in this text book concerned with text processing only, covering work in the ‘area computational linguistics and the tasks in which NLP has found useful application. Computational linguistics phenomena. It deals with the application of linguistics theories and ‘computational techniques for NLP. In computational linguistics, representing a language is major problem; most knowledge representations tackle only a small part of knowledge. Computational models may be| broadly classified under knowledge driven and data driven categories. Knowledge-driven systems rely on explicitly coded linguistic knowledge, often expressed as a set of handcrafted grammar rules, Vi Semester, CSE (D5) Dr. Murali G, Professor NLP Introduction aches presume the existence of'a large amount of data and usually employ some machine Data driven appro: learning technique to leam syntactic patterns. and a few traditional applications. With the The NLP is no longer confined to classroom teach unprecedented amount of inforn mation now available on the web, NLP has become one tof the leading} formation retrieval is used here ina broad} techniques for processing and retrieval information, ‘The term ns such as information extraction, text manner to include a number of information processing appli summatizations, question answering and so forth. 1.3Language and Knowledge. Language is the medium of expression in which knowledge is deciphered. Language, being a medium ., is the outer form of the content it expresses. The same content can be expressed in| of expres different languages. Language (text) processing has different levels, each involving different types of knowledge. The| various levels of processing and the types of knowledge it involves.. 1. Lexical analysis 2. Syntactic analysis Taka haan + Syatacic Anais 4, Discourse analysis . + Semantic Anaysis 3. Semantic analysis 5. Pragmatic analysis ragmatle Anas Lexical analysis, is lexical analysis, which involves analysis of words. Words are| > The simplest level of analy: the most fundamental unit of any natural language text. knowledge about the structure > (Word-level processing requires morphological knowledge, and formation of words from basic units (morphemes). The rules for forming words from morphemes are language specific. Syntactic analysis > The next level of analysis is syntactic analysis, which considers a sequence of words as a unit, usually a sentence, and finds its structure. > Syntac they relate to each other. It capture s grammatically or non-grammatically of sentences by] analysis decomposes a sentence into its constituents ( or words) and identifies how .e word order, number, and case agreement. looking at constraints ViSemester, CSE (DS) Dr. Murali G, Professor Introduction’ > For example; “She is going to the market” is valid, but “she are going to the market” is not. Semantic analy: > Semantic analysis semantics is associated with the meaning of the language. analysis is concerned with creating meaningful representation of linguistic inputs. The general ides of semantic interpretation is to take natural language sentences or utterances and map them! onto some representation of meaning. > For example; “Colourless green ideas sleep furiously”, ‘The sentence is well-formed, syntactically correct, but semantically anomalous. However this does not mean that syntax has no role to play in meaning. ‘ssSemantics to be a projection of its syntax. That is semantic structure is interpreted syntactic structure Discourse analysis > Higher level of analysis is discourse analysis. Discourse-level processing attempts to interpret the structure and meaning of even larger units;"€.@ at the paragraph and document level, in| terms of words, phrases, cluster and sentences. It requires the resolution of anaphoric references| and identification of discourse structure. > For example; in the following sentences, resolving the anaphoric reference they’ requires} pragmatic knowledge: The district administration refused to give the trade union permission for the meeting because they feared violence. The district administration refused to give the trade union permission for the meeting because they oppose government, Pragmatic analysis > The highest level of processing is pragmatic analysis, which deals with the purposeful use of| sentences in situations. It requires knowledge of the world, ic., knowledge that extends beyond! the Contents of the text. 1.4 The Challenges of NLP > There are a number of factors that make NLP difficult. These relate to the problem of| representation and interpretation. Language computing requires precise representation of content, Given that natural languages| | are highly ambiguous and vague; achieving such representation can be difficult, ss ‘Vi Semester, CSE (DS) Dr. Murali G, Professor Introduction NLP. another source of difficulty. of knowledge that humans use to process } The inability to capture all the required knowledge > It is almost impossible to embody all sources Janguage. ural language is identifying its semanti tobe a composi > The greatest source of difficulty in compositional semantics considers the meaning of a sentence meaning of words appearing in it. ee-seoping is another problem. The scope of quantifiers is often clear and poses > Quan problem in automatic processing. guages is another difficulty. ich effort, we can identify words that have multiple meanings > The ambiguity of natural lan; “The first level of the ambiguity arises at the word level. Without mu associated with them, e.g., bank, can, bat and still. 1.5 Language and Grammar, } Language needs to be understood by Device instead of Knowledge > Grammar defines Language, it consists set of rules.that allows to parse & generate sentences in| a language. > ‘Transformational grammars are required, proposed by Chomsky. It consists of lexical functional] ‘grammar, Paninian Grammar, tree! grammar, generalized phrase structure grammar, Dependency adj > Generative ‘generate gramm: ing grammar etc. grammars are often referred to general framework it consist set of rules to specify or tatical sentence sin a language Example Pooja playa Veena Veena _ is played by Pooja Surface structure subj al rel Pooja plays. Veena Deep structure ‘Surface and deep structure of sentence Figure 1.1 Dr. Murali G, Professor Page 5 ‘Vi semester, CSE (DS) ion NUP Introductio! strodueed by in 1987, Chomsky argued that an utterance is the surface) ‘Transformational grammar ws «lin a number of ways to yield representation of a “deeper structure’, The deeper structure ean be transformes any different surface-evel representations. Sentences with different surface-level representation having} the same meaning. Chomsky’s theory was able to explain why sentences like Pooja plays veena, Veena is played by Pooja have the same meaning, despite having different surface structures, ‘Transformational grammar has three components: 1. Phrase structure grammar. 2. Transformational rules 3. Morphophonemic rules- these rules match each sentence’representation to.a string of phonemes. Phrase structure grammar consists of rules that generate natural language sentenaces and assign a structural description to them. For example, consider the following set of rules; Aux-> will, is, can ‘+ In these rules, $ stands for sentence, NP for noun phrase, VP for verb phrase, and Det for determine. Sentences that can be generated using these rules are termed grammatical. ‘© The second component of transformational grammar is a set of transformation rules, which transform one phrase — maker (underlying) into another phrase- maker (derived). These ‘vi Semester, CSE (DS) Dr. Murali G, Professor Page 6| Introduction rules are applied on the terminal string generated by phrase structure rules. These rules are: used to transform one surface representation into another, e.g., an active sentence into} passive one, Morphophonemic rules match each sentence representation to a string of phoneme. Consider the active s The police will catch the snatcher, s+..+(1) “The application of phrase structure rules will assign the structure shown in figure to the sentence. s ca ee police Verb NP one Det Noun }o=| \ Bs catch the snatcher Figure: 1.2 : parse structure of sentence ‘The passive transformation rules will convert the sentence into: ‘The + culprit + will + be + en'+ catch + by+ ploice Det Noun \ \ the police Figure 1.3: Structure of sentence (1) after applying passive transformations. ‘ViSemester, CSE (DS) Dr. Murali G, Professor NLP Introduction] 1.6 Processing Indian Languages There are a number of differences between Indian language and English. This introduces difference in their processing. Some of these differences are listed here, > Unlike, English, Indie scripts have a non-linear structure, > Unlike English, Indian languages have SOV( Subject-Object-Verb) as default sentence structure, Indian languages have a free word order, i.c., words can be moved freely within, a sentaence without changing the meaning of the sentence. Spelling standardization is more subtle in Hindi than in English. > > Indian languages have relatively rich set of morphological variants. > Indian languages use post-position ease markers instead of prepositions: > Indian languages use verb complexes consisting of sequences of verbs. 1.7 NLP Applications Machine ‘Translation — refers to automatic translation of text from one human language to another. It is necessary to have an understanding of words and phrases, grammars of the two languages involved, semantic of the languages, and world knowledge. Speech Recognition ~ This isthe process of mapping acoustic speech signals to a set of word. ‘Speech Synthesis- refers to automatic production of Speech. Such systems can read out your mails on telephone, or even read out a storybook for you. So NLP remains an important component of any speech synthesis system. Natural language, interfaces to Databases- Natural language interfaces allow querying a structured database using natural language Sentences. Information Retrieval- This is concerned with identifying documents relevant to a user's query. NLP) techniques have found useful applications in information retrieval. Information Extraction- An information extraction system captures an outputs factual information céntained Withinya document. Similar to an information system, it responds to a user’s information| need. Question Answering- A question answering system attempts to find the precise answer or at least the| precise portion of text in which the answer appears. A question answering system is different from an information extraction system in that the content that is to be extracted is unknown. Text Summarization - This deals with the creation of summaries of documents and involves systactic, | semantic, and discourse level processing of text. ‘Vi Semester, CSE (DS) Dr. Murali G, Professor NLP Introduction 1.8 Language Model woe Model is a description of sonie complex entity orprocess. Language model is thus a description of language. Natural language is a complex entity in order to process it we need to represent or build a model, this is known as language modelling, Language model can be viewed as a problem of grammar inference or Problem of probability esti mation. Grammar based lang model attempts to distinguish a grammatical sentence from a non grammatical one ,Where probability model estimates maximum likelihood estimate. ‘There are two approaches for language modelling. ‘One is to define a grammar that can handle the Janguage. ‘The other is to capture the patterns in a grammar language statistically. 1.8.1 Statistical Language Model: & >A Statistical language model is a probability distribution! P(s) over Fall possible word sequences (or any other linguistic unit like words, sentences, paragraphs, dociiments or spoken utterances). A\ number of statistical language models have been proposed in literature. The dominant approach in| statistical language modelling is the n-gram model. >SLMs are fundamental task in many NLP applications like speech recognition, Spell correction, machine translation, QA, IR and Text summarization. n-gram Model. The goal a statistical language ‘model is to estimate the probability of a sentence, This is achieved by decomposing, sentence probababitty into a product of conditional probabilities using the chain rule as follows: é 2 : P(s) = P (Wi, Wa, Wes Wes) SP (wi) PCws/1) P(wsa/w1 wa) P(wa/w; We Wa). seeeP(Wa/ Wi We «+ Wret)) F POW, / hi). i a where hiis history of word w; defined a wi, Was -.....Wiet so, in order to calculate sentence probability, we need to calculate the probability of a word, given the} sequence of words preceding it. An n- gram model simplifies the task by approximating the probability] given previous n-1 word only. P(w)/ hi) = P (Wiener Wer) Vi Semester, CSE (DS) Dr. Murali G, Professor NLP = Introduction = : Thus an n-gram medel cottutebe pero thi) by modelling leupuege ot Mmathov model of rd nl, bee, by looker ab prevouk m-l voorda ly fs t 7 USing bigrvam and cht-gram aitimale, Aentente, can be cabiutebedt ot! Peay a Tp tuifven) od Pld 2 Hp wifey PY) ht too. probalet iihy ofa An an example , fre biagam appoximabton of PM Pleat Hoe), rokucer a Pi.gram appeximahion 4 Pleat Jot He) DA Apectad woowel ( psenrde usocd ) es 7314 Jafroduuord to morte shee epiatng of dre Acnten ts Ehagram erbimabien <7 Prokarsuh, of due dirk word te a Antenne 14 Condihened on Ss> 0 inty, te tA 4am & hina fodeduee +0 Poendo words 4412 ond 6427 * do edtimabe tree protaciiih'er, how, wa dmuss hele ~sram model ™ hae Medime baedrafates dhe 2 : Frraaning terpeys we esting 19" parameters Utes Hos Leet exhmabten (MLE) ferety ets ice TA due refutes corps end ctu de. n-gram desk Abo dus ame woximum let? A paahoted nage” Me by dee AU at prepiss = COR nt = The Aum op atl n-grams uod Abose frat n-l toovd4 ty ‘Vi Semester, CSE (05) "Dr. Murali, Professor CCPLnd, wey ECB nay ~ p (wi [Minti woud t Exam ple Tratalng Seb <@) The Avablan lenights Tete BRL Hoe fatry take e dae cork Vale ghar” A doe pHenblan lenignhs are bransladed he ved. manny Lo ugueged, a} uae : Bi-gram modet ’. plie/es>) = 0°67 plAvabian| fea) 20:8 pLEnigitel Avalinn)=ho P (are /prote > 1:0 P Cte lade) 20-4" PU fairy |be) > 0-2 Plab)er [forry) obo p Cob] teks) > 1-0. p CYefof) elo Pleat/te) so-g° plstta/te) sor. pl of | Stories) 210 P Coamfbnrghta)oie pftrantiated [aes 9. Win] irandetd bio © plmany [i'n ) 21-0 Leg sae Pl largunger fmany ) 20 Pek Sentencels) > The -trabian. Kntghta ate due fairy totes OF bees Sth * (me /<5>) x p CAvebian [teo) x pCenignhs [#rabian) X plarfeni gure) xc p (trelare) x plfairy |de) x pl tated Facry) x PlotHets, % Plthe|oh) LP [ eott] fee) = bien x Osx tro X10 LO! SH MOL Kl O KO LOX 0-2 : = Op0 7 fra each probability ft necerranil y wen dome Tem ely} dus probatith'a mighh came numoweol vrderty Ho wo, parhitaty Pa long Auntenca go-averd Hah, d ja log Apree , polete or codlerdahing toreipmda to addons log “E Tadevddinnk probabiitre g Aetosy antes Be bum- f The n-gram model Aut from lata Spaarenen prolle -m. An n-goem fred does not otcur In dea bates -dabe TA amyned zero preva DAY ese aan —terpur har Seurst Dro enbhiy wh Th ae —e > A numbev Amootrins -feetniqyuet ceapy te bandte dina dabe Apardencs probum, Hee pimple due berg adel-o™m smotthtug, “Smoothing ja general referr te foe tare 4 Pe-evaliahus 2 eyo-probability ar low-probaloiihy Nograma ancl amignins from non-rer0e Valu! Add -one Smeotting Then ta Hee ainpliak Ambotrrg derbntque LE addr a valu. y one to |ach -g7am Preayuan ery Layne Tate, probabil’ y Thus fas loepone nommoblainy con dittoned: pritauiitydre cern ot $ Closer phy pea Lobos ME ne Ma fue nember of agra Jrok o€ted exAchy dime ta dee trataing loeput - Feample — carsriles doo duc Number Pn-grar dicak oteer cama tA vgs oe gad fre number of n-grama fact cere CHMe My 80:SUe = Tren He Avkopthed Cou fir & hy be MOSM 2 4.04 410. ay we ae _jaPaninian Tromewon CPP) : @ Sppanintaw “grammer Cea) poor Written by Panini ia Soo Be Hw sonalen be ctrame wor Can be wed for Olen Dadian lauguege ad podivbly Zorne Aoan language at Locll: > Lnthe Baglian, Adotan languages ee Sov (Subfed object Verb) ordond & In PlecHen.atty pick - Thee Inflectons prod e. Impertank syntochie ‘aud Same. ea cones for, begenge only aod! undo ~tandius + Seme Imeectant Feoturs Of Ladau fee @ Indian larspeager Looe croditonalty wtel oval Comme ni Coton fer wenerleh oe propagation « T™ peyps Oh faere languages Me do) Cen pietertdes dom rev Argecurs ” mind do Mee tisdeneva mind » Suck oro. pet beeen ye mise tera merpre sees ae Lamune, BNe Ho wretauely Visca cord xd tree. ae Lee Santlett Von? qe ficetbt thy hy do aloe word Groups yepreunHing Aubpert, eee oe ver do occur Gay OS odes. pr Obrers, Wer util wie Com Llane Me pees of Avbjeck aa object. Foes) 9) air qa ar et tt a) ry Rashi po Huanan dele hot Mote, eltid to food Gre -W) Moker “pier fog to tee OE 8 me = a a ar are\ be Maan bie Cet ages matter bees ae es) Tok a iMe augteay verbs foilos Jue mein wea Da Weds tse ta woul tnd Drug “Rata as Acporobe wordt, “woueteat ie atau) Largan gar, tembhue wile due motu for an, SS areags aa £ Khan rebar hal keane yaaa bal eat hag clo tug been hoy ceo ws har been def AY RIK Sudtan lesguonger y dus novm OO fottosed by Pest PORK jrrteocd prepoMlions bayeredt Repretentabon in PS ann the GB repre hme Syntach'e Levels * deep Siuehis eet = SdJem i ond = logcet form (ur), oe : wa Sommer Tuy daeeny pres do rete laiginge Me ab Synbacke Saat ony ho! Pee LE Me neeresvbo 2 Pantin grammed framers bA hesd tobe Synbactiio oSemanhtey hes, Oretan 90 frorr Sushece 2 Waje, 20 dace CemanHitr by paudas disoush Yfermed lade ayers TL nuginge Cane be peprenteal an Fotiouss | reas priest cent ronerecuats chee achsel Seelling ox. Semantre work ha concemed voths Mee meaning 4 untby leangen peony wordy , Anrus O% Chand ond Stntentes + Do Whnakh Ubnatty mean Inflechtan, but here, It refer co word Cnouns Verb oF oder ) gues bared etter thar , or coenpornd verbs, OM core. exrdtaga, vr post =Po? or main and abtaribray wees ete. ALL Indie language; Ca Ga repttenbed, ob dre virbhokh! lente >) Kataoka (prnovnad Kaaraka ) Uberatty means Come, od FeGe Pan intan “Gromman har 144 coun way Of dekuing Radata Wlbltert puteteb Tuten rete ose bared on dua wary aa wooed groups poakictpale In rathy ck by deena qour: Br Hots See, Tks eRe a Abe OE Fall; syntechc: \oKanate Fercke theeny JA due contiel theme 4 PG ox rat PAGER by Vertous Participants, Th toe Mean BUR ONY! Theae! vole are wedlectedd Jn dha marters CVarguoge Seeectc) and pout petition Marken (parsarsay./ : Tt Verions Kero¥es, Marios Karta ttupsect), karma Lotject), Rasana Linttument), Sam prredana Coenchetar) — a Arodan Lsepanaiton) aut Mitveran Cleeur) » there Aeron priors ade Gat etampler aud not Complte AMeuion PG or Karoka theory « Te cxpladn marntous barnlea relator, Ub us Contleles on emamele, NA dere dette wo) poe S J © alt geht atosiben at gm at AE fear J Maan bacheht bo aangan meth hoot se yoHi euoukitbn. hut Mother ‘eutd 4S Courkerd 1 loudly bread Peed C4): The mode feeds loyend fo doc cued by hand ta dis Coumtyard te The hear Impertank poraw i dubjedy eS ‘ee ote Weeer MA mo Kasta ta dened ot HES ride GRUP Lo are , ; ae “koxmal ta ites do Cbyceh and Wirths Locus of ua res a ue ackuthy. Py 20, HT Cbread) ta ee leadma -. weil. Sanat" PRAlbasear) i WESC, [Aro Now grout chen : whe goal iw ha dnsemadhs Bn eye host Chand) 4 tee hater m ¢ dus % ese. Bacher, “Sam predan!) ay Hue benaferary op een i Sy ve Cerstay Beane V@peeden! denotes Separchion « LPssuey jr Panintan Grammar bee tay abet The +tuso ~protery CRETE Saco 1) Gemputatronal impimetiation of Pa. a 2) Adapfatien % PG to indian, § obbey Aimilar buryeoger, o8a_ CUtS cogs Revada! (olefawttieneephion), utern miler oss k Vn muthiple Layers tn Auelen usc ae eay ae Comsasts! Ge Yule wottcr are tea MTP eR CEP to rd Wn de VWtgher buen: SJ Institute of logy ~~. No G7. BGS Hoalth & Education City, Vitara Road, Kengo, Bongoloru-s60 060

You might also like