0% found this document useful (0 votes)
5 views

Module_1_part1_NLP

Natural Language Processing (NLP) focuses on developing computational models for understanding and generating human language, with applications in areas like machine translation and speech recognition. The field faces challenges such as ambiguity and the complexity of language structure, requiring knowledge of various linguistic components like syntax and semantics. Historical approaches include rationalist and empiricist methods, and NLP is crucial for information retrieval, which involves organizing and accessing relevant data.

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Module_1_part1_NLP

Natural Language Processing (NLP) focuses on developing computational models for understanding and generating human language, with applications in areas like machine translation and speech recognition. The field faces challenges such as ambiguity and the complexity of language structure, requiring knowledge of various linguistic components like syntax and semantics. Historical approaches include rationalist and empiricist methods, and NLP is crucial for information retrieval, which involves organizing and accessing relevant data.

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit-1

Introduction

References from : Natural Language Processing and Information Retrieval


by
Tanveer Siddiqui U.S Tiwary
What is Natural Language Processing (NLP)
NLP is concerned with the development of computational
models of aspects of human language processing.

Reasons for Developing NLP

• To develop automated tools for language processing


• To gain a better understanding of human
communication
NLP field
• Primarily concerned with getting computers to
perform useful and interesting tasks with
human languages.
• Secondarily concerned with helping us come
to a better understanding of human language.

Historically major Approaches of NLP


• Rationalist Approach
• Empiricist Approach
Origins of NLP
• NLP Termed as NLU originated from machine
translation , But NLP involves Both NLU and
NLG (Natural Language Understanding &
Generation).

• Language Constructs
Theoretical linguistics
Computational linguistics
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But
still both of them are hard.
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and very
ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.

• One input can mean many different things. Ambiguity can be at different
levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect the
meaning of that sentence.

• Many input can mean the same thing.


• Interaction among components of the input is not clear.

6
• Computational Models classified into
Data Driven Knowledge Driven

As part of Information Retrieval Extraction of


“Information” information can be speech, images
and text.
Language is
the medium of expression in which knowledge is deciphered.
the medium of expression is the outer form of content it expresses
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– Speech

• To process written text, we need:


– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge

• To process spoken language, we need everything


required to process written text, plus the challenges of
speech recognition and speech synthesis.
8
Levels in Language ..
• Lexical analysis
• Syntax analysis
• Semantic analysis
• Discourse analysis
• Pragmatic analysis
Knowledge of Language
• Phonology – concerns how words are related to the sounds
that realize them.
• Morphology – concerns how words are constructed from
more basic meaning units called morphemes. A morpheme is
the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word
plays in the sentence and what phrases are subparts of other
phrases.
• Semantics – concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.

10
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the
sentence.

• Discourse – concerns how the immediately preceding


sentences affect the interpretation of the next sentence.
For example, interpreting pronouns and interpreting the
temporal aspects of the information.

• World Knowledge – includes general knowledge about the


world. What each language user must know about the other’s
beliefs and goals.

11
Challenges of NLP
Ambiguity
• Language – (lexical, syntax)
• Semantics (new words ,new corpus Eg: News)
• Quantifier Scoping
• Word Level , Sentence Level ambiguities
Languages and Grammar
• Language needs to be understood by Device instead of
Knowledge
• Grammar defines Language , it consists set of rules that
allows to parse & generate sentences in a language.
• Transformational grammars are required , proposed by
Chomsky. It consists of lexical functional grammar,
generalized phrase structure grammar, Dependency
grammar, Paninian Grammar, tree adjoining grammar
etc.
• Generative grammars are often referred to general
frame work it consist set of rules to specify or generate
grammatical sentences in a language
Syntactic Structure

Each Sentence in a language has two levels of


representation namely :

• Deep Structure
• Surface Structure

“Mapping from deep structure to surface structure is


carried out by transformations”.
Example
Transformational Grammar
• Introduced by Chomsky in 1957

3 components
1. Phrase Structure Grammar
2. Transformational rules (Obligatory or Optional )
3. Morphophonemic rules
Morphophonemic rules
Processing Indian Languages
• Unlike English
Indic Scripts have a non linear structure
• Indian languages
have SOV as default sentence structure
have free word order
spelling standardization is more subtle in Hindi
make extensive and productive use of complex predicates
use verb complexes consist of sequences of verbs

 Paninian Grammar provides a framework for Indian


language models, these can be used for computation of
Indian languages, grammar focuses on Karaka relations
from a sentence.
NLP APPLICATIONS
• Machine Translation
• Speech Recognition
• Speech Synthesis
• Information Retrieval
• Information Extraction
• Question Answering
• Text Summarization
• Natural Language Interfaces to Data Bases
Some Successful Early NLP Systems

• ELIZA
• SysTran
• TAUM METEO
• SHRDLU
• LUNAR
Information Retrieval
• Distinguish for Information , Information theory
entropy terms.
• IR helps to retrieve relevant information, information
always associated with text, number, image and so on.
• As cognitive activity the word ‘retrieval’ refers to
operation of accessing information from memory/
accessing from some computer based representation.
• Retrieval needs the information to be stored and
processed.IR deals with facets and it is concerned with
organization, storage, retrieval and evaluation of
information relevant to the query.
• IR deals with unstructured data, retrieval is
performed on the content of the document rather
than its structure.
• IR components have been traditionally incorporated
into different types of information systems including
DBMS, Bibliographic text retrieval ,QA and search
engines.

Current Approaches:
• Topic Hierarchy (eg: Yahoo)
• Rank the retrieved documents
Major Issues in IR
• Representation of a document (most of the
documents are keyword based)
• Problems with Polysem, Homonymy,
Synonymy
• Keyword based retrievals
• In appropriate characterization of queries
• Document type Document size is also an
major issue
• Understanding relevance

You might also like