Module_1_part1_NLP
Module_1_part1_NLP
Introduction
• Language Constructs
Theoretical linguistics
Computational linguistics
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But
still both of them are hard.
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and very
ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at different
levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect the
meaning of that sentence.
6
• Computational Models classified into
Data Driven Knowledge Driven
10
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the
sentence.
11
Challenges of NLP
Ambiguity
• Language – (lexical, syntax)
• Semantics (new words ,new corpus Eg: News)
• Quantifier Scoping
• Word Level , Sentence Level ambiguities
Languages and Grammar
• Language needs to be understood by Device instead of
Knowledge
• Grammar defines Language , it consists set of rules that
allows to parse & generate sentences in a language.
• Transformational grammars are required , proposed by
Chomsky. It consists of lexical functional grammar,
generalized phrase structure grammar, Dependency
grammar, Paninian Grammar, tree adjoining grammar
etc.
• Generative grammars are often referred to general
frame work it consist set of rules to specify or generate
grammatical sentences in a language
Syntactic Structure
• Deep Structure
• Surface Structure
3 components
1. Phrase Structure Grammar
2. Transformational rules (Obligatory or Optional )
3. Morphophonemic rules
Morphophonemic rules
Processing Indian Languages
• Unlike English
Indic Scripts have a non linear structure
• Indian languages
have SOV as default sentence structure
have free word order
spelling standardization is more subtle in Hindi
make extensive and productive use of complex predicates
use verb complexes consist of sequences of verbs
• ELIZA
• SysTran
• TAUM METEO
• SHRDLU
• LUNAR
Information Retrieval
• Distinguish for Information , Information theory
entropy terms.
• IR helps to retrieve relevant information, information
always associated with text, number, image and so on.
• As cognitive activity the word ‘retrieval’ refers to
operation of accessing information from memory/
accessing from some computer based representation.
• Retrieval needs the information to be stored and
processed.IR deals with facets and it is concerned with
organization, storage, retrieval and evaluation of
information relevant to the query.
• IR deals with unstructured data, retrieval is
performed on the content of the document rather
than its structure.
• IR components have been traditionally incorporated
into different types of information systems including
DBMS, Bibliographic text retrieval ,QA and search
engines.
Current Approaches:
• Topic Hierarchy (eg: Yahoo)
• Rank the retrieved documents
Major Issues in IR
• Representation of a document (most of the
documents are keyword based)
• Problems with Polysem, Homonymy,
Synonymy
• Keyword based retrievals
• In appropriate characterization of queries
• Document type Document size is also an
major issue
• Understanding relevance