0% found this document useful (0 votes)
80 views

Week 6: Introduction To Natural Language Processing

This document provides an overview of natural language processing (NLP). It begins with describing NLP and some common NLP tasks like sentiment analysis and machine translation. It then discusses why NLP is important and challenging. The document outlines several preprocessing steps for text data, including tokenization, stop word removal, stemming/lemmatization. It also describes common word features used in NLP like bag-of-words and tf-idf vectors. Finally, it briefly introduces N-gram language models.

Uploaded by

Dimpu Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Week 6: Introduction To Natural Language Processing

This document provides an overview of natural language processing (NLP). It begins with describing NLP and some common NLP tasks like sentiment analysis and machine translation. It then discusses why NLP is important and challenging. The document outlines several preprocessing steps for text data, including tokenization, stop word removal, stemming/lemmatization. It also describes common word features used in NLP like bag-of-words and tf-idf vectors. Finally, it briefly introduces N-gram language models.

Uploaded by

Dimpu Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Week 6

Introduction to Natural Language Processing


Session Agenda
● Basics of NLP
● Pre-processing steps
● Language Models
● Case Study
Natural Language Processing
Natural Language Processing
1. Natural Language Processing is a subfield of artificial
intelligence concerned with methods of communication
between computers and natural languages such as
english, hindi, etc.
2. Objective of Natural Language processing is to perform
useful tasks involving human languages like
○ Sentiment Analysis
○ Machine Translation
○ Part of Speech Tags
○ Human-Machine communication(chatbots)
Why study NLP?
1. Language is involved in most of the activities that involve interaction between humans, e.g. reading, writing,
speaking, listening.
2. Voice can be used as an interface for interactions between humans and machines e.g. cortana, google assistant,
siri, amazon alexa.
3. There is massive amount of data available in text format which can used to derive insights from using NLP,
e.g. blogs, research articles, consumer reviews, literature, discussion forums.
Different Tasks in NLP
● Text Classification
○ Sentiment Analysis: Determining the general context
of a review, whether it is positive or negative or
neutral.
○ Consumer Complaints Classification: Categorizing
complaints on consumer forums to respective
departments.
● Machine Translation
○ Improving human-human interaction by translating
sentences from one language to another.
Different Tasks in NLP
● Part of Speech Tagging
○ In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called
grammatical tagging or word-category disambiguation, is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech, based on both its definition and its context.
○ A simplified form of this is the identification of words as nouns, verbs, adjectives, adverbs, etc.
○ Tagset: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Different Tasks in NLP
● Word Segmentation
○ In some languages, there is no space between words, or a word may contain smaller syllables.n such
languages, word segmentation is the first step of NLP systems.
● Semantic Analysis
○ Semantic analysis of a corpus (a large and structured set of texts) is the task of building structures that
approximate concepts from a large set of documents.
○ Application of Semantic Analysis:
■ Text Similarity
■ Context Recognition
■ Sentence Parsing
■ Topic Modelling
Why NLP is hard?
1. Languages are changing everyday, new words, new rules, etc.
2. The number of tokens is not fixed. A natural language can have hundreds of thousands of different
words, new words are created on the fly.
3. Words can have different meanings depending on context, and they can acquire new meanings over
time (apple(a fruit), Apple(the company)], they can even change their parts of speech(Google --> to
google).
4. Every language has its own uniqueness. Like in the case of English we have words, sentences,
paragraphs and so on to limit our language. But in Thai, there is no concept of sentences.
Pre-processing Steps
Tokenization
● Tokenization is the task of taking a text or set of text and breaking it up into its individual tokens.
● Tokens are usually individual words (at least in languages like English).
● Tokenization can be achieved using different methods. Most common method is Whitespace tokenizer and
Regexp Tokenizer. We will use them in our case study.
Stop Words Removal
● Stopwords are common words that carry less
important meaning than keywords.
● When using some bag of words based methods,
i.e, countVectorizer or tfidf that works on counts
and frequency of the words, removing stopwords
is great as it lowers the dimensional space.
● Not always a good idea?
○ When working on problems where
contextual information is important like
machine translation, removing stop words
is not recommended.
Stemming and Lemmatization
● The idea of reducing different forms of a word to a core root.
● Words that are derived from one another can be mapped to a central word or symbol, especially
if they have the same core meaning.
● In stemming, words are reduced to their word stems. A word stem is an equal to or smaller form
of the word.
● “cook,” “cooking,” and “cooked” all are reduced to same stem of “cook.”
● Lemmatization involves resolving words to their dictionary form. A lemma of a word is its
dictionary or canonical form!
Word Features
Bag of Words
● In this model, a text (such as a sentence or a document) is represented
as the bag of its words, disregarding grammar and even word order
but keeping multiplicity.
● We use the tokenized words for each observation and find out the
frequency of each token.
● We define the vocabulary of corpus as all the unique words in the
corpus above and below some certain threshold of frequency.
● Each sentence or document is defined by a vector of same dimension
as vocabulary containing the frequency of each word of the
vocabulary in the sentence.
● The bag-of-words model is commonly used in methods of document
classification where the (frequency of) occurrence of each word is
used as a feature for training a classifier.
Tf-idf Vectors
● Tf-idf (term frequency times inverse document frequency) is a scheme to weight individual tokens.
● One of the advantage of tf-idf is reduce the impact of tokens that occur very frequently, hence offering little
to none in terms of information.
N-gram and Language Model
● Language models are the type of models that assign probabilities to sequence of words.
● N-grams is the most simplest language model. It’s a sequence of N-words.
● Bi-gram is a special case of N-grams where we consider only the sequence of two words (Markovian
assumption ).
● In N-gram models we calculate the probability of Nth words give the sequence of N-1 words. We do this by
calculating the relative frequency of the sequence occurring in the text corpus.
Thank you! :)
Questions are always welcome

You might also like