0% found this document useful (0 votes)
2 views

NLP

The document provides an introduction to Natural Language Processing (NLP) and its applications, including text summarization, sentiment analysis, and chatbots. It discusses the Natural Language Toolkit (nltk) in Python, which offers tools and datasets for processing human language data. Additionally, it covers concepts like tokenization, bag-of-words models, and sentiment analysis using a Naive Bayes Classifier.

Uploaded by

vagifsamadov2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP

The document provides an introduction to Natural Language Processing (NLP) and its applications, including text summarization, sentiment analysis, and chatbots. It discusses the Natural Language Toolkit (nltk) in Python, which offers tools and datasets for processing human language data. Additionally, it covers concepts like tokenization, bag-of-words models, and sentiment analysis using a Naive Bayes Classifier.

Uploaded by

vagifsamadov2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

11/23/2023

Natural Language Processing:


nltk

Intro to Natural Language Processing (NLP)

▪ Algorithms to analyze, understand and derive meaning from


human language

▪ Hard computational problem because human language is


ambiguous, needs context and ability to link concepts

▪ Applications: summarize text, generate keywords, identify


sentiment of text

1
11/23/2023

Real Life examples of NLP

▪ Speech recognition engines like Siri,


Google Now or Alexa
▪ Automatic translation like Google
Translate or Facebook automatic
translation of statuses
▪ Chat bots that can answer question via
Facebook Messenger, for example
provided by Techcrunch, Disney or Whole
Foods

nltk

Natural Language Toolkit in Python

▪ Work with human language data


▪ Includes over 50 datasets
▪ Complete library of easy to use
algorithms for processing text
▪ Available for free under open source
license

http://nltk.org

2
11/23/2023

Natural Language Processing:


nltk corpora

nltk corpora

corpus (plural corpora) is a collection of


text in digital form, assembled for text
processing

nltk provides a download interface to pre-


processed text datasets.

3
11/23/2023

List of nltk corpora

nltk movie reviews corpus

nltk.download("movie_reviews")

ls nltk_data/corpora/movie_reviews

README neg pos

2000 files:

▪ 1000 positive reviews in the pos/ folder


▪ 1000 negative reviews in the neg/ folder

4
11/23/2023

nltk movie reviews corpus

nltk.download("movie_reviews")

2000 files:

▪ 1000 positive reviews in the pos/ folder


▪ 1000 negative reviews in the neg/ folder
▪ average 800 words per review

Natural language processing:


tokenize

5
11/23/2023

Tokenization

The first step in analyzing text is to split it


into words: Tokenization

Corner cases:

▪ punctuation
▪ contractions
▪ hyphenated words

Example: "New York-based"

First Attempt without nltk

Naively just split on whitespace

See Tokenize text in words

6
11/23/2023

Tokenize with nltk

nltk.word_tokenize

Sophisticated tokenizer specific to English,


it requires the punkt corpus.

It correctly identifies also punctuation.

Natural language processing:


build a bag-of-words model

7
11/23/2023

Bag-of-words Model

Bag-of-words =text as unordered


collection of words

▪ simple model
▪ discards sentence structure
▪ useful to identify topic or sentiment

Building Features with Words

outstanding movie family worse uninvolving interesting

Review 1 True True False False False False

Review 2 False True False True True False

Review 3 True True True False False False

8
11/23/2023

Filter out Stopwords and Punctuation

• The movie_reviewstokenized words also include punctuation and


stopwords.

• Stopwords are very common words that have no intrinsic meaning like
"the", "is","which".

Natural Language Processing:


Plotting Frequency of Words

9
11/23/2023

Number of Words in Movie Reviews Corpus

▪ ~1.6 million words

▪ just 710 thousand after filtering


punctuation and stopwords

Using Counter

▪ Part of the collections package in the


Python Standard Library
▪ Counts how many time an item is
repeated

counter =Counter(filtered_words)
counter["movie"]
5771

10
11/23/2023

Plotting Word
Frequency

Histogram of Word Counts

▪ Use hist from matplotlib to create a


histogram
▪ Choose bin number and optionally log
axes

11
11/23/2023

Natural Language Processing:


Sentiment Analysis

What is Sentiment Analysis

▪ Identify attitude or emotion encoded in


a text
▪ Can be implemented as a Machine
Learning Classifier

▪ Example: prediction on the appearance


of words in a review

12
11/23/2023

Build features/label pairs

• The function implemented previously creates a set of features.

• Create a pair of feature and positive/negative label for each review.

Naive Bayes Classifier

Naive Bayes Classifier is a simple classifier


based on Conditional Probabilities.

In the training phase, it detects the


probability that each feature (word)
appears in a category (positive or
negative).

Once trained, it collects the "votes" for all


words in the new review and finds the most
probable label.

13

You might also like