NLP
NLP
1
11/23/2023
nltk
http://nltk.org
2
11/23/2023
nltk corpora
3
11/23/2023
nltk.download("movie_reviews")
ls nltk_data/corpora/movie_reviews
2000 files:
4
11/23/2023
nltk.download("movie_reviews")
2000 files:
5
11/23/2023
Tokenization
Corner cases:
▪ punctuation
▪ contractions
▪ hyphenated words
6
11/23/2023
nltk.word_tokenize
7
11/23/2023
Bag-of-words Model
▪ simple model
▪ discards sentence structure
▪ useful to identify topic or sentiment
8
11/23/2023
• Stopwords are very common words that have no intrinsic meaning like
"the", "is","which".
9
11/23/2023
Using Counter
counter =Counter(filtered_words)
counter["movie"]
5771
10
11/23/2023
Plotting Word
Frequency
11
11/23/2023
12
11/23/2023
13