Week 6: Introduction To Natural Language Processing

This document provides an overview of natural language processing (NLP). It begins with describing NLP and some common NLP tasks like sentiment analysis and machine translation. It then discusses why NLP is important and challenging. The document outlines several preprocessing steps for text data, including tokenization, stop word removal, stemming/lemmatization. It also describes common word features used in NLP like bag-of-words and tf-idf vectors. Finally, it briefly introduces N-gram language models.

Uploaded by

Dimpu Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

198 views18 pages

Week 6: Introduction To Natural Language Processing

Uploaded by

Dimpu Shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Week 6

Introduction to Natural Language Processing

Session Agenda
● Basics of NLP
● Pre-processing steps
● Language Models
● Case Study
Natural Language Processing
Natural Language Processing
1. Natural Language Processing is a subfield of artificial
intelligence concerned with methods of communication
between computers and natural languages such as
english, hindi, etc.
2. Objective of Natural Language processing is to perform
useful tasks involving human languages like
○ Sentiment Analysis
○ Machine Translation
○ Part of Speech Tags
○ Human-Machine communication(chatbots)
Why study NLP?
1. Language is involved in most of the activities that involve interaction between humans, e.g. reading, writing,
speaking, listening.
2. Voice can be used as an interface for interactions between humans and machines e.g. cortana, google assistant,
siri, amazon alexa.
3. There is massive amount of data available in text format which can used to derive insights from using NLP,
e.g. blogs, research articles, consumer reviews, literature, discussion forums.
Different Tasks in NLP
● Text Classification
○ Sentiment Analysis: Determining the general context
of a review, whether it is positive or negative or
neutral.
○ Consumer Complaints Classification: Categorizing
complaints on consumer forums to respective
departments.
● Machine Translation
○ Improving human-human interaction by translating
sentences from one language to another.
Different Tasks in NLP
● Part of Speech Tagging
○ In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called
grammatical tagging or word-category disambiguation, is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech, based on both its definition and its context.
○ A simplified form of this is the identification of words as nouns, verbs, adjectives, adverbs, etc.
○ Tagset: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Different Tasks in NLP
● Word Segmentation
○ In some languages, there is no space between words, or a word may contain smaller syllables.n such
languages, word segmentation is the first step of NLP systems.
● Semantic Analysis
○ Semantic analysis of a corpus (a large and structured set of texts) is the task of building structures that
approximate concepts from a large set of documents.
○ Application of Semantic Analysis:
■ Text Similarity
■ Context Recognition
■ Sentence Parsing
■ Topic Modelling
Why NLP is hard?
1. Languages are changing everyday, new words, new rules, etc.
2. The number of tokens is not fixed. A natural language can have hundreds of thousands of different
words, new words are created on the fly.
3. Words can have different meanings depending on context, and they can acquire new meanings over
time (apple(a fruit), Apple(the company)], they can even change their parts of speech(Google --> to
google).
4. Every language has its own uniqueness. Like in the case of English we have words, sentences,
paragraphs and so on to limit our language. But in Thai, there is no concept of sentences.
Pre-processing Steps
Tokenization
● Tokenization is the task of taking a text or set of text and breaking it up into its individual tokens.
● Tokens are usually individual words (at least in languages like English).
● Tokenization can be achieved using different methods. Most common method is Whitespace tokenizer and
Regexp Tokenizer. We will use them in our case study.
Stop Words Removal
● Stopwords are common words that carry less
important meaning than keywords.
● When using some bag of words based methods,
i.e, countVectorizer or tfidf that works on counts
and frequency of the words, removing stopwords
is great as it lowers the dimensional space.
● Not always a good idea?
○ When working on problems where
contextual information is important like
machine translation, removing stop words
is not recommended.
Stemming and Lemmatization
● The idea of reducing different forms of a word to a core root.
● Words that are derived from one another can be mapped to a central word or symbol, especially
if they have the same core meaning.
● In stemming, words are reduced to their word stems. A word stem is an equal to or smaller form
of the word.
● “cook,” “cooking,” and “cooked” all are reduced to same stem of “cook.”
● Lemmatization involves resolving words to their dictionary form. A lemma of a word is its
dictionary or canonical form!
Word Features
Bag of Words
● In this model, a text (such as a sentence or a document) is represented
as the bag of its words, disregarding grammar and even word order
but keeping multiplicity.
● We use the tokenized words for each observation and find out the
frequency of each token.
● We define the vocabulary of corpus as all the unique words in the
corpus above and below some certain threshold of frequency.
● Each sentence or document is defined by a vector of same dimension
as vocabulary containing the frequency of each word of the
vocabulary in the sentence.
● The bag-of-words model is commonly used in methods of document
classification where the (frequency of) occurrence of each word is
used as a feature for training a classifier.
Tf-idf Vectors
● Tf-idf (term frequency times inverse document frequency) is a scheme to weight individual tokens.
● One of the advantage of tf-idf is reduce the impact of tokens that occur very frequently, hence offering little
to none in terms of information.
N-gram and Language Model
● Language models are the type of models that assign probabilities to sequence of words.
● N-grams is the most simplest language model. It’s a sequence of N-words.
● Bi-gram is a special case of N-grams where we consider only the sequence of two words (Markovian
assumption ).
● In N-gram models we calculate the probability of Nth words give the sequence of N-1 words. We do this by
calculating the relative frequency of the sequence occurring in the text corpus.
Thank you! :)
Questions are always welcome

DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Essentials for AI Enthusiasts
No ratings yet
NLP Essentials for AI Enthusiasts
4 pages
NLP Components and Techniques Guide
No ratings yet
NLP Components and Techniques Guide
26 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
NLP 9
No ratings yet
NLP 9
44 pages
NLP Challenges & Techniques
No ratings yet
NLP Challenges & Techniques
45 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
NLP
No ratings yet
NLP
17 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
NLP Unit 1
No ratings yet
NLP Unit 1
43 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Cs383 Lecture16 PDF
No ratings yet
Cs383 Lecture16 PDF
46 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
NLP Final
No ratings yet
NLP Final
33 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
NLP Ans
No ratings yet
NLP Ans
91 pages
NLP Final
No ratings yet
NLP Final
27 pages
Sample
No ratings yet
Sample
8 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
34 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Intro To Statistical NLP
No ratings yet
Intro To Statistical NLP
57 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Project Report
No ratings yet
Project Report
12 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Notes
No ratings yet
NLP Notes
56 pages
NLP Applications in Healthcare
No ratings yet
NLP Applications in Healthcare
71 pages
Core Components of Natural Language Processing
No ratings yet
Core Components of Natural Language Processing
43 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
NLP & Linguistics for Researchers
No ratings yet
NLP & Linguistics for Researchers
35 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
Unit 1
No ratings yet
Unit 1
35 pages
Final Exam
No ratings yet
Final Exam
2 pages
Civilcassignment Final
No ratings yet
Civilcassignment Final
19 pages
Principle of Mathematical Induction Solution and Proof
No ratings yet
Principle of Mathematical Induction Solution and Proof
3 pages
Authority of St. Thomas Aquinas
No ratings yet
Authority of St. Thomas Aquinas
109 pages
A Summer To Die by Lois Lowry - Excerpt
100% (1)
A Summer To Die by Lois Lowry - Excerpt
16 pages
2016 It Can Be Done
No ratings yet
2016 It Can Be Done
10 pages
Jeric Poem
No ratings yet
Jeric Poem
36 pages
The Regeneration by Nelly Page 1
No ratings yet
The Regeneration by Nelly Page 1
1,103 pages
Supervision: Philosophy, Theory, & Plan
No ratings yet
Supervision: Philosophy, Theory, & Plan
13 pages
Image Distortion
No ratings yet
Image Distortion
4 pages
Animal Farm Timeline Project
No ratings yet
Animal Farm Timeline Project
2 pages
Grade 6 - GMRC: L'Altra Montessori School, Inc. Elementary Department Learning Plan
100% (1)
Grade 6 - GMRC: L'Altra Montessori School, Inc. Elementary Department Learning Plan
2 pages
Reading Material Expanding Universe
No ratings yet
Reading Material Expanding Universe
7 pages
Hilton Chapter 14 Adobe Connect Live
No ratings yet
Hilton Chapter 14 Adobe Connect Live
18 pages
Economics
No ratings yet
Economics
2 pages
Cased Hole Logging Halliburton
No ratings yet
Cased Hole Logging Halliburton
8 pages
Ysaÿe Sonata No. 4 for Solo Violin
0% (1)
Ysaÿe Sonata No. 4 for Solo Violin
1 page
Love at First Byte British English Teacher B1 B2
No ratings yet
Love at First Byte British English Teacher B1 B2
11 pages
Laforgue
No ratings yet
Laforgue
4 pages
Adults Pre Inter 1 Mid Term Reading and Use of English S2
No ratings yet
Adults Pre Inter 1 Mid Term Reading and Use of English S2
5 pages
Ca. Colorektal
No ratings yet
Ca. Colorektal
30 pages
Breath Awareness Meditation Kundalini Yoga
No ratings yet
Breath Awareness Meditation Kundalini Yoga
2 pages
AL1x2x Acyclic Commands Rev2
No ratings yet
AL1x2x Acyclic Commands Rev2
40 pages
Ekey Inversion Exercises: Luyện Thi Vào Lớp 10 Chuyên Anh logo here
No ratings yet
Ekey Inversion Exercises: Luyện Thi Vào Lớp 10 Chuyên Anh logo here
2 pages
NEW Speaking
No ratings yet
NEW Speaking
5 pages
Lihjkjsa
No ratings yet
Lihjkjsa
1 page
CFA Level II - Ethics
No ratings yet
CFA Level II - Ethics
4 pages
Received Pronunciation
No ratings yet
Received Pronunciation
3 pages
Digimon World BETA Handbook v1.1.1
No ratings yet
Digimon World BETA Handbook v1.1.1
57 pages
Sexual Reproduction in Flowering Plants 1
No ratings yet
Sexual Reproduction in Flowering Plants 1
5 pages

Week 6: Introduction To Natural Language Processing

Uploaded by

Week 6: Introduction To Natural Language Processing

Uploaded by

Week 6

Introduction to Natural Language Processing

You might also like