Lect05

Uploaded by

rodrigoferraribr

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Lect05

Uploaded by

rodrigoferraribr

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Text Classification

By Ivan Wong
Introduction
• In machine learning, classification is the problem of categorizing a
data instance into one or more known classes.
• The data point can be originally of different formats, such as text, speech,
image, or numeric.
• Text classification is a special instance of the classification
problem, where the input data point(s) is text and the goal is to
categorize the piece of text into one or more buckets (called a
class) from a set of pre-defined buckets (classes).
• The “text” can be of arbitrary length: a character, a word, a
sentence, a paragraph, or a full document.
Introduction
• Any supervised classification approach, including text
classification, can be further distinguished into three types based
on the number of categories involved:
• Binary [spam or not-spam],
• Multiclass [news categories], and
• multilabel classification [a tweet may have different emotions].
Applications
• Text classification has been of interest in a number of application
scenarios, ranging from identifying the author of an unknown text
in the 1800s to the efforts of USPS in the 1960s to perform optical
character recognition on addresses and zip codes.
• In the 1990s, researchers began to successfully apply ML
algorithms for text classification for large datasets.
• Email filtering, popularly known as “spam classification,” is one of
the earliest examples of automatic text classification, which
impacts our lives to this day.
Applications (Cont.)
• Content classification and organization: This refers to the task
of classifying/tagging large amounts of textual data. This, in turn,
is used to power use cases like content organization, search
engines, and recommendation systems, to name a few.
• Customer support: Customers often use social media to express
their opinions about and experiences of products or services. Text
classification is often used to identify the tweets that brands must
respond to and those that don’t require a response (i.e., noise).
Applications (Cont.)
• E-commerce: Customers leave reviews for a range of products on
e-commerce websites like Amazon, eBay, etc. An example use of
text classification in this kind of scenario is to understand and
analyze customers’ perception of a product or service based on
their comments.
• Other applications:
• language identification
• Authorship attribution
• Triaging posts in an online support forum
A Pipeline for
Building Text
Classification
Systems
Naive Bayes Classifier
• Naive Bayes is a probabilistic classifier that uses Bayes’ theorem
to classify texts based on the evidence seen in training data.
• It estimates the conditional probability of each feature of a given
text for each class based on the occurrence of that feature in that
class and multiplies the probabilities of all the features of a given
text to compute the final probability of classification for each
class.
• Finally, it chooses the class with maximum probability.
• Details: https://web.stanford.edu/~jurafsky/slp3/
Potential reasons for poor classifier
performance
Reason 1 Since we extracted all possible features, we ended up in a
large, sparse feature vector, where most features are too
rare and end up being noise. A sparse feature set also
makes training hard.
Reason 2 There are very few examples of relevant articles (~20%)
compared to the non-relevant articles (~80%) in the
dataset. This class imbalance makes the learning process
skewed toward the non-relevant articles category, as there
are very few examples of “relevant” articles.
Reason 3 Perhaps we need a better learning algorithm.
Reason 4 Perhaps we need a better pre-processing and feature
extraction mechanism.
Reason 5 Perhaps we should look to tuning the classifier’s
parameters and hyperparameters.
Logistic Regression
• Unlike Naive Bayes, which estimates probabilities based on
feature occurrence in classes, logistic regression “learns” the
weights for individual features based on how important they are to
make a classification decision.
• The goal of logistic regression is to learn a linear separator
between classes in the training data with the aim of maximizing
the probability of the data.
Our logistic regression classifier instantiation has an argument class_weight, which is
given a value “balanced”. This tells the classifier to boost the weights for classes in
inverse proportion to the number of samples for that class.
Logistic Regression (Cont.)
Support Vector Machine
• A support vector machine (SVM), first invented in the early 1960s,
is a discriminative classifier like logistic regression.
• However, unlike logistic regression, it aims to look for an optimal
hyperplane in a higher dimensional space, which can separate the
classes in the data by a maximum possible margin.
• Further, SVMs are capable of learning even non-linear separations
between classes, unlike logistic regression.
• However, they may also take longer to train.
Support Vector Machine (Cont.)

Ref.: https://scikit-learn.org/stable/modules/svm.html
Using Neural Embeddings in Text
Classification
• In feature engineering, techniques using neural networks, such as
word embeddings, character embeddings, and document
embeddings have the advantage that they create a dense, low-
dimensional feature representation instead of the sparse, high-
dimensional structure of BoW/TF-IDF and other such features.
• There are different ways of designing and using features based on
neural embeddings.
Word Embeddings
• Doc2Vec
• In the Doc2vec embedding scheme, we learn a direct representation for the
entire document (sentence/paragraph) rather than each word.
• Word2Vec
• We’ll use a pre-trained embedding model.
• Subword Embeddings and fastText
• If a word in our dataset was not present in the pre-trained model’s vocabulary,
how will we get a representation for this word? Out of vocabulary (OOV).
• fastText - They’re based on the idea of enriching word embeddings with
subword-level information.
• The embedding representation for each word is represented as a sum of the
representations of individual character n-grams.
Deep Learning for Text Classification
• Two of the most commonly used neural network architectures for
text classification are convolutional neural networks (CNNs) and
recurrent neural networks (RNNs).
• Long short-term memory (LSTM) networks are a popular form of
RNNs.
• Recent approaches also involve starting with large, pre-trained
language models and fine-tuning them for the task at hand.
Deep Learning for Text Classification (Cont.)
1. Tokenize the texts and convert them into word index vectors.
2. Pad the text sequences so that all text vectors are of the same
length.
3. Map every word index to an embedding vector. We do that by
multiplying word index vectors with the embedding matrix. The
embedding matrix can either be populated using pre-trained
embeddings or it can be trained for embeddings on this corpus.
4. Use the output from Step 3 as the input to a neural network
architecture.

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification
16 pages
Ducat Java Full Stack
No ratings yet
Ducat Java Full Stack
8 pages
Loan Processing System
No ratings yet
Loan Processing System
366 pages
Unit 2
No ratings yet
Unit 2
26 pages
NLP Unit-3
No ratings yet
NLP Unit-3
17 pages
Unit-3
No ratings yet
Unit-3
27 pages
UNIT-III Text Classification
No ratings yet
UNIT-III Text Classification
4 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper (1)
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper (1)
74 pages
NLP m4
No ratings yet
NLP m4
97 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
IR - Group1
No ratings yet
IR - Group1
27 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
research paper 3
No ratings yet
research paper 3
7 pages
Kshitij Text Classification
No ratings yet
Kshitij Text Classification
20 pages
13. TEXT CLASSIFICATION USING NLP
No ratings yet
13. TEXT CLASSIFICATION USING NLP
28 pages
Classification Survey
No ratings yet
Classification Survey
40 pages
Text Classification
No ratings yet
Text Classification
24 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Review of Text Classification Methods On Deep Learning
No ratings yet
Review of Text Classification Methods On Deep Learning
13 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
3
No ratings yet
3
5 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
Large Language Models Are Zero Shot Text Classifiers
No ratings yet
Large Language Models Are Zero Shot Text Classifiers
9 pages
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
No ratings yet
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
17 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Enhancing Text Classification Through Novel Deep Learning Sequential Attention Fusion Architecture
No ratings yet
Enhancing Text Classification Through Novel Deep Learning Sequential Attention Fusion Architecture
12 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
Review On Comparison Between Text Classification Algorithms
No ratings yet
Review On Comparison Between Text Classification Algorithms
4 pages
Document Classification Using Machine Learning: What Is Document Classifier?
No ratings yet
Document Classification Using Machine Learning: What Is Document Classifier?
9 pages
Talking Points
No ratings yet
Talking Points
8 pages
L2_CSE256_FA24_TC
No ratings yet
L2_CSE256_FA24_TC
65 pages
IEEE-paper on NLP
No ratings yet
IEEE-paper on NLP
3 pages
01 What Is Text Classification 8-12
No ratings yet
01 What Is Text Classification 8-12
4 pages
Theis finaldoc
No ratings yet
Theis finaldoc
86 pages
spam detection
No ratings yet
spam detection
39 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
He Laskar 2019
No ratings yet
He Laskar 2019
4 pages
Classification in Machine Learning
No ratings yet
Classification in Machine Learning
25 pages
text classification research paper 2
No ratings yet
text classification research paper 2
7 pages
Best Text To Speech Ai - Aitech - Studio
No ratings yet
Best Text To Speech Ai - Aitech - Studio
8 pages
Research on Short Text Classification Based on Tex
No ratings yet
Research on Short Text Classification Based on Tex
8 pages
Lecture#11
No ratings yet
Lecture#11
19 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
Ml Projrct Article 2
No ratings yet
Ml Projrct Article 2
6 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
Lecture5 421
No ratings yet
Lecture5 421
115 pages
Lab 04 - SUpervised ML Classification
No ratings yet
Lab 04 - SUpervised ML Classification
3 pages
17 - Project Report - NLP-2-27
No ratings yet
17 - Project Report - NLP-2-27
26 pages
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
No ratings yet
19_ArticleClassificationusingNaturalLanguageProcessingandMachineLearning
8 pages
Lecture-Feb20&25
No ratings yet
Lecture-Feb20&25
11 pages
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models
26 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
2019 Using Deep Neural Network
No ratings yet
2019 Using Deep Neural Network
4 pages
Addressing Sentiment Analysis Challenges
No ratings yet
Addressing Sentiment Analysis Challenges
8 pages
03 Classification
No ratings yet
03 Classification
66 pages
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
3175 Lab 2 - Trip Booking App - Solutions(1)
No ratings yet
3175 Lab 2 - Trip Booking App - Solutions(1)
1 page
3175 Lab 3
No ratings yet
3175 Lab 3
1 page
3175 Lab 4
No ratings yet
3175 Lab 4
2 pages
CSIS 3300 W3 Denormalization StarSchema
No ratings yet
CSIS 3300 W3 Denormalization StarSchema
27 pages
Proj2
No ratings yet
Proj2
5 pages
Csis 3300 w5 9 Nosql
No ratings yet
Csis 3300 w5 9 Nosql
27 pages
CSIS 3300 W13 Transactions
No ratings yet
CSIS 3300 W13 Transactions
13 pages
CSIS 3300 W3 Denormalization StarSchema Sol
No ratings yet
CSIS 3300 W3 Denormalization StarSchema Sol
2 pages
CSIS 3300 W11 QueryOptimization
No ratings yet
CSIS 3300 W11 QueryOptimization
27 pages
Csis3300 001 Outline Nb f24
No ratings yet
Csis3300 001 Outline Nb f24
8 pages
Lect06
No ratings yet
Lect06
21 pages
Proj01
No ratings yet
Proj01
5 pages
Lect07
No ratings yet
Lect07
24 pages
Lect08
No ratings yet
Lect08
17 pages
Lect04
No ratings yet
Lect04
44 pages
Lect02
No ratings yet
Lect02
23 pages
CSIS3400 070CourseOutline 2024Fall(1)
No ratings yet
CSIS3400 070CourseOutline 2024Fall(1)
5 pages
Lect01
No ratings yet
Lect01
28 pages
Edexcel GCE: Decision Mathematics D1 Advanced/Advanced Subsidiary
No ratings yet
Edexcel GCE: Decision Mathematics D1 Advanced/Advanced Subsidiary
24 pages
Satellite Substation
No ratings yet
Satellite Substation
37 pages
2537 - York YCAS 690
No ratings yet
2537 - York YCAS 690
11 pages
Guidelines For Estimating and Reporting Measurement Uncertainty of Chemical Test Results
No ratings yet
Guidelines For Estimating and Reporting Measurement Uncertainty of Chemical Test Results
11 pages
Cs 104: Programming Fundamentals: Lecture # 01
No ratings yet
Cs 104: Programming Fundamentals: Lecture # 01
25 pages
PPT
100% (1)
PPT
97 pages
EOT Crane Load Calculations
No ratings yet
EOT Crane Load Calculations
8 pages
Six Sigma Project Case Study Presentation
No ratings yet
Six Sigma Project Case Study Presentation
12 pages
Soil Dynamics and Earthquake Engineering: Aparna Roy, Atanu Santra, Rana Roy
No ratings yet
Soil Dynamics and Earthquake Engineering: Aparna Roy, Atanu Santra, Rana Roy
19 pages
Transformer Technical Data Sheet For The 1LAP016413
No ratings yet
Transformer Technical Data Sheet For The 1LAP016413
1 page
005 - Semester 2 Final Exam
No ratings yet
005 - Semester 2 Final Exam
9 pages
Fundamentals of Flow Meter
No ratings yet
Fundamentals of Flow Meter
12 pages
QT Designer and KDevelop-3.0 For Beginners
No ratings yet
QT Designer and KDevelop-3.0 For Beginners
44 pages
Cuestionario Final
No ratings yet
Cuestionario Final
15 pages
Zinc Gluconate
No ratings yet
Zinc Gluconate
2 pages
16.Gate Ies Psu's Engg Mechanics Notes-2020
No ratings yet
16.Gate Ies Psu's Engg Mechanics Notes-2020
31 pages
Budgeting Case Study
No ratings yet
Budgeting Case Study
16 pages
Non Woven, Needle Punched, Staple Fiber
No ratings yet
Non Woven, Needle Punched, Staple Fiber
2 pages
Report Thinkpad R30
No ratings yet
Report Thinkpad R30
96 pages
Cheng 2018
No ratings yet
Cheng 2018
1 page
Classical Mechanics Block 3/4
No ratings yet
Classical Mechanics Block 3/4
64 pages
Biotechnology - Principles and Processes - Notes - June 2023
No ratings yet
Biotechnology - Principles and Processes - Notes - June 2023
15 pages
Information Retrieval 8 Term Weighting A
No ratings yet
Information Retrieval 8 Term Weighting A
11 pages
Indian Standard For Lifts
No ratings yet
Indian Standard For Lifts
11 pages
Heat Exchangers in NPP - SP
No ratings yet
Heat Exchangers in NPP - SP
5 pages
TS en Iso 15614-8
No ratings yet
TS en Iso 15614-8
51 pages
Voltaic Cells Lab Report
No ratings yet
Voltaic Cells Lab Report
8 pages

Lect05

Uploaded by

Lect05

Uploaded by

Text Classification

You might also like