0% found this document useful (0 votes)
43 views

Text Classification

Text classification is a process that assigns predefined categories to text. It works by first inputting text data and assigning numbers to each unique word. It then calculates term frequency to see how often each word appears and inverse document frequency to downscale common words. The text is then encoded as a list of word counts. A machine learning model is trained on this encoded text using an algorithm to learn how to classify new text. The trained model can then be tested on new text data.

Uploaded by

Shravya M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Text Classification

Text classification is a process that assigns predefined categories to text. It works by first inputting text data and assigning numbers to each unique word. It then calculates term frequency to see how often each word appears and inverse document frequency to downscale common words. The text is then encoded as a list of word counts. A machine learning model is trained on this encoded text using an algorithm to learn how to classify new text. The trained model can then be tested on new text data.

Uploaded by

Shravya M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Text Classification

Have you seen your parents booking pizza for you? Or purchasing clothes for you? When you
look at the apps, you can see many categories like veg pizza, non-veg pizza, sides, desserts,
and many more options. Another example is your school classes, where you are classified
into specific classes according to grades.

Similarly, when we look at the online content there is huge data that needs to classify to use
it. So, how you classify that? Well, the machine-learning model can help you with that. To do
that first you need to train your model to classify the data. There are various algorithms, which
will help you train your model to classify the data. We will look into algorithms later first, let
us deep-dive into text classification.

What is text classification?

Text classification is a process of assigning a set of pre-defined categories to available text.


Text classification can be used to organize the structure and categorize the free-text.

For example, News articles are classified by topics like sports news, entertainment news,
politics, etc.

Did you know? Sentiment analysis that you learned in the previous class is also a type of text
classification, where you classified the text under three categories namely positive, negative,
and neutral classes.

How does it work?

Step 1: The first step in text classification is to input the data

Step 2: Word count / Count Vectorization

That is assigning a number to each word for a given input. Let us take an example and
understand how it works. Here is our sentence.

“The quick brown fox jumped over the lazy dog” once we input the data, the
tokenization process is carried out. That is assigning a random number to each word. This
sentence includes nine words in which “the” is repeating. So, the repeating words are ignored
and calculated as one word as shown below. The words are numbered from 0 to 7 which is 8
words.

The:7, lazy:4, Jumped:3, brown:0, over:5, quick: 6, dog:1, fox:2

Now, let us re-arrange the words and according to tokens.

Brown, dog, fox, jumped, lazy, over, quick, the


Once we re-arrange the tokens, we need to encode the sentence/input data. That is counting
the occurrence of each word. That the number of times the word “brown” in the sentence.

[Brown, dog, fox, jumped, lazy, over, quick, the]

[ 1, 1, 1, 1, 1, 1, 1, 2]

As “the” is repeating twice we have encoded as 2. This process is repeated for all the input
data.

Step 3: TF and IDF

Term Frequency: This summarizes how often a given word appears within a document.

Inverse document frequency: This downscales words that appear a lot across the document.

To understand this let us add two more sentences to our example that is,

“The quick brown fox jumped over the lazy dog”


“The dog”
“The fox”
Using the formula, IDF calculates the weight of each word. In this, you can see that the word
“the” is repeating 4 times.

[1.69, 1.28, 1.28, 1.69, 1.68, 1.69, 1.69, 1]

The last keyword “the” has the least weightage and it is least important according to IDF.

Step 4: After all these steps our machine-learning model uses an algorithm to train the model.

Step 5: Test the trained model.

This is how text classification works.


Text Classification steps:

You might also like