0% found this document useful (0 votes)
16 views

Automatic Content Tagging Using NLP and Machine Learning - LinkedIn

The document discusses how to build an automated content tagger using natural language processing (NLP) and machine learning. It describes using named entity recognition (NER) and named entity linking (NEL) to extract named entities from text and link them to a taxonomy. However, this approach has limitations in terms of coverage and redundancy. Instead, key phrase extraction is recommended, which can be done using unsupervised methods like TF-IDF or graph-based ranking, or supervised methods like KEA that use machine learning to restrict candidate phrases. When building an automated tagger, it is important to consider whether to use an open or closed vocabulary.

Uploaded by

Uma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Automatic Content Tagging Using NLP and Machine Learning - LinkedIn

The document discusses how to build an automated content tagger using natural language processing (NLP) and machine learning. It describes using named entity recognition (NER) and named entity linking (NEL) to extract named entities from text and link them to a taxonomy. However, this approach has limitations in terms of coverage and redundancy. Instead, key phrase extraction is recommended, which can be done using unsupervised methods like TF-IDF or graph-based ranking, or supervised methods like KEA that use machine learning to restrict candidate phrases. When building an automated tagger, it is important to consider whether to use an open or closed vocabulary.

Uploaded by

Uma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

1 1 Get
Home My Network Jobs Messaging Notifications Me For Business Try

Create your own newsletter


Start your own discussion with a newsletter on LinkedIn. Share what you know and build
your thought leadership with every new edition.

Try it out

KEA diagram

Automatic Content
Tagging using NLP and
Machine Learning
Josip Lazarevski
Senior Director, Data Science and Knowledge 27 articles Follow
Engineering. Championing Success through…

August 11, 2020

Open Immersive Reader

In this article, we will explore how content tagging can be


automated with the help of NLP.

I will also go into the details of what resources you will


need to implement such a system and what approach is
more favorable for your case.

A few years back, I have developed an automated tagging


system that took over 8000 digital assets and tagged them
with over 85% correctness. The tagger was deployed and
made realtime tagging new digital assets every day, fully
automated.

Using Named Entity Recognition


(NER) and Named Entity Linking
(NEL)

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 1/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

NER is the task of extracting Named Entities out of the


article text; on the other hand, the goal is linking these
named entities to a taxonomy like Wikipedia. 

One possible way to generate candidates for tags is to


extract all the Named entities or the Aspects in the text as
represented by, for example, Wikipedia entries of the
named entities in the article. We are using a tool
like wikifier.

While this method can generate adequate candidates for


other approaches like key-phrase extraction, this approach
faces 2 issues:

Coverage: not all the tags in your articles have to be


named entities, they might as well be any phrase.

Redundancy: Not all the named entities mentioned in a


text document are necessarily important for the article.

Key Phrase Extraction

In keyphrase extraction, the goal is to extract significant


tokens in the text. There are several methods. This can be
done, and they generally fall in 2 main categories:

1. Unsupervised Methods

These are simple methods that basically rank the words in


the article based on several metrics and retrieves the
highest-ranking words. These methods can be further
classified into statistical and graph-based:

Graph-Based Methods

In these methods, the system represents the document in a


graph form and then ranks the phrases based on their
centrality score, which is commonly calculated using
PageRank or a variant of it. The main difference between
these methods lies in how they construct the graph and
how the vertex weights are calculated. The algorithms in
this category include (TextRank, SingleRank, TopicRank,
TopicalPageRank, PositionRank, MultipartiteRank)

Statistical Methods

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 2/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

In this type, the candidates are ranked using their


occurrence statistics mostly using TFIDF, some of the
methods in this category are:

TFIDF: this is the simplest possible method. Basically,


we calculate the TFIDF score of every N-gram in the
text and then select those with the highest TFIDF score.

KPMiner: the main drawback of using TFIDF is that it


inherently has a bias for shorter n-gram since they
would have larger scores. In KPMiner the system
modifies the candidate selection process to reduce
erroneous candidates and then adds a boosting factor
to modify the weights of the TFIDF.

YAKE: introduces a method that relies on local


statistical features from every term and then generates
the scores by combining the consecutive N-words into
key phrases.

EmbedRank: This simple method uses the following


steps:

Candidates are phrases that consist of zero or more


adjectives, followed by one or multiple nouns.

These candidates and the whole document are then


represented using Doc2Vec or Sent2Vec

Afterward, each of the candidates is then ranked based


on their cosine similarity to the document vector.

2. Supervised Methods

KEA is a very famous algorithm for keyphrase


extraction. Basically, it extracts candidates from
documents using TFIDF, and then the trained model is
then used to restrict the candidates’ set.

KEA in Practice

Deep methods were also suggested to tackle this task.


Basically, the task is converted to a sequence tagging
problem where the input is the article text while the
output is the BOI annotation.

Other things to consider while


building automated content tagger
using MAchine learning and NLP

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 3/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

A significant distinction between keyphrase extraction


is whether the method uses a closed or open
vocabulary. In the closed case, the extractor only
selects candidates from a pre-specified set of key
phrases. This often improves the quality of the
generated words but requires building the set as well.
It can reduce the number of keywords extracted and
can restrict them to the size of the close-set.

Most of the algorithms mentioned above are already


implemented in packages like pke.

Some articles suggest several post-processing steps to


improve the quality of the extracted phrases:

In [Bennani-Smires, Kamil, et al. “Simple Unsupervised


Keyphrase Extraction using Sentence
Embeddings.” arXiv preprint arXiv:1801.04470 (2018).]
the authors suggest using maximal marginal
relevance(MMR) to improve the semantic diversity of
the selected key-phrases. They ran a manual
experiment with 200 human participants and found
that although reducing the phrases’ semantic overlap
leads to no F-score gains; humans prefer the
increased diversity selection.

Several cloud services, including AWS, comprehend


and Azur Cognitive does support keyphrase extraction
for paid fees. However, their performance in non-
English languages is not always excellent.

Data

As mentioned above, most of these methods are


unsupervised and thus require no training data. However, if
you wish to use supervised methods, then you will need
training data for your models.

I have included data from Blogs, Web Pages, Data Sheets,


product specifications, Videos ( using voice to text
recognition models).

Pros And Cons

These methods are generally straightforward and have


very high performance.

Most of these algorithms, like YAKE, for example, are


multi-lingual and usually only require a list of stop
https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 4/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

words to operate.

The unsupervised methods can generalize easily to any


domain and require no training data; even most of the
supervised methods require a very small amount of
training data.

Being extractive, these algorithms can only generate


phrases from within the original text. This means that
the generated keyphrases can’t abstract the content,
and the generated keyphrases might not be suitable
for grouping documents.

The quality of the key phrases depends on the domain


and algorithm used.

Key Phrase Generation

A significant drawback of using extractive methods is that


in most datasets, a significant portion of the keyphrases is
not explicitly included within the text.

Key Phrase Generation treats the problem instead of as a


machine translation task where the source language is the
article’s main text while the target is usually the list of key
phrases. Neural architectures specifically designed for
machine translation like seq2seq models, are the prominent
method in tackling this task. Furthermore, the same tricks
used to improve translation including transforms, copy
decoders and encoding text using pair bit encoding are
commonly used.

While the supervised method usually yields better key


phrases than it’s extractive counter-part there are some
problems of using this approach:

These methods are usually language and domain-


specific: a model trained on news articles would
generalize miserably on Wikipedia entries. This
increases the cost of incorporating other languages.

The deep models often require more computation for


both the training and inference phases

These methods require large quantities of training data


to generalize. However as we mentioned above, for
some domains such as news articles it is simple to
scrap such data.

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 5/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

Text Tagging using Machine


Learning and NLP

Another approach to tackle this issue is to treat it as a fine-


grained classification task. Where the input of the system is
the article, and the system needs to select one or more tags
from a pre-defined set of classes that best represents this
article.

There are two main challenges in this approach:

choosing a model that can predict an often very large


set of classes

obtaining enough data to train it.

The first task is not simple. Several challenges have tackled


this task, especially the LSHTC challenges series. The
models often used for such tasks include boosting a large
number of generative models or by using large neural
models like those developed for object detection tasks in
computer vision.

The second task is rather simpler, and it is possible to reuse


the data of the key-phrase generation task for this
approach. Another large source of categorized articles is
public taxonomies like Wikipedia and DMOZ.

One interesting case of this task is when the tags have a


hierarchical structure; one example of this is the tags
commonly used in a news outlet or the categories of
Wikipedia pages. In this case, the model should consider
the hierarchical structure of the tags in order to generalize
better. Several deep models have been suggested for this
task, including HDLTex and Capsul Networks.

The drawbacks of this approach are similar to that of key-


phrase generation, namely, the inability to generalize across
other domains or languages and the increased
computational costs.

Ad-hoc solutions

In [ Syed, Zareen, Tim Finin, and Anupam Joshi. “Wikipedia


as an ontology for describing documents.” UMBC Student
Collection (2008).] a very interesting method was
suggested. The authors basically indexed the English
Wikipedia using the Lucene search engine. Then for every

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 6/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

new article to generate the tags, they used the following


steps:

Use the new article (or a set of its sentences like


summary or titles) as a query to the search engine

Sort the results based on their cosine similarity to the


article and select the top N Wikipedia articles that are
similar to the input

Extract the tags from the categories of resulted in


Wikipedia articles and score them based on their co-
occurrence

filter the unneeded tags especially the administrative


tags like (born in 1990, died in 1990, …) then return the
top N tags

This is a fairly simple approach. However, it might even be


unnecessary to index the Wikipedia articles since Wikimedia
already has an open free API that can support both
querying the Wikipedia entries and extracting their
categories. However, this service is somewhat limited in
terms of the supported endpoints and their results.

Customizable Text Classification by


Tagging

Several commercial APIs like TextRazor provide one very


useful service, which is customizable text classification.
Basically, the user can define her own classes in a similar
manner to defining your own interests on sites like quora.
Next, the model can classify the new articles to the pre-
defined classes.

Regardless of the method, you choose to build your tagger,


one very cool application to the tagging system arises when
the categories come for a specific hierarchy. This case can
happen either in hierarchical taggers or even in key-phrase
generation and extraction by restricting the extracted key-
phrases to a specific lexicon, for example, using DMOZ or
Wikipedia categories.

The customizable classification system can be implemented


by making the user define their own classes as a set of tags.
For example, from Wikipedia, we can define the class
football players like the following set {Messi, Ronaldo, … }.

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 7/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

In the test case, the tagging system is used to generate the


tags, and then the generated tags are grouped using the
class sets.

If the original categories come from a pre-defined


taxonomy like in the case of Wikipedia or DMOZ it is much
easier to define special classes or use the pre-defined
taxonomies.

Conclusion

There are several approaches to implement an


automatic tagging system; they can be broadly
categorized into key-phrase based, classification-
based, and ad-hoc methods.

For simple use cases, the unsupervised key-phrase


extraction methods provide a simple multi-lingual
solution to the tagging task, but their results might not
be satisfactory for all cases, and they can’t generate
abstract concepts that summarize the whole meaning
of the article.

More advanced supervised approaches like key-phrase


generation and supervised tagging provides better and
more abstractive results at the expense of reduced
generalization and increased computation. They also
require a longer time to implement due to the time
spent on data collection and training the models.
However, it is fairly simple to build large-enough
datasets for this task automatically.

The approach presented in [ Syed, Zareen, Tim Finin,


and Anupam Joshi. “Wikipedia as an ontology for
describing documents.” UMBC Student
Collection (2008).] is fairly general and simple, and it is
possible to leverage APIs to implement it in a fairly
simple manner.

The simplest way to build a tagging system, in my


opinion, is to combine shallow key-phrase extraction
with tags from WikiMedia to generate adequate tags. If
the quality of the generated tags is not satisfactory to
your application or if you want to support a limited set
of tags, then you may want to consider stronger
options like a key-phrase generation or supervised
tagging.

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 8/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

One fascinating application of an auto-tagger is the


ability to build a user-customizable text classification
system. Such a system can be more useful if the tags
come from an already established taxonomy.
Report this

Published by
Josip Lazarevski 27 Follow
Senior Director, Data Science and Knowledge Engineering. Championing Succe…
articles
Published • 3y

#NLP #Automatedtagging

Like Comment Share 10 1 comment

Reactions

1 Comment
Most relevant

Add a comment…

Sunil Kumar Dash • 3rd+ 5mo


--

Do you think a pre-trained model like BERT fine-tuned over a decently


sized dataset can yield better results?

Like · 1 Reply

Josip Lazarevski
Senior Director, Data Science and Knowledge Engineering. Championing
Success through Empowerment | Data Science Leader | Change Catalyst

Follow

More from Josip Lazarevski

Leading customer-centric How to invest in a winning


team Martech strategy?
Josip Lazarevski on LinkedIn Josip Lazarevski on LinkedIn

What do you do when you


feel not recognized?
Josip Lazarevski on LinkedIn

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 9/10
8/31/23, 10:43 PM (2) Automatic Content Tagging using NLP and Machine Learning | LinkedIn

See all 27 articles

https://www.linkedin.com/pulse/automatic-content-tagging-using-nlp-machine-learning-josip-lazarevski/ 10/10

You might also like