1-What Is Text Mining - IBM
1-What Is Text Mining - IBM
Text is a one of the most common data types within databases. Depending on the
database, this data can be organized as:
Structured data: This data is standardized into a tabular format with numerous rows
and columns, making it easier to store and process for analysis and machine learning
algorithms. Structured data can include inputs such as names, addresses, and phone
numbers.
Unstructured data: This data does not have a predefined data format. It can include
text from sources, like social media or product reviews, or rich media formats like,
video and audio files.
Information retrieval
Information retrieval (IR) returns relevant information or documents based on a pre-
defined set of queries or phrases. IR systems utilize algorithms to track user behaviors
and identify relevant data. Information retrieval is commonly used in library catalogue
systems and popular search engines, like Google. Some common IR sub-tasks include:
Tokenization: This is the process of breaking out long-form text into sentences and
words called “tokens”. These are, then, used in the models, like bag-of-words, for text
clustering and document matching tasks.
Stemming: This refers to the process of separating the prefixes and suffixes from
words to derive the root word form and meaning. This technique improves
information retrieval by reducing the size of indexing files.
Natural language processing (NLP)
Natural language processing, which evolved from computational linguistics, uses methods
from various disciplines, such as computer science, artificial intelligence, linguistics, and
data science, to enable computers to understand human language in both written and
verbal forms. By analyzing sentence structure and grammar, NLP sub-tasks allow
computers to “read”. Common sub-tasks include:
Sentiment analysis: This task detects positive or negative sentiment from internal or
external data sources, allowing you to track changes in customer attitudes over time.
It is commonly used to provide information about perceptions of brands, products,
and services. These insights can propel businesses to connect with customers and
improve processes and user experiences.
Information extraction
Information extraction (IE) surfaces the relevant pieces of data when searching various
documents. It also focuses on extracting structured information from free text and storing
these entities, attributes, and relationship information in a database. Common information
extraction sub-tasks include:
Maintenance: Text mining provides a rich and complete picture of the operation and
functionality of products and machinery. Over time, text mining automates decision
making by revealing patterns that correlate with problems and preventive and
reactive maintenance procedures. Text analytics helps maintenance professionals
unearth the root cause of challenges and failures faster. Learn how Korean Airlines is
using text analytics for maintenance.
Allow your data scientists to excel by equipping them with a powerful data mining toolkit.
IBM’s Watson Natural Language Understanding can help your teams learn how to analyze
text to reveal structure and meaning. Your teams can extract metadata from content such
as concepts, entities, keywords, categories, sentiment, emotion, relations and semantic
roles using natural language understanding. Get started with IBM Watson Natural
Language Understanding today.