0% found this document useful (0 votes)
3 views

Unit I –Text Mining

Text mining, or text analytics, is the process of extracting meaningful information from unstructured textual data to transform it into structured data for analysis. It has various applications, including information extraction, text classification, and sentiment analysis, and involves a semi-automated process that includes gathering, structuring, and mining text data. Techniques such as tokenization, stemming, lemmatization, and categorization are essential for effective text mining and are applied across diverse fields like finance, medicine, and social media analysis.

Uploaded by

1032221049
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit I –Text Mining

Text mining, or text analytics, is the process of extracting meaningful information from unstructured textual data to transform it into structured data for analysis. It has various applications, including information extraction, text classification, and sentiment analysis, and involves a semi-automated process that includes gathering, structuring, and mining text data. Techniques such as tokenization, stemming, lemmatization, and categorization are essential for effective text mining and are applied across diverse fields like finance, medicine, and social media analysis.

Uploaded by

1032221049
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Unit I –Text Mining

Introduction
• Definition:
• Text Mining, also known as Text Analytics, involves deriving meaningful
information from unstructured textual data.
• Text mining is formally defined as process of extracting relevant information
or pattern from different sources that are in unstructured.
• Purpose:
• To transform unstructured text into structured data for analysis.
• Applications:
• Information extraction, text classification, clustering, sentiment analysis, etc.
Why Do We Need Text Mining
• Most of the human recorded information in the world is in form of
written text.
• For Example:- Research on Earthquake.
• No single person can read and interpret all the data on earthquake of
written material by himself.
Deriving meaning and filtering out the unimportant from the
important is still something a human is better at than any machine.
Text Mining Examples
• Text mining works on texts from practically any kind of sources from
any business domain, in any formats, including Word documents, PDF
files, XML files, and so on.
• Here are some representative example:
• In the legal profession
• In academic research
• In the world of finance
• In medicine
• In marketing
• In the world of technology and search
Text Mining Process
• Text mining is a semi-automated process.
• Text data needs to be gathered, structured, and then mined, in a
three- step process
• The text and document are first gathered into a corpus and organized.
• The corpus is then analyzed for structure. The result is a matrix mapping
important terms to source documents.
• The structured data is then analyzed for word structures, sequences, and
frequency.
Text Mining Process

Establish the Corpus of Structure using Term


Mine TDM for Patterns
Text: Document Matrix(TDM):

Apply data mining tools


Gather documents, Select a bag of words,
like classification and
compute frequencies of
clean, prepare for analysis cluster analysis
occurrence
Core text mining operations

• Core mining operations in text mining systems center on the algorithms


that underlie the creation of queries for discovering patterns in document
collections.
• Or in text mining systems are algorithms of the creation of queries for
discovering patterns in document collections.
• Core text mining operations consists of various mechanisms for
discovering patterns of concept with in a document collection.
• Three types of patterns
• Distributions(= proportions)
• Frequent and near frequent sets
• Associations
Pre-processing techniques
• Lower Case
Removing Punctuations
• Removing punctuation is the process of eliminating symbols like periods (.), commas (,), exclamation points
(!), question marks (?), and other non-alphanumeric characters from textual data.
Tokenization
• Breaking a text into smaller units, such as words, phrases, or
sentences.
• Simplifies text processing by dividing it into manageable components.
• Example:
• Input: "Text mining is fascinating.“
• Output: ["Text", "mining", "is", "fascinating"]
Tokenization

OR
Stopwords are commonly used words in a language that do not carry significant meaning and are often
removed from text during preprocessing.
Stemming
• Stemming is a method in text processing that eliminates prefixes and
suffixes from words, transforming them into their fundamental or
root form
• Reduces words to a common root form, ensuring that different
variations of a word are treated as the same.
• Example: run, running, runs → "run“
• It follows predefined rules without considering meaning.
• "Happiness" → "happi"
• The output may not be a real word.
• "Studies" → "studi"
males -> male/s , playing -> play/ing
Lemmatization
• Lemmatization reduces words to their base or dictionary form
(lemma) while ensuring that the root word is meaningful.
• Ensures output is a meaningful word
• Example:
• "Running" → "run“
• "Better" → "good"
• "Studies" → "study"
#With POS as adjective ("a")
print(lemmatizer.lemmatize("better", pos="a"))
Output:- good
Part-of-Speech Tagging
• Assigning a part-of-speech to each word in a text.
• Words often have more than one POS.
• Example:
• Shubham ate fruit.
• There were 70 children.

Shubham/NOUN ate/VERB fruit/NOUN


There/PRO were/VERB 70/NUM children/NOUN./PUNC
Categorization
• Categorization is the process of classifying data into predefined groups.
• In text analysis, this is known as Text Categorization (TC)
• Example:
• Assigning news articles to topics like Politics, Sports, or Technology
Categorization
Task : Classify a text object into one or more of the categories
Categorization Result

Text Objects Sports


Categorization
System Business

Education
...
Science
Sports

Business

Training Data Education


(Known Categories)
Approaches to Text Categorization
1. Knowledge Engineering Approach
• Experts manually define rules & conditions
• Works well for structured & predictable data
• Hard to scale for large datasets
2. Machine Learning (ML) Approach
• Uses pre-classified data to train a model
• Efficient for large-scale text classification
• Adapts to new data automatically
APPLICATIONS OF TEXT
CATEGORIZATION
• Text Indexing with Controlled Vocabulary
• In information retrieval (IR) systems, each document in a big collection is
assigned one or more key terms describing its content.
• Then, the IR system is able to retrieve the documents according to the user
queries, which are based on the key terms.
• The key terms all belong to a finite set called controlled vocabulary.
• The task of assigning keywords from a controlled vocabulary to text
documents is called text indexing
• Assigning keywords for efficient document retrieval.
APPLICATIONS OF TEXT
CATEGORIZATION
• Document Sorting
• Sorting documents into predefined categories or "bins."
• Examples:
• Classified ads in a newspaper: "Personal," "Car Sale," "Real Estate."
• Emails in an organization: "Complaints," "Deals," "Job Applications.“
• Organizing documents into predefined categories.
APPLICATIONS OF TEXT
CATEGORIZATION
• Text filtering
• Sorting documents into two bins – "relevant" and "irrelevant.“
• Text filtering is separating relevant from irrelevant documents.
• Example:
• A sports related online magazine should filter out all non sport stories
• Blocking spam emails
Definition of problem
• The general text categorization task can be formally defined as the
task of approximating an unknown category assignment function:

• D: Set of all possible documents.


• C: Set of predefined categories.
• F(d, c):
• 1 if document d belongs to category c.
• 0 if document d does not belong to category c.
Definition of problem

M
• The Classifier's Predicted Assignment Function

• D: Set of all documents.


• C: Set of all predefined categories.
• M(d, c):
• 1 if the classifier predicts that document d belongs to category c.
• 0 if the classifier predicts that document d does not belong to category c.
• Build M such that its predictions are as close as possible to F.
Clustering
• Clustering is an unsupervised process through which objects are
classified into groups called clusters.
• In the case of clustering, the problem is to group the given unlabeled
collection into meaningful clusters without any prior information. Any
labels associated with objects are obtained solely from the data.
• Clustering is useful in a wide range of data analysis fields, including
data mining, document retrieval, image segmentation, and pattern
classification.
Clustering
Examples of Clustering
• Document Clustering for News Articles
• Clustering of websites
• Customer Feedback Analysis
• Fake News & Spam Detection
Information Extraction (IE)
• Information Extraction (IE) is about locating specific items in natural-
language document.
• IE concerns locating specific pieces of data in natural-language
documents, thereby extracting structured information from free text.
• “Text mining” is used to describe the application of data mining
techniques to automated discovery of useful or interesting knowledge
from unstructured text
• DISCOTEX (Discovery from Text EXtraction), using a learned
information extraction system to transform text into more structured
data which is then mined for interesting relationships.
Information Extraction (IE)
• Information extraction (IE) surfaces the relevant pieces of data when searching
various documents. It also focuses on extracting structured information from
free text and storing these entities, attributes, and relationship information in a
database.
• Common information extraction sub-tasks include:
• Feature selection, or attribute selection, is the process of selecting the important
features (dimensions) to contribute the most to output of a predictive analytics model.
• Feature extraction is the process of selecting a subset of features to improve the
accuracy of a classification task. This is particularly important for dimensionality
reduction.
• Named-entity recognition (NER) also known as entity identification or entity extraction,
aims to find and categorize specific entities in text, such as names or locations. For
example, NER identifies “California” as a location and “Mary” as a woman’s name.
Named Entity Recognition (NER)
• NER consists of two main steps:
• Segmentation (Identifying entity boundaries)
• Classification (Assigning entity types)
• Segmentation
• Segmentation is about determining which words belong to a specific named entity
• Example:
• Shubham was born in Pune, Maharashtra in 2000
• Output:
• Shubham
• Pune
• Maharashtra
• 200
Named Entity Recognition (NER)
• Classification
• After segmentation, each identified entity is classified into predefined
categories such as:
• Person (PER): “Shubham“
• Location (LOC): “Pune," “Maharashtra“
• Date (DATE): “2000“
IE-based text mining framework
Probabilistic Model
• The probability ranking principle, as originally states:
• “ If a search engine’s response to each request is a ranking of the documents
in the collection in order of decreasing probability of relevance to the user
who submitted the request, where the probabilities are estimated as
accurately as possible on the basis of whatever data have been made
available to the system for this purpose, the overall effectiveness of the
system to its user will be the best that is obtainable on the basis of those
data”
• Probability ranking principle doesn’t tell us how to calculate or estimate the
probability of relevance
Probabilistic Model: Information
Retrieval as Classification
• P(R|D) is a conditional probability
representing the probability of
relevance given the representation of
that document
• P(NR|D) is the conditional probability
for non relevance
Probabilistic Model: Information
Retrieval as Classification
• Let’s focus on P(R|D)……. P(R|D) --> P(D|R)
• If we had information about how often specific words occurred in the
relevant set, then given a document, it would be relatively
straightforward to calculate how likely it would be to see the
combination of words in the document occurring in the relevant set.
• Example:
• Assume the probability of the word "president" appearing is 0.02.
• The probability of the word "lincoln" appearing is 0.03.
• These probabilities are based on how often these words appear in a dataset.
Probabilistic Model: Information
Retrieval as Classification
• If a new document contains both "president" and "Lincoln", we
assume independence.
• The probability of both words appearing together is calculated as:
0.02×0.03=0.0006
• This means the chance of both words occurring in a document is
0.06%.
Probabilistic Model: Information
Retrieval as Classification
• So how does calculating P(D|R) get us to the probability of relevance? it turns out
there is a relationship between P(D|R) and P(D|R) that is expressed by bayes rule:

• Where P(R) is the priori probability of relevance ( in other words, how likely any
document is to be relevant), and P(D) acts as a normalizing constant.
• Classify document relevant if P(D∣R)P(R) > P(D∣NR)P(NR). This is the same as
classifying a document as relevant if :

• If we use the likelihood ratio as a score, the highly ranked document will be those that
have a high likelihood of belonging to the relevant set
Probabilistic Model: Information
Retrieval as Classification
• Bayes' Rule Formula

• Probability of individual words appearing:

• Assuming independence, we calculate 𝑃(𝐷∣𝑅)


P("president")=0.02 , P("lincoln")=0.03

• Similarly, for not relevant 𝑃(𝐷∣𝑁𝑅)


P(D∣R)=P("president") × P("lincoln")=0.02×0.03=0.0006

P(D∣NR)=0.01×0.01 =0.0001
• Prior probabilities:
P(R)=0.5, P(NR)=0.5
Probabilistic Model: Information
Retrieval as Classification
• Total probability of the document occurring: 𝑃(𝐷)
𝑃(𝐷)=P(D∣R)P(R)+P(D∣NR)P(NR)
𝑃(𝐷)=(0.0006×0.5)+(0.0001×0.5)
𝑃(𝐷)=0.0003+0.00005=0.00035
Compute 𝑃(𝑅∣𝐷)
Probabilistic Model: Information
Retrieval as Classification
• Compute 𝑃(𝑁𝑅∣𝐷)

• Is the Document Relevant?


P(R∣ D)=3, P(NR∣ D)=0.111
Since 3 is much greater than 0.111, the document is classified as relevant
Probabilistic Model: Information
Retrieval as Classification
• Given
𝑃(𝐷
∣𝑅
𝑃(𝐷∣𝑁𝑅)=0.0002 (Probability of document appearing in non-relevant set)
• )=0.0006 (Probability of document appearing in relevant set)

𝑃(𝑅)=0.9 (Prior probability of relevance)


𝑃(𝑁𝑅)= 1−𝑃(𝑅)


=1−0.9
=0.1 (Prior probability of non-relevance)
Probabilistic Model: Information
Retrieval as Classification
• Compute Likelihood Ratio

• The prior odds of relevance are calculated as:

• To classify the document as relevant, we check if:

• Since 3 is much greater than 0.111, the document is classified as relevant.


• Because the likelihood ratio is significantly higher than the prior odds, the document containing "Lincoln"
and "President" is classified as relevant.
Hidden Markov Model
A Hidden Markov Model (HMM) is a probabilistic model used to represent systems that transition between
different hidden states over time. It is particularly useful for modeling sequential data where the
underlying state is not directly observable but can be inferred through observed signals.
Applications of Text Mining
• Risk Management
• Risk management involves systematically identifying, analyzing, treating, and
monitoring risks associated with business processes. Insufficient risk analysis
often leads to organizational failures, especially in financial institutions.
• Text mining enhances risk management by utilizing Risk Administration
Software.
• It enables organizations to process vast amounts of text files efficiently.
• The ability to link and access relevant data in real time enhances decision-
making and risk mitigation strategies.
Applications of Text Mining
• Customer Care Service
• Text mining, particularly NLP-based techniques, is increasingly used in
customer service to enhance user experience and satisfaction.
• Companies leverage text analytics software to process textual data from
surveys, feedback, and customer interactions.
• It helps reduce response time and address user grievances more effectively.
• Analyzing customer sentiment allows businesses to improve their service
quality and engagement strategies.
Applications of Text Mining
• Social Media Analysis
• Text mining tools are widely used for analyzing data generated on social
media platforms.
• These tools track and interpret text from news articles, blogs, emails, and
social media posts.
• Businesses can analyze interactions such as posts, likes, and follower activity
to understand audience engagement.
• It helps brands measure customer perception and improve their marketing
strategies.
Applications of Text Mining
• Business Intelligence
• Text mining plays a crucial role in business intelligence by providing actionable
insights.
• Organizations analyze customer behavior and market trends using text mining
techniques.
• Competitive analysis enables companies to assess their strengths and
weaknesses against industry rivals.
• Extracting valuable data from multiple sources allows businesses to enhance
strategic decision-making and gain a competitive edge.

You might also like