Unit I –Text Mining
Unit I –Text Mining
Introduction
• Definition:
• Text Mining, also known as Text Analytics, involves deriving meaningful
information from unstructured textual data.
• Text mining is formally defined as process of extracting relevant information
or pattern from different sources that are in unstructured.
• Purpose:
• To transform unstructured text into structured data for analysis.
• Applications:
• Information extraction, text classification, clustering, sentiment analysis, etc.
Why Do We Need Text Mining
• Most of the human recorded information in the world is in form of
written text.
• For Example:- Research on Earthquake.
• No single person can read and interpret all the data on earthquake of
written material by himself.
Deriving meaning and filtering out the unimportant from the
important is still something a human is better at than any machine.
Text Mining Examples
• Text mining works on texts from practically any kind of sources from
any business domain, in any formats, including Word documents, PDF
files, XML files, and so on.
• Here are some representative example:
• In the legal profession
• In academic research
• In the world of finance
• In medicine
• In marketing
• In the world of technology and search
Text Mining Process
• Text mining is a semi-automated process.
• Text data needs to be gathered, structured, and then mined, in a
three- step process
• The text and document are first gathered into a corpus and organized.
• The corpus is then analyzed for structure. The result is a matrix mapping
important terms to source documents.
• The structured data is then analyzed for word structures, sequences, and
frequency.
Text Mining Process
OR
Stopwords are commonly used words in a language that do not carry significant meaning and are often
removed from text during preprocessing.
Stemming
• Stemming is a method in text processing that eliminates prefixes and
suffixes from words, transforming them into their fundamental or
root form
• Reduces words to a common root form, ensuring that different
variations of a word are treated as the same.
• Example: run, running, runs → "run“
• It follows predefined rules without considering meaning.
• "Happiness" → "happi"
• The output may not be a real word.
• "Studies" → "studi"
males -> male/s , playing -> play/ing
Lemmatization
• Lemmatization reduces words to their base or dictionary form
(lemma) while ensuring that the root word is meaningful.
• Ensures output is a meaningful word
• Example:
• "Running" → "run“
• "Better" → "good"
• "Studies" → "study"
#With POS as adjective ("a")
print(lemmatizer.lemmatize("better", pos="a"))
Output:- good
Part-of-Speech Tagging
• Assigning a part-of-speech to each word in a text.
• Words often have more than one POS.
• Example:
• Shubham ate fruit.
• There were 70 children.
Education
...
Science
Sports
Business
M
• The Classifier's Predicted Assignment Function
• Where P(R) is the priori probability of relevance ( in other words, how likely any
document is to be relevant), and P(D) acts as a normalizing constant.
• Classify document relevant if P(D∣R)P(R) > P(D∣NR)P(NR). This is the same as
classifying a document as relevant if :
• If we use the likelihood ratio as a score, the highly ranked document will be those that
have a high likelihood of belonging to the relevant set
Probabilistic Model: Information
Retrieval as Classification
• Bayes' Rule Formula
P(D∣NR)=0.01×0.01 =0.0001
• Prior probabilities:
P(R)=0.5, P(NR)=0.5
Probabilistic Model: Information
Retrieval as Classification
• Total probability of the document occurring: 𝑃(𝐷)
𝑃(𝐷)=P(D∣R)P(R)+P(D∣NR)P(NR)
𝑃(𝐷)=(0.0006×0.5)+(0.0001×0.5)
𝑃(𝐷)=0.0003+0.00005=0.00035
Compute 𝑃(𝑅∣𝐷)
Probabilistic Model: Information
Retrieval as Classification
• Compute 𝑃(𝑁𝑅∣𝐷)
𝑃(𝑁𝑅)= 1−𝑃(𝑅)
•
•
=1−0.9
=0.1 (Prior probability of non-relevance)
Probabilistic Model: Information
Retrieval as Classification
• Compute Likelihood Ratio