Text Mining and Its Applications
Text Mining and Its Applications
Abstract:
As computer networks become the backbones of science and
economy, enormous quantities of machine readable documents
become available. Computerization and automated data gathering
has resulted in extremely large data repositories e.g. Walmart:
2000 stores, 20 M transactions/day. Unfortunately, the usual
logic-based programming paradigm has great difficulties in
capturing the fuzzy and often ambiguous relations in text
documents. Text mining refers generally to the process of
extracting interesting information and knowledge from
unstructured text.
In this paper, text mining is described as a method for information
retrieval, machine learning, statistical analysis and especially data
mining. First these methods are given and then defined text
mining in relation to them. Later sections give different
approaches for the main analysis tasks preprocessing,
classification,
clustering,
information
extraction
and
visualization. The last section explains number of successful
applications of text mining.
Keywords: data mining, machine learning, text mining, text
categorization, clustering, text visualization
1. INTRODUCTION
Text mining is a new area of computer science which
fosters strong connections with natural language processing,
data mining, machine learning, information retrieval and
knowledge management. Text mining seeks to extract useful
information from unstructured textual data through the
identification and exploration of interesting patterns. [2]
The purpose of Text Mining is to process unstructured
(textual) information, extract meaningful numeric indices
from the text, and, thus, make the information contained in
the text accessible to the various data mining (statistical and
machine learning) algorithms. Information can be extracted
to derive summaries for the words contained in the
documents or to compute summaries for the documents based
on the words contained in them. Hence, you can analyze
words, clusters of words used in documents, etc., or you
could analyze documents and determine similarities between
them or how they are related to other variables of interest in
the data mining project. In the most general terms, text
mining will "turn text into numbers" (meaningful indices),
which can then be incorporated in other analyses such as
predictive data mining projects, the application of
unsupervised learning methods (clustering), etc
2. KNOWLEDGE DISCOVERY IN DATABASES
Fayyad has defined Knowledge Discovery in Databases
(KDD) as follows [1]:
Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data
Knowledge discovery in databases is a process that is
defined by several processing steps that have to be applied to
a data set of interest in order to extract useful patterns. These
steps have to be performed iteratively and several steps
usually require interactive feedback from a user.
2.1 Data Mining, Machine Learning and Statistical
Learning
There is data mining as synonym for KDD, meaning that
data mining contains all aspects of the knowledge discovery
process. Data mining is considered as part of the KDDProcesses and describes the modeling phase, i.e. the
application of algorithms and methods for the calculation of
the searched patterns or models.
Databases are necessary in order to analyze large
quantities of data efficiently. Since the analysis of the data
with data mining algorithms can be supported by databases
and thus the use of database technology in the data mining
process might be useful.
Machine Learning (ML) is an area of artificial intelligence
concerned with the development of techniques which allow
computers to learn by the analysis of data sets. ML is also
concerned with the algorithmic complexity of computational
implementations.
Statistics deals with the science and practice for the
analysis of empirical data. It is based on statistical theory
which is a branch of applied mathematics. Within statistical
theory, randomness and uncertainty are modeled by
probability theory. Today many methods of statistics are used
in the field of KDD [2].
2.2 Text Mining
Text mining or knowledge discovery from text (KDT)
deals with the machine supported analysis of text. It uses
techniques from information retrieval, information extraction
as well as natural language processing (NLP) and connects
them with the algorithms and methods of KDD, data mining,
machine learning and statistics.
There are different
definitions of text mining, according to specific perspective
of different research areas:
Text Mining = Information Extraction. The first
approach assumes that text mining essentially corresponds to
information extraction the extraction of facts from texts.
4.8 Bioinformatics
Bio-entity recognition aims to identify and classify
technical terms in the domain of molecular biology that
corresponds to instances of concepts that are of interest to
biologists. Examples of such entities include the names of
proteins, genes and their locations of activity such as cells or
organism names. Entity recognition is becoming increasingly
important with the massive increase in reported results due to
high throughput experimental methods. It can be used in
several higher level information access tasks such as relation
extraction, summarization and question answering. For
practical applications the current accuracy levels are not yet
satisfactory and research currently aims at including a
sophisticated mix of external resources such as keyword lists
and ontologies which provide terminological resources [2].
4.9 Anti-Spam Filtering of Emails
The explosive growth of unsolicited e-mail, more
commonly known as spam, over the last years has been
undermining constantly the usability of e-mail. One solution
is offered by anti-spam filters. Most commercially available
filters use black-lists and hand-crafted rules. On the other
hand, the success of machine learning methods in text
classification offers the possibility to arrive at anti-spam
filters that quickly may be adapted to new types of spam.
There is a growing number of learning spam filters mostly
using naive Bayes classifiers. A prominent example is
Mozillas e-mail client. Michelakis et al. [MAP+04] compare
different classifier methods and investigate different costs of
classifying a proper mail as spam. They find that for their
benchmark corpora the SVM nearly always yields best
results. To explore how well a learning-based filter performs
in real life, they used an SVMbased procedure for seven
months without retraining. They achieved a precision of
96.5% and a recall of 89.3%. They conclude that these good
results may be improved by careful preprocessing and the
extension of filtering to different languages [2].
4.10 Mining Bibliographic Data
Vojvodina, the northern province of Serbia, is home to
many educational and research institutions. In 2004, the
Provincial Secretariat for Science and Technological
Development of Vojvodina started collecting data from
researchers employed at institutions within its jurisdiction.
Every researcher was asked to fill in a form, provided as an
MS Word document, with bibliographic references of all
authored publications, among other data. Notable properties
of the collection are its incompleteness and diversity of
approaches to giving references, permitted by information
being entered in free text format [5][6].
5. Conclusion
In this paper, a brief introduction is given to the broad field of
text mining. A more formal definition of the term is used
herein and presented a brief overview of currently available
text mining methods, their properties and their applications to