0% found this document useful (0 votes)
4 views

DM-Unit_1

The document provides an overview of data mining, highlighting its significance in the information age where vast amounts of data are collected but often remain underutilized. It defines data mining as the automated analysis of large data sets to uncover hidden patterns and trends, differentiating it from traditional statistical methods and other data analysis tools. The text outlines the knowledge discovery process, key data mining techniques, and the types of data that can be mined, emphasizing the importance of effective data management for informed decision-making.

Uploaded by

D.Thilgavathi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DM-Unit_1

The document provides an overview of data mining, highlighting its significance in the information age where vast amounts of data are collected but often remain underutilized. It defines data mining as the automated analysis of large data sets to uncover hidden patterns and trends, differentiating it from traditional statistical methods and other data analysis tools. The text outlines the knowledge discovery process, key data mining techniques, and the types of data that can be mined, emphasizing the importance of effective data management for informed decision-making.

Uploaded by

D.Thilgavathi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Mining:

Concepts and Techniques


Chapter I: Introduction to Data Mining

We are in an age often referred to as the information age. In this information age, because we believe that
information leads to power and success, and thanks to sophisticated technologies such as computers,
satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of
computers and means for mass digital storage, we started collecting and storing all sorts of data, counting
on the power of computers to help sort through this amalgam of information. Unfortunately, these massive
collections of data stored on disparate structures very rapidly became overwhelming. This initial chaos has
led to the creation of structured databases and database management systems (DBMS). The efficient
database management systems have been very important assets for management of a large corpus of data
and especially for effective and efficient retrieval of particular information from a large collection
whenever needed. The proliferation of database management systems has also contributed to recent
massive gathering of all sorts of information. Today, we have far more information than we can handle:
from business transactions and scientific data, to satellite pictures, text reports and military intelligence.
Information retrieval is simply not enough anymore for decision-making. Confronted with huge collections
of data, we have now created new needs to help us make better managerial choices. These needs are
automatic summarization of data, extraction of the "essence" of information stored, and the discovery of
patterns in raw data.

Data mining is a powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. It has been defined as:

The automated analysis of large or complex data sets in order to discover significant patterns or trends that
would otherwise go unrecognised.

The key elements that make data mining tools a distinct form of software are:

Automated analysis

Data mining automates the process of sifting through historical data in order to discover new
information. This is one of the main differences between data mining and statistics, where a model is
usually devised by a statistician to deal with a specific analysis problem. It also distinguishes data
mining from expert systems, where the model is built by a knowledge engineer from rules extracted
from the experience of an expert.

The emphasis on automated discovery also separates data mining from OLAP and simpler query and
reporting tools, which are used to verify hypotheses formulated by the user. Data mining does not rely
on a user to define a specific query, merely to formulate a goal - such as the identification of fraudulent
claims.
Large or complex data sets

One of the attractions of data mining is that it makes it possible to analyse very large data sets in a
reasonable time scale. Data mining is also suitable for complex problems involving relatively small
amounts of data but where there are many fields or variables to analyse. However, for small, relatively
simple data analysis problems there may be simpler, cheaper and more effective solutions.

Discovering significant patterns or trends that would otherwise go unrecognised

The goal of data mining is to unearth relationships in data that may provide useful insights.

Data mining tools can sweep through databases and identify previously hidden patterns in one step. An
example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products
that are often purchased together. Other pattern discovery problems include detecting fraudulent credit
card transactions, performance bottlenecks in a network system and identifying anomalous data that
could represent data entry keying errors. The ultimate significance of these patterns will be assessed by a
domain expert - a marketing manager or network supervisor - so the results must be presented in a way
that human experts can understand.

Data mining tools can also automate the process of finding predictive information in large databases.
Questions that traditionally required extensive hands-on analysis can now be answered directly from the
data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data
on past promotional mailings to identify the targets most likely to maximize return on investment in
future mailings. Other predictive problems include forecasting bankruptcy and other forms of default,
and identifying segments of a population likely to respond similarly to given events.

Data mining techniques can yield the benefits of automation on existing software and hardware platforms to
enhance the value of existing information resources, and can be implemented on new products and systems
as they are brought on-line. When implemented on high performance client/server or parallel processing
systems, they can analyse massive databases to deliver answers to questions such as:

"Which clients are most likely to respond to my next promotional mailing, and why?"

Data mining is ready for application because it is supported by three technologies that are now sufficiently
mature:

 Massive data collection

 Powerful multiprocessor computers


 Data mining algorithms

Commercial databases are growing at unprecedented rates, especially in the retail sector. The
accompanying need for improved computational engines can now be met in a cost-effective manner with
parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed
for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that
consistently outperform older statistical methods.

The core components of data mining technology have been under development for decades, in research
areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these
techniques, coupled with high-performance relational database engines and broad data integration efforts,
make these technologies practical for current data warehouse environments.

The key to understanding the different facets of data mining is to distinguish between data mining
applications, operations, techniques and algorithms.

Applications Database marketing

customer segmentation

customer retention

fraud detection

credit checking

web site analysis

Operations Classification and prediction

clustering

association analysis

forecasting

Techniques Neural networks

decision trees

K-nearest neighbour algorithms

naive Bayesian

cluster analysis
What kind of information are we collecting?
We have been collecting a myriad of data, from simple numerical measurements and text documents, to
more complex information such as spatial data, multimedia channels, and hypertext documents. Here is a
non-exclusive list of a variety of information collected in digital form in databases and in flat files.

Business transactions: Every transaction in the business industry is (often) "memorized" for
perpetuity.Such transactions are usually time related and can be inter-business deals such as
purchases, exchanges, banking, stock, etc., or intra-business operations such as management of in-
house wares and assets. Large department stores, for example, thanks to the widespread use of bar
codes, store millions of transactions daily representing often terabytes of data. Storage space is not
the major problem, as the price of hard disks is continuously dropping, but the effective use of the
data in a reasonable time frame for competitive decision-making is definitely the most important
problem to solve for businesses that struggle to survive in a highly competitive world.
Scientific data: Whether in a Swiss nuclear accelerator laboratory counting particles, in the
Canadian forest studying readings from a grizzly bear radio collar, on a South Pole iceberg
gathering data about oceanic activity, or in an American university investigating human psychology,
our society is amassing colossal amounts of scientific data that need to be analyzed. Unfortunately,
we can capture and store more new data faster than we can analyze the old data already
accumulated.
Medical and personal data: From government census to personnel and customer files, very large
collections of information are continuously gathered about individuals and groups. Governments,
companies and organizations such as hospitals, are stockpiling very important quantities of personal
data to help them manage human resources, better understand a market, or simply assist clientele.
Regardless of the privacy issues this type of data often reveals, this information is collected, used
and even shared. When correlated with other data this information can shed light on customer
behaviour and the like.
Surveillance video and pictures: With the amazing collapse of video camera prices, video cameras
are becoming ubiquitous. Video tapes from surveillance cameras are usually recycled and thus the
content is lost. However, there is a tendency today to store the tapes and even digitize them for
future use and analysis.
Satellite sensing:There is a countless number of satellites around the globe: some are geo-stationary
above a region, and some are orbiting around the Earth, but all are sending a non-stop stream of data
to the surface. NASA, which controls a large number of satellites, receives more data every second
than what all NASA researchers and engineers can cope with. Many satellite pictures and data are
made public as soon as they are received in the hopes that other researchers can analyze them.
Games: Our society is collecting a tremendous amount of data and statistics about games, players
and athletes. From hockey scores, basketball passes and car-racing lapses, to swimming times,
boxers pushes and chess positions, all the data are stored. Commentators and journalists are using
this information for reporting, but trainers and athletes would want to exploit this data to improve
performance and better understand opponents.
Digital media: The proliferation of cheap scanners, desktop video cameras and digital cameras is
one of the causes of the explosion in digital media repositories. In addition, many radio stations,
television channels and film studios are digitizing their audio and video collections to improve the
management of their multimedia assets. Associations such as the NHL and the NBA have already
started converting their huge game collection into digital forms.
CAD and Software engineering data: There are a multitude of Computer Assisted Design (CAD)
systems for architects to design buildings or engineers to conceive system components or circuits.
These systems are generating a tremendous amount of data. Moreover, software engineering is a
source of considerable similar data with code, function libraries, objects, etc., which need powerful
tools for management and maintenance.
Virtual Worlds: There are many applications making use of three-dimensional virtual spaces.
These spaces and the objects they contain are described with special languages such as VRML.
Ideally, these virtual spaces are described in such a way that they can share objects and places.
There is a remarkable amount of virtual reality object and space repositories available. Management
of these repositories as well as content-based search and retrieval from these repositories are still
research issues, while the size of the collections continues to grow.
Text reports and memos (e-mail messages): Most of the communications within and between
companies or research organizations or even private people, are based on reports and memos in
textual forms often exchanged by e-mail. These messages are regularly stored in digital form for
future use and reference creating formidable digital libraries.
The World Wide Web repositories: Since the inception of the World Wide Web in 1993,
documents of all sorts of formats, content and description have been collected and inter-connected
with hyperlinks making it the largest repository of data ever built. Despite its dynamic and
unstructured nature, its heterogeneous characteristic, and its very often redundancy and
inconsistency, the World Wide Web is the most important data collection regularly used for
reference because of the broad variety of topics covered and the infinite contributions of resources
and publishers. Many believe that the World Wide Web will become the compilation of human
knowledge.

What are Data Mining and Knowledge Discovery?


With the enormous amount of data stored in files, databases, and other repositories, it is increasingly
important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data
and for the extraction of interesting knowledge that could help in decision-making.

Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data in databases. While
data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery process. The following figure (Figure 1.1) shows data
mining as a step in an iterative knowledge discovery process.

The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections
to some form of new knowledge. The iterative process consists of the following steps:

Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data
are removed from the collection.
Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a
common source.
Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the
data collection.
Data transformation: also known as data consolidation, it is a phase in which the selected data is
transformed into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract patterns
potentially useful.
Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified
based on given measures.
Knowledge representation: is the final phase in which the discovered knowledge is visually
represented to the user. This essential step uses visualization techniques to help users understand
and interpret the data mining results.
It is common to combine some of these steps together. For instance, data cleaning and data integration can
be performed together as a pre-processing phase to generate a data warehouse. Data selection and data
transformation can also be combined where the consolidation of the data is the result of the selection, or, as
for the case of data warehouses, the selection is done on transformed data.

The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation
measures can be enhanced, the mining can be further refined, new data can be selected or further
transformed, or new data sources can be integrated, in order to get different, more appropriate results.

Data mining derives its name from the similarities between searching for valuable information in a large
database and mining rocks for a vein of valuable ore. Both imply either sifting through a large amount of
material or ingeniously probing the material to exactly pinpoint where the values reside. It is, however, a
misnomer, since mining for gold in rocks is usually called "gold mining" and not "rock mining", thus by
analogy, data mining should have been called "knowledge mining" instead. Nevertheless, data mining
became the accepted customary term, and very rapidly a trend that even overshadowed more general terms
such as knowledge discovery in databases (KDD) that describe a more complete process. Other similar
terms referring to data mining are: data dredging, knowledge extraction and pattern discovery.

What kind of Data can be mined?


In principle, data mining is not specific to one type of media or data. Data mining should be applicable to
any kind of information repository. However, algorithms and approaches may differ when applied to
different types of data. Indeed, the challenges presented by different types of data vary significantly. Data
mining is being put into use and studied for databases, including relational databases, object-relational
databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-
structured repositories such as the World Wide Web, advanced databases such as spatial databases,
multimedia databases, time-series databases and textual databases, and even flat files. Here are some
examples in more detail:

Flat files: Flat files are actually the most common data source for data mining algorithms,
especially at the research level. Flat files are simple data files in text or binary format with a
structure known by the data mining algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements, etc.
Relational Databases: Briefly, a relational database consists of a set of tables containing either
values of entity attributes, or values of attributes from entity relationships. Tables have columns and
rows, where columns represent attributes and rows represent tuples. A tuple in a relational table
corresponds to either an object or a relationship between objects and is identified by a set of
attribute values representing a unique key. In Figure 1.2 we present some relations Customer, Items,
and Borrow representing business activity in a fictitious video store OurVideoStore. These relations
are just a subset of what could be a database for the video store and is given as an example.

The most commonly used query language for relational database is SQL, which allows retrieval and
manipulation of the data stored in the tables, as well as the calculation of aggregate functions such
as average, sum, min, max and count. For instance, an SQL query to select the videos grouped by
category would be:

SELECT count(*) FROM Items WHERE type=video GROUP BY category.

Data mining algorithms using relational databases can be more versatile than data mining
algorithms specifically written for flat files, since they can take advantage of the structure inherent
to relational databases. While data mining can benefit from SQL for data selection, transformation
and consolidation, it goes beyond what SQL could provide, such as predicting, comparing, detecting
deviations, etc.

Data Warehouses: A data warehouse as a storehouse, is a repository of data collected from


multiple data sources (often heterogeneous) and is intended to be used as a whole under the same
unified schema. A data warehouse gives the option to analyze data from different sources under the
same roof. Let us suppose that OurVideoStore becomes a franchise in North America. Many video
stores belonging to OurVideoStore company may have different databases and different structures.
If the executive of the company wants to access the data from all stores for strategic decision-
making, future direction, marketing, etc., it would be more appropriate to store all the data in one
site with a homogeneous structure that allows interactive analysis. In other words, data from the
different stores would be loaded, cleaned, transformed and integrated together. To facilitate
decision-making and multi-dimensional views, data warehouses are usually modeled by a multi-
dimensional data structure. Figure 1.3 shows an example of a three dimensional subset of a data
cube structure used for OurVideoStore data warehouse.

The figure shows summarized rentals grouped by film categories, then a cross table of summarized
rentals by film categories and time (in quarters). The data cube gives the summarized rentals along
three dimensions: category, time, and city. A cube contains cells that store values of some aggregate
measures (in this case rental counts), and special cells that store summations along dimensions.
Each dimension of the data cube contains a hierarchy of values for one attribute.

Because of their structure, the pre-computed summarized data they contain and the hierarchical
attribute values of their dimensions, data cubes are well suited for fast interactive querying and
analysis of data at different conceptual levels, known as On-Line Analytical Processing (OLAP).
OLAP operations allow the navigation of data at different levels of abstraction, such as drill-down,
roll-up, slice, dice, etc. Figure 1.4 illustrates the drill-down (on the time dimension) and roll-up (on
the location dimension) operations.

Transaction Databases: A transaction database is a set of records representing transactions, each


with a time stamp, an identifier and a set of items. Associated with the transaction files could also
be descriptive data for the items. For example, in the case of the video store, the rentals table such
as shown in Figure 1.5, represents the transaction database. Each record is a rental contract with a
customer identifier, a date, and the list of items rented (i.e. video tapes, games, VCR, etc.). Since
relational databases do not allow nested tables (i.e. a set as attribute value), transactions are usually
stored in flat files or stored in two normalized transaction tables, one for the transactions and one for
the transaction items. One typical data mining analysis on such data is the so-called market basket
analysis or association rules in which associations between items occurring together or in sequence
are studied.
Multimedia Databases: Multimedia databases include video, images, audio and text media. They
can be stored on extended object-relational or object-oriented databases, or simply on a file system.
Multimedia is characterized by its high dimensionality, which makes data mining even more
challenging. Data mining from multimedia repositories may require computer vision, computer
graphics, image interpretation, and natural language processing methodologies.

Spatial Databases: Spatial databases are databases that, in addition to usual data, store geographical
information like maps, and global or regional positioning. Such spatial databases present new
challenges to data mining algorithms.

Time-Series Databases: Time-series databases contain time related data such stock market data or
logged activities. These databases usually have a continuous flow of new data coming in, which
sometimes causes the need for a challenging real time analysis. Data mining in such databases
commonly includes the study of trends and correlations between evolutions of different variables, as
well as the prediction of trends and movements of the variables in time. Figure 1.7 shows some
examples of time-series data.

World Wide Web: The World Wide Web is the most heterogeneous and dynamic repository
available. A very large number of authors and publishers are continuously contributing to its growth
and metamorphosis, and a massive number of users are accessing its resources daily. Data in the
World Wide Web is organized in inter-connected documents. These documents can be text, audio,
video, raw data, and even applications. Conceptually, the World Wide Web is comprised of three
major components: The content of the Web, which encompasses documents available; the structure
of the Web, which covers the hyperlinks and the relationships between documents; and the usage of
the web, describing how and when the resources are accessed. A fourth dimension can be added
relating the dynamic nature or evolution of the documents. Data mining in the World Wide Web, or
web mining, tries to address all these issues and is often divided into web content mining, web
structure mining and web usage mining.

You might also like