01 Introduction To Data Mining

This document provides an introduction to data mining and its evolution from database management systems. It discusses how data mining emerged from the need to analyze vast amounts of data generated in today's digital world. The document outlines the history of database systems, from early file processing to modern relational databases and data warehousing. It explains how data mining represents the natural next step in using advanced analysis techniques to discover useful knowledge from large data collections. The goal of data mining is to provide tools that can automatically uncover valuable patterns and insights from tremendous amounts of data.

Uploaded by

Raj Endran

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

200 views

01 Introduction To Data Mining

Uploaded by

Raj Endran

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

1

Introduction

This book is an introduction to the young and fast-growing field of data mining (also known
as knowledge discovery from data, or KDD for short). The book focuses on fundamental
data mining concepts and techniques for discovering interesting patterns from data in
various applications. In particular, we emphasize prominent techniques for developing
effective, efficient, and scalable data mining tools.
This chapter is organized as follows. In Section 1.1, you will learn why data mining is
in high demand and how it is part of the natural evolution of information technology.
Section 1.2 defines data mining with respect to the knowledge discovery process. Next,
you will learn about data mining from many aspects, such as the kinds of data that can
be mined (Section 1.3), the kinds of knowledge to be mined (Section 1.4), the kinds of
technologies to be used (Section 1.5), and targeted applications (Section 1.6). In this
way, you will gain a multidimensional view of data mining. Finally, Section 1.7 outlines
major data mining research and development issues.

1.1

Why Data Mining?

Necessity, who is the mother of invention. Plato
We live in a world where vast amounts of data are collected daily. Analyzing such data
is an important need. Section 1.1.1 looks at how data mining can meet this need by
providing tools to discover knowledge from data. In Section 1.1.2, we observe how data
mining can be viewed as a result of the natural evolution of information technology.

1.1.1

Moving toward the Information Age

We are living in the information age is a popular saying; however, we are actually living
in the data age. Terabytes or petabytes1 of data pour into our computer networks, the
World Wide Web (WWW), and various data storage devices every day from business,
1A

petabyte is a unit of information or computer storage equal to 1 quadrillion bytes, or a thousand

terabytes, or 1 million gigabytes.
Data Mining: Concepts and Techniques
c 2012 Elsevier Inc. All rights reserved.

Chapter 1 Introduction

society, science and engineering, medicine, and almost every other aspect of daily life.
This explosive growth of available data volume is a result of the computerization of
our society and the fast development of powerful data collection and storage tools.
Businesses worldwide generate gigantic data sets, including sales transactions, stock
trading records, product descriptions, sales promotions, company profiles and performance, and customer feedback. For example, large stores, such as Wal-Mart, handle
hundreds of millions of transactions per week at thousands of branches around the
world. Scientific and engineering practices generate high orders of petabytes of data in
a continuous manner, from remote sensing, process measuring, scientific experiments,
system performance, engineering observations, and environment surveillance.
Global backbone telecommunication networks carry tens of petabytes of data traffic
every day. The medical and health industry generates tremendous amounts of data from
medical records, patient monitoring, and medical imaging. Billions of Web searches
supported by search engines process tens of petabytes of data daily. Communities and
social media have become increasingly important data sources, producing digital pictures and videos, blogs, Web communities, and various kinds of social networks. The
list of sources that generate huge amounts of data is endless.
This explosively growing, widely available, and gigantic body of data makes our
time truly the data age. Powerful and versatile tools are badly needed to automatically
uncover valuable information from the tremendous amounts of data and to transform
such data into organized knowledge. This necessity has led to the birth of data mining.
The field is young, dynamic, and promising. Data mining has and will continue to make
great strides in our journey from the data age toward the coming information age.
Example 1.1 Data mining turns a large collection of data into knowledge. A search engine (e.g.,
Google) receives hundreds of millions of queries every day. Each query can be viewed
as a transaction where the user describes her or his information need. What novel and
useful knowledge can a search engine learn from such a huge collection of queries collected from users over time? Interestingly, some patterns found in user search queries
can disclose invaluable knowledge that cannot be obtained by reading individual data
items alone. For example, Googles Flu Trends uses specific search terms as indicators of
flu activity. It found a close relationship between the number of people who search for
flu-related information and the number of people who actually have flu symptoms. A
pattern emerges when all of the search queries related to flu are aggregated. Using aggregated Google search data, Flu Trends can estimate flu activity up to two weeks faster
than traditional systems can.2 This example shows how data mining can turn a large
collection of data into knowledge that can help meet a current global challenge.

1.1.2

Data Mining as the Evolution of Information Technology

Data mining can be viewed as a result of the natural evolution of information technology. The database and data management industry evolved in the development of
2 This

is reported in [GMP+ 09].

1.1 Why Data Mining?

Data Collection and Database Creation

(1960s and earlier)
Primitive file processing
Database Management Systems
(1970s to early 1980s)
Hierarchical and network database systems
Relational database systems
Data modeling: entity-relationship models, etc.
Indexing and accessing methods
Query languages: SQL, etc.
User interfaces, forms, and reports
Query processing and optimization
Transactions, concurrency control, and recovery
Online transaction processing (OLTP)
Advanced Database Systems
(mid-1980s to present)
Advanced data models: extended-relational,
object relational, deductive, etc.
Managing complex data: spatial, temporal,
multimedia, sequence and structured,
scientific, engineering, moving objects, etc.
Data streams and cyber-physical data systems
Web-based databases (XML, semantic web)
Managing uncertain data and data cleaning
Integration of heterogeneous sources
Text database systems and integration with
information retrieval
Extremely large data management
Database system tuning and adaptive systems
Advanced queries: ranking, skyline, etc.
Cloud computing and parallel data processing
Issues of data privacy and security

Advanced Data Analysis

(late-1980s to present)
Data warehouse and OLAP
Data mining and knowledge discovery:
classification, clustering, outlier analysis,
association and correlation, comparative
summary, discrimination analysis, pattern
discovery, trend and deviation analysis, etc.
Mining complex types of data: streams,
sequence, text, spatial, temporal, multimedia,
Web, networks, etc.
Data mining applications: business, society,
retail, banking, telecommunications, science
and engineering, blogs, daily life, etc.
Data mining and society: invisible data
mining, privacy-preserving data mining,
mining social and information networks,
recommender systems, etc.

Future Generation of Information Systems

(Present to future)

Figure 1.1 The evolution of database system technology.

several critical functionalities (Figure 1.1): data collection and database creation, data
management (including data storage and retrieval and database transaction processing),
and advanced data analysis (involving data warehousing and data mining). The early
development of data collection and database creation mechanisms served as a prerequisite for the later development of effective mechanisms for data storage and retrieval,
as well as query and transaction processing. Nowadays numerous database systems
offer query and transaction processing as common practice. Advanced data analysis has
naturally become the next step.

Chapter 1 Introduction

Since the 1960s, database and information technology has evolved systematically
from primitive file processing systems to sophisticated and powerful database systems.
The research and development in database systems since the 1970s progressed from
early hierarchical and network database systems to relational database systems (where
data are stored in relational table structures; see Section 1.3.1), data modeling tools,
and indexing and accessing methods. In addition, users gained convenient and flexible
data access through query languages, user interfaces, query optimization, and transaction management. Efficient methods for online transaction processing (OLTP), where a
query is viewed as a read-only transaction, contributed substantially to the evolution and
wide acceptance of relational technology as a major tool for efficient storage, retrieval,
and management of large amounts of data.
After the establishment of database management systems, database technology
moved toward the development of advanced database systems, data warehousing, and
data mining for advanced data analysis and web-based databases. Advanced database
systems, for example, resulted from an upsurge of research from the mid-1980s onward.
These systems incorporate new and powerful data models such as extended-relational,
object-oriented, object-relational, and deductive models. Application-oriented database
systems have flourished, including spatial, temporal, multimedia, active, stream and
sensor, scientific and engineering databases, knowledge bases, and office information
bases. Issues related to the distribution, diversification, and sharing of data have been
studied extensively.
Advanced data analysis sprang up from the late 1980s onward. The steady and
dazzling progress of computer hardware technology in the past three decades led to
large supplies of powerful and affordable computers, data collection equipment, and
storage media. This technology provides a great boost to the database and information
industry, and it enables a huge number of databases and information repositories to be
available for transaction management, information retrieval, and data analysis. Data
can now be stored in many different kinds of databases and information repositories.
One emerging data repository architecture is the data warehouse (Section 1.3.2).
This is a repository of multiple heterogeneous data sources organized under a unified schema at a single site to facilitate management decision making. Data warehouse
technology includes data cleaning, data integration, and online analytical processing
(OLAP)that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation, as well as the ability to view information from different
angles. Although OLAP tools support multidimensional analysis and decision making,
additional data analysis tools are required for in-depth analysisfor example, data mining tools that provide data classification, clustering, outlier/anomaly detection, and the
characterization of changes in data over time.
Huge volumes of data have been accumulated beyond databases and data warehouses. During the 1990s, the World Wide Web and web-based databases (e.g., XML
databases) began to appear. Internet-based global information bases, such as the WWW
and various kinds of interconnected, heterogeneous databases, have emerged and play
a vital role in the information industry. The effective and efficient analysis of data from
such different forms of data by integration of information retrieval, data mining, and
information network analysis technologies is a challenging task.

1.2 What Is Data Mining?

How can I analyze these data?

Figure 1.2 The world is data rich but information poor.

In summary, the abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation (Figure 1.2). The
fast-growing, tremendous amount of data, collected and stored in large and numerous
data repositories, has far exceeded our human ability for comprehension without powerful tools. As a result, data collected in large data repositories become data tombsdata
archives that are seldom visited. Consequently, important decisions are often made
based not on the information-rich data stored in data repositories but rather on a decision makers intuition, simply because the decision maker does not have the tools to
extract the valuable knowledge embedded in the vast amounts of data. Efforts have
been made to develop expert system and knowledge-based technologies, which typically
rely on users or domain experts to manually input knowledge into knowledge bases.
Unfortunately, however, the manual knowledge input procedure is prone to biases and
errors and is extremely costly and time consuming. The widening gap between data and
information calls for the systematic development of data mining tools that can turn data
tombs into golden nuggets of knowledge.

1.2

What Is Data Mining?

It is no surprise that data mining, as a truly interdisciplinary subject, can be defined
in many different ways. Even the term data mining does not really present all the major
components in the picture. To refer to the mining of gold from rocks or sand, we say gold
mining instead of rock or sand mining. Analogously, data mining should have been more

Chapter 1 Introduction

Knowledge

Figure 1.3 Data miningsearching for knowledge (interesting patterns) in data.

appropriately named knowledge mining from data, which is unfortunately somewhat

long. However, the shorter term, knowledge mining may not reflect the emphasis on
mining from large amounts of data. Nevertheless, mining is a vivid term characterizing
the process that finds a small set of precious nuggets from a great deal of raw material
(Figure 1.3). Thus, such a misnomer carrying both data and mining became a popular choice. In addition, many other terms have a similar meaning to data miningfor
example, knowledge mining from data, knowledge extraction, data/pattern analysis, data
archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery. The knowledge discovery process is
shown in Figure 1.4 as an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)3

popular trend in the information industry is to perform data cleaning and data integration as a
preprocessing step, where the resulting data are stored in a data warehouse.

Steel Shop Drawings-Charles Street
No ratings yet
Steel Shop Drawings-Charles Street
13 pages
Ib Econ Ia Student Rubric
50% (2)
Ib Econ Ia Student Rubric
2 pages
Module 1-Data Mining Introduction (Student Edition)
No ratings yet
Module 1-Data Mining Introduction (Student Edition)
39 pages
Unit I - Chapter 1 - Data Mining
No ratings yet
Unit I - Chapter 1 - Data Mining
77 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
8 pages
What Is Data Mining
No ratings yet
What Is Data Mining
22 pages
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
No ratings yet
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
7 pages
Big Data Unit I
No ratings yet
Big Data Unit I
8 pages
unit 1
No ratings yet
unit 1
24 pages
Data Mining Note Sixth Semester ..
No ratings yet
Data Mining Note Sixth Semester ..
79 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
3 pages
Data Mining With Big DataUsing HACE Theorem
No ratings yet
Data Mining With Big DataUsing HACE Theorem
6 pages
Ramy mahmoud 52117
No ratings yet
Ramy mahmoud 52117
3 pages
Big Data Analysis Using Apache HADOOP (November 2013) : Abstract-Big Data Problems Are Often Complex To
No ratings yet
Big Data Analysis Using Apache HADOOP (November 2013) : Abstract-Big Data Problems Are Often Complex To
11 pages
Lesson 01 - Data and Information
No ratings yet
Lesson 01 - Data and Information
19 pages
UNIT-1_Big Data and Hadoop
No ratings yet
UNIT-1_Big Data and Hadoop
41 pages
The Big Data System, Components, Tools, and Technologies A Survey
No ratings yet
The Big Data System, Components, Tools, and Technologies A Survey
100 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
23 pages
Big Data Security Issues
No ratings yet
Big Data Security Issues
7 pages
Data Mining Note
No ratings yet
Data Mining Note
79 pages
Introduction To The Big Data Era: Stephan Kudyba and Matthew Kwatinetz
No ratings yet
Introduction To The Big Data Era: Stephan Kudyba and Matthew Kwatinetz
15 pages
Rao 2018
No ratings yet
Rao 2018
81 pages
Unit 5 - Principles of Big Data 2
No ratings yet
Unit 5 - Principles of Big Data 2
14 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
41 pages
1.introduction To Data Science
No ratings yet
1.introduction To Data Science
23 pages
BDA-1
No ratings yet
BDA-1
26 pages
Lec.2.Intro.D.S. Fall 2024
No ratings yet
Lec.2.Intro.D.S. Fall 2024
31 pages
Data Mining Functionalities
No ratings yet
Data Mining Functionalities
58 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
Reading Teks Kelompok 2
No ratings yet
Reading Teks Kelompok 2
12 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
41 pages
Unit 1 Explores The Basic Concepts of ICT Together With Its Role and Applicability in Today's Knowledge Based Society
No ratings yet
Unit 1 Explores The Basic Concepts of ICT Together With Its Role and Applicability in Today's Knowledge Based Society
14 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Detailednotes_unit1_Big Data
No ratings yet
Detailednotes_unit1_Big Data
22 pages
DWDM B Tech Unit 1 Part-A
No ratings yet
DWDM B Tech Unit 1 Part-A
15 pages
Unit - Introduction - : Data Mining: Concepts and Techniques
No ratings yet
Unit - Introduction - : Data Mining: Concepts and Techniques
56 pages
Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society
No ratings yet
Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society
7 pages
BD-Unit-1
No ratings yet
BD-Unit-1
63 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Data v2
No ratings yet
Data v2
25 pages
Introduction to information and big data security
No ratings yet
Introduction to information and big data security
39 pages
04 Data Mining-Applications
No ratings yet
04 Data Mining-Applications
6 pages
DWDM
No ratings yet
DWDM
48 pages
CH 3
No ratings yet
CH 3
35 pages
What Is Big Data - Introduction
No ratings yet
What Is Big Data - Introduction
6 pages
Big Data Research Paper
No ratings yet
Big Data Research Paper
14 pages
E. Vega-Albarado Et Al Sometido
No ratings yet
E. Vega-Albarado Et Al Sometido
7 pages
Processing Model From Mining Prospective
No ratings yet
Processing Model From Mining Prospective
5 pages
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
No ratings yet
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
14 pages
Beyond The Hype
No ratings yet
Beyond The Hype
30 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
What Is Data
No ratings yet
What Is Data
20 pages
M.sc. SE IVth Yr ISEE84 Data Warehouse and Mining
No ratings yet
M.sc. SE IVth Yr ISEE84 Data Warehouse and Mining
110 pages
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
No ratings yet
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
9 pages
Mkristel Aleman Mreview Bigdata
No ratings yet
Mkristel Aleman Mreview Bigdata
11 pages
A Comprehensive Study of Data Stream Mining Techniques
No ratings yet
A Comprehensive Study of Data Stream Mining Techniques
9 pages
BIG DATA_UNIT-I
No ratings yet
BIG DATA_UNIT-I
17 pages
LN2015 01
No ratings yet
LN2015 01
16 pages
What Is Data
No ratings yet
What Is Data
24 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
Reality Mining: Using Big Data to Engineer a Better World
From Everand
Reality Mining: Using Big Data to Engineer a Better World
Nathan Eagle
4/5 (2)
Data Mining-Mining Time Series Data
0% (1)
Data Mining-Mining Time Series Data
7 pages
Data Mining-Mining Sequence Patterns in Biological Data
No ratings yet
Data Mining-Mining Sequence Patterns in Biological Data
6 pages
Data Mining-Spatial Data Mining
No ratings yet
Data Mining-Spatial Data Mining
8 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Data Mining - Mining Sequential Patterns
No ratings yet
Data Mining - Mining Sequential Patterns
10 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Data Mining-Multimedia Datamining
No ratings yet
Data Mining-Multimedia Datamining
8 pages
Data Mining-Constraint Based Cluster Analysis
100% (1)
Data Mining-Constraint Based Cluster Analysis
4 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Data Mining - Other Classifiers
No ratings yet
Data Mining - Other Classifiers
7 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Data Mining-Backpropagation
100% (1)
Data Mining-Backpropagation
5 pages
Data Mining-Rule Based Classification
No ratings yet
Data Mining-Rule Based Classification
4 pages
Data Mining - Outlier Analysis
100% (3)
Data Mining - Outlier Analysis
11 pages
Data Mining-Data Warehouse
No ratings yet
Data Mining-Data Warehouse
7 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
Data Mining - Density Based Clustering
No ratings yet
Data Mining - Density Based Clustering
8 pages
08 Data Mining-Other Classifications
No ratings yet
08 Data Mining-Other Classifications
4 pages
Slot 13-14-15-Building Windows Forms Application
No ratings yet
Slot 13-14-15-Building Windows Forms Application
61 pages
Past, Present and Future
No ratings yet
Past, Present and Future
3 pages
Basic Flash Tutorials CS3
No ratings yet
Basic Flash Tutorials CS3
16 pages
2012 Film Review
No ratings yet
2012 Film Review
2 pages
Quick Start Platinum Pro Subaru MY01 05 Rev 3
No ratings yet
Quick Start Platinum Pro Subaru MY01 05 Rev 3
16 pages
Application Form Onside
No ratings yet
Application Form Onside
5 pages
Model Z-C1 (M022/M026) Parts Catalog
No ratings yet
Model Z-C1 (M022/M026) Parts Catalog
212 pages
OB Chap 4
No ratings yet
OB Chap 4
23 pages
Case 3-WPS Office
No ratings yet
Case 3-WPS Office
17 pages
QWERLive Loads Final
No ratings yet
QWERLive Loads Final
8 pages
Membership Form
No ratings yet
Membership Form
1 page
Company BS Format
No ratings yet
Company BS Format
11 pages
Final Report: Strategic Marketing
No ratings yet
Final Report: Strategic Marketing
33 pages
Polar Graphing
No ratings yet
Polar Graphing
21 pages
Master Thesis Information Technology
100% (3)
Master Thesis Information Technology
8 pages
Yearly Teaching Plan Summary Maths f2 KSSM
No ratings yet
Yearly Teaching Plan Summary Maths f2 KSSM
2 pages
Bola Tinubu
No ratings yet
Bola Tinubu
12 pages
Btp-L580iic User's Manual v1.0
No ratings yet
Btp-L580iic User's Manual v1.0
27 pages
Planning Your Essay
No ratings yet
Planning Your Essay
3 pages
SAN11 Paper Guide 16-17
No ratings yet
SAN11 Paper Guide 16-17
4 pages
Applications of Laplace Transform: EEE111 Electric Circuit Analysis
No ratings yet
Applications of Laplace Transform: EEE111 Electric Circuit Analysis
29 pages
Auxins More Questions
No ratings yet
Auxins More Questions
7 pages
The Little Prince
No ratings yet
The Little Prince
4 pages
Module Food Selection Final Edition (Repaired)
No ratings yet
Module Food Selection Final Edition (Repaired)
214 pages
The IMRaD Format
No ratings yet
The IMRaD Format
3 pages
Are You Ready To Decide PDF
No ratings yet
Are You Ready To Decide PDF
6 pages
Statechart Based Modeling and Controller Implementation of Complex Reactive Systems
No ratings yet
Statechart Based Modeling and Controller Implementation of Complex Reactive Systems
6 pages