Unit - Introduction - : Data Mining: Concepts and Techniques
Unit - Introduction - : Data Mining: Concepts and Techniques
Concepts and
Techniques
UNIT 1
Introduction
Chapter 1. Introduction
Evolution of Sciences
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
The Internet and computing Grid that makes all these archives universally
accessible
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online
Science, Comm. ACM, 45(11): 50-54, Nov. 2002
Data Mining: Concepts and
Techniques
Evolution of Database
Technology
1960s:
1970s:
1980s:
1990s:
2000s
Alternative names
Data miningcore of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and
Techniques
DBA
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Data Mining: Concepts and
Techniques
Visualization
Other
Disciplines
High-dimensionality of data
10
Data to be mined
Knowledge to be mined
classification,
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW
Applications adapted
11
12
Predictive Tasks
Descriptive Tasks
together.(human
interpretable
15
Predictive
Clustering
Classification
Association
Decision Tree
Sequential Analysis
Rule Induction
Neural Networks
Nearest Neighbor Classification
Regression
Supervised learning
Pattern recognition
Prediction
Unsupervised learning
Segmentation
Partitioning
17
Characterization
Generalization
Affinity Analysis
Association Rules
Sequential Analysis determines sequential
patterns.
18
19
Object-relational databases
Multimedia database
Text databases
20
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
User interaction
21
Other Applications
22
Know
ledge
-Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
Data
World-Wide Other Info
Repositories
Warehouse
Web
Data Mining: Concepts and
Techniques
23
Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which
he defined in the following way: "A warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support
of management's decision making process". He defined the terms in
the sentence as follows:
Subject Oriented:
Data that gives information about a particular subject instead of
about a company's ongoing operations.
Integrated:
Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
Time-variant:
All data in the data warehouse is identified with a particular time
period.
Non-volatile
Data is stable in a data warehouse. More data is added but data is
never removed. This enables management to gain a consistent
picture of the business.
A
process
of
transforming data into
information
and
making it available to
users in a timely
enough
manner
to
make a difference
[Forrester Research, April
1996]
Data
25
Data Warehouse
Architecture
Data Warehouse
Architecture
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Metadata Repository
27
Analyze
Query
Data
Warehousing
provides the Enterprise
with a memory
Data is integrated
Data is subject-oriented
Mainly read-only
updates
Current, Old,
Summarized
with
Lightly
periodic
Summarized,
batch
Highly
Environment is characterized by
transactions to very large data sets
System
that
traces
transformations, and storage
Source,
transformation,
relationships, history, etc
data
integration,
Read-only
sources,
storage,
Webhouse
32
Data Webhouse
Required Capabilities
Attach multimedia to DW
DW security
Architecture Web to
Warehouse
Timliness real-time
Data volume no upper limit
Response time less than 10 seconds
Traditional
Power users
Analysts
Report viewers
Web
Our customers
Our business partners
Our employees
Clickstreams
Clickstream as defined by Internet
Advertising Bureau (IAB) :
The electronic path a user takes while
navigating from site to site, and from page to
page within a site. It is a comprehensive body
of data describing the sequence of activity
between a users browser and any other
Internet resource, such as a Web site or third
party ad server
Data Mining: Concepts and
Techniques
Clickstreams
Clickstreams
Primary
function
of
dw
to
publish
Need
distributed
dw
web
universal connectivity
provides
27 X 7 expected
International characters, dates, addresses
Expanded multimedia
Mass customization
Fully distributed
Partition files
Increase RAM
Website design
Help choices
Report library
Business
metadata
interface
understand
Streamline Process
Website design
Streamline Process
Website design
Build Trust
Website design
Build Trust
Two-factor security
Provide Communication
Hooks
Website design