Datawarehouse & Data Mining
Datawarehouse & Data Mining
• Late 1980s-present
– Advanced Data Analysis
• Data warehouse and OLAP
• Data mining and knowledge discovery
• Advanced data mining appliations
• Data mining and socity
• 1990s-present:
– XML-based database systems
– Integration with information retrieval
– Data and information integreation
Evolution of Database Technology
• Present – future:
– New generation of integrated data and
information system.
What Is Data Mining?
What Is Data Mining?
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process. Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
Steps of a KDD Process
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentaion
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of
application
• Creating a target data set: data selection
• Data cleaning and preprocessing
• Data reduction and transformation:
– Find useful features, dimensionality/variable
reduction, invariant representation.
Steps of a KDD Process
• Choosing functions of data mining
– summarization, classification, regression, association,
clustering.
• Choosing the mining algorithms
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant
patterns, etc.
• Use of discovered knowledge
Architecture of a Typical Data
Mining System
Graphical user interface
Pattern evaluation
Data
Databases Warehouse
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
• Relational databases
• Data warehouses
• Transactional databases
Data Mining: On What Kind of Data?
• Cluster analysis
– Analyze class-labeled data objects, clustering
analyze data objects without consulting a known
class label.
– Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
Data Mining Functionalities
• Outlier analysis
– Outlier: a data object that does not comply(fulfill) with the general
behavior of the model of the data
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
Information
Science Data Mining MachineLearnin
g
Visualization Other
Disciplines
Data Mining systems: Classification Schemes
• General functionality
– Descriptive data mining
– Predictive data mining
• Data mining various criteria's:
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adopted
Data Mining: Classification Schemes
• Databases to be mined
– Relational, transactional, object-oriented,
object-relational, active, spatial, time-series, text,
multi-media, heterogeneous, legacy, WWW, etc.
• Knowledge to be mined
– Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
– Multiple/integrated functions and mining at multiple
levels
Data Mining: Classification Schemes
• Techniques utilized
– Database-oriented, data warehouse (OLAP),
machine learning, statistics, visualization,
neural network, etc.
• Applications adopted
– Retail, telecommunication, banking, fraud
analysis, DNA mining, stock market
Major Issues in Data Mining
Major Issues in Data Mining
• Data warehousing:
– The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product,
sales.
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous
data sources
– relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are
applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted.
Data Warehouse—Time Variant
• The time horizon for the data warehouse is
significantly longer than that of operational systems.
– Operational database: current value data.
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”.
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from
the operational environment.
• Operational update of data does not occur in the
data warehouse environment.
– Does not require transaction processing, recovery, and
concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data.
Data Warehouse vs. Operational DBMS
• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
Data Marts
• Data extraction:
– get data from multiple, heterogeneous, and external
sources
• Data cleaning:
– detect errors in the data and rectify them when possible
• Data transformation:
– convert data from legacy or host format to warehouse
format
• Load:
– sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
• Refresh
– propagate the updates from the data sources to the
warehouse
Three Data Warehouse Models
• Enterprise warehouse
– collects all of the information about subjects spanning the entire
organization
• Data Mart
– a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
• Independent vs. dependent (directly from warehouse) data
mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be materialized
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse