01 Intro
01 Intro
Introduction
Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? Major Issues in Data Mining
http://www.google.org/flutrends/
Influenza-like Illness (ILI) percentages estimated by model (black) and provided by the CDC (red) in the mid-Atlantic region, showing data available at four points in the 2007-2008 influenza season.
1960s:
Data collection, database creation, IMS and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
1970s:
1980s:
1990s:
2000s
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Simple search and query processing (Deductive) expert systems
6
Alternative names
This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection
Data Cleaning
Data Integration Databases
7
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery
End User
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
Input Data
Data PreProcessing
Health care & medical data mining often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation
10
Data to be mined Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge to be mined (or: Data mining functions) Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Descriptive vs. predictive data mining Multiple/integrated functions and mining at multiple levels Techniques utilized Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
11
Relational database, data warehouse, transactional database Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web
12
Data cleaning, transformation, integration, and multidimensional data model Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing)
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
13
What items are frequently purchased together in your Walmart? A typical association rule
How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification, clustering, and other applications?
14
Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict some unknown class labels Decision trees, nave Bayesian classification, support vector machines, neural networks, rule-based classification, patternbased classification, logistic regression, Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages,
15
Typical methods
Typical applications:
Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity
16
Outlier analysis
Outlier: A data object that does not comply with the general behavior of the data Noise or exception? One persons garbage could be another persons treasure Methods: by product of clustering or regression analysis, Useful in fraud detection, rare events analysis
17
Sequence, trend and evolution analysis Trend, time-series, and deviation analysis: e.g., regression and value prediction Sequential pattern mining e.g., first buy digital camera, then buy large SD memory cards Periodicity analysis Motifs and biological sequence analysis Approximate and consecutive motifs Similarity-based analysis Mining data streams Ordered, time-varying, potentially infinite, data streams
18
Evaluation of Knowledge
One can mine tremendous amount of patterns and knowledge Some may fit only certain dimension space (time, location, ) Some may not be representative, may be transient,
Applications
Data Mining
Visualization
Algorithm
Database Technology
High-Performance Computing
20
Algorithms must be highly scalable to handle potentially tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
High-dimensionality of data
Mining Methodology
Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge Presentation and visualization of data mining results
22
23