Data Mining From Scratch
Data Mining From Scratch
2
Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
3
What Is Data Mining?
4
Why Data Mining?—Potential Applications
5
Why Data Mining?—Potential Applications
• Other Applications
• Text mining (news group, email, documents) and Web
mining
• Stream data mining
• Bioinformatics and bio-data analysis
6
Market Analysis and Management
7
Market Analysis and Management
• Cross-market analysis
• Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
• What types of customers buy what products
8
Fraud Detection & Mining Unusual Patterns
9
Data Mining: A KDD Process
Task-relevant Data
Data Cleaning
Data Integration
Databases 10
Steps of a KDD Process
• Relational database
• Data warehouse
• Transactional database
• Advanced database and information repository
• Spatial and temporal data
• Time-series data
• Stream data
• Multimedia database
• Text databases & WWW
12
Data Mining Functionalities
Data mining functionalities are generally divided into two major
categories:
• Data mining may generate thousands of patterns: Not all of them are
interesting
• Interestingness measures
• A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
14
Origins of Data Mining
• Draws ideas from: machine learning/AI, statistics, and database
systems
Statistics
Machine Learning
Data Mining
Database
systems
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Data Mining Visualization
Learning
Algorithm Other
Disciplines
16
Major Issues in Data Mining
• Mining methodology
• Mining different kinds of knowledge from diverse data types,
e.g., bio, stream, Web
• Performance: efficiency, effectiveness, and scalability
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge
• Handling noise and incomplete data
• Parallel, distributed and incremental mining methods
• Integration of the discovered knowledge with existing one:
knowledge fusion
17