0% found this document useful (0 votes)
12 views

Data Mining

Data mining involves the analysis of large datasets to discover patterns and relationships within the data. It aims to extract useful information and summarize the data in novel ways that are understandable and useful. Key aspects of data mining include analyzing large amounts of data from various sources like government, corporations, and science to discover patterns and apply the findings. Common data mining tasks are classification, regression, clustering, dependency analysis, and summarization. Popular data mining methods are decision trees, association rules, sequential patterns, and clustering.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Mining

Data mining involves the analysis of large datasets to discover patterns and relationships within the data. It aims to extract useful information and summarize the data in novel ways that are understandable and useful. Key aspects of data mining include analyzing large amounts of data from various sources like government, corporations, and science to discover patterns and apply the findings. Common data mining tasks are classification, regression, clustering, dependency analysis, and summarization. Popular data mining methods are decision trees, association rules, sequential patterns, and clustering.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Mining

What is Data Mining?

 Data Mining is:


(1) The efficient discovery of previously unknown,
valid, potentially useful, understandable patterns in
large datasets

(2) The analysis of (often large) observational data


sets to find unsuspected relationships and to
summarize the data in novel ways that are both
understandable and useful to the data owner
Overview of terms

 Data: a set of facts (items) D, usually stored in a


database
 Pattern: an expression E in a language L, that
describes a subset of facts
 Attribute: a field in an item i in D.
 Interestingness: a function ID,L that maps an
expression E in L into a measure space M
Overview of terms

 The Data Mining Task:

For a given dataset D, language of facts L,


interestingness function ID,L and threshold c, find
the expression E such that ID,L(E) > c efficiently.
Knowledge Discovery
Examples of Large Datasets

 Government: IRS, NGA, …


 Large corporations
 WALMART: 20M transactions per day
 MOBIL: 100 TB geological databases
 AT&T 300 M calls per day
 Credit card companies

 Scientific
 NASA, EOS project: 50 GB per hour
 Environmental datasets
Examples of Data mining Applications

1. Fraud detection: credit cards, phone cards


2. Marketing: customer targeting
3. Data Warehousing: Walmart
4. Astronomy
5. Molecular biology
How Data Mining is used

1. Identify the problem


2. Use data mining techniques to transform the
data into information
3. Act on the information
4. Measure the results
The Data Mining Process

1. Understand the domain


2. Create a dataset:
 Select the interesting attributes
 Data cleaning and preprocessing
3. Choose the data mining task and the specific
algorithm
4. Interpret the results, and possibly return to 2
Origins of Data Mining

 Draws ideas from machine learning/AI,


pattern recognition, statistics, and database
systems
AI /
Statistics
 Must address: Machine Learning
 Enormity of data
 High dimensionality Data Mining
of data
 Heterogeneous,
distributed nature Database
of data systems
Data Mining Tasks

1. Classification: learning a function that maps an


item into one of a set of predefined classes
2. Regression: learning a function that maps an
item to a real value
3. Clustering: identify a set of groups of similar
items
Data Mining Tasks

4. Dependencies and associations:


identify significant dependencies between data
attributes
5. Summarization: find a compact description of
the dataset or a subset of the dataset
Data Mining Methods

1. Decision Tree Classifiers:


Used for modeling, classification
2. Association Rules:
Used to find associations between sets of attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc

You might also like