Data mining involves the analysis of large datasets to discover patterns and relationships within the data. It aims to extract useful information and summarize the data in novel ways that are understandable and useful. Key aspects of data mining include analyzing large amounts of data from various sources like government, corporations, and science to discover patterns and apply the findings. Common data mining tasks are classification, regression, clustering, dependency analysis, and summarization. Popular data mining methods are decision trees, association rules, sequential patterns, and clustering.
Download as PPT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
12 views
Data Mining
Data mining involves the analysis of large datasets to discover patterns and relationships within the data. It aims to extract useful information and summarize the data in novel ways that are understandable and useful. Key aspects of data mining include analyzing large amounts of data from various sources like government, corporations, and science to discover patterns and apply the findings. Common data mining tasks are classification, regression, clustering, dependency analysis, and summarization. Popular data mining methods are decision trees, association rules, sequential patterns, and clustering.
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13
Data Mining
What is Data Mining?
Data Mining is:
(1) The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets
(2) The analysis of (often large) observational data
sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner Overview of terms
Data: a set of facts (items) D, usually stored in a
database Pattern: an expression E in a language L, that describes a subset of facts Attribute: a field in an item i in D. Interestingness: a function ID,L that maps an expression E in L into a measure space M Overview of terms
The Data Mining Task:
For a given dataset D, language of facts L,
interestingness function ID,L and threshold c, find the expression E such that ID,L(E) > c efficiently. Knowledge Discovery Examples of Large Datasets
Government: IRS, NGA, …
Large corporations WALMART: 20M transactions per day MOBIL: 100 TB geological databases AT&T 300 M calls per day Credit card companies
Scientific NASA, EOS project: 50 GB per hour Environmental datasets Examples of Data mining Applications
1. Fraud detection: credit cards, phone cards
2. Marketing: customer targeting 3. Data Warehousing: Walmart 4. Astronomy 5. Molecular biology How Data Mining is used
1. Identify the problem
2. Use data mining techniques to transform the data into information 3. Act on the information 4. Measure the results The Data Mining Process
1. Understand the domain
2. Create a dataset: Select the interesting attributes Data cleaning and preprocessing 3. Choose the data mining task and the specific algorithm 4. Interpret the results, and possibly return to 2 Origins of Data Mining
Draws ideas from machine learning/AI,
pattern recognition, statistics, and database systems AI / Statistics Must address: Machine Learning Enormity of data High dimensionality Data Mining of data Heterogeneous, distributed nature Database of data systems Data Mining Tasks
1. Classification: learning a function that maps an
item into one of a set of predefined classes 2. Regression: learning a function that maps an item to a real value 3. Clustering: identify a set of groups of similar items Data Mining Tasks
4. Dependencies and associations:
identify significant dependencies between data attributes 5. Summarization: find a compact description of the dataset or a subset of the dataset Data Mining Methods
1. Decision Tree Classifiers:
Used for modeling, classification 2. Association Rules: Used to find associations between sets of attributes 3. Sequential patterns: Used to find temporal associations in time series 4. Hierarchical clustering: used to group customers, web users, etc