0% found this document useful (0 votes)
1 views

Week+1+-+Part+2

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Week+1+-+Part+2

Uploaded by

sainathgunda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Learning objectives

By the end of this module, you will be able to:


• Explain the introduction to data mining

• Illustrate the main data sources of data mining: Relational database, Data warehouse, and

Transactional data
[email protected]
DLZNK464L9

• Summarize various functionalities of data mining

• Explain various Technologies and applications of Data Mining

• Explain the evoluation methods, major issues, and history of data mining

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9
What Kind of Pattern can be Mined

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In this session, we will learn about:


Types of Data Mining
Various functionalities of data mining

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
What Kind of Patterns can be Mined ?
Data mining functionalities are used to define the types of patterns that will be discovered in data mining
tasks. Data mining tasks are divided into two types: descriptive and predictive.
● Descriptive mining tasks describe the general properties of the database's data.
● Predictive mining tasks make predictions by performing inference on current data.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Types

Data Mining

Association Prediction Clustering


[email protected]
DLZNK464L9

Affinity Analysis Classification Segmentation

Sequence Analysis Regression Outlier Analysis

Link Analysis Time Series

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Class Description
● Example classes
○ Customers: Budget spenders, impulsive spenders, big spenders
○ Handheld devices with 10% and more increased sale in the last quarter
● Approaches
○ Characterization
[email protected]
DLZNK464L9

■ Find general characteristics or features of a target class


○ Discrimination
■ Compare/contrast characteristics of a target class against other given classes
● Methods: statistics and OLAP techniques (Ch 2 and Ch4/5)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Associations and Correlations
● Frequent patterns
○ Sets of entities that occur frequently in data
■ Frequent itemset patterns
■ Frequent sequential patterns
■ Frequent structure patterns
[email protected]
DLZNK464L9

● Examples:
○ Buys(X, “computer) => buys(X, “software”)

[support = 1%, confidence = 50%]

○ Age(X, “20..29”) Ʌ income (X, “40..49K”) => buys (X, “laptop”)

[support = 2%, confidence = 60%]

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Associations and Correlations (Contd)

Association, correlation vs. causation

● A typical association rule

○ Diaper 🡪🡪 Beer [0.5%, 75%] (support, confidence)

● Are strongly associated items also strongly correlated?


[email protected]
DLZNK464L9
○ What if buying beer occurs 80% of times

■ Association mining is not correlation mining

● Strong correlation doesn’t automatically indicate a causal relationship.

○ https://www.stat.auckland.ac.nz/~wild/d2i/articles/4.5.Association%20and%20Correlation_A
RTICLE.pdf

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Classification and Regression for
Predictive Analysis
● Classification and label prediction
○ Construct models (functions) based on some training examples
○ Describe and distinguish classes or concepts for future prediction
■ E.g., classify terrorists based on their behavioral and social features, or classify cars based on
[email protected]
DLZNK464L9 gas mileage, accident rates, etc.
○ Predict class labels for new instances

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Classification and Regression for
Predictive Analysis(Contd.)

● Frequently used methods


○ Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-
based classification, pattern-based classification, logistic regression, …
● Typical applications:
[email protected]
DLZNK464L9

○ Credit card fraud detection, direct marketing, classifying web-pages, and other entities, …

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Classification and Regression for
Predictive Analysis (Contd)
● Classification predicts class labels (categorical or nominal values)

● Regression generally predicts continuous values (e.g. precipitation in a year)

● A lot of predictive problems, if not all, can be formulated as classification or regression problems

● Some machine learning methods are best for classification or regression tasks, others can perform
[email protected]

both tasks (e.g. ANN (Artificial neural networks), SVM (Support vector machine), decision trees etc.).
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Cluster Analysis
● Is unsupervised learning (i.e., no target class labels)

● Group data to form categories (i.e., clusters), e.g., cluster houses to find distribution patterns

● Principle:
○ Maximizing intra-group similarity & minimizing inter-group similarity
[email protected]
DLZNK464L9
● Avoid using “classes” when refer to “clusters”

● Many methods and applications

● Example:

○ Identify groups of customers based on their shopping habit

○ Identify groups of scholars based on their academic production

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Outlier Analysis
Outlier analysis or anomaly mining

○ Outlier: A data object that does not comply with the general behavior of the data
○ Noise or exception? ― One person’s garbage could be another person’s treasure

○ Methods:
[email protected]

■ Statistical tests
DLZNK464L9

■ By product of clustering or regression analysis

■ Classification

○ Useful in fraud detection, rare events analysis

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Evaluation
● Are all mined patterns interesting?
○ One can mine a lot of “patterns” and “knowledge”
■ E.g., association rule mining.
○ Some may fit only certain dimension space (time, location, …)
○ Some may not be representative, may be transient, …
[email protected]
DLZNK464L9

● Evaluation of mined knowledge


○ Objective measures (accuracy, precision, recall etc.)
○ Subjective measures
■ Goal oriented
■ Actionable
■ Unexpectedness
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Patterns can be Mined: Evaluation (Contd)
● Interesting patterns need to be
○ Understandable by humans
○ Valid on new data with some degree of certainty
○ Potentially useful
○ Novel
[email protected]
● Objective measures for association rules
DLZNK464L9

○ Support (X=>Y) = P(X U Y)


○ Confidence (X =>Y) = P(Y|X)
○ X and Y are itemsets
○ S-C Thresholds are applied to select rules
● Other objective measures
○ If-then rules: accuracy and coverage
○ Classification: precision and recall
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:


• We discussed the types of data mining tasks such as descriptive and predictive.
• We discussed various patterns that can be mined such as association, prediction, and clustering.
• We discussed Classification and Regression in predictive analysis that predicts class labels for new
instances.
• We learnt various applications of Classification and Regression, such as credit card fraud detection,
direct marketing, etc.
[email protected]
DLZNK464L9

• We discussed Cluster analysis, unsupervised learning used to group data, to form categories(i.e.
clusters).
• We discussed the Outlier analysis to identify the anomalous observation in the dataset.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining Technologies and
[email protected]
DLZNK464L9
Applications

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In this session, we will learn about:


• Technologies of Data Mining
• Applications of Data Mining

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: Technologies

Statistics Machine learning Pattern Recognition

Database Systems Visualization


[email protected]
DLZNK464L9

Data Mining

Data Warehouse Algorithms

High-Performance
Information Retrieval Applications
Computing

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Why Confluence of Multiple Disciplines?
Traditional data analysis:
Traditional data analytics typically relies on dashboards composed of visualizations. These dashboards
are based on common business questions and are predefined well in advance. Answering a new question
requires time and technical skills, usually multiple days (or weeks) and assistance from a data analyst or
scientist
[email protected]
DLZNK464L9

● Tremendous amount of data


○ Algorithms must be highly scalable to handle such as terabytes of data
● High-dimensionality of data
○ Micro-array may have tens of thousands of dimensions

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Why Confluence of Multiple Disciplines? (Contd.)

● High complexity of data


○ Data streams and sensor data
○ Time-series data, temporal data, sequence data
○ Structure data, graphs, social networks and multi-linked data
[email protected]
DLZNK464L9
○ Heterogeneous databases and legacy databases
○ Spatial, spatiotemporal, multimedia, text and Web data
○ Software programs, scientific simulations
● New and sophisticated applications

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: Applications
● Web page analysis: from web page classification, clustering to PageRank & HITS algorithms

● Collaborative analysis & recommender systems


● Basket data analysis to targeted marketing

● Biological and medical data analysis: classification, cluster analysis (microarray data analysis),
[email protected]

biological sequence analysis, biological network analysis


DLZNK464L9

● Data mining and software engineering

● From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle
Data Mining Tools) to invisible data mining (=data mining built into regular functional components,
running all the time, often unbeknownst to the user)

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:


• We discussed various technologies of Data Mining, such as Statistics, Machine learning, Data
warehouse, etc.
• We understood the confluence of multiple disciplines in data mining, such as:
-The tremendous amount of data
-High-complexity data such as Time-series data, Structure data, etc.
[email protected]
DLZNK464L9

• We discussed various applications of Data Mining, such as Web page analysis, Collaborative analysis &
recommender systems, etc.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9
Issues and History of Data Mining

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Agenda

In this session, we will learn about:


Major issues in data mining
History of Data Mining

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: Major issues
● Mining Methodology
○ Mining various and new kinds of knowledge
○ Mining knowledge in multi-dimensional space
○ Data mining: An interdisciplinary effort
○ Boosting the power of discovery in a networked environment
[email protected]
DLZNK464L9

○ Handling noise, uncertainty, and incompleteness of data


○ Pattern evaluation and pattern- or constraint-guided mining
● User Interaction
○ Interactive mining
○ Incorporation of background knowledge
○ Presentation and visualization of data mining results
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: Major issues (Contd)
● Efficiency and Scalability
○ Efficiency and scalability of data mining algorithms
○ Parallel, distributed, stream, and incremental mining methods
● Diversity of data types
○ Handling complex types of data
[email protected]
DLZNK464L9
○ Mining dynamic, networked, and global data repositories
● Data mining and society
○ Social impacts of data mining
○ Privacy-preserving data mining
○ Invisible data mining

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: History
● 1989 IJCAI Workshop on Knowledge Discovery in Databases
○ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
● 1991-1994 Workshops on Knowledge Discovery in Databases
○ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
and R. Uthurusamy, 1996)
[email protected]

● 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining


DLZNK464L9

(KDD’95-98)
○ Journal of Data Mining and Knowledge Discovery (1997)
● ACM SIGKDD conferences since 1998 and SIGKDD Explorations
● More conferences on data mining
○ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
● ACM Transactions on KDD starting in 2007
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: Conferences and Journals
KDD Conferences Other related conferences
● ACM SIGKDD Int. Conf. on ● DB conferences: ACM SIGMOD,
Knowledge Discovery in Databases VLDB, ICDE, EDBT, ICDT, …
and Data Mining (KDD) ● Web and IR conferences: WWW,
● SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
● (IEEE) Int. Conf. on Data Mining
● ML conferences: ICML, NIPS
(ICDM)
[email protected]

● PR conferences: CVPR,
DLZNK464L9

● European Conf. on Machine


Learning and Principles and Journals
practices of Knowledge Discovery ● Data Mining and Knowledge
and Data Mining (ECML-PKDD) Discovery (DAMI or DMKD)
● Pacific-Asia Conf. on Knowledge ● IEEE Trans. On Knowledge and Data
Discovery and Data Mining (PAKDD)
Eng. (TKDE)
● Int. Conf. on Web Search and Data
● KDD Explorations
Mining (WSDM)
● ACM Trans. on KDD
This file is meant for personal use by [email protected] only.
Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: References
● Data mining and KDD (SIGKDD: CDROM)
○ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
○ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
● Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
○ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
○ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
● AI & Machine Learning
[email protected]
DLZNK464L9
○ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
○ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
● Web and IR
○ Conferences: SIGIR, WWW, CIKM, etc.
○ Journals: WWW: Internet and Web Information Systems,

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Data Mining: References(Contd.)
● Statistics
○ Conferences: Joint Stat. Meeting, etc.
○ Journals: Annals of statistics, etc.
● Visualization
○ Conference proceedings: CHI, ACM-SIGGraph, etc.
○ Journals: IEEE Trans. visualization and computer graphics, etc.
[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary

Here is a quick recap:


• We understood the major issues in data mining, such as Efficiency and scalability, Diversity of data
types, etc.
• We discussed a brief history of data mining over the past years.

[email protected]
DLZNK464L9

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9
Summary-Data Mining

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Summary
● Data mining: Discovering interesting patterns and knowledge from massive amount of data
● A natural evolution of database technology, in great demand, with wide applications
● A KDD process includes data cleaning, data integration, data selection, transformation, data mining,
pattern evaluation, and knowledge presentation
● Mining can be performed in a variety of data
[email protected]
DLZNK464L9

● Data mining functionalities: characterization, discrimination, association, classification, clustering,


outlier and trend analysis, etc.
● Data mining technologies and applications
● Major issues in data mining

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
Learning outcomes

By the end of this module, you are expected to:


Explain the introduction to data mining

Summarize the main data sources of data mining: Relational database, Data warehouse, and

Transactional data
[email protected]
DLZNK464L9

Explain various functionalities of data mining

Illustrate various Technologies and applications of Data Mining

Examine the evaluation methods, major issues, and history of data mining.

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.
[email protected]
DLZNK464L9
Thank you

This file is meant for personal use by [email protected] only.


Proprietary content. ©University of Arizona. All Rights Reserved. Unauthorized use or distribution prohibited.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like