CIS 467 - Topic 1 - Introduction - 2020
CIS 467 - Topic 1 - Introduction - 2020
2022
E-mail: [email protected]
2
Text Book Page
3
Notes
4
5
Motivating Students (1):
Google and Read about these Topics
Data Science
Cloud Computing
Internet of Things
Big Data
Data Analytics
Disruptive Technologies
Sentiment Analysis
6
Motivating Students (2):
7
The Course
This Course is an introduction to the
young and fast-growing field of data mining
(also known as knowledge discovery from
data, or KDD for short).
Introduction to Data
Mining
9
Chapter 1. Introduction
10
Why Data Mining?
Necessity, who is the mother of invention.
We live in a world where vast amounts of data are
collected daily. Analyzing such data is an important
need.
13
Data Mining
Data mining, also popularly referred to as
knowledge discovery from data (KDD), is the
automated or convenient extraction of patterns
representing knowledge implicitly stored or
captured in large databases, data warehouses, the
Web, other massive information repositories, or
data streams.
14
Data Mining as the Evolution of
Information Technology
15
16
Evolution of Sciences
(The early appearance of Data Science)
Before 1600: Empirical Science
1600-1950s: Theoretical Science
Each discipline has grown a theoretical component. Theoretical models
often motivate experiments and generalize our understanding.
1950s-1990s: Computational Science
Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics,
or linguistics.)
Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.
1990-now: Data Science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally
accessible
Scientific info. management, acquisition, organization, query, and
visualization tasks scale almost linearly with data volumes. Data mining
is a major new challenge!
17
Evolution of Database Technology
See Figure 1.1
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
18
Data Science
Data Science
Exploration Modeling
http://www.saedsayad.com/
19
Data Exploration
20
Modellin
21
What is Data Mining
It is no surprise that data
mining, as a truly
interdisciplinary subject,
can be defined in many
different ways.
22
Traditional Data Analysis Methods
Example : Data mining turns a large
collection of data into knowledge.
24
DM and KDD
Note:
25
1.2 What Is Data Mining?
26
Basic Terms
27
Nice Illustrative Figure of Data Mining
28
Knowledge Discovery (KDD)
Process
29
From Data to Knowledge
30
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
31
Knowledge Discovery (KDD) Process
Example 1)
75% of customers who bought TV 35” or larger
are 85% likely to buy a home-theater system
within the next five weeks.
(In class discussion)
Example 2)
If income <= 35000 and credit_rating < 3 and
age < 35 and credit_amount > 50000 then
minimum loan term is 5 years.
35
1.3 What Kinds Data can be mined?
36
Example of Relational Database
37
Example of Transactional
Database
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
38
Example of a Flat Dataset
The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Example of a Flat Dataset
Example of a Flat Dataset
44
An Example of a Data Cube
45
Ex: Multidimensional Data
Office Day
Month
46
An Example of Data Cube
operations
47
An Example of Data Cube
operations
Example:
Results of Data Mining May Include:
Forecasting what may happen in the future
Classifying people or things into groups by recognizing patterns
Clustering people or things into groups based on their attributes
Associating what events are likely to occur together
Sequencing what events are likely to lead to later events
49
Data Mining Models and Tasks
Data Mining
Predictive Descriptiv
e
Associatio
Classificatio Regressio Clusterin n Rules
n n g
50
1.4.1 Concept/Class Description:
Characterization and Discrimination
51
Breast Cancer Dataset
53
1.4.2 Mining Frequent Patterns, Associations,
and Correlations
Frequent patterns are patterns that occur frequently in data. A
frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and
bread.
Issues:
1) How to mine such patterns and rules efficiently in large datasets?
2) How to use such patterns for classification, clustering, and other applications?
54
1.4.3 Classification and Predication
Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of training data (i.e.,
data objects whose class label is known). Describe and distinguish classes or
concepts for future prediction.
Examples:
Classify countries based on (climate),
Classify cars based on (gas mileage)
The derived model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical formulae, or
neural networks
Typical methods
Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic regression,
…
Typical applications:
Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, …
Prediction: models continuous-valued functions. That is, it is used to predict 55
Example of a Flat Dataset
The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Example: A Decision Tree
The Output of classification task using Decision Tree method for “buys_computer” class
Example : If Age =
no yes no yes 30..40 Then
Buys_Computer = yes.
Example (2) : A Decision Tree
1.4.4 Cluster Analysis
Cluster
59
1.4.5 Outlier Analysis
Outlier: A data object that does not comply with the general behavior
of the data
Noise or exception? ― One person’s garbage could be another
person’s treasure
Methods: by product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis
Outlier
60
1.4.6 Are All the “Discovered” Patterns
Interesting?
62
Break
63
1.6 Which Technologies Are Used?
64
Data Mining is A confluence of Disciplines
Data mining involves an integration of techniques from multiple disciplines such as database
and data warehouse technology, statistics, machine learning, high-performance computing,
pattern recognition, neural networks, data visualization, information retrieval, image and
signal processing, and spatial or temporal data analysis.
65
Why Confluence of Multiple Disciplines?
66
1.6 Which Kinds of Applications Are
Targeted?
Where there are data, there are data mining
applications.
As a young research field, data mining has made
broad and significant progress since its early
beginnings in the 1980s.
67
1.6 Which Kinds of Applications Are
Targeted?
68
Business Intelligence
Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations. Examples include reporting,
online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics.
Without data mining, many businesses may not be able to perform
effective market analysis, compare customer feedback on similar
products, discover the strengths and weaknesses of their competitors,
retain highly valuable customers, and make smart business decisions.
Clearly, data mining is the core of business intelligence. Online
analytical processing tools in business intelligence rely on data
warehousing and multidimensional data mining. Classification and
prediction techniques are the core of predictive analytics in business
intelligence, for which there are many applications in analyzing
markets, supplies, and sales. Moreover, clustering plays a central role
in customer relationship management, which groups customers based
on their similarities. Using characterization mining techniques, we can
better understand features of each customer group and develop
customized customer reward programs.
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
71
Data Mining Software Tools
Commercial Products
SAS Enterprise Miner
SPSS Clementine
IBM Intelligent Miner
SGI MineSet
i2 Analyst’s Notebook
Microsoft OLE DB for Data Mining
Oracle Data Mining
Free and open Source Products.
Weka
Rapid Miner
Orange
Knime
Try It
https://rapidminer.com/
Example 3: PYTHON for Data Science
You can find some books related to using PYTHON for Data
Science.
CIShttps://www.python.org/
467L (Data Mining Lab)
Summary
Data mining: Discovering interesting patterns and knowledge
from massive amount of data
A natural evolution of database technology, in great demand,
with wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
Data mining technologies and applications
76
Some Pioneers in the Data Mining Field
U. M. Fayyad,
G. Piatetsky-Shapiro,
J. Han
I. H. Witten
E. Frank
77
Say some thing about the Course Lab
CIS 467L
78
Exercises
1.1 What is data mining? In your answer, address the following:
(a) Is it another hype?
(b) Is it a simple transformation of technology developed from databases, statistics, and
machine learning?
(c) Explain how the evolution of database technology led to data mining.
(d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery.
1.11 Outliers are often discarded as noise. However, one person’s garbage could
be another’s treasure. For example, exceptions in credit card transactions
can help us detect the fraudulent use of credit cards. Taking fraudulence
detection as an example, propose two methods that can be used to detect
outliers and discuss which one is more reliable.
79