0% found this document useful (0 votes)
23 views

CIS 467 - Topic 1 - Introduction - 2020

Uploaded by

Dragon Pavilion
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

CIS 467 - Topic 1 - Introduction - 2020

Uploaded by

Dragon Pavilion
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

CIS 467 :Data Mining

2022

Department of Information Systems


Faculty of Information Technology and Computer Sciences
Yarmouk University – Jordan
Instructor EMAIL

Dr. “Mohammad Ashraf” Ottom

E-mail: [email protected]

2
Text Book Page

3
Notes

4
5
Motivating Students (1):
Google and Read about these Topics

Data Science

Cloud Computing

Internet of Things

Big Data

Data Analytics

Disruptive Technologies

Sentiment Analysis

6
Motivating Students (2):

 Think of a way that allow us to predict students marks before the


end of the semester?
 Think how we can form two teams from the students in the section
to participate in a programming competition?
 Think of a way that help registration department to know the most
frequent courses that are registered together during a semester?
 Think ….
 Think …
 Think …

7
The Course
 This Course is an introduction to the
young and fast-growing field of data mining
(also known as knowledge discovery from
data, or KDD for short).

 The Course focuses on fundamental data


mining concepts and techniques for
discovering interesting patterns from data
in various applications.
Topic 1

Introduction to Data
Mining

9
Chapter 1. Introduction

 Why Data Mining?


 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Data Mining Tools
 Summary

10
Why Data Mining?
 Necessity, who is the mother of invention.
 We live in a world where vast amounts of data are
collected daily. Analyzing such data is an important
need.

 Moving toward the Information Age


 “We are living in the information age” ???

 We are actually living in the data age. Terabytes or


petabytes of data pour into our computer
networks, the World Wide Web (WWW), and various
data storage devices every day from business

 We are drowning in data, but starving for


knowledge!
The Explosive growth data
 This explosive growth of available data volume is a
result of the computerization of our society and the
fast development of powerful data collection and
storage tools.

 Global backbone telecommunication networks


carry tens of petabytes of data traffic every day.
The medical and health industry generates
tremendous amounts of data from medical records,
patient monitoring, and medical imaging.

 Billions of Web searches supported by search


engines process tens of petabytes of data daily.
The Explosive growth data
 We live in a world where vast amounts of data are
collected daily. Analyzing such data is an important
need.

 First we will look at how data mining can meet this


need by providing tools to discover knowledge
from data.

 Then we will observe how data mining can be


viewed as a result of the natural evolution of
information technology.

13
Data Mining
 Data mining, also popularly referred to as
knowledge discovery from data (KDD), is the
automated or convenient extraction of patterns
representing knowledge implicitly stored or
captured in large databases, data warehouses, the
Web, other massive information repositories, or
data streams.

 Data mining is a multidisciplinary field, drawing


work from areas including database technology,
machine learning, statistics, pattern recognition,
information retrieval, neural networks, Knowledge-
based systems, artificial intelligence, high-
performance computing, and data visualization.

14
Data Mining as the Evolution of
Information Technology

 Data mining can be viewed as a result of the


natural evolution of information technology.

 The database and data management industry


evolved in the development of several critical
functionalities.

 Data mining emerged during the late 1980s, made


great strides during the 1990s, and continues to
flourish into the new millennium.

15
16
Evolution of Sciences
(The early appearance of Data Science)
 Before 1600: Empirical Science
 1600-1950s: Theoretical Science
 Each discipline has grown a theoretical component. Theoretical models
often motivate experiments and generalize our understanding.
 1950s-1990s: Computational Science
 Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics,
or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our
inability to find closed-form solutions for complex mathematical models.

1990-now: Data Science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally
accessible
 Scientific info. management, acquisition, organization, query, and
visualization tasks scale almost linearly with data volumes. Data mining
is a major new challenge!
17
Evolution of Database Technology
See Figure 1.1
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

18
Data Science

Data Science

Explaining the Past Predicting the Future

Exploration Modeling

http://www.saedsayad.com/

19
Data Exploration

20
Modellin

21
What is Data Mining
 It is no surprise that data
mining, as a truly
interdisciplinary subject,
can be defined in many
different ways.

 Even the term data


mining does not really
present all the major
components in the
picture. To refer to the
mining of gold from rocks
or sand, we say gold
mining instead of rock or
sand mining.

22
Traditional Data Analysis Methods
Example : Data mining turns a large
collection of data into knowledge.

24
DM and KDD
 Note:

 We agree that data mining DM is a step in the


knowledge discovery process KDD.

 However, in industry, in media, and in the


database research milieu, the term data mining is
becoming more popular than the longer term of
knowledge discovery from data.
 Now Data Analysis is more popular.

25
1.2 What Is Data Mining?

 Formal Definition of DM:


 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data

 Data mining: a misnomer?


 The term is actually a misnomer. Remember that the mining of gold from
rocks or sand is referred to as gold mining rather than rock or sand
mining.

 Thus, data mining should have been more appropriately named


“knowledge mining from data,” which is unfortunately somewhat long.
“Knowledge mining,” a shorter term, may not reflect the emphasis on
mining from large amounts of data.

26
Basic Terms

27
Nice Illustrative Figure of Data Mining

Data Analyst (Miner)


Data
Data Mining Tools and Functions

Extracted Knowledge (Gold)

28
Knowledge Discovery (KDD)
Process

From Data to Knowledge

29
From Data to Knowledge

30
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
 Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
31
Knowledge Discovery (KDD) Process

Knowledge discovery as a process is depicted in Figure 1.4 and consists


of an iterative sequence of the following steps:
1. Data cleaning: to remove noise and inconsistent data)
2. Data integration: where multiple data sources may be combined
3. Data selection: where data relevant to the analysis task are
retrieved from the database
4. Data transformation: where data are transformed or consolidated
into forms appropriate for mining by performing summary or
aggregation operations, for instance
5. Data mining: an essential process where intelligent methods are
applied in order to extract data patterns.
• Choosing functions of data mining: summarization, classification,
regression, association, clustering
• Choosing the mining algorithm (s)
6. Pattern evaluation: to identify the truly interesting patterns
representing knowledge based on some interestingness measures;
(Section 1.5)
7. Knowledge presentation: where visualization and knowledge
representation techniques are used to present the mined knowledge to
Note: Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining.
the user 32
KDD Process: A Typical View from ML and
Statistics

Input Data Pre- Data Post-


Processin Mining Processin
Data g g

Data Cleaning Pattern discovery Pattern evaluation


Data integration Association & Pattern selection
correlation
Normalization Classification Pattern
interpretation
Feature selection Clustering
Outlier analysis Pattern visualization
Dimension reduction …
… …………

 This is a view from typical machine learning and statistics


communities
33
Examples of Data Mining Findings

Some examples of data mining findings:

Example 1)
75% of customers who bought TV 35” or larger
are 85% likely to buy a home-theater system
within the next five weeks.
(In class discussion)

Example 2)
If income <= 35000 and credit_rating < 3 and
age < 35 and credit_amount > 50000 then
minimum loan term is 5 years.

Exercise: Students should think of more


practical examples from their environments.
Aspects of Data Mining
(A multi-dimensional View of Data Mining)
 1.3 What Kinds Data can be mined
 Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal, time-
series, sequence, text and web, multi-media, graphs & social and
information networks
 1.4 What kinds of patterns to be mined (Data mining functions/tasks)
 Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 1.5 Which technologies are used
 Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
 1.6 Which Kinds of Applications Are Targeted
 Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.

35
1.3 What Kinds Data can be mined?

 Database-oriented data sets and applications


 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications

 A relational database is a collection of tables, each of which is assigned a unique name.


 A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.
 A data cube provides a multidimensional view of data and allows the pre-computation and fast
accessing of summarized data.
 Object-relational databases are constructed based on an object-relational data model
 A transactional database consists of a file where each record represents a transaction
 A temporal database typically stores relational data that include time-related attributes
 A sequence database stores sequences of ordered events
 A time-series database stores sequences of values or events obtained over repeated
measurements of time
 Spatial databases contain spatial-related information
 Text databases are databases that contain word descriptions for objects
 Multimedia databases store image, audio, and video data
 A legacy database is a group of heterogeneous databases that combines different kinds of data
systems
Note: Fore more details and definitions of these kinds of data, refer to book pages 10 to 21

36
Example of Relational Database

37
Example of Transactional
Database

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

38
Example of a Flat Dataset

The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Example of a Flat Dataset
Example of a Flat Dataset

Extracted Knowledge represented as


a Decision Tree
Example of a Dataset

Extracted Knowledge represented as a


Decision Tree and Rules
Example of a Dataset with some
Boolean Attributes
An Example of a Data
Warehouse

44
An Example of a Data Cube

45
Ex: Multidimensional Data

 Sales volume as a function of product,


month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
o n
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
46
An Example of Data Cube
operations

47
An Example of Data Cube
operations

In New York city, how


many security items
sold in in 2019?
2019

How many Computers


sold in Vancouver in the
second half of year
2019?

How many Comps sold


in Toronto in quarter 1?
48
1.4 What Kinds of Patterns Can Be Mined?
(Data Mining Functionalities/tasks)

 We have observed various types of databases and information repositories


on which data mining can be performed. Let us now examine the kinds of
data patterns that can be mined. Data mining functionalities are used to
specify the kind of patterns to be found in data mining tasks.

 In general, data mining tasks can be classified into two


categories:

Descriptive mining tasks: characterize the general properties of the
data in the database.

Predictive mining tasks: perform inference on the current data in order
to make predictions.

 Example:

Results of Data Mining May Include:
Forecasting what may happen in the future
 Classifying people or things into groups by recognizing patterns
 Clustering people or things into groups based on their attributes
 Associating what events are likely to occur together
 Sequencing what events are likely to lead to later events

49
Data Mining Models and Tasks

Data Mining

Predictive Descriptiv
e

Associatio
Classificatio Regressio Clusterin n Rules
n n g

Time Sequence Summarisati


Predictio Series Discovery
n on
Analysis

50
1.4.1 Concept/Class Description:
Characterization and Discrimination

 Data can be associated with classes or concepts. For example, in


the AllElectronics store, classes of items for sale include computers
and printers, and concepts of customers include big-Spenders and
budget-Spenders.
 Such descriptions of a class or a concept are called class/concept
descriptions. These descriptions can be derived via:

(1) data characterization, by summarizing the data of the class under study (often
called the target class) in general terms, or

(2) data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes), or

(3) both data characterization and discrimination.

 Data characterization is a summarization of the general


characteristics or features of a target class of data.
 Example : study the characteristics of software products whose
sales increased by 10% in the last year.
 The output of data characterization can be presented in various
forms. Examples : include pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional tables.
 The resulting descriptions can also be presented as generalized
relations or in rule form(called characteristic rules).

51
Breast Cancer Dataset

Data Mining: Concepts and


November 22, 2024 Techniques 52
1.4.1 Concept/Class Description:
Characterization and Discrimination
 Data discrimination is a comparison of the general features of
target class data objects with the general features of objects from
one or a set of contrasting classes.

 Example: the user may like to compare the general features of


software products whose sales increased by 10% in the last year
with those whose sales decreased by at least 30% during the
same period.

 Discrimination descriptions expressed in rule form are referred to


as discriminant rules.

53
1.4.2 Mining Frequent Patterns, Associations,
and Correlations
 Frequent patterns are patterns that occur frequently in data. A
frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and
bread.

 What is Association Mining?



Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases,
relational databases, and other information repositories

Frequent pattern: pattern (set of items, sequence, etc) that occurs
frequently in database
 Example of typical association rules:

X  Y [0.5%, 75%] (support, confidence)

 Issues:
 1) How to mine such patterns and rules efficiently in large datasets?
 2) How to use such patterns for classification, clustering, and other applications?

54
1.4.3 Classification and Predication


Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class label is unknown.

The derived model is based on the analysis of a set of training data (i.e.,
data objects whose class label is known). Describe and distinguish classes or
concepts for future prediction.
 Examples:
 Classify countries based on (climate),
 Classify cars based on (gas mileage)
 The derived model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical formulae, or
neural networks
 Typical methods
 Decision trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification, logistic regression,

 Typical applications:
 Credit card fraud detection, direct marketing, classifying stars, diseases, web-
pages, …

Prediction: models continuous-valued functions. That is, it is used to predict 55
Example of a Flat Dataset

The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Example: A Decision Tree

The Output of classification task using Decision Tree method for “buys_computer” class

age? A decision tree is a


flow-chart-like tree
structure, where each
node denotes a test on
an attribute value, each
<=30 overcast
30..40 >40 branch represents an
outcome of the test, and
tree leaves represent
classes or class
student? yes credit rating? distributions.

Decision trees can easily


be
no yes excellent fair converted to
classification rules.

Example : If Age =
no yes no yes 30..40 Then
Buys_Computer = yes.
Example (2) : A Decision Tree
1.4.4 Cluster Analysis

 Unsupervised learning (i.e., Class label is unknown)


 Group data to form new categories (i.e., clusters),
 Example: cluster houses to find distribution patterns
 Principle: Maximizing intra-class similarity & minimizing interclass
similarity
 Many methods and applications

Cluster

59
1.4.5 Outlier Analysis

 Outlier: A data object that does not comply with the general behavior
of the data
 Noise or exception? ― One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis

Outlier

60
1.4.6 Are All the “Discovered” Patterns
Interesting?

 Data mining may generate thousands of patterns: Not all of


them are interesting
 Suggested approach: Human-centered, query-based, focused
mining
 Interestingness measures
 A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
 Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc. 61
1.4.6 Evaluation of Knowledge
 Evaluation of mined knowledge → directly mine only interesting
knowledge?
 Descriptive vs. predictive
 Coverage
 Typicality vs. novelty
 Accuracy
 Timeliness
 …

62
Break

63
1.6 Which Technologies Are Used?

 Data mining is an interdisciplinary field, the confluence of a


set of disciplines, including database systems, statistics,
machine learning, visualization, and information science
(Figure 1.12).

 Moreover, depending on the data mining approach used,


techniques from other disciplines may be applied, such as
neural networks, fuzzy and/or rough set theory, knowledge
representation, inductive logic programming, or high-
performance Computing.

 Depending on the kinds of data to be mined or on the given


data mining application, the data mining system may also
integrate techniques from spatial data analysis, information
retrieval, pattern recognition, image analysis, signal
processing, computer graphics, Web technology, economics,
business, bioinformatics, or psychology.
See Next figure

64
Data Mining is A confluence of Disciplines

Data mining involves an integration of techniques from multiple disciplines such as database
and data warehouse technology, statistics, machine learning, high-performance computing,
pattern recognition, neural networks, data visualization, information retrieval, image and
signal processing, and spatial or temporal data analysis.
65
Why Confluence of Multiple Disciplines?

 Tremendous amount of data


 Algorithms must be highly scalable to handle such as tera-bytes of data
 High-dimensionality of data
 Some Data may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications

66
1.6 Which Kinds of Applications Are
Targeted?
 Where there are data, there are data mining
applications.
 As a young research field, data mining has made
broad and significant progress since its early
beginnings in the 1980s.

 Today, data mining is used in a vast array of areas,


and numerous commercial data mining systems are
available.

 As a highly application-driven discipline, data


mining has seen great successes in many
applications. It is impossible to enumerate all
applications where data mining plays a critical role.

67
1.6 Which Kinds of Applications Are
Targeted?

 1.6.1 Business Intelligence


 1.6.2 Web Search Engines
 …..
 ….

68
Business Intelligence
 Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations. Examples include reporting,
online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics.
 Without data mining, many businesses may not be able to perform
effective market analysis, compare customer feedback on similar
products, discover the strengths and weaknesses of their competitors,
retain highly valuable customers, and make smart business decisions.
 Clearly, data mining is the core of business intelligence. Online
analytical processing tools in business intelligence rely on data
warehousing and multidimensional data mining. Classification and
prediction techniques are the core of predictive analytics in business
intelligence, for which there are many applications in analyzing
markets, supplies, and sales. Moreover, clustering plays a central role
in customer relationship management, which groups customers based
on their similarities. Using characterization mining techniques, we can
better understand features of each customer group and develop
customized customer reward programs.
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
70
More Applications and Cases
will be discussed at the last topic of this course

71
Data Mining Software Tools
 Commercial Products

SAS Enterprise Miner

SPSS Clementine

IBM Intelligent Miner

SGI MineSet

i2 Analyst’s Notebook

Microsoft OLE DB for Data Mining

Oracle Data Mining
 Free and open Source Products.

Weka

Rapid Miner

Orange

Knime

 And Now PYTHON


Example 1: WEKA - Data Mining
Software

 WEKA is a collection of machine learning algorithms for data


mining tasks.
 The algorithms can either be applied directly to a dataset or
called from your own Java code.
 Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization.
 Weka is open source software in JAVA issued under the GNU
General Public License.
 http://www.cs.waikato.ac.nz/ml/weka/
Example 2: Rapid Miner - Data
Mining Software

 RapidMiner is a software platform for analytics teams that


unites data prep, machine learning, and predictive model
deployment.

 Try It

 https://rapidminer.com/
Example 3: PYTHON for Data Science

 PYTHON is an interpreted, object-oriented, high-level


programming language with dynamic semantics. Its high-
level built in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for Rapid
Application Development, as well as for use as a scripting or
glue language to connect existing components together.
Python's simple, easy to learn syntax emphasizes readability
and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages
program modularity and code reuse. The Python interpreter
and the extensive standard library are available in source or
binary form without charge for all major platforms, and can
be freely distributed.

 You can find some books related to using PYTHON for Data
Science.

 CIShttps://www.python.org/
467L (Data Mining Lab)
Summary
 Data mining: Discovering interesting patterns and knowledge
from massive amount of data
 A natural evolution of database technology, in great demand,
with wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation,
and knowledge presentation
 Mining can be performed in a variety of data
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.
 Data mining technologies and applications

76
Some Pioneers in the Data Mining Field

 U. M. Fayyad,
 G. Piatetsky-Shapiro,
 J. Han
 I. H. Witten
 E. Frank

 Motivate students to Find More >>>>>>

77
Say some thing about the Course Lab

CIS 467L

78
Exercises
1.1 What is data mining? In your answer, address the following:
(a) Is it another hype?
(b) Is it a simple transformation of technology developed from databases, statistics, and
machine learning?
(c) Explain how the evolution of database technology led to data mining.
(d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery.

1.6 Define each of the following data mining functionalities: characterization,


discrimination, association and correlation analysis, classification, prediction,
clustering, and evolution analysis. Give examples of each data mining
functionality, using a real-life database with which you are familiar.

1.11 Outliers are often discarded as noise. However, one person’s garbage could
be another’s treasure. For example, exceptions in credit card transactions
can help us detect the fraudulent use of credit cards. Taking fraudulence
detection as an example, propose two methods that can be used to detect
outliers and discuss which one is more reliable.

79

You might also like