0% found this document useful (0 votes)

557 views26 pages

Data Mining System Architecture Overview

The document discusses key concepts in data mining including data mining primitives, interestingness measures, and presentation of results. It describes the typical architecture of a data mining system including components like the data mining engine, knowledge base, and user interface. It also explains important data mining primitives that define a data mining task like specifying the relevant data, type of knowledge to mine, background knowledge, and how results should be measured and presented.

Uploaded by

Surya Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

557 views26 pages

Data Mining System Architecture Overview

Uploaded by

Surya Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Mining Primitives,

Languages and System

Architecture
What is Data Mining???

 Data mining refers to extracting or “mining” knowledge from large

amounts of data. Also referred as Knowledge Discovery in Databases.

 It is a process of discovering interesting knowledge from large amounts of

data stored either in databases, data warehouses, or other information
repositories.
Architecture of a typical data mining system

Graphical user interface

Pattern evaluation
Knowledge base

Data mining engine

Database or data warehouse server

Data cleansing
Data Integration Filtering

Database Data warehouse

 Misconception: Data mining systems can autonomously dig out all of the
valuable knowledge from a given large database, without human
intervention.

 If there was no user intervention then the system would uncover a large set
of patterns that may even surpass the size of the database. Hence, user
interference is required.

 This user communication with the system is provided by using a set of data
mining primitives.
Data Mining Primitives
Data mining primitives define a data mining task, which can be specified in
the form of a data mining query.

 Task Relevant Data

 Kinds of knowledge to be mined

 Background knowledge

 Interestingness measure

 Presentation and visualization of discovered patterns

Task relevant data

 Data portion to be investigated.

 Attributes of interest (relevant attributes) can be specified.

 Initial data relation

 Minable view
Example

 If a data mining task is to study associations between items frequently

purchased at AllElectronics by customers in Canada, the task relevant data
can be specified by providing the following information:
 Name of the database or data warehouse to be used (e.g., AllElectronics_db)
 Names of the tables or data cubes containing relevant data (e.g., item, customer,
purchases and items_sold)
 Conditions for selecting the relevant data (e.g., retrieve data pertaining to
purchases made in Canada for the current year)
 The relevant attributes or dimensions (e.g., name and price from the item table
and income and age from the customer table)
Kind of knowledge to be mined

 It is important to specify the knowledge to be mined, as this determines the

data mining function to be performed.

 Kinds of knowledge include concept description, association, classification,

prediction and clustering.

 User can also provide pattern templates. Also called metapatterns or

metarules or metaqueries.
Example

A user studying the buying habits of allelectronics customers may

choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)

Meta rules such as the following can be specified:

age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”)
[2.2%, 60%]
occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”)
[1.4%, 70%]
Background knowledge

 It is the information about the domain to be mined

 Concept hierarchy: is a powerful form of background knowledge.

 Four major types of concept hierarchies:

schema hierarchies
set-grouping hierarchies
operation-derived hierarchies
rule-based hierarchies
Concept hierarchies (1)
 Defines a sequence of mappings from a set of low-level concepts to higher-
level (more general) concepts.

 Allows data to be mined at multiple levels of abstraction.

 These allow users to view data from different perspectives, allowing further
insight into the relationships.

 Example (location)
Example

all Level 0

USA Level 1
Canada

British Ontario New York Illinois Level 2

Columbia

Vancouver Victoria Toronto Ottawa New York Buffalo Chicago Level 3

Concept hierarchies (2)
 Rolling Up - Generalization of data
Allows to view data at more meaningful and explicit abstractions.
Makes it easier to understand
Compresses the data
Would require fewer input/output operations
 Drilling Down - Specialization of data
Concept values replaced by lower level concepts
 There may be more than concept hierarchy for a given attribute or
dimension based on different user viewpoints
 Example:
Regional sales manager may prefer the previous concept hierarchy but
marketing manager might prefer to see location with respect to linguistic
lines in order to facilitate the distribution of commercial ads.
Schema hierarchies
 Schema hierarchy is the total or partial order among attributes in the
database schema.

 May formally express existing semantic relationships between attributes.

 Provides metadata information.

 Example: location hierarchy

street < city < province/state < country
Set-grouping hierarchies
 Organizes values for a given attribute into groups or sets or range of values.

 Total or partial order can be defined among groups.

 Used to refine or enrich schema-defined hierarchies.

 Typically used for small sets of object relationships.

 Example: Set-grouping hierarchy for age

{young, middle_aged, senior} all (age)
{20….29} young
{40….59} middle_aged
{60….89} senior
Operation-derived hierarchies
 Operation-derived:
based on operations specified
operations may include
decoding of information-encoded strings
information extraction from complex data objects
data clustering
Example: URL or email address
[email protected] gives login name < dept. < univ. < country
Rule-based hierarchies
 Rule-based:
Occurs when either whole or portion of a concept hierarchy is defined as a
set of rules and is evaluated dynamically based on current database data and
rule definition

 Example: Following rules are used to categorize items as low_profit,

medium_profit and high_profit_margin.
low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50)
medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)≥50)^((P1-P2)≤250)
high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250)
Interestingness measure (1)
 Used to confine the number of uninteresting patterns returned by the
process.

 Based on the structure of patterns and statistics underlying them.

 Associate a threshold which can be controlled by the user.

 patterns not meeting the threshold are not presented to the user.

 Objective measures of pattern interestingness:

simplicity
certainty (confidence)
utility (support)
novelty
Interestingness measure (2)
 Simplicity
a patterns interestingness is based on its overall simplicity for human
comprehension.
Example: Rule length is a simplicity measure

 Certainty (confidence)
Assesses the validity or trustworthiness of a pattern.
confidence is a certainty measure
confidence (A=>B) = # tuples containing both A and B
# tuples containing A
A confidence of 85% for the rule buys(X, “computer”)=>buys(X,“software”)
means that 85% of all customers who purchased a computer also bought
software
Interestingness measure (3)
 Utility (support)
usefulness of a pattern
support (A=>B) = # tuples containing both A and B
total # of tuples
A support of 30% for the previous rule means that 30% of all customers in
the computer department purchased both a computer and software.

 Association rules that satisfy both the minimum confidence and support
threshold are referred to as strong association rules.

 Novelty
Patterns contributing new information to the given pattern set are called
novel patterns (example: Data exception).
removing redundant patterns is a strategy for detecting novelty.
Presentation and visualization
 For data mining to be effective, data mining systems should be able to
display the discovered patterns in multiple forms, such as rules, tables,
crosstabs (cross-tabulations), pie or bar charts, decision trees, cubes, or
other visual representations.

 User must be able to specify the forms of presentation to be used for

displaying the discovered patterns.
Data mining query languages
 Data mining language must be designed to facilitate flexible and effective
knowledge discovery.

 Having a query language for data mining may help standardize the
development of platforms for data mining systems.

 But designed a language is challenging because data mining covers a wide

spectrum of tasks and each task has different requirement.

 Hence, the design of a language requires deep understanding of the

limitations and underlying mechanism of the various kinds of tasks.
Data mining query languages (2)
 So…how would you design an efficient query language???

 Based on the primitives discussed earlier.

 DMQL allows mining of different kinds of knowledge from relational

databases and data warehouses at multiple levels of abstraction.
DMQL
 Adopts SQL-like syntax

 Hence, can be easily integrated with relational query languages

 Defined in BNF grammar

[ ] represents 0 or one occurrence
{ } represents 0 or more occurrences
Words in sans serif represent keywords
DMQL-Syntax for task-relevant data specification

 Names of the relevant database or data warehouse, conditions and relevant

attributes or dimensions must be specified
 use database ‹database_name› or use data warehouse ‹data_warehouse_name›

 from ‹relation(s)/cube(s)› [where condition]

 in relevance to ‹attribute_or_dimension_list›
 order by ‹order_list›
 group by ‹grouping_list›
 having ‹condition›
Example

U1 - Data Mining Task Primitives
No ratings yet
U1 - Data Mining Task Primitives
4 pages
Data Mining Task Primitives and Major Issues
No ratings yet
Data Mining Task Primitives and Major Issues
18 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
Data Mining Query Language (DMQL) Overview
0% (1)
Data Mining Query Language (DMQL) Overview
7 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Graph Mining Techniques Overview
No ratings yet
Graph Mining Techniques Overview
23 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Techniques To Evaluate Accuracy of Classifier in Data Mining
No ratings yet
Techniques To Evaluate Accuracy of Classifier in Data Mining
2 pages
Key Data Mining Resources and Techniques
No ratings yet
Key Data Mining Resources and Techniques
14 pages
Data Mining Concept Description: Characterization and Comparison
No ratings yet
Data Mining Concept Description: Characterization and Comparison
14 pages
Web Mining
No ratings yet
Web Mining
13 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
DWM Question Bank
No ratings yet
DWM Question Bank
3 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Web Mining: Content, Structure, and Usage
No ratings yet
Web Mining: Content, Structure, and Usage
3 pages
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
No ratings yet
Iv Semester: Data Mining Question Bank: Unit 2 2 Mark Questions)
5 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
KDD Vs Data Mining
No ratings yet
KDD Vs Data Mining
2 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Chapter 5 Concept Description Characterization and Comparison 395
No ratings yet
Chapter 5 Concept Description Characterization and Comparison 395
64 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
K-Means Clustering Practice Questions
No ratings yet
K-Means Clustering Practice Questions
2 pages
Classification and Prediction in Data Mining
No ratings yet
Classification and Prediction in Data Mining
126 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Data Mining Question Bank U3 & U4
No ratings yet
Data Mining Question Bank U3 & U4
3 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
(Machine Learning) BAYES' THEOREM AND CONCEPT LEARNING
No ratings yet
(Machine Learning) BAYES' THEOREM AND CONCEPT LEARNING
22 pages
Types of Data for Mining Analysis
No ratings yet
Types of Data for Mining Analysis
6 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
32 pages
Data Mining Knowledge Representation
No ratings yet
Data Mining Knowledge Representation
19 pages
Vi - Sem - Bca Ai Question Bank
No ratings yet
Vi - Sem - Bca Ai Question Bank
13 pages
CS2032 2 Marks & 16 Marks With Answers
100% (1)
CS2032 2 Marks & 16 Marks With Answers
30 pages
ADA SolBank Final
No ratings yet
ADA SolBank Final
80 pages
FP Growth Algorithm
No ratings yet
FP Growth Algorithm
10 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
DWDM Unit 6 Cluster Analysis
No ratings yet
DWDM Unit 6 Cluster Analysis
183 pages
User Defined Functions in Javascript
No ratings yet
User Defined Functions in Javascript
6 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
23 pages
Mobile Data Management Guide
100% (1)
Mobile Data Management Guide
63 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
10 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
3rd Sem Python 2023 MQ Paper With Solution
No ratings yet
3rd Sem Python 2023 MQ Paper With Solution
27 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Data Mining: Classification & Prediction
No ratings yet
Data Mining: Classification & Prediction
16 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
DM Important Questions
100% (1)
DM Important Questions
2 pages
ML Ptu
No ratings yet
ML Ptu
2 pages
ML 3 PPT Unit 3
No ratings yet
ML 3 PPT Unit 3
51 pages
Data Mining Primitives and Architecture
No ratings yet
Data Mining Primitives and Architecture
39 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Ch-4 Data Mining Knowledge Representation Premitives
No ratings yet
Ch-4 Data Mining Knowledge Representation Premitives
16 pages
Data Mining Primitives Guide
No ratings yet
Data Mining Primitives Guide
30 pages
Data Mining: Concepts and Techniques: - Chapter 4
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 4
29 pages
Data Mining Primitives and Queries
No ratings yet
Data Mining Primitives and Queries
12 pages
Cyber Forensics
No ratings yet
Cyber Forensics
3 pages
Bypass AV-EDR Solutions Combining Well Known Techniques
No ratings yet
Bypass AV-EDR Solutions Combining Well Known Techniques
42 pages
Bia Proj
No ratings yet
Bia Proj
29 pages
SNOW Basic Q&A
No ratings yet
SNOW Basic Q&A
6 pages
OOP Lab for Computer Science Students
No ratings yet
OOP Lab for Computer Science Students
11 pages
AntiVirusS Canning Working
No ratings yet
AntiVirusS Canning Working
31 pages
ServiceNow CSA Exam Questions & Answers
No ratings yet
ServiceNow CSA Exam Questions & Answers
45 pages
Installation Media For An SAP HANA SPS
No ratings yet
Installation Media For An SAP HANA SPS
18 pages
Introduction to Rational Rose Software Modeling
No ratings yet
Introduction to Rational Rose Software Modeling
7 pages
Project Report On Hotel Management System
40% (5)
Project Report On Hotel Management System
43 pages
Firewall: Seminar On
No ratings yet
Firewall: Seminar On
19 pages
SAP DR/Sandbox Setup Guide for Linux
No ratings yet
SAP DR/Sandbox Setup Guide for Linux
2 pages
Group 7 Presents: How IT Enabled Business Model Innovation at Vdab
No ratings yet
Group 7 Presents: How IT Enabled Business Model Innovation at Vdab
30 pages
Microsoft Office Weekly Schedule
No ratings yet
Microsoft Office Weekly Schedule
92 pages
SAP Integration Training Guide
No ratings yet
SAP Integration Training Guide
142 pages
L3b - Agile Software Development v1
No ratings yet
L3b - Agile Software Development v1
20 pages
Security Analyst Exam Prep Questions
No ratings yet
Security Analyst Exam Prep Questions
149 pages
It3511 - Full Stack Web Development - Faculty - Manual
No ratings yet
It3511 - Full Stack Web Development - Faculty - Manual
117 pages
PI DataLink 2019 User Guide
100% (2)
PI DataLink 2019 User Guide
172 pages
Playbook
No ratings yet
Playbook
12 pages
Osi Model
No ratings yet
Osi Model
44 pages
Intro to Computers & Info Science
No ratings yet
Intro to Computers & Info Science
2 pages
Object-Oriented Analysis Lab Guide
No ratings yet
Object-Oriented Analysis Lab Guide
2 pages
Sri Ramakrishna Engineering College: Department of Computer Science and Engineering
No ratings yet
Sri Ramakrishna Engineering College: Department of Computer Science and Engineering
13 pages
Data Flow Diagrams Guide
No ratings yet
Data Flow Diagrams Guide
10 pages
Kioptrix 1
No ratings yet
Kioptrix 1
14 pages
Understanding Display Advertising Basics
No ratings yet
Understanding Display Advertising Basics
13 pages
Aayushi Gupta: Software Developer Resume
No ratings yet
Aayushi Gupta: Software Developer Resume
1 page
School E-Safety Survey Report
No ratings yet
School E-Safety Survey Report
3 pages
Unit-I Introduction To Data Science
No ratings yet
Unit-I Introduction To Data Science
40 pages

Data Mining System Architecture Overview

Uploaded by

Data Mining System Architecture Overview

Uploaded by

Data Mining Primitives,

Languages and System

 Data mining refers to extracting or “mining” knowledge from large

 It is a process of discovering interesting knowledge from large amounts of

Graphical user interface

Data mining engine

Database or data warehouse server

Database Data warehouse

 Task Relevant Data

 Kinds of knowledge to be mined

 Presentation and visualization of discovered patterns

 Data portion to be investigated.

 Attributes of interest (relevant attributes) can be specified.

 Initial data relation

 If a data mining task is to study associations between items frequently

 It is important to specify the knowledge to be mined, as this determines the

 Kinds of knowledge include concept description, association, classification,

 User can also provide pattern templates. Also called metapatterns or

A user studying the buying habits of allelectronics customers may

Meta rules such as the following can be specified:

 It is the information about the domain to be mined

 Concept hierarchy: is a powerful form of background knowledge.

 Four major types of concept hierarchies:

 Allows data to be mined at multiple levels of abstraction.

British Ontario New York Illinois Level 2

Vancouver Victoria Toronto Ottawa New York Buffalo Chicago Level 3

 May formally express existing semantic relationships between attributes.

 Provides metadata information.

 Example: location hierarchy

 Total or partial order can be defined among groups.

 Used to refine or enrich schema-defined hierarchies.

 Typically used for small sets of object relationships.

 Example: Set-grouping hierarchy for age

 Example: Following rules are used to categorize items as low_profit,

 Based on the structure of patterns and statistics underlying them.

 Associate a threshold which can be controlled by the user.

 Objective measures of pattern interestingness:

 User must be able to specify the forms of presentation to be used for

 But designed a language is challenging because data mining covers a wide

 Hence, the design of a language requires deep understanding of the

 Based on the primitives discussed earlier.

 DMQL allows mining of different kinds of knowledge from relational

 Hence, can be easily integrated with relational query languages

 Defined in BNF grammar

 Names of the relevant database or data warehouse, conditions and relevant

 from ‹relation(s)/cube(s)› [where condition]

You might also like