Data Mining Primitives,
Languages and System
Architecture
What is Data Mining???
Data mining refers to extracting or “mining” knowledge from large
amounts of data. Also referred as Knowledge Discovery in Databases.
It is a process of discovering interesting knowledge from large amounts of
data stored either in databases, data warehouses, or other information
repositories.
Architecture of a typical data mining system
Graphical user interface
Pattern evaluation
Knowledge base
Data mining engine
Database or data warehouse server
Data cleansing
Data Integration Filtering
Database Data warehouse
Misconception: Data mining systems can autonomously dig out all of the
valuable knowledge from a given large database, without human
intervention.
If there was no user intervention then the system would uncover a large set
of patterns that may even surpass the size of the database. Hence, user
interference is required.
This user communication with the system is provided by using a set of data
mining primitives.
Data Mining Primitives
Data mining primitives define a data mining task, which can be specified in
the form of a data mining query.
Task Relevant Data
Kinds of knowledge to be mined
Background knowledge
Interestingness measure
Presentation and visualization of discovered patterns
Task relevant data
Data portion to be investigated.
Attributes of interest (relevant attributes) can be specified.
Initial data relation
Minable view
Example
If a data mining task is to study associations between items frequently
purchased at AllElectronics by customers in Canada, the task relevant data
can be specified by providing the following information:
Name of the database or data warehouse to be used (e.g., AllElectronics_db)
Names of the tables or data cubes containing relevant data (e.g., item, customer,
purchases and items_sold)
Conditions for selecting the relevant data (e.g., retrieve data pertaining to
purchases made in Canada for the current year)
The relevant attributes or dimensions (e.g., name and price from the item table
and income and age from the customer table)
Kind of knowledge to be mined
It is important to specify the knowledge to be mined, as this determines the
data mining function to be performed.
Kinds of knowledge include concept description, association, classification,
prediction and clustering.
User can also provide pattern templates. Also called metapatterns or
metarules or metaqueries.
Example
A user studying the buying habits of allelectronics customers may
choose to mine association rules of the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)
Meta rules such as the following can be specified:
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”)
[2.2%, 60%]
occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”)
[1.4%, 70%]
Background knowledge
It is the information about the domain to be mined
Concept hierarchy: is a powerful form of background knowledge.
Four major types of concept hierarchies:
schema hierarchies
set-grouping hierarchies
operation-derived hierarchies
rule-based hierarchies
Concept hierarchies (1)
Defines a sequence of mappings from a set of low-level concepts to higher-
level (more general) concepts.
Allows data to be mined at multiple levels of abstraction.
These allow users to view data from different perspectives, allowing further
insight into the relationships.
Example (location)
Example
all Level 0
USA Level 1
Canada
British Ontario New York Illinois Level 2
Columbia
Vancouver Victoria Toronto Ottawa New York Buffalo Chicago Level 3
Concept hierarchies (2)
Rolling Up - Generalization of data
Allows to view data at more meaningful and explicit abstractions.
Makes it easier to understand
Compresses the data
Would require fewer input/output operations
Drilling Down - Specialization of data
Concept values replaced by lower level concepts
There may be more than concept hierarchy for a given attribute or
dimension based on different user viewpoints
Example:
Regional sales manager may prefer the previous concept hierarchy but
marketing manager might prefer to see location with respect to linguistic
lines in order to facilitate the distribution of commercial ads.
Schema hierarchies
Schema hierarchy is the total or partial order among attributes in the
database schema.
May formally express existing semantic relationships between attributes.
Provides metadata information.
Example: location hierarchy
street < city < province/state < country
Set-grouping hierarchies
Organizes values for a given attribute into groups or sets or range of values.
Total or partial order can be defined among groups.
Used to refine or enrich schema-defined hierarchies.
Typically used for small sets of object relationships.
Example: Set-grouping hierarchy for age
{young, middle_aged, senior} all (age)
{20….29} young
{40….59} middle_aged
{60….89} senior
Operation-derived hierarchies
Operation-derived:
based on operations specified
operations may include
decoding of information-encoded strings
information extraction from complex data objects
data clustering
Example: URL or email address
[email protected] gives login name < dept. < univ. < country
Rule-based hierarchies
Rule-based:
Occurs when either whole or portion of a concept hierarchy is defined as a
set of rules and is evaluated dynamically based on current database data and
rule definition
Example: Following rules are used to categorize items as low_profit,
medium_profit and high_profit_margin.
low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50)
medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-
P2)≥50)^((P1-P2)≤250)
high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250)
Interestingness measure (1)
Used to confine the number of uninteresting patterns returned by the
process.
Based on the structure of patterns and statistics underlying them.
Associate a threshold which can be controlled by the user.
patterns not meeting the threshold are not presented to the user.
Objective measures of pattern interestingness:
simplicity
certainty (confidence)
utility (support)
novelty
Interestingness measure (2)
Simplicity
a patterns interestingness is based on its overall simplicity for human
comprehension.
Example: Rule length is a simplicity measure
Certainty (confidence)
Assesses the validity or trustworthiness of a pattern.
confidence is a certainty measure
confidence (A=>B) = # tuples containing both A and B
# tuples containing A
A confidence of 85% for the rule buys(X, “computer”)=>buys(X,“software”)
means that 85% of all customers who purchased a computer also bought
software
Interestingness measure (3)
Utility (support)
usefulness of a pattern
support (A=>B) = # tuples containing both A and B
total # of tuples
A support of 30% for the previous rule means that 30% of all customers in
the computer department purchased both a computer and software.
Association rules that satisfy both the minimum confidence and support
threshold are referred to as strong association rules.
Novelty
Patterns contributing new information to the given pattern set are called
novel patterns (example: Data exception).
removing redundant patterns is a strategy for detecting novelty.
Presentation and visualization
For data mining to be effective, data mining systems should be able to
display the discovered patterns in multiple forms, such as rules, tables,
crosstabs (cross-tabulations), pie or bar charts, decision trees, cubes, or
other visual representations.
User must be able to specify the forms of presentation to be used for
displaying the discovered patterns.
Data mining query languages
Data mining language must be designed to facilitate flexible and effective
knowledge discovery.
Having a query language for data mining may help standardize the
development of platforms for data mining systems.
But designed a language is challenging because data mining covers a wide
spectrum of tasks and each task has different requirement.
Hence, the design of a language requires deep understanding of the
limitations and underlying mechanism of the various kinds of tasks.
Data mining query languages (2)
So…how would you design an efficient query language???
Based on the primitives discussed earlier.
DMQL allows mining of different kinds of knowledge from relational
databases and data warehouses at multiple levels of abstraction.
DMQL
Adopts SQL-like syntax
Hence, can be easily integrated with relational query languages
Defined in BNF grammar
[ ] represents 0 or one occurrence
{ } represents 0 or more occurrences
Words in sans serif represent keywords
DMQL-Syntax for task-relevant data specification
Names of the relevant database or data warehouse, conditions and relevant
attributes or dimensions must be specified
use database ‹database_name› or use data warehouse ‹data_warehouse_name›
from ‹relation(s)/cube(s)› [where condition]
in relevance to ‹attribute_or_dimension_list›
order by ‹order_list›
group by ‹grouping_list›
having ‹condition›
Example