Algorithmics Research On Knowledge Discovery and Data Mining
Algorithmics Research On Knowledge Discovery and Data Mining
Vladimir Estivill-Castro
School of Computing and Information Technology
Outline
u u u
u Example
of algorithm
Oriented Generalization
Vladimir Estivill -Castro 2
u Attribute
Motivation
Motivation
u
Our technology to analyze data and understand massive datasets lags far behind our technology to gather and store data.
Emerged in Databases
u
Massaging the data so statistics would reflect my preconceived hypothesis Data => Hypothesis vs Hypothesis validated by data
Vladimir Estivill -Castro 5
Knowledge Discovery
The nontrivial process of identifying valid, novel, potentially useful and ultimately understandable patters in large data sets.
Data: The geo-referenced layers. u Information: The average population per administrative region u Knowledge: The patterns of growth of population densities and valid explanations for them.
u
Vladimir Estivill -Castro 7 6
Nontrivial
u
u u
Potentially useful
u
it goes beyond computing closed- from quantities or evaluating models. the discovered patterns are true with some degree of certainty of unseen data
Understandable
u
Valid
u
Different answers
Grouping .
u Classification. u Clustering.
Characterization: Generalize, summarize, and possibly contrast data characteristics, e.g., dry vs. wet regions. Association: Rules like inside(x, city) near(x, highway). Classification: Classify data based on the values in a classifying attribute, e.g., classify countries based on climate. Clustering: Cluster data to form new classes, e.g., cluster houses to find distribution patterns. Trend and deviation analysis : Find and characterize evolution trend, sequential patterns, similar sequences, and deviation data, e.g., housing market analysis. Pattern-directed analysis : Find and characterize user-specified patterns in large databases, e.g., volcanoes on Mars.
Vladimir Estivill -Castro
u u
Example: Retail
u
Bar-code technology makes possible to collect and store massive amounts of sales data (the basked data). u Information driven marketing process demands mining association rules over basket data.
u
``98% of customers that purchase tires and auto accessories also get service done
Vladimir Estivill -Castro 12
Application of Rules
u
catalog design, add-on sales. u store layout u customer segmentation based on buying patterns.
Vladimir Estivill -Castro 13
Illustration
Product B
Product C
Product D
X X
X X X X X X
``90% of purchases that have bread and butter also include milk. u It is a rule of the form A B (90%). u A is the antecedent. u B is the consequent. u There is a confidence value associated to the rule.
u
14
15
Find all rules that have sausage in the antecedent and mustard in the consequent u what items should be sold with sausages to make highly likely that mustard will also be sold. Find all rules relating items located in shelves A and B u understand if the distance affects the sales of items from both shelves.
16
Data
Target Data
Patterns
Knowledge
17
18
Multi-disciplinary
u
Finding useful patterns in data is known by different names among different communities:
u u u u u
data mining (statistics, databases) knowledge discovery information discovery, information harvesting data archeology pattern processing
Vladimir Estivill -Castro 19
A multi-disciplinary field
Data Bases
Statistic
Knowledge Discovery
u
Databases Machine Learning Pattern Recognition Artificial Intelligence Knowledge Acquisition Scientific Discovery High-Performance Computing Algorithms (Analysis and Design) Statistics
Vladimir Estivill -Castro 21
22
probabilistic models u descriptive, nonparametric, exploratory u mathematically sound (advanced) u informative and predictive
u
concern for computability and scalability u interpret data u understandable u data on electronic media (and structured)
u
24
KDD - ACM but now IEEE conference on KDD SIAM conference on KDD PAKDD (2001 - 5th Pacific Rim Conf. on KDD) PKDD (2000 - 4th European Conf. on Principles and Practice of KDD) Data Base conferences u SIGMOD
u u
u u u
Successful Example
u Recent
u u u
250 fields per customer back to 1914! over nine million records
39% of customers had business and personal accounts with the bank this cluster accounted for 27% of the 11% of customers that had been classified by a decision tree as likely respondents to a home equity loan offer
Vladimir Estivill -Castro 26
Text Mining
u u
u Results:
u There
exists a text (or a sets of texts) such as speak-much-of Miss Lewinsky & speak-littleof Mrs. Clinton u There exists another text such as speak-a-lot-of Mrs. Clinton & do-not-speak-of Miss Lewinsky
Vladimir Estivill -Castro 27
Data Mining system generates hypothesis search (and visualization) in abstract space inductive generalizations exceeding content of database
GIS user generates hypothesis visualization in geographical space shows whats inside the data
28
29
Spatial Associations
FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5
30
Spatial association: Association relationship containing spatial predicates, e.g., close_to, intersect, contains, etc.
u
Topological relations: u intersects, overlaps, disjoint, etc. Spatial orientations: u left_of, west_of, under, etc. Distance information:
u u
g_close_to: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it.
B.C.?
u
u Rules
u u
mined:
is_a(x, large_town) ^ intersect(x, highway) adjacent_to(x, water). [7%, 85%] is_a(x, large_town) ^adjacent_to(x, georgia_strait) close_to(x, u.s.a.). [1%, 78%]
u Mining method:
Spatial Classification
u Generalization-
Discover centers: local maximal of some non-spatial attribute. Determine the (theoretical) trend of some non-spatial attribute, when moving away from the centers. Discover deviations (from the theoretical trend). Explain the deviations.
u Example:
Trend of unemployment rate change according to the distance to Munich. u Similar modeling can be used to study trend of temperature with the altitude, degree of pollution in relevance to the regions of population density, etc.
Vladimir Estivill -Castro
Spatial Clustering
u How
can we cluster points? u What are the distinct features of the clusters?
There are more customers with university degrees in clusters located in the West. Thus, we can use different marketing strategies!
Vladimir Estivill -Castro 36
identify volcanos on Venus surface over 30,000 high resolution images Resolution accuracy: 80% 3 steps: data focusing, feature extraction, and classification learning
u POSSII
u u u
2x stellar objects (galaxies, stars, etc.) classified Resolution:one magnitude better than in previous studies Classification accuracy: no normalization 75%, with normalization 94%, and compared with neural networks.
109
u QuakeFinder
space.
u
u Finding
u
Get phone data for business with a fax number u Get usage records of lines to find who dials a business fax number from home for larger than 20 seconds
Vladimir Estivill -Castro 38
Crime detection
u crime
investigation (e.g., the Okalahoma City bombing) u fraud detection [Italy KDD-99 San Diego] but also Australian Taxation Office and HIC [PAKDD-99]
u
39
40
u (Ouyang)
Clustering towards improving the reusability in the design phase. u (Anquetil) Re-modularization of software
Vladimir Estivill -Castro 41
KDDM
Associaiton rules c% of the programs that use file X also use files Y and Z. Results in a group of programs that uses a similar set of files.
p4 p3 (100%)
Using metrics, OO, KDDM and clustering to split with low coupling. Approach: ( Create list of related entity pairs (classes, methods and objects) ( Use OO metrics to create sets of metrics CBOSet, RFCSet, and DACSet (DAC: Data Abstraction Coupling) CBO_d, CBO_d, and DAC_t ( Generate matrices of classes that interact (matrices de interaccion) ( Apply KDDM algorithms (association rules)
Use hierarchical clustering on the coefficients of the association rules to produce a hierarchical decomposition.
Vladimir Estivill -Castro 43
HTML Editor
MOZILLA
HTML Composer/Editor - Mozilla SUMMARY Symbol Table Statistics of 30 projects ============================================== Files: 111 Includes: 697 Macros: 147 Functions: 60 Types: 0 Variables: 79 Enums: 4 Userdef: 0 Classes: 90 Inst Vars: 415 Methods: 1320 Friends: 25 Localdefs: 46 SUMMARY File Type Statistics of 30 projects ============================================== File Type: Number of Files: HTML Header IDL Interface Image Implementation Project Description 4 65 4 57 42 38
SUMMARY Symbol Table Statistics of 1223 projects ======================================== Files: 6713 Includes: 36492 Macros: 27024 Functions: 15898 Types: 3176 Variables: 11151 Enums: 715 Userdef: 0 Classes: 5933 Instance Variables: 23757 Methods: 41015 Friends: 273 Localdefs: 290 SUMMARY File Type Statistics of 1223 projects ======================================== File Type Number of files HTML Header Implementation Make Project Description 763 3331 3382 117 1309
Mozilla Statistics
u Finding
u
u Chemical
Predicting stock market Monitoring condition of equipment, weather, pilot behavior during long flights.
46
An example of generalization
u
47
48
STUDENT
Wang Wise
M.S. freshman
Statistics Literature
Nanjing Toronto
3.2 3.9 49
graduate
freshman sophomore
junior
senior
M.A.
M.S.
PhD
50
In relation STUDENT, learn characteristic rule for STATUS=graduate in relevance to NAME, BIRTH PLACE, GPA u (threshold value of 3)
51
The induction
u 1)
Wang
M.S.
Statistics
Nanjing
3.2
52
The induction
u 1)
Wang
M.S.
Statistics
Nanjing
3.2
53
The induction
u 2)
The induction
u 3)
55
The induction
u
4) If there is a large set of distinct values for an attribute but there is no higher level concept provided for the attribute, the attribute should be removed.
u
56
The induction
u 2)
A generalization
MAJOR History Physics Math Biology Computing BIRTH PLACE Vancouver Ottawa Bombay Shanghai Victoria GPA 3.5 3.9 3.3 3.4 3.8 VOTE 2 3 1 1 2
Statistics
Nanjing
3.2
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 58
4) If there is a higher level concept in the concept tree for an attribute, the substitution of the higher level concept generalizes the tuple.
Science Arts
Statistics
Computing
Biology
Math
Physics
History
59
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 60
Threshold Control
u
If the number of distinct values of an attribute is larger than the generalization threshold value, further generalization on this attribute should be performed.
u u
Not the case for Major It is the case for Birth Place
61
Further generalization
VOTE 35 40 25
The value of the vote of a tuple should be carried to its generalized tuple and the votes should be accumulated when merging identical tuples.
Vladimir Estivill -Castro 62
Transformation to rules
MAJOR {Art, Science} Science BIRTH PLACE Canada GPA excellent VOTE 37
foreign
good
25
A graduate student is either (with 75% probability) a Canadian with excellent GPA or (with 25% probability) a foreign student, majoring in science with a good GPA.
63
Questions?
64