4 Data Mining & Preprocessing L 11,12,13,14,15,16
4 Data Mining & Preprocessing L 11,12,13,14,15,16
Pre-processing
Lecture-11,12,13,14,15,16
Dr. Sumit Dhariwal
School of Computing Information Technology
Manipal University Jaipur
India
Outline
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
• Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area
• Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of CD player
• Fraud detection
• Find outliers of unusual transactions
• Financial planning
• Summarize and compare the resources and spending
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
• Data are organized around major subjects, e.g. customer, item, supplier and
activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization
• Cluster Analysis
• Class label is unknown: group data to form new classes
• Clusters of objects are formed based on the principle of maximizing intra-
class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.
• Outlier Analysis
• Data that do no comply with the general behavior or model.
• Outliers are usually discarded as noise or exceptions.
• Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
• Evolution Analysis
• Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
• Subjective measures
• Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
1.6 Classification of data mining systems
• Database
• Relational, data warehouse, transactional, stream, object-oriented/relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
• Characterization, discrimination, association, classification, clustering, trend/deviation,
outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
1.7 Data Mining Task Primitives
(1) The set of task-relevant data – which portion of the database to be used
(5) Visualization methods – what form to display the result, e.g. rules,
(1)
(3)
(2)
(1)
(1)
(1)
(2)
(1)
(5)
Data Mining: Concepts and Techniques 40
Why Data Mining Query Language?
• Semi-tight
– Efficient implementations of a few essential data mining primitives in a DB/
DW system are provided, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of some stat
functions
– Enhanced DM performance
• Tight
– DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query analysis, data structures, indexing, query
processing methods of a DB/DW system
– A uniform information processing environment, highly desirable
Data Mining: Concepts and Techniques 43
1.9 Major Issues in Data Mining
• Mining methodology and User interaction
• Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
• Interactive mining of knowledge at multiple levels of abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
• Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different levels of abstraction
• Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language
• 1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation
2.Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3.Discretization:
This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
• Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
• The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).
WHAT IS FREQUENT PATTERN MINING?
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4={abcd}
How to Count Supports of Candidates?
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges of Frequent Pattern Mining
• Challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates
Reduce the Number of Candidates
• A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
• Candidates: a, b, c, d, e
• Hash entries: {ab, ad, ae} {bd, be, de} …
• Frequent 1-itemset: a, b, d, e
• ab is not a candidate 2-itemset if the sum of the count of {ab,
ad, ae} is below the support threshold
Sampling for Frequent Patterns
Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%
• Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
• Multi-dimensional rules: 2 dimensions or predicates
• Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
• hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
• Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
• Quantitative Attributes: numeric, implicit ordering among values—
discretization, clustering, and gradient approaches
Mining Quantitative Associations
• Succinctness:
• Given A1, the set of items satisfying a succinctness constraint
C, then any set S satisfying C is based on A1 , i.e., S contains
a subset belonging to A1
• Idea: Without looking at the transaction database, whether
an itemset S satisfies constraint C can be determined based
on the selection of items
• min(S.Price) v is succinct
• sum(S.Price) v is not succinct
• Optimization: If C is succinct, C is pre-counting pushable
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
Scan D
200 235 {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Naïve Algorithm: Apriori + Constraint
TDB (min_sup=2)
TID Transaction
• Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly ordering
20 b, c, d, f, g, h
items
30 a, c, d, e, f
• Examine C: avg(S.profit) 25 40 c, e, f, g
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
support(S) no yes no
A Classification of Constraints
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
Frequent-Pattern Mining: Summary