Unit-2 data Mining
Unit-2 data Mining
1
1. Set of task-relevant data to be mined:
It refers to the specific data that is relevant and necessary for a particular task or analysis
being conducted using data mining techniques. This specifies the data to be mined.
This data may include specific attributes, variables, or characteristics that are relevant
to the task at hand, such as customer data, sales data, or website usage statistics.
The data selected for mining is typically a subset of the overall data available, as not all
data may be necessary or relevant for the task.
For example: Extracting the database name, database tables, and relevant required
attributes from the dataset from the provided input database.
2. Kind of knowledge to be mined:
It refers to the type of information that are being required through the use of data mining
techniques. This describes the data mining tasks or functions to be performed.
It includes various tasks such as classification, clustering, discrimination,
characterization, association, and evolution analysis.
For example, it determines the task to be performed on the relevant data in order to
mine useful information such as classification, clustering, prediction, discrimination,
outlier detection, and correlation analysis.
3. Background knowledge to be used in the discovery process:
It refers to any prior information or understanding that is used to guide the data mining
process.
This can include domain-specific knowledge, such as industry-specific terminology,
trends as well as knowledge about the data itself.
The use of background knowledge can help to improve the accuracy and relevance of
the insights obtained from the data mining process.
For example, Concept hierarchies are a popular form of background knowledge, which
allows data to be mined at multiple levels of abstraction. Concept hierarchy defines a
sequence of mappings from low-level concepts to higher-level. Such as:
Rolling Up - Generalization of data: Allow to view data at more meaningful and
explicit abstractions and makes it easier to understand. It compresses the data, and it
would require fewer input/output operations.
Drilling Down - Specialization of data: Concept values replaced by lower-level
concepts. Based on different user viewpoints, there may be more than one concept
hierarchy for a given attribute or dimension.
2
4. Interestingness measures and thresholds for pattern evaluation:
It refers to the methods and criteria used to evaluate the quality and relevance of the
patterns or insights discovered through data mining.
Interestingness measures are used to quantify the degree to which a pattern is
considered to be interesting or relevant based on certain criteria, such as its support or
confidence.
These measures are used to identify patterns that are meaningful or relevant to the task.
Thresholds for pattern evaluation are used to set a minimum level of interestingness
that pattern must meet in order to be considered for further analysis or action.
These are used to evaluate the quality and relevance of the discovered patterns. They
help to identify patterns that are truly meaningful and useful. Examples include:
Support: How often a pattern occurs in the data.
Confidence: How likely it is that a pattern is true.
For example: Evaluating the interestingness measures such as utility, certainty, and
novelty for the data and setting an appropriate threshold value for the pattern evaluation.
5. Representation for visualizing the discovered patterns:
It refers to the methods used to represent the patterns discovered through data mining in
a way that is easy to understand and interpret.
Visualization techniques such as charts, graphs, and maps are commonly used to
represent the data and can help to highlight important trends, patterns, or relationships
within the data.
Visualizing the discovered pattern helps to make the patterns obtained from the data
mining process more accessible and understandable to a wider audience, including non-
technical stakeholders.
This specifies how the discovered patterns should be presented to the user. Common
visualization techniques include:
Rules: Expressing patterns in the form of "if-then" statements.
Tables: Presenting patterns in a structured format.
Charts and graphs: Visualizing patterns using different types of diagrams.
Decision trees: Representing classification models in a tree-like structure.
For example: Presentation and visualization of discovered pattern of data using various
visualization techniques such as bar graphs, charts, graphs, tables, etc.
3
DATA MINING TASKS with examples:
Data Characterization: Example:
Summarizing general characteristics of Finding the average salary and job roles of
data in a target class. employees in an organization.
Data Discrimination: Example:
Comparing two or more datasets to find Comparing the characteristics of high-
differences. performing and low-performing students.
Association Rule Mining: Example:
Identifying relationships between different "Customers who buy bread often buy butter."
attributes in a dataset.
Classification: Example:
Assigning data objects to predefined Email classification into "Spam" and "Not
categories or classes. Spam."
Clustering: Example:
Grouping data into clusters based on Segmenting customers based on purchasing
similarities behaviour.
Outlier Detection: Example:
Identifying data points that significantly Detecting fraudulent transactions in banking.
differ from others.
Regression Analysis: Example:
Predicting continuous numerical values Predicting house prices based on features like
based on input data. location, size, and number of rooms
Sequential Pattern Mining: Example:
Identifying patterns that occur in Predicting stock price trends based on historical
sequences over time. data.
Deviation Analysis: Example:
Finding anomalies or changes in patterns Detecting unusual drops in sales for a product
over time.
4
DATA MINING QUERY LANGUAGE (DMQL):
Data Mining Query Language (DMQL) is a specialized query language used to define
data mining tasks in databases. DMQL provides a structured way to extract patterns,
similar to how SQL is used for querying databases.
It helps users specify:
What data to mine
Which patterns to discover
How results should be presented
Designing a comprehensive data mining language is challenging because data mining
covers a wide spectrum of tasks, from data characterization to evolution analysis. Each
task has different requirements.
The design of an effective data mining query language requires deep understanding of
power, limitation, and fundamental mechanisms of various kinds of data mining tasks.
This facilitates a data mining system's communication with other information systems
and integrates with the overall information processing environment.
Basic Structure of DMQL: A DMQL query consists of the following main components:
1. Data to be Mined: Specifies the dataset.
2. Type of Knowledge to be Defines the type of pattern (e.g., association,
Discovered: classification, prediction, clustering)
3. Background Knowledge: Includes additional rules, constraints, or hierarchies.
4. Interestingness Measures: Defines criteria for meaningful patterns
5. Presentation of Results: Specifies how results should be visualized
Basic syntax in DMQL: DMQL acquires syntax like the relational query language, SQL.
Syntax -- To retrieve relevant dataset:
1. use database (database_name)
2. { use hierarchy (hierarchy_name) for (attribute) }
3. (rule_specified)
4. related to (attribute_or_aggreagate_list)
5. from (relation(s)) [ where(condition) ]
6. [ order by(order_list) ]
7. { with [ (type_of) ] threshold = (threshold_value) [ for(attribute(s)) ] }
5
In the above syntax of Data Mining Query,
The first line retrieves the required database (database_name).
The second line uses hierarchy, one has chosen(hierarchy_name) with given attribute.
In third line, (rule_specified) denotes the types of rules to be specified.
In fourth line, To find out the various specified rules, one must find the related set based
on the attribute or aggregation which helps in generalization.
In fifth line, FROM and WHERE clauses makes sure of given condition being satisfied.
In sixth line, Then they are ordered using “order by” for a designated threshold value
with respect to attributes.
For the rule_specified in DMQL, The syntax is given below:
RULES SYNTAX
Generalization: generalize data [into (relation_name)]
Association: find association rules [as (rule_name)]
Classification: find classification rules [as (rule_name) ] according to [(attribute)]
Characterization: find characteristic rules [as (rule_name)]
6
Example-2: Association Rule Mining in DMQL
Problem: Suppose we have a retail dataset containing customer transactions, and we want to
find frequent itemsets that customers often purchase together.
DMQL Query: Output:
use database RetailDB; The system will return frequent itemsets such as
mine association rules {Milk, Bread} → {Butter}, Which means
customers who buy Milk and Bread often buy
from TransactionData Butter too.
extracting frequent patterns
with support threshold = 30%
and confidence threshold = 70%;
Explanation of the Query:
use database RetailDB; Specifies the database that contains the transaction data.
mine association rules Instructs the system to perform association rule mining.
from TransactionData Specifies table (or dataset) that contains transaction data.
extracting frequent patterns Indicates that we want to find itemsets that appear
frequently.
with support threshold = Means that an itemset must appear in at least 30% of the
30% transactions to be considered frequent.
and confidence threshold = Specifies that the confidence level for a rule must be at
70%; least 70% (e.g., "if a customer buys bread, there is a 70%
chance they also buy butter").
7
ADVANTAGES OF DMQL:
1. Standardized Query Language: DMQL provides a structured and systematic approach to
data mining, making it easier to define and execute mining tasks.
2. Integration with Databases: It can be integrated with traditional databases and data
warehouses, allowing seamless data retrieval and mining.
3. Supports Multiple Data Mining Functions: DMQL supports classification, clustering,
association rule mining, and other tasks.
4. User-Friendly & Simplifies Complex Queries: It provides easier way for users to specify
complex mining operations compared to traditional programming or SQL-based queries.
DISADVANTAGES OF DMQL:
1. Complex Syntax for Beginners: While it simplifies some tasks, DMQL can still be
challenging for users without experience in query languages.
2. Performance Issues with Large Datasets: Executing DMQL queries on massive datasets
can be slow if not optimized properly.
3. Requires Strong Understanding of Data Mining Concepts: Users must be familiar with
data mining techniques to write effective DMQL queries.
8
Components of Data Mining Architecture:
1. Data Sources:
1. Database, World Wide Web(WWW), and data warehouse are parts of data sources.
2. The data in these sources may be in the form of plain text, spreadsheets, or other
forms of media like photos or videos.
3. WWW is one of the biggest sources of data.
2. Database Server:
1. The database server contains the actual data ready to be processed.
2. It performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine:
1. It is one of the core components of the data mining architecture that performs all
kinds of data mining techniques like association, classification, clustering etc.
4. Pattern Evaluation Modules:
1. They are responsible for finding interesting patterns in the data and sometimes they
also interact with the database servers for producing the result of the user requests.
5. Graphic User Interface:
1. Since the user cannot fully understand the complexity of the data mining process so
graphical user interface helps the user to communicate effectively with the data
mining system.
6. Knowledge Base:
1. Knowledge Base is an important part of the data mining engine that is quite
beneficial in guiding the search for the result patterns.
2. Data mining engines may also sometimes get inputs from the knowledge base.
3. This knowledge base may contain data from user experiences.
4. The objective of the knowledge base is to make the result more accurate and reliable.
11
CHARACTERIZATION:
Characterization is a fundamental concept in data mining used to describe and summarize
the general features of a specific dataset. It helps businesses, researchers, and decision-
makers understand key properties of data and make data-driven decisions.
By using techniques like attribute-oriented induction (AOI), statistical
summarization, rule-based descriptions, and visualization can identify trends,
behaviours, and patterns efficiently.
For example, in a retail business, characterization might help describe frequent customers
based on their purchasing patterns and shopping behaviour.
Steps in Characterization: (Data Characterization Process)
1. Data Selection: Choose relevant data for characterization.
2. Data Pre-processing: Clean, normalize, and prepare data for analysis.
3. Attribute-Oriented Induction (AOI): Generalize raw data into higher-level concepts.
4. Data Summarization: Find statistics like mean, median, and frequency.
5. Pattern Extraction and Rule Generation: Identify trends, associations, and
classification rules.
6. Visualization & Presentation: Represent data using charts, graphs, and reports.
Advantages of Characterization:
1. It helps in understanding key trends and patterns in the data.
2. Summarized data helps businesses and researchers to simplify decision making.
3. It helps in personalized marketing and targeted campaigns.
Applications of Characterization:
1. Retail and Marketing: Identifying the characteristics of high-valued customers.
Example: Finding most high-value customers are aged 25-40. They prefer online shopping
over in-store purchases. They purchase electronics and fashion items frequently.
2. Healthcare: Characterize diabetic patients based on lifestyle and medical history.
Example: Findings Most diabetic patients are over 40 years old. Many have a family
history of diabetes. A majority have sitting lifestyles with high carbohydrate intake.
3. Banking and Finance: It characterizes individuals likely to default on loans. Example:
Findings defaulters often have low credit scores. Many have multiple high-interest loans.
A significant percentage have unstable employment histories.
12
Techniques of Characterization:
1. Attribute-Oriented Induction (AOI):
A technique that uses data generalization to extract characteristic rules and summarize
data. It generalizes data by replacing specific values with higher-level concepts.
Example: If dataset contains "Age = 21, 22, 23," it can be generalized to "young adult."
Data Generalization involves summarizing data to a higher level of abstraction. For
example, instead of describing individual customers, we might describe customer
segments based on its purchasing behaviour.
2. Statistical Summarization:
It provides an overview of data using statistical measures, such as mean, median, and
mode. Techniques like calculating summary statistics, creating histograms, and
generating box plots provides a visual and quantitative overview of the data's
distribution and key features. Example: "The average salary of employees in the IT
department is $80,000, with a standard deviation of $10,000."
o Mean (Average): The central value of a numerical attribute.
o Median: The middle value when data is sorted.
o Mode: The most frequently occurring value.
o Standard Deviation: Measures data spread or variability.
3. Rule-Based Description: It uses association rules or classification rules. For example: If
Age > 30 and Income > $50,000 then Likely to Buy a Luxury Car. This rule helps
businesses target high-income individuals for luxury car promotions.
4. Visualization Techniques: Charts, graphs, and other visual representations can effectively
communicate the characteristics of a class, making it easier to understand the dataset. They
are used to display summarized data. Example: A bar chart showing the percentage of
customers from different age groups in a store.
13
COMPARISON / DISCRIMINATION:
Comparison is also called Discrimination. It is the process of analysing and contrasting
two or more groups of data to identify differences, distinguishing features, and patterns.
It helps businesses, researchers, and analysts to find key differences in customer behaviour,
product performance, or market trends. It supports decision-making by analysing different
data segments.
Using statistical techniques, rule-based methods, and machine learning models,
organizations can make data-driven decisions, optimize marketing strategies, detect fraud,
and improve healthcare outcomes.
For example: In marketing, comparison can help differentiate high-spending and low-
spending customers based on their purchasing behaviour.
Applications of Comparison:
1. Customer Segmentation in Marketing: Compare high-value and low-value customers.
Findings: High-value customers shop frequently and prefer premium brands. Low-value
customers shop occasionally and buy discount products.
2. Fraud Detection in Banking: Compare fraud and valid transactions.
Findings: Fraud transactions are often international, high-value, and made at unusual times.
Valid transactions match customer spending history and location.
3. Medical Diagnosis: Compare patients with and without a disease.
Findings: Diabetic patients have higher glucose levels and sitting lifestyles. Non-diabetic
patients have normal glucose levels and exercise regularly.
4. Stock Market Analysis: Compare performing and non-performing stocks.
Findings: Performing stocks have high trading volume and positive earnings reports. Non-
performing stocks show low demand and declining revenue.
14
Techniques for Comparison:
1. Attribute-Oriented Induction (AOI): Similar to AOI used in characterization, but instead
of summarizing, it finds key differences. Attributes are generalized and compared between
two datasets. Example:
Group-1: High-income customers Group-2: Low-income customers
Finding: High-income customers frequently buy luxury items, whereas low-income
customers prefer budget-friendly products.
2. Statistical Comparison: Uses statistical measures to compare datasets. Such as, mean,
median, mode to find the central tendency of different groups. Example:
Group-1 (Online Shoppers): Average Group-2 (In-store Shoppers): Average
purchase = $150 purchase = $100
Finding: Online shoppers spend more per transaction than in-store shoppers.
4. Rule-Based Comparison: Uses association rules and decision trees to extract rules
differentiating datasets. Example:
Rule 1: Rule 2:
If the customer is under 30 and has an If the customer is over 50, they are more
active social media presence, they are likely likely to visit physical bank branches.
to prefer digital banking.
15
Advantages of Comparison:
1. It helps in business strategy and decision-making.
2. It distinguishes different consumer groups for personalized marketing.
3. It finds anomalies that indicate fraud activity.
16
DATA GENERALIZATION:
Data Generalization is the process of summarizing data by replacing relatively low level
values with higher level concepts. It is a form of descriptive data mining.
It simplifies large datasets by summarizing data and identifies trends that may not be visible
in raw data. It reduces data storage needs by eliminating unnecessary details. It enhances
privacy by masking specific details in sensitive data.
For example: Instead of listing every single value, data is grouped into ranges or categories.
Original Data Generalized Data
Age: 22 Age: 20-30
Age: 27 Age: 20-30
City: New York City: USA
Salary: $55,000 Salary: $50K - $60K
There are two basic approaches of data generalization: OLAP approach & AOI approach
1. Data cube approach (OLAP - Online Analytical Processing approach):
It is an efficient approach as it is helpful to make the past selling graph.
In this approach, computation and results are stored in the Data cube.
It uses Roll-up and Drill-down operations on a data cube.
These operations typically involve aggregate functions, such as count(), sum(),
average(), and max().
These materialized views can then be used for decision support, knowledge discovery,
and many other applications.
17
2. Attribute Oriented Induction (AOI):
It is an online data analysis, query oriented and generalization based approach.
In this approach, we perform generalization on basis of different values of each
attributes within the relevant data set. After that same tuple are merged and their
respective counts are accumulated in order to perform aggregation.
Attribute oriented induction approach uses two method :
a) Attribute removal.
b) Attribute generalization.
a) Attribute Removal:
Attribute removal is a technique used in data generalization where specific attributes
in a dataset are eliminated to simplify the data. This helps in reducing complexity,
improving privacy, and highlighting only the most relevant features.
It involves removing less important or sensitive attributes from a dataset to make
analysis more efficient.
Why Use Attribute Removal? When to Use Attribute Removal?
Few attributes make data easier to When certain attributes are irrelevant (e.g.,
analyze. customer names in sales trend analysis).
Sensitive data is removed, protecting When privacy concerns exist (e.g., removing
user identities. personal information like A/c no.).
Less data to process means faster When data redundancy is high (e.g.,
computations and improves removing one column if another provides
performance. similar information).
Example: In Customer Data, removing name and Phone number. We have removed because:
Privacy → Phone numbers & names are sensitive.
Relevance → Name doesn’t affect salary analysis.
Efficiency → Fewer columns mean faster processing.
Before Attribute removal: After Attribute Removal
Name Age City Salary Phone Number (Removing Name & Phone
Number):
John 28 New York $50K 123-456-7890
Age City Salary
Sarah 35 Chicago $65K 987-654-3210
28 New York $50K
35 Chicago $65K
18
b) Attribute Generalization:
Attribute Generalization is a technique in data generalization where specific attribute
values are replaced with higher-level, more abstract concepts using concept hierarchies.
This helps in summarizing data and identifying patterns efficiently.
It involves replacing detailed values with generalized categories based on pre-defined
hierarchies. It simplifies large datasets by reducing complexity. It uses less space by
summarizing attributes.
Example: Instead of dealing with exact ages, we group them into broader categories.
Original Age Generalized Age (Level 1) Generalized Age (Level 2)
23 20-30 Young
37 30-40 Middle-aged
65 60-70 Senior
19
DATA SUMMARIZATION:
Data summarization is a key technique in data mining that helps in extracting useful
information from large datasets by providing a compact representation of the data.
It enables to understand large datasets quickly, reduces storage requirements, and helps in
decision-making. It involves computing statistical measures, aggregating data, and creating
visual or textual summaries that highlight key patterns, trends, and distributions.
Techniques of Data Summarization:
1. Statistical Measures: Summarization begins with computing basic statistical measures:
Mean (Average): The central value of a dataset.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value.
Variance & Standard Deviation: Measures of data dispersion.
2. Aggregation: Combining multiple data points into a single value to provide a higher-level
summary. For example:
Summing up monthly sales data to get yearly totals.
Averaging customer ratings for a product.
3. Clustering and Segmentation:
Clustering: Grouping similar data points together to form patterns.
Segmentation: Dividing data into meaningful groups based on predefined attributes.
4. Visualization Techniques:
Histograms & Bar Charts: Show frequency distributions.
Pie Charts: Represent proportions of categories.
Box Plots: Indicate data spread and outliers.
Heat-maps: Display correlations between variables.
5. Data Cube (OLAP Summarization):
Online Analytical Processing (OLAP) cubes store pre-aggregated data for fast retrieval.
Used for multidimensional data analysis (e.g., sales by region, time, and product).
Applications of Data Summarization:
a) Business: Understanding customer behaviour, sales trends, and market performance.
b) Healthcare: Summarizing patient records for better diagnosis.
c) Finance: Summarizing stock market data for investment decisions.
d) Social Media: Aggregating trends from large-scale text data.
20
MINING CLASS COMPARISIONS:
Mining class comparisons is a technique used to analyse and compare different classes
or groups of data based on their attributes. The goal is to identify discriminative features
that distinguish one class from another.
Class discrimination or class comparison mines descriptions that distinguish a target class
from its contrasting classes. The target and contrasting classes must be comparable
because they share similar dimensions and attributes.
For example, the three classes - person, address, and item are not comparable. But the
sales in the last three years are comparable classes. Also, we can compare computer
science candidates versus physics candidates.
The general procedure for class comparison is as follows:
1. Data Collection: The set of relevant data in the database and data warehouse is
collected by query Processing and partitioned into a target class and one or a set of
contrasting classes.
2. Dimension relevance analysis: If there are many dimensions and analytical
comparisons are desired, then dimension relevance analysis should be performed on
these classes and only highly relevant dimensions are included in the further analysis.
3. Synchronous Generalization: The process of generalization is performed upon the
target class to the level controlled by the user, which results in a prime target class
relation. The concepts in the contrasting class are generalized to the same level as those
in the prime target class relation, forming the prime contrasting class relation or cuboid.
4. Drilling Down, Rolling Up and other OLAP adjustments: Synchronous and
asynchronous drill-down, roll-up & other OLAP operations such as, slicing, dicing and
pivoting can be performed on target and contrasting classes based on user instructions.
5. Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, charts, and rules. This presentation usually
includes a "contrasting" measure (such as count %) that reflects the comparison between
the target and contrasting classes.
The OLAP operations are:
1. Roll-up: Aggregates data to a higher level of hierarchy.
2. Drill-down: Breaks down data to a more detailed level.
3. Slice: Filters data along a single dimension.
4. Dice: Filters data based on multiple dimensions.
5. Pivot (Rotate): Re-orients the data view for better analysis.
21
Example of Mining Class Comparison:
Suppose we would like to compare the general properties between the old customer and new
customer of Royal Electronics which deals in computer’s electronic products, given the
attributes name, gender, product, birthplace, birthdate, residence and phone.
Target class: old customer
Name Gender Product Birthplace Birthdate Residence Phone
Rakesh M Printer Jodhpur 14/07/1993 3/A C.H.B. 9925852890
Sumit M Scanner Jaipur 22/06/2002 A.G. colony 9875894102
Minal F Keyboard Jodhpur 11/08/1992 Bank colony 9928509928
…… …… …… …… …… …… ……
22
Initial working relations: the target class vs. the contrasting class.
1. Data Collection: In this, we select two set of task relevant data. One for the initial target
class working relation and the other for the initial contrasting class working relation.
2. Dimension relevance analysis: Now, this analysis is performed on the two classes of
data. After this analysis, irrelevant or weakly relevant dimensions such as name, gender,
product, residence and phone are removed from the resulting class. Only highly relevant
attributes are included in the subsequent analysis.
3. Synchronous Generalization: Here, the generalization is performed on the target class
to the levels controlled by user, forming the prime target class relation. The contrasting
class is also generalized to the same levels as those in the prime target class, forming
the prime contrasting class relation.
Prime generalized relation for the Prime generalized relation for the
target class: old customer contrasting class: new customer
Birthplace Age-Range Birthplace Age-Range
Jodhpur 25-30 Jodhpur 18-25
Jaipur 20-25 Jodhpur 18-25
Others Over 30 Others Over 30
4. Drilling Down, Rolling Up and other OLAP adjustments: The OLAP operations are
performed on the target and contrasting class, based on the user’s instruction to adjust
the level of abstraction.
5. Presentation of the derived comparison: Finally, the resulting class comparison is
presented in form of tables, graphs or rules.
23