0% found this document useful (0 votes)
8 views23 pages

Unit-2 data Mining

The document outlines data mining task primitives, including the specification of relevant data, types of knowledge to be mined, background knowledge, interestingness measures, and visualization techniques. It introduces Data Mining Query Language (DMQL) for defining mining tasks and discusses various data mining architectures, including no-coupling, loose coupling, semi-tight coupling, and tight coupling systems. Additionally, it highlights the advantages and disadvantages of DMQL and provides examples of data mining tasks and queries.

Uploaded by

paramt1315
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views23 pages

Unit-2 data Mining

The document outlines data mining task primitives, including the specification of relevant data, types of knowledge to be mined, background knowledge, interestingness measures, and visualization techniques. It introduces Data Mining Query Language (DMQL) for defining mining tasks and discusses various data mining architectures, including no-coupling, loose coupling, semi-tight coupling, and tight coupling systems. Additionally, it highlights the advantages and disadvantages of DMQL and provides examples of data mining tasks and queries.

Uploaded by

paramt1315
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT-2 DATA MINING PRIMITIVES, LANGUAGE, ARCHITECTURE

DATA MINING TASK PRIMITIVES:


 Data mining task primitives are the basic building blocks that define a data mining task.
They allow users to specify what data to use, what kind of knowledge to discover, and
how to evaluate the results.
 A data mining task can be specified in the form of a data mining query, which is input to
the data mining system. A data mining query is defined in terms of data mining task
primitives.
 A data mining query language can be designed to incorporate these primitives, allowing
users to interact with data mining systems flexibly.
 The key primitives are:
1. Set of task-relevant data to be mined.
2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

1
1. Set of task-relevant data to be mined:
 It refers to the specific data that is relevant and necessary for a particular task or analysis
being conducted using data mining techniques. This specifies the data to be mined.
 This data may include specific attributes, variables, or characteristics that are relevant
to the task at hand, such as customer data, sales data, or website usage statistics.
 The data selected for mining is typically a subset of the overall data available, as not all
data may be necessary or relevant for the task.
 For example: Extracting the database name, database tables, and relevant required
attributes from the dataset from the provided input database.
2. Kind of knowledge to be mined:
 It refers to the type of information that are being required through the use of data mining
techniques. This describes the data mining tasks or functions to be performed.
 It includes various tasks such as classification, clustering, discrimination,
characterization, association, and evolution analysis.
 For example, it determines the task to be performed on the relevant data in order to
mine useful information such as classification, clustering, prediction, discrimination,
outlier detection, and correlation analysis.
3. Background knowledge to be used in the discovery process:
 It refers to any prior information or understanding that is used to guide the data mining
process.
 This can include domain-specific knowledge, such as industry-specific terminology,
trends as well as knowledge about the data itself.
 The use of background knowledge can help to improve the accuracy and relevance of
the insights obtained from the data mining process.
 For example, Concept hierarchies are a popular form of background knowledge, which
allows data to be mined at multiple levels of abstraction. Concept hierarchy defines a
sequence of mappings from low-level concepts to higher-level. Such as:
 Rolling Up - Generalization of data: Allow to view data at more meaningful and
explicit abstractions and makes it easier to understand. It compresses the data, and it
would require fewer input/output operations.
 Drilling Down - Specialization of data: Concept values replaced by lower-level
concepts. Based on different user viewpoints, there may be more than one concept
hierarchy for a given attribute or dimension.

2
4. Interestingness measures and thresholds for pattern evaluation:
 It refers to the methods and criteria used to evaluate the quality and relevance of the
patterns or insights discovered through data mining.
 Interestingness measures are used to quantify the degree to which a pattern is
considered to be interesting or relevant based on certain criteria, such as its support or
confidence.
 These measures are used to identify patterns that are meaningful or relevant to the task.
 Thresholds for pattern evaluation are used to set a minimum level of interestingness
that pattern must meet in order to be considered for further analysis or action.
 These are used to evaluate the quality and relevance of the discovered patterns. They
help to identify patterns that are truly meaningful and useful. Examples include:
 Support: How often a pattern occurs in the data.
 Confidence: How likely it is that a pattern is true.
 For example: Evaluating the interestingness measures such as utility, certainty, and
novelty for the data and setting an appropriate threshold value for the pattern evaluation.
5. Representation for visualizing the discovered patterns:
 It refers to the methods used to represent the patterns discovered through data mining in
a way that is easy to understand and interpret.
 Visualization techniques such as charts, graphs, and maps are commonly used to
represent the data and can help to highlight important trends, patterns, or relationships
within the data.
 Visualizing the discovered pattern helps to make the patterns obtained from the data
mining process more accessible and understandable to a wider audience, including non-
technical stakeholders.
 This specifies how the discovered patterns should be presented to the user. Common
visualization techniques include:
 Rules: Expressing patterns in the form of "if-then" statements.
 Tables: Presenting patterns in a structured format.
 Charts and graphs: Visualizing patterns using different types of diagrams.
 Decision trees: Representing classification models in a tree-like structure.
 For example: Presentation and visualization of discovered pattern of data using various
visualization techniques such as bar graphs, charts, graphs, tables, etc.

3
DATA MINING TASKS with examples:
Data Characterization: Example:
Summarizing general characteristics of Finding the average salary and job roles of
data in a target class. employees in an organization.
Data Discrimination: Example:
Comparing two or more datasets to find Comparing the characteristics of high-
differences. performing and low-performing students.
Association Rule Mining: Example:
Identifying relationships between different "Customers who buy bread often buy butter."
attributes in a dataset.
Classification: Example:
Assigning data objects to predefined Email classification into "Spam" and "Not
categories or classes. Spam."
Clustering: Example:
Grouping data into clusters based on Segmenting customers based on purchasing
similarities behaviour.
Outlier Detection: Example:
Identifying data points that significantly Detecting fraudulent transactions in banking.
differ from others.
Regression Analysis: Example:
Predicting continuous numerical values Predicting house prices based on features like
based on input data. location, size, and number of rooms
Sequential Pattern Mining: Example:
Identifying patterns that occur in Predicting stock price trends based on historical
sequences over time. data.
Deviation Analysis: Example:
Finding anomalies or changes in patterns Detecting unusual drops in sales for a product
over time.

4
DATA MINING QUERY LANGUAGE (DMQL):
 Data Mining Query Language (DMQL) is a specialized query language used to define
data mining tasks in databases. DMQL provides a structured way to extract patterns,
similar to how SQL is used for querying databases.
 It helps users specify:
 What data to mine
 Which patterns to discover
 How results should be presented
 Designing a comprehensive data mining language is challenging because data mining
covers a wide spectrum of tasks, from data characterization to evolution analysis. Each
task has different requirements.
 The design of an effective data mining query language requires deep understanding of
power, limitation, and fundamental mechanisms of various kinds of data mining tasks.
 This facilitates a data mining system's communication with other information systems
and integrates with the overall information processing environment.
Basic Structure of DMQL: A DMQL query consists of the following main components:
1. Data to be Mined: Specifies the dataset.
2. Type of Knowledge to be Defines the type of pattern (e.g., association,
Discovered: classification, prediction, clustering)
3. Background Knowledge: Includes additional rules, constraints, or hierarchies.
4. Interestingness Measures: Defines criteria for meaningful patterns
5. Presentation of Results: Specifies how results should be visualized

Basic syntax in DMQL: DMQL acquires syntax like the relational query language, SQL.
Syntax -- To retrieve relevant dataset:
1. use database (database_name)
2. { use hierarchy (hierarchy_name) for (attribute) }
3. (rule_specified)
4. related to (attribute_or_aggreagate_list)
5. from (relation(s)) [ where(condition) ]
6. [ order by(order_list) ]
7. { with [ (type_of) ] threshold = (threshold_value) [ for(attribute(s)) ] }
5
In the above syntax of Data Mining Query,
 The first line retrieves the required database (database_name).
 The second line uses hierarchy, one has chosen(hierarchy_name) with given attribute.
 In third line, (rule_specified) denotes the types of rules to be specified.
 In fourth line, To find out the various specified rules, one must find the related set based
on the attribute or aggregation which helps in generalization.
 In fifth line, FROM and WHERE clauses makes sure of given condition being satisfied.
 In sixth line, Then they are ordered using “order by” for a designated threshold value
with respect to attributes.
For the rule_specified in DMQL, The syntax is given below:
RULES SYNTAX
Generalization: generalize data [into (relation_name)]
Association: find association rules [as (rule_name)]
Classification: find classification rules [as (rule_name) ] according to [(attribute)]
Characterization: find characteristic rules [as (rule_name)]

Example-1 Clustering Analysis:


Clustering groups data into meaningful clusters based on similarities.
DMQL Query: Explanation:
mine clusters  mine clusters: Specifies that clustering should be performed.
from CustomerData  from CustomerData: Uses data from CustomerData table.
using k-means  using k-means: Uses the k-means clustering algorithm.
with k = 3;  with k = 3: Creates 3 clusters.
Output:
The system clusters customers into three groups:
1. Low-income customers
2. Middle-income customers
3. High-income customers

6
Example-2: Association Rule Mining in DMQL
Problem: Suppose we have a retail dataset containing customer transactions, and we want to
find frequent itemsets that customers often purchase together.
DMQL Query: Output:
use database RetailDB; The system will return frequent itemsets such as
mine association rules {Milk, Bread} → {Butter}, Which means
customers who buy Milk and Bread often buy
from TransactionData Butter too.
extracting frequent patterns
with support threshold = 30%
and confidence threshold = 70%;
Explanation of the Query:
use database RetailDB; Specifies the database that contains the transaction data.
mine association rules Instructs the system to perform association rule mining.
from TransactionData Specifies table (or dataset) that contains transaction data.
extracting frequent patterns Indicates that we want to find itemsets that appear
frequently.
with support threshold = Means that an itemset must appear in at least 30% of the
30% transactions to be considered frequent.
and confidence threshold = Specifies that the confidence level for a rule must be at
70%; least 70% (e.g., "if a customer buys bread, there is a 70%
chance they also buy butter").

Example-3: Concept Hierarchy Definition


A concept hierarchy allows data to be viewed at different levels of abstraction. For example,
we can define a hierarchy for location data (Country → State → City).
DMQL Query: Explanation:
define hierarchy location_hierarchy  Defines a new hierarchy named location_hierarchy.
on Customer(Location)  Specifies that the hierarchy applies to the Location
as (Country, State, City); attribute in the Customer table.
 Defines the levels of hierarchy from the highest
(Country) to the lowest (City).

7
ADVANTAGES OF DMQL:
1. Standardized Query Language: DMQL provides a structured and systematic approach to
data mining, making it easier to define and execute mining tasks.
2. Integration with Databases: It can be integrated with traditional databases and data
warehouses, allowing seamless data retrieval and mining.
3. Supports Multiple Data Mining Functions: DMQL supports classification, clustering,
association rule mining, and other tasks.
4. User-Friendly & Simplifies Complex Queries: It provides easier way for users to specify
complex mining operations compared to traditional programming or SQL-based queries.

DISADVANTAGES OF DMQL:
1. Complex Syntax for Beginners: While it simplifies some tasks, DMQL can still be
challenging for users without experience in query languages.
2. Performance Issues with Large Datasets: Executing DMQL queries on massive datasets
can be slow if not optimized properly.
3. Requires Strong Understanding of Data Mining Concepts: Users must be familiar with
data mining techniques to write effective DMQL queries.

DATA MINING ARCHITECTURE:

8
Components of Data Mining Architecture:
1. Data Sources:
1. Database, World Wide Web(WWW), and data warehouse are parts of data sources.
2. The data in these sources may be in the form of plain text, spreadsheets, or other
forms of media like photos or videos.
3. WWW is one of the biggest sources of data.
2. Database Server:
1. The database server contains the actual data ready to be processed.
2. It performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine:
1. It is one of the core components of the data mining architecture that performs all
kinds of data mining techniques like association, classification, clustering etc.
4. Pattern Evaluation Modules:
1. They are responsible for finding interesting patterns in the data and sometimes they
also interact with the database servers for producing the result of the user requests.
5. Graphic User Interface:
1. Since the user cannot fully understand the complexity of the data mining process so
graphical user interface helps the user to communicate effectively with the data
mining system.
6. Knowledge Base:
1. Knowledge Base is an important part of the data mining engine that is quite
beneficial in guiding the search for the result patterns.
2. Data mining engines may also sometimes get inputs from the knowledge base.
3. This knowledge base may contain data from user experiences.
4. The objective of the knowledge base is to make the result more accurate and reliable.

VARIOUS ARCHITECTURES OF DATA MINING SYSTEMS:


 Data mining systems can be categorized into different architectures based on how they
interact with databases, how they process data, and their level of integration.
 In this scheme, the main focus is on data mining design and on developing efficient and
effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows -
9
1. No-Coupling Architecture:
 It is a Standalone data mining system that does not interact with a database or data
warehouse. In this, Data is stored separately in local files. Data mining is performed
independently without direct access to databases. Users must manually extract data
before running mining algorithms.
Pros: Cons:
 Simple to implement.  Inefficient for large datasets.
 Useful for small datasets or experimental  No real-time or dynamic querying
research support.

2. Loose Coupling Architecture:


 In this, Data mining system accesses databases indirectly. In this, Data is extracted
from databases or data warehouses. The mining system processes the extracted data
separately. Results are stored back in the system or displayed to the user.
Pros: Cons:
 More flexible than no-coupling architecture.  Not optimized for real-time mining.
 Can work with multiple data sources.  Additional steps are required for data
extraction and storage.

3. Semi-Tight Coupling Architecture:


 In this, Partial integration of the data mining system with databases or data
warehouses. In this, Data mining components use database functionalities (like
indexing, query optimization). Here, some pre-processing and mining tasks are
performed within the database. The mining system still works as a separate module.
Pros: Cons:
 Faster processing than loose coupling.  Still not fully integrated, leading to
 Takes advantage of database capabilities. inefficiencies.

4. Tight Coupling Architecture:


 In this, Fully integrated with database systems or data warehouses. In this, Data
mining is performed directly inside the database system. It uses SQL-based queries or
built-in mining functions. No need for separate data extraction.
Pros: Cons:
 High efficiency and performance.  Complex implementation.
 Supports real-time mining.  Requires modifications to existing
 Better scalability for big data applications. database systems.
10
CONCEPT DESCRIPTION: CHARACTERIZATION AND COMPARISIONS
 Concept Description refers to the process of summarizing and characterizing a set of data
to extract meaningful patterns, trends, and relationships. It involves generating
summarized information about a dataset.
 It a descriptive task that aims to provide summaries of data. It involves characterizing and
comparing different concepts or classes within a dataset.
Types of Concept Description:
1. Characterization:
 It summarizes the general characteristics of a target class or concept. This involves
identifying common features, patterns, and trends within the data. It uses statistical
measures such as mean, median, mode, and standard deviation.
 It provides a general summary of a specific set of data (e.g., customers who frequently
purchase a product).
 For example: "The majority of high-spending customers are aged 25-40 and prefer
online shopping."
2. Discrimination / Comparison:
 It compares two or more classes/datasets to identify their differences and similarities.
It helps in distinguishing one category from another based on key attributes.
 For example: "Customers who purchase luxury items tend to have higher annual
incomes compared to those who buy budget items."

Techniques of Concept Description:


1. Attribute-Oriented Induction (AOI)
2. Data summarization/ Statistical Summarization
3. Rule-Based Description
4. Visualization
Applications of Concept Description:
1. Understanding customer behaviour: It means customer segmentation in marketing.
2. Analysing market trends: Comparing product sales across different regions to identify
trends and opportunities.
3. Identifying risk factors: Characterizing individuals with a high risk of developing a
certain disease based on their medical history and lifestyle factors.

11
CHARACTERIZATION:
 Characterization is a fundamental concept in data mining used to describe and summarize
the general features of a specific dataset. It helps businesses, researchers, and decision-
makers understand key properties of data and make data-driven decisions.
 By using techniques like attribute-oriented induction (AOI), statistical
summarization, rule-based descriptions, and visualization can identify trends,
behaviours, and patterns efficiently.
 For example, in a retail business, characterization might help describe frequent customers
based on their purchasing patterns and shopping behaviour.
Steps in Characterization: (Data Characterization Process)
1. Data Selection: Choose relevant data for characterization.
2. Data Pre-processing: Clean, normalize, and prepare data for analysis.
3. Attribute-Oriented Induction (AOI): Generalize raw data into higher-level concepts.
4. Data Summarization: Find statistics like mean, median, and frequency.
5. Pattern Extraction and Rule Generation: Identify trends, associations, and
classification rules.
6. Visualization & Presentation: Represent data using charts, graphs, and reports.

Advantages of Characterization:
1. It helps in understanding key trends and patterns in the data.
2. Summarized data helps businesses and researchers to simplify decision making.
3. It helps in personalized marketing and targeted campaigns.

Applications of Characterization:
1. Retail and Marketing: Identifying the characteristics of high-valued customers.
Example: Finding most high-value customers are aged 25-40. They prefer online shopping
over in-store purchases. They purchase electronics and fashion items frequently.
2. Healthcare: Characterize diabetic patients based on lifestyle and medical history.
Example: Findings Most diabetic patients are over 40 years old. Many have a family
history of diabetes. A majority have sitting lifestyles with high carbohydrate intake.
3. Banking and Finance: It characterizes individuals likely to default on loans. Example:
Findings defaulters often have low credit scores. Many have multiple high-interest loans.
A significant percentage have unstable employment histories.
12
Techniques of Characterization:
1. Attribute-Oriented Induction (AOI):
 A technique that uses data generalization to extract characteristic rules and summarize
data. It generalizes data by replacing specific values with higher-level concepts.
Example: If dataset contains "Age = 21, 22, 23," it can be generalized to "young adult."
 Data Generalization involves summarizing data to a higher level of abstraction. For
example, instead of describing individual customers, we might describe customer
segments based on its purchasing behaviour.
2. Statistical Summarization:
 It provides an overview of data using statistical measures, such as mean, median, and
mode. Techniques like calculating summary statistics, creating histograms, and
generating box plots provides a visual and quantitative overview of the data's
distribution and key features. Example: "The average salary of employees in the IT
department is $80,000, with a standard deviation of $10,000."
o Mean (Average): The central value of a numerical attribute.
o Median: The middle value when data is sorted.
o Mode: The most frequently occurring value.
o Standard Deviation: Measures data spread or variability.
3. Rule-Based Description: It uses association rules or classification rules. For example: If
Age > 30 and Income > $50,000 then Likely to Buy a Luxury Car. This rule helps
businesses target high-income individuals for luxury car promotions.
4. Visualization Techniques: Charts, graphs, and other visual representations can effectively
communicate the characteristics of a class, making it easier to understand the dataset. They
are used to display summarized data. Example: A bar chart showing the percentage of
customers from different age groups in a store.

13
COMPARISON / DISCRIMINATION:
 Comparison is also called Discrimination. It is the process of analysing and contrasting
two or more groups of data to identify differences, distinguishing features, and patterns.
 It helps businesses, researchers, and analysts to find key differences in customer behaviour,
product performance, or market trends. It supports decision-making by analysing different
data segments.
 Using statistical techniques, rule-based methods, and machine learning models,
organizations can make data-driven decisions, optimize marketing strategies, detect fraud,
and improve healthcare outcomes.
 For example: In marketing, comparison can help differentiate high-spending and low-
spending customers based on their purchasing behaviour.

Steps in the Comparison Process:


1. Select Data Groups: Define the two or more datasets to compare.
2. Pre-process Data: Clean, filter, and transform data to ensure accuracy.
3. Choose Comparison Method: Statistical, machine learning, rule-based, or
visualization techniques.
4. Analyse Differences: Identify key distinguishing features and trends.
5. Present Findings: Use tables, charts, graphs, or rules to explain the differences.

Applications of Comparison:
1. Customer Segmentation in Marketing: Compare high-value and low-value customers.
Findings: High-value customers shop frequently and prefer premium brands. Low-value
customers shop occasionally and buy discount products.
2. Fraud Detection in Banking: Compare fraud and valid transactions.
Findings: Fraud transactions are often international, high-value, and made at unusual times.
Valid transactions match customer spending history and location.
3. Medical Diagnosis: Compare patients with and without a disease.
Findings: Diabetic patients have higher glucose levels and sitting lifestyles. Non-diabetic
patients have normal glucose levels and exercise regularly.
4. Stock Market Analysis: Compare performing and non-performing stocks.
Findings: Performing stocks have high trading volume and positive earnings reports. Non-
performing stocks show low demand and declining revenue.

14
Techniques for Comparison:
1. Attribute-Oriented Induction (AOI): Similar to AOI used in characterization, but instead
of summarizing, it finds key differences. Attributes are generalized and compared between
two datasets. Example:
Group-1: High-income customers Group-2: Low-income customers
Finding: High-income customers frequently buy luxury items, whereas low-income
customers prefer budget-friendly products.

2. Statistical Comparison: Uses statistical measures to compare datasets. Such as, mean,
median, mode to find the central tendency of different groups. Example:
Group-1 (Online Shoppers): Average Group-2 (In-store Shoppers): Average
purchase = $150 purchase = $100
Finding: Online shoppers spend more per transaction than in-store shoppers.

3. Discriminant Analysis: A classification technique that finds attributes that best


differentiate between two groups. Example:
Group-1: Loan Defaulters Group-2: Non-Defaulters
Defaulters have low credit scores, high and Non-defaulters have high credit scores and
unstable employment stable jobs.

4. Rule-Based Comparison: Uses association rules and decision trees to extract rules
differentiating datasets. Example:
Rule 1: Rule 2:
If the customer is under 30 and has an If the customer is over 50, they are more
active social media presence, they are likely likely to visit physical bank branches.
to prefer digital banking.

5. Machine Learning-Based Comparison: Uses models like Decision Trees, Random


Forest, and Neural Networks to find key distinguishing factors. Example:
Group-1: Spam Emails Group-2: Non-Spam Emails
Spam emails often contain words like Non-spam emails have more personalized
"FREE," "WINNER," and "CASH PRIZE." greetings and business language

15
Advantages of Comparison:
1. It helps in business strategy and decision-making.
2. It distinguishes different consumer groups for personalized marketing.
3. It finds anomalies that indicate fraud activity.

Difference between Data Characterization and Data Comparison:


Feature Data Characterization Data Comparison
Definition Summarizes and describes the general Compares two or more datasets to
characteristics of a dataset. identify differences.
Purpose Provides an overview of the key Highlights differences between
features of a dataset. groups or datasets.
Techniques  Attribute-Oriented Induction (AOI)  Statistical Analysis
Used  Statistical Summarization  Discriminant Analysis
 Visualization  Rule-Based Comparison
Example "The average income of customers in "Customers with an income above
this dataset is $70,000, and most $70,000 buy luxury items, while
purchases come from people aged 25- those with lower income prefer
40." budget-friendly products."
Output Generalized patterns, summaries, and Differences, distinguishing features,
Type trends. and contrasting patterns.
Use Cases  Understanding customer behaviour  Customer segmentation (e.g.,
 Summarizing medical records high vs. low spenders)
 Identifying key attributes in a  Fraud detection (e.g., fraud vs.
dataset valid transactions)
 Disease classification (e.g.,
diabetic vs. non-diabetic patients)

16
DATA GENERALIZATION:
 Data Generalization is the process of summarizing data by replacing relatively low level
values with higher level concepts. It is a form of descriptive data mining.
 It simplifies large datasets by summarizing data and identifies trends that may not be visible
in raw data. It reduces data storage needs by eliminating unnecessary details. It enhances
privacy by masking specific details in sensitive data.
For example: Instead of listing every single value, data is grouped into ranges or categories.
Original Data Generalized Data
Age: 22 Age: 20-30
Age: 27 Age: 20-30
City: New York City: USA
Salary: $55,000 Salary: $50K - $60K

There are two basic approaches of data generalization: OLAP approach & AOI approach
1. Data cube approach (OLAP - Online Analytical Processing approach):
 It is an efficient approach as it is helpful to make the past selling graph.
 In this approach, computation and results are stored in the Data cube.
 It uses Roll-up and Drill-down operations on a data cube.
 These operations typically involve aggregate functions, such as count(), sum(),
average(), and max().
 These materialized views can then be used for decision support, knowledge discovery,
and many other applications.

17
2. Attribute Oriented Induction (AOI):
 It is an online data analysis, query oriented and generalization based approach.
 In this approach, we perform generalization on basis of different values of each
attributes within the relevant data set. After that same tuple are merged and their
respective counts are accumulated in order to perform aggregation.
 Attribute oriented induction approach uses two method :
a) Attribute removal.
b) Attribute generalization.
a) Attribute Removal:
 Attribute removal is a technique used in data generalization where specific attributes
in a dataset are eliminated to simplify the data. This helps in reducing complexity,
improving privacy, and highlighting only the most relevant features.
 It involves removing less important or sensitive attributes from a dataset to make
analysis more efficient.
Why Use Attribute Removal? When to Use Attribute Removal?
 Few attributes make data easier to  When certain attributes are irrelevant (e.g.,
analyze. customer names in sales trend analysis).
 Sensitive data is removed, protecting  When privacy concerns exist (e.g., removing
user identities. personal information like A/c no.).
 Less data to process means faster  When data redundancy is high (e.g.,
computations and improves removing one column if another provides
performance. similar information).

Example: In Customer Data, removing name and Phone number. We have removed because:
 Privacy → Phone numbers & names are sensitive.
 Relevance → Name doesn’t affect salary analysis.
 Efficiency → Fewer columns mean faster processing.
Before Attribute removal: After Attribute Removal
Name Age City Salary Phone Number (Removing Name & Phone
Number):
John 28 New York $50K 123-456-7890
Age City Salary
Sarah 35 Chicago $65K 987-654-3210
28 New York $50K
35 Chicago $65K

18
b) Attribute Generalization:
 Attribute Generalization is a technique in data generalization where specific attribute
values are replaced with higher-level, more abstract concepts using concept hierarchies.
This helps in summarizing data and identifying patterns efficiently.
 It involves replacing detailed values with generalized categories based on pre-defined
hierarchies. It simplifies large datasets by reducing complexity. It uses less space by
summarizing attributes.
Example: Instead of dealing with exact ages, we group them into broader categories.
Original Age Generalized Age (Level 1) Generalized Age (Level 2)
23 20-30 Young
37 30-40 Middle-aged
65 60-70 Senior

Techniques of Attribute Generalization:


1. Concept Hierarchy-Based Generalization: Data is grouped into hierarchical levels
for abstraction. The higher levels provide more generalization.
Example: Location Generalization
City Generalized Level 1 Generalized Level 2
New York New York (State) USA
Los Angeles California USA
Paris Île-de-France France

2. Range-Based Generalization: Numerical data is grouped into ranges instead of


individual values. This is useful when analysing salary distributions rather than
focusing on individual salaries.
Example: Income Generalization
Exact Salary Generalized Salary
$52,000 $50K - $60K
$98,000 $90K - $100K

19
DATA SUMMARIZATION:
 Data summarization is a key technique in data mining that helps in extracting useful
information from large datasets by providing a compact representation of the data.
 It enables to understand large datasets quickly, reduces storage requirements, and helps in
decision-making. It involves computing statistical measures, aggregating data, and creating
visual or textual summaries that highlight key patterns, trends, and distributions.
Techniques of Data Summarization:
1. Statistical Measures: Summarization begins with computing basic statistical measures:
 Mean (Average): The central value of a dataset.
 Median: The middle value when data is sorted.
 Mode: The most frequently occurring value.
 Variance & Standard Deviation: Measures of data dispersion.
2. Aggregation: Combining multiple data points into a single value to provide a higher-level
summary. For example:
 Summing up monthly sales data to get yearly totals.
 Averaging customer ratings for a product.
3. Clustering and Segmentation:
 Clustering: Grouping similar data points together to form patterns.
 Segmentation: Dividing data into meaningful groups based on predefined attributes.
4. Visualization Techniques:
 Histograms & Bar Charts: Show frequency distributions.
 Pie Charts: Represent proportions of categories.
 Box Plots: Indicate data spread and outliers.
 Heat-maps: Display correlations between variables.
5. Data Cube (OLAP Summarization):
 Online Analytical Processing (OLAP) cubes store pre-aggregated data for fast retrieval.
 Used for multidimensional data analysis (e.g., sales by region, time, and product).
Applications of Data Summarization:
a) Business: Understanding customer behaviour, sales trends, and market performance.
b) Healthcare: Summarizing patient records for better diagnosis.
c) Finance: Summarizing stock market data for investment decisions.
d) Social Media: Aggregating trends from large-scale text data.
20
MINING CLASS COMPARISIONS:
 Mining class comparisons is a technique used to analyse and compare different classes
or groups of data based on their attributes. The goal is to identify discriminative features
that distinguish one class from another.
 Class discrimination or class comparison mines descriptions that distinguish a target class
from its contrasting classes. The target and contrasting classes must be comparable
because they share similar dimensions and attributes.
 For example, the three classes - person, address, and item are not comparable. But the
sales in the last three years are comparable classes. Also, we can compare computer
science candidates versus physics candidates.
The general procedure for class comparison is as follows:
1. Data Collection: The set of relevant data in the database and data warehouse is
collected by query Processing and partitioned into a target class and one or a set of
contrasting classes.
2. Dimension relevance analysis: If there are many dimensions and analytical
comparisons are desired, then dimension relevance analysis should be performed on
these classes and only highly relevant dimensions are included in the further analysis.
3. Synchronous Generalization: The process of generalization is performed upon the
target class to the level controlled by the user, which results in a prime target class
relation. The concepts in the contrasting class are generalized to the same level as those
in the prime target class relation, forming the prime contrasting class relation or cuboid.
4. Drilling Down, Rolling Up and other OLAP adjustments: Synchronous and
asynchronous drill-down, roll-up & other OLAP operations such as, slicing, dicing and
pivoting can be performed on target and contrasting classes based on user instructions.
5. Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, charts, and rules. This presentation usually
includes a "contrasting" measure (such as count %) that reflects the comparison between
the target and contrasting classes.
The OLAP operations are:
1. Roll-up: Aggregates data to a higher level of hierarchy.
2. Drill-down: Breaks down data to a more detailed level.
3. Slice: Filters data along a single dimension.
4. Dice: Filters data based on multiple dimensions.
5. Pivot (Rotate): Re-orients the data view for better analysis.

21
Example of Mining Class Comparison:
Suppose we would like to compare the general properties between the old customer and new
customer of Royal Electronics which deals in computer’s electronic products, given the
attributes name, gender, product, birthplace, birthdate, residence and phone.
Target class: old customer
Name Gender Product Birthplace Birthdate Residence Phone
Rakesh M Printer Jodhpur 14/07/1993 3/A C.H.B. 9925852890
Sumit M Scanner Jaipur 22/06/2002 A.G. colony 9875894102
Minal F Keyboard Jodhpur 11/08/1992 Bank colony 9928509928
…… …… …… …… …… …… ……

Contrasting class: new customer


Name Gender Product Birthplace Birthdate Residence Phone
Harish M Mouse Jodhpur 14/07/1993 11/B-road 9941852791
Priya F Monitor Jaipur 22/06/2002 Jain colony 9460768333
…… …… …… …… …… …… ……

22
Initial working relations: the target class vs. the contrasting class.
1. Data Collection: In this, we select two set of task relevant data. One for the initial target
class working relation and the other for the initial contrasting class working relation.
2. Dimension relevance analysis: Now, this analysis is performed on the two classes of
data. After this analysis, irrelevant or weakly relevant dimensions such as name, gender,
product, residence and phone are removed from the resulting class. Only highly relevant
attributes are included in the subsequent analysis.
3. Synchronous Generalization: Here, the generalization is performed on the target class
to the levels controlled by user, forming the prime target class relation. The contrasting
class is also generalized to the same levels as those in the prime target class, forming
the prime contrasting class relation.
Prime generalized relation for the Prime generalized relation for the
target class: old customer contrasting class: new customer
Birthplace Age-Range Birthplace Age-Range
Jodhpur 25-30 Jodhpur 18-25
Jaipur 20-25 Jodhpur 18-25
Others Over 30 Others Over 30

4. Drilling Down, Rolling Up and other OLAP adjustments: The OLAP operations are
performed on the target and contrasting class, based on the user’s instruction to adjust
the level of abstraction.
5. Presentation of the derived comparison: Finally, the resulting class comparison is
presented in form of tables, graphs or rules.

23

You might also like