0% found this document useful (0 votes)

18 views15 pages

dm unit 3

Data preprocessing is crucial for ensuring data quality, which includes accuracy, completeness, consistency, timeliness, believability, and interpretability. Key tasks involve data cleaning, integration, reduction, and transformation to address common issues like missing values and noise, ultimately leading to more reliable analysis and insights. Techniques such as dimensionality reduction, regression, and data cube aggregation are employed to enhance data usability and efficiency in mining algorithms.

Uploaded by

cantbeatme006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views15 pages

dm unit 3

Uploaded by

cantbeatme006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Data Preprocessing Overview: o A sales manager may find the data unreliable because of

missing or incorrect values, while a marketing analyst may

1. Data Quality and Why Preprocess?
still find the data usable despite some inaccuracies.
o Data quality means the data must meet the needs of its
4. Timeliness of Data:
intended use.
o Delays in data collection or updates can affect its quality,
o Important aspects of data quality:
like missing sales records after the month ends.
▪ Accuracy: Data should be correct.
5. Believability & Interpretability:
▪ Completeness: All needed data must be present.
o Even if data is accurate, if users don't trust it or can't
▪ Consistency: No conflicts or discrepancies in the understand it, they’ll consider it low quality.
data.
Data Preprocessing Tasks:
▪ Timeliness: Data must be up-to-date.
1. Data Cleaning:
▪ Believability: Users must trust the data.
o Fixes issues in data like missing values, noise (errors),
▪ Interpretability: Data must be easy to outliers (extreme values), and inconsistencies.
understand.
o Cleaning ensures the data is reliable before running
2. Common Data Quality Issues: mining algorithms.

o Inaccurate Data: Mistakes during data entry or 2. Data Integration:

transmission, incorrect values.
o Combines data from multiple sources, ensuring consistent
o Incomplete Data: Missing values or attributes that names and values.
weren't recorded.
o Addresses issues like different names for the same
o Inconsistent Data: Conflicting or mismatched values, such attribute (e.g., customer id vs cust id).
as different names for the same item (e.g., “Bill” vs
3. Data Reduction:
“William”).
o Reduces the size of the dataset while maintaining its
3. Real-World Scenario:
integrity, making it easier and faster to analyze.
o Types of Data Reduction: Key Takeaway:

▪ Dimensionality Reduction: Reduces the number • Data preprocessing is essential for preparing data for analysis and
of attributes (features) by removing irrelevant ensuring that the insights drawn from it are accurate and
ones. trustworthy.

▪ Numerosity Reduction: Uses techniques like

sampling or aggregation to replace data with
smaller representations.

4. Data Transformation:

o Normalization: Scales data values to a standard range

(e.g., [0.0, 1.0]) to prevent larger values from dominating.

o Discretization: Replaces continuous values with ranges or

categories (e.g., age → youth, adult, senior).

o Concept Hierarchy: Grouping data into higher-level

concepts for easier analysis.

5. Purpose of Preprocessing:

o Improves Data Quality: Clean, complete, and consistent

data ensures more accurate analysis.

o Better Mining Results: Preprocessed data is more suitable

for mining algorithms, improving accuracy and efficiency.

6. Summary:

o Data in the real world is often "dirty" (incomplete,

inaccurate, inconsistent).

o Data preprocessing improves quality, leading to more

reliable results in decision-making and data mining.
Data Cleaning 6. Predict the value using other data: Use more advanced methods
like regression or decision trees to predict and fill in the missing
Data cleaning is the process of fixing incomplete, noisy, or inconsistent
value based on other data points.
data so that it can be used for analysis. The section discusses three main
aspects of data cleaning: handling missing values, dealing with noisy data, Note: Sometimes, missing values are not errors; for example, a person
and managing the data cleaning process. who doesn't have a driver's license might leave that field blank.

3.2.1 Missing Values 3.2.2 Noisy Data

When working with data, you might encounter missing values. Here are Noisy data refers to random errors or inconsistencies in data. Here are
ways to handle them: ways to reduce noise:

1. Ignore the tuple (row): Skip the data point if the missing value is 1. Binning: Sort data into bins (groups), then smooth values by using
not critical (e.g., if the missing value is for the class label in a the mean, median, or boundaries of the bin.
classification task). This method is not ideal if too many values are
2. Regression: Fit a model to the data to make predictions and
missing.
smooth out fluctuations.
2. Fill in manually: Replace missing values manually, which can be
3. Outlier analysis: Identify outliers (data points that deviate
very time-consuming and impractical for large datasets.
significantly from others), and treat them as noise. Clustering
3. Use a constant value: Replace the missing value with a global methods can help find outliers.
constant (like "Unknown"). This can be misleading as the mining
Note: Smoothing techniques can also be used for data transformation and
program may treat these values as meaningful.
reduction.
4. Use the mean or median of the data: Replace missing values with
the mean (average) or median (middle value) of the data. The
median is better if the data is skewed. 3.2.3 Data Cleaning as a Process
5. Use the mean or median for the same class: If you're working Data cleaning is a step-by-step process involving the following:
with different categories (e.g., customer credit risk), replace the
missing value with the mean or median of that class. 1. Discrepancy detection: Identify inconsistencies or errors in the
data, which can arise from mistakes in data entry, outdated
information, or using the data in ways not originally intended.
o Use knowledge about the data (metadata) to find issues • It helps reduce redundancies (repetition of data) and
(e.g., valid ranges for values, expected relationships inconsistencies, which improves the quality and speed of data
between attributes). analysis.

o Look for unusual data points that don’t fit trends or Challenges in Data Integration:
patterns (outliers).
1. Entity Identification Problem:
2. Inconsistent data representations: Check for differences in how
o This problem occurs when we need to match equivalent
the data is represented, such as date formats or codes used.
data from different databases.
3. Manual correction: Some errors can be fixed manually using
o For example, if "customer id" in one database and "cust
external references, like verifying an address.
number" in another refer to the same thing, we need to
4. Data transformation: Once discrepancies are found, apply ensure they match up correctly.
changes to fix them. This process may involve writing scripts or
o Metadata (like data type and range) helps avoid mistakes
using tools to make bulk corrections.
when combining data.
o Commercial tools can help automate data cleaning, like
2. Redundancy and Correlation Analysis:
checking for errors in postal addresses or using statistical
analysis to identify issues. o Redundancy happens when data is repeated
unnecessarily, like having the same information in
5. Iterative process: Data cleaning is a repetitive task. After making
different places.
changes, check if new issues arise, and repeat the process until
the data is clean. o Correlation analysis checks if two pieces of data are
related or if one can be derived from another (like annual
o Some tools help by allowing users to make changes and
revenue being derived from monthly sales).
instantly see the results, improving interactivity and
reducing errors. How to Detect Redundancy:
Data Integration and Its Importance: 1. For Nominal (categorical) Data:
• Data integration is about combining data from different sources o Use the χ² (chi-square) test to find out if two attributes
into one consistent dataset. (like gender and preferred reading material) are related.
o The chi-square statistic checks if the actual data o Different data sources might store the same attribute in
significantly differs from what we expect, indicating a different ways (e.g., one system uses metric units while
correlation. another uses imperial units for weight).

2. For Numeric Data: o Conflicts can also arise due to different representations
of data or different abstraction levels (e.g., sales data at a
o The correlation coefficient measures how strongly two
store level versus regional level).
numeric attributes (like sales and profits) are related.
Data Reduction Overview:
o The coefficient can range from -1 (negative correlation) to
+1 (positive correlation), where 0 means no correlation. 1. Data Reduction: Helps make large datasets smaller, maintaining
most of the original data’s key information. This makes analysis
o Correlation doesn’t imply causality (i.e., one doesn't
faster and more efficient.
necessarily cause the other).
2. Data Reduction Techniques: Include:
3. Covariance:
o Dimensionality Reduction: Reducing the number of
o Covariance is similar to correlation and shows how two
attributes (columns) used in analysis.
numeric attributes change together.
o Numerosity Reduction: Replacing data with smaller
o Positive covariance means both attributes increase
representations (models or compressed data).
together, and negative covariance means one increases
while the other decreases. o Data Compression: Reducing data size while keeping
information, either with or without losing data quality
Additional Issues in Data Integration:
(lossless vs. lossy compression).
1. Tuple Duplication:
3.4.1 Data Reduction Strategies:
o Sometimes, duplicate records appear in datasets (e.g.,
1. Dimensionality Reduction:
the same person being listed multiple times), which leads
to redundancy and errors. o Reduces the number of attributes or variables in the data.

2. Data Value Conflict: o Techniques include:

▪ Wavelet Transforms: A technique to transform

data and keep only the most important parts.
▪ Principal Components Analysis (PCA): Combines 3. Procedure: Involves transforming data, then truncating less
attributes into fewer, more significant important parts of the data, resulting in a more compact and
components. faster-to-process dataset.

2. Numerosity Reduction: 3.4.3 Principal Components Analysis (PCA):

o Uses models to represent data with fewer numbers. 1. PCA: A method that combines several attributes into fewer,
"principal" components, representing most of the data’s variation.
o Includes methods like regression (predicting data),
histograms, clustering, sampling, and data cube 2. Procedure:
aggregation.
o Normalizes data to ensure no attribute dominates.
3. Data Compression:
o Creates orthogonal (independent) components that
o Lossless Compression: Reduces data size without losing represent the data.
any information.
o Reduces data by keeping only the most significant
o Lossy Compression: Reduces data size but sacrifices some components.
accuracy for smaller data size.
3. Benefits of PCA: Helps reduce the complexity of the data while
3.4.2 Wavelet Transforms: preserving important patterns. It’s better for handling sparse data.

1. Wavelet Transform: A method to compress data by transforming 3.4.4 Attribute Subset Selection:
it into wavelet coefficients. The most important coefficients are
1. Attribute Subset Selection: This technique removes irrelevant or
kept, and the rest are discarded to reduce data size.
redundant attributes to improve data mining efficiency and
2. Wavelet Properties: quality.

o Allows data cleaning by removing noise while keeping 2. Why It’s Needed: Too many attributes can slow down analysis and
important data. cause poor quality results.

o Works faster in specific scenarios and provides better 3. Methods for Selection:
compression than traditional methods like JPEG.
o Stepwise Forward Selection: Starts with no attributes and
adds the most useful ones step by step.
o Stepwise Backward Elimination: Starts with all attributes • Multiple Linear Regression: Extends simple linear regression to
and removes the least useful ones. model relationships involving more than two variables.

o Combination of Both: Adds and removes attributes to • Log-Linear Models: Used to estimate probabilities for data in
find the best set. higher-dimensional spaces (more than two variables) and are
helpful for reducing dimensions of the data.
o Decision Tree Induction: Uses decision trees to find which
attributes are most useful for classification. • Data Reduction: Both regression and log-linear models help
reduce data by summarizing it more simply.
4. Additional Methods:
Histograms:
o Attribute Construction: Combines attributes to create
new, more useful attributes (e.g., combining height and • Binning: Histograms group data into different buckets (or bins) to
width to create area). summarize distributions.

Key Points: • Single Attribute Histograms: Show the distribution of one

attribute (e.g., prices) by grouping similar values together.
• Data Reduction helps make large datasets manageable by
reducing size while preserving useful information. • Multidimensional Histograms: These capture the relationship
between multiple attributes, but they are effective only up to a
• Dimensionality reduction reduces the number of features,
certain number of dimensions (about five).
numerosity reduction reduces data volume, and data
compression reduces data size for storage or processing • Data Smoothing: Histograms can smooth out variations and
efficiency. represent data more clearly.

• Different techniques, such as Wavelet Transforms and PCA, are Clustering:

used depending on the type of data and the problem at hand.
• Clustering: Groups data into clusters where each group has
Regression and Log-Linear Models: similar characteristics.

• Linear Regression: A method to model a relationship between • Cluster Quality: The quality of a cluster can be measured by how
two variables (e.g., predicting one based on the other) using a close the points in the cluster are to each other (using methods
straight line. This method minimizes the error between actual and like centroid distance or diameter).
predicted data points.
• Data Reduction in Clustering: Once clusters are identified, the • Data Reduction: Aggregating data in cubes reduces its size while
data can be represented by just the cluster centers, which reduces keeping the essential patterns, making it easier to analyze.
the size of the data.
Data Transformation Strategies:
Sampling:
1. Smoothing:
• Sampling: Reduces large datasets by selecting a smaller random
o Removes noise (irrelevant or random variations) from
subset of data.
data.
• Types of Sampling:
o Techniques: Binning, Regression, Clustering.
o Simple Random Sampling: Selects a random sample of
2. Attribute Construction (Feature Construction):
data without replacement or with replacement.
o Creates new attributes by combining or transforming
o Cluster Sampling: Groups data into clusters (e.g.,
existing ones to improve the mining process.
geographical locations), then samples entire clusters.
3. Aggregation:
o Stratified Sampling: Divides data into strata (groups) and
samples from each stratum to ensure representation of o Summarizes data by combining it into higher-level groups
different segments. (e.g., daily sales data can be aggregated into monthly or
annual totals).
• Advantage: Sampling reduces the data size significantly, often
providing good estimates with less computation. o Used in creating data cubes, which are used for analysis at
multiple levels.
Data Cube Aggregation:
4. Normalization:
• Data Cubes: Store summarized data from different perspectives,
allowing fast access to precomputed aggregate values. o Scales data so that it falls within a smaller range, like from
0 to 1 or -1 to 1.
• Aggregation: For example, sales data might be aggregated by year
rather than by quarter to reduce data size while preserving o This helps avoid bias caused by attributes with larger
important information. ranges, making the data comparable.
• Cuboids: A data cube can be seen as a set of data cubes at 5. Discretization:
different abstraction levels, from the most detailed to the most
summarized (apex cuboid).
o Converts continuous data (like age) into categorical • It ensures all attributes have similar importance when analyzing
intervals (like 0-10, 11-20, etc.) or labels (youth, adult, data (e.g., in classification or clustering).
senior).
Types of Normalization:
o Helps in simplifying data and making patterns easier to
1. Min-Max Normalization:
understand.
o Scales values to a specific range, such as [0, 1].
6. Concept Hierarchy Generation for Nominal Data:
o Formula:
o Generalizes attributes into higher-level concepts. For
(vi−minA)(maxA−minA)×(new maxA−new minA)+new min
example, instead of “street,” you could use “city” or
A\frac{(v_i - minA)}{(maxA - minA)} \times (new\ maxA -
“country.”
new\ minA) + new\ minA(maxA−minA)(vi−minA)
o This helps group similar data together, which is useful in ×(new maxA−new minA)+new minA.
mining for patterns at different levels of abstraction.
2. Z-Score Normalization:
Discretization Types:
o Normalizes values based on the mean and standard
• Supervised Discretization: Uses class information to split data deviation.
into intervals.
o Formula: (vi−meanA)stdA\frac{(v_i - meanA)}{stdA}stdA(vi
• Unsupervised Discretization: No class information, data is split −meanA).
based on attributes alone.
o It’s useful when you have outliers or unknown
• Top-Down Discretization: Starts with a few split points and minimum/maximum values.
divides data into intervals recursively.
3. Decimal Scaling:
• Bottom-Up Discretization: Starts by merging small intervals and
o Normalizes by moving the decimal point of values.
then recursively refines them.
o The amount moved depends on the largest value in the
Data Transformation Techniques:
dataset.
• Normalization is essential to remove dependency on units (like
Example for Normalization:
meters vs. inches or kilograms vs. pounds).
• Min-Max Example: If the minimum income is $12,000 and the • One common method is Market Basket Analysis, where you find
maximum is $98,000, a value of $73,600 would be transformed to items that are often bought together. For example, if someone
0.716 in the [0, 1] range. buys a computer, they might also buy antivirus software.

• Z-Score Example: If the average income is $54,000 and the • This helps retailers make decisions, like arranging items together
standard deviation is $16,000, a value of $73,600 would be in the store to increase sales.
transformed to 1.225.
Association Rules:
Frequent Patterns in Sales:
• These are rules that describe how items are related. For example,
• As a sales manager, you want to know which products customers the rule: "If a customer buys a computer, they are likely to also
are likely to buy together. This helps you make recommendations. buy antivirus software".

• A frequent pattern is a combination of items that frequently • Support: The percentage of transactions that include both items
appear together in transactions, such as milk and bread being (e.g., 2% of all transactions include both a computer and antivirus
bought together. software).

Types of Frequent Patterns: • Confidence: The likelihood that a customer buying one item will
buy the other (e.g., 60% of customers who buy a computer also
• Itemsets: Combinations of items, like {milk, bread}.
buy antivirus software).
• Subsequences: Items bought in a sequence, like buying a PC, then
Frequent Itemsets:
a digital camera, then a memory card.
• A frequent itemset is a set of items that appear together often.
• Substructures: More complex patterns, like graphs or trees.
• The support count is the number of transactions where that set
Frequent Pattern Mining:
of items appears.
• This is the process of finding common patterns, associations, or
• Frequent k-itemset: A set of k items (like {computer, antivirus
relationships in large data sets. It is useful for various tasks like
software} is a 2-itemset).
data analysis and product recommendations.
Closed and Maximal Itemsets:
Market Basket Analysis:
• A closed frequent itemset is an itemset that does not have any
larger itemset with the same support count.
• A maximal frequent itemset is the largest itemset that is Overview of Apriori Algorithm for Mining Frequent Itemsets:
frequent, and no larger itemset is frequent.
1. Apriori Algorithm Basics:
• Closed itemsets provide complete information, while maximal
o Apriori is a popular algorithm used to find frequent
itemsets only give partial information but are simpler to handle.
itemsets (combinations of items that appear together
Challenges in Mining Frequent Patterns: frequently in transactions).

• Mining all possible frequent itemsets can result in a huge number o It uses a level-wise search where it first finds frequent 1-
of combinations, making it computationally expensive. itemsets, then uses those to find frequent 2-itemsets, and
so on.
• To solve this, the concepts of closed frequent itemsets and
maximal frequent itemsets are used to reduce the number of o The algorithm continues this process until no more
combinations. frequent itemsets are found.

2. Apriori Property:

o The key property of Apriori is that if a set of items is

frequent, all of its smaller subsets must also be frequent.

o If a larger itemset is not frequent, then any itemset that

contains it also cannot be frequent.

o This helps reduce the search space by eliminating

unnecessary candidate itemsets.

3. Steps in the Algorithm:

o Step 1: First, find all frequent 1-itemsets by counting how

often each item appears in the transactions.

o Step 2: For each subsequent step, generate candidate

itemsets by combining frequent itemsets from the
previous step (called the "join" step).
o Step 3: Once candidate itemsets are generated, check if o Important Procedures:
they meet the minimum support (threshold). If they do,
▪ apriori_gen: Generates candidate itemsets.
they are considered frequent and will be used in the next
iteration. ▪ has_infrequent_subset: Checks if a candidate
itemset has any subset that isn’t frequent.
4. Join and Prune Steps:
Summary:
o Join Step: Candidate itemsets for the next iteration are
generated by joining pairs of frequent itemsets from the The Apriori algorithm is a method for finding frequent itemsets in
previous iteration. transaction databases by using the Apriori property to reduce the search
space and efficiently generate frequent itemsets. The algorithm works by
o Prune Step: After generating the candidate itemsets,
joining frequent itemsets, pruning infeasible candidates, and scanning the
check if any of their subsets are not frequent (using the
database to count the support of candidate itemsets.
Apriori property). If any subset is not frequent, the
candidate itemset is pruned (removed) to avoid 1. Generating Association Rules
unnecessary work.
• Association rules are generated from frequent itemsets in a
5. Pruning Example: database. These rules are used to identify relationships between
items in transactions.
o Suppose we are generating 3-itemsets (C3) from 2-
itemsets (L2). If a candidate 3-itemset has a subset (a 2- • Confidence is used to determine how strong an association rule
itemset) that is not frequent, the candidate 3-itemset can is, and it is calculated as:
be removed. confidence(A⇒B)=support count(A∪B)support count(A)\text{conf
idence}(A \Rightarrow B) = \frac{\text{support count}(A \cup
6. Algorithm Efficiency:
B)}{\text{support
o The use of the Apriori property helps make the algorithm count}(A)}confidence(A⇒B)=support count(A)support count(A∪B)
more efficient by reducing the number of candidate
• The steps to generate association rules:
itemsets that need to be considered.
1. For each frequent itemset, find all its nonempty subsets.
7. Pseudocode:
2. For every subset, check if the confidence meets the
o The algorithm works through these steps iteratively, using
minimum threshold. If it does, generate the rule.
frequent itemsets from one iteration to generate
candidates for the next.
• Frequent itemsets automatically satisfy the minimum support ▪ If an itemset's count is below the support
condition. threshold, it is not frequent and is discarded.

• The result is a list of association rules that meet the confidence 2. Transaction Reduction:
threshold.
▪ If a transaction does not contain any frequent
2. Example of Generating Association Rules itemsets, it is removed from future scans.

• Example with itemset X = {I1, I2, I5}: 3. Partitioning:

o The nonempty subsets of X are: {I1, I2}, {I1, I5}, {I2, I5}, ▪ Divide the database into partitions and find
{I1}, {I2}, {I5}. frequent itemsets within each partition.

o The generated association rules with their confidence: ▪ After finding local frequent itemsets, check them
globally to find the overall frequent itemsets.
▪ {I1, I2} ⇒ I5 (50% confidence)
4. Sampling:
▪ {I1, I5} ⇒ I2 (100% confidence)
▪ Use a random sample of the data to find frequent
▪ {I2, I5} ⇒ I1 (100% confidence)
itemsets quickly.
▪ I1 ⇒ {I2, I5} (33% confidence)
▪ If the sample contains all the frequent itemsets,
▪ I2 ⇒ {I1, I5} (29% confidence) only one scan of the data is needed.

▪ I5 ⇒ {I1, I2} (100% confidence) ▪ If not, a second scan is performed to find any
missing itemsets.
• If the minimum confidence threshold is 70%, only rules with
confidence ≥ 70% will be kept. 5. Dynamic Itemset Counting:

3. Improving Apriori Algorithm Efficiency ▪ Candidate itemsets are added dynamically as the
database is being scanned, improving efficiency.
• There are various methods to make the Apriori algorithm faster:
• Problem with Apriori:
1. Hash-based Technique: 1. Apriori is a well-known algorithm for mining frequent
▪ Use a hash table to map itemsets and reduce the itemsets (sets of items that frequently occur together in
transactions).
size of candidate itemsets.
2. However, it has two main problems: ▪ Start from the least frequent items (to improve
▪ It can generate a huge number of candidate efficiency).
itemsets, making it inefficient. ▪ Create a conditional pattern base for each item,
▪ It requires repeatedly scanning the entire which includes the paths in the tree that contain
database to count the support of each candidate the item.
itemset, which is time-consuming. ▪ Build a new FP-tree for each item and continue
• FP-growth (Frequent Pattern Growth): mining recursively.
1. FP-growth is an alternative method that aims to avoid • Benefits of FP-growth:
the costly candidate generation process. 1. It is much faster and more scalable than the Apriori
2. It uses a divide-and-conquer approach, which involves algorithm, especially when the database is large.
two key steps: 2. It avoids scanning the database repeatedly, reducing the
▪ Step 1: Compress the database into a Frequent overall computational cost.
Pattern Tree (FP-tree) that retains itemset 3. It mines both short and long frequent itemsets efficiently.
relationships. • Alternative: Vertical Data Format (Eclat Algorithm):
▪ Step 2: Divide the FP-tree into smaller 1. Instead of using the horizontal transaction format, Eclat
conditional databases, each associated with a uses a vertical data format.
specific frequent item (called a "pattern 2. In this format, each item has a list of transaction IDs
fragment"). (TIDs) where it appears.
▪ By doing this, only smaller subsets of data are 3. To find frequent itemsets, the algorithm intersects TID
needed to find frequent itemsets, reducing lists of items.
search space and improving efficiency. 4. This method avoids rescanning the database and directly
• How FP-growth works: counts the support of itemsets using the TID lists.
1. First database scan: Find frequent individual items and • Mining Closed Itemsets:
their counts, then sort them in descending order of 1. Closed frequent itemsets are itemsets that have no
frequency. superset with the same support. Mining these can
2. Create FP-tree: After sorting the items in each significantly reduce the number of itemsets.
transaction, construct an FP-tree. Items that occur 2. There are strategies to efficiently mine closed itemsets,
together often share common prefixes in the tree. including:
3. Item header table: Links all occurrences of an item in the ▪ Item merging: If a frequent itemset can be
FP-tree, making it easier to find paths where that item merged with another to form a closed itemset,
occurs. there is no need to search for more itemsets that
4. Mining the FP-tree: contain the first one but not the second.
▪ Sub-itemset pruning: If a frequent itemset is a
subset of an already found closed itemset with
the same support, it can be discarded.
▪ Item skipping: Items that are frequently found in
many itemsets can be skipped to avoid
unnecessary calculations.
• Efficient Closure Checking:
1. When a new frequent itemset is found, it's important to
check if it is a closed itemset.
2. Superset checking: If a new itemset is a superset of a
closed itemset with the same support, it can be
discarded.
3. Subset checking: If a new itemset is a subset of a closed
itemset with the same support, it can be discarded.
4. A pattern-tree structure helps track all found closed
itemsets, making these checks faster.

By using these techniques, the mining process becomes more efficient,

and the number of itemsets generated is reduced.

Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Document (2)
No ratings yet
Document (2)
29 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
How should data preparation be done for an analytics project_
No ratings yet
How should data preparation be done for an analytics project_
30 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit 3
No ratings yet
Unit 3
18 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Correlation
No ratings yet
Correlation
14 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
2 DM DataPreprocessing
No ratings yet
2 DM DataPreprocessing
43 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Week 3
No ratings yet
Week 3
23 pages
Data Mining
No ratings yet
Data Mining
22 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Machine Learning C
No ratings yet
Machine Learning C
24 pages
PDF Laporan Praktikum Data Mining - Compress
No ratings yet
PDF Laporan Praktikum Data Mining - Compress
142 pages
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
No ratings yet
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
11 pages
Vandna - Jain Resume
No ratings yet
Vandna - Jain Resume
2 pages
Instant Download Applied Unsupervised Learning with R Uncover hidden relationshik means clustering hierarchical clustering and PCA Alok Malik PDF All Chapters
100% (6)
Instant Download Applied Unsupervised Learning with R Uncover hidden relationshik means clustering hierarchical clustering and PCA Alok Malik PDF All Chapters
24 pages
AutomaticLiverTumor PDF
No ratings yet
AutomaticLiverTumor PDF
18 pages
Unsupervised Classification of Multivariate Time Series Using VPCA and Fuzzy Clustering With Spatial Weighted Matrix Distance
No ratings yet
Unsupervised Classification of Multivariate Time Series Using VPCA and Fuzzy Clustering With Spatial Weighted Matrix Distance
10 pages
Module 4
No ratings yet
Module 4
29 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
Cs425 Datawarehousing & Datamining (Elective - Iv) : IV Year B.Tech. ECM I - Semester L T P To C
No ratings yet
Cs425 Datawarehousing & Datamining (Elective - Iv) : IV Year B.Tech. ECM I - Semester L T P To C
2 pages
Apache Mahout
No ratings yet
Apache Mahout
22 pages
Machine Learning PYQ 2023
No ratings yet
Machine Learning PYQ 2023
8 pages
Udacity Enterprise Syllabus Introduction To Machine Learning With TensorFlow nd230
No ratings yet
Udacity Enterprise Syllabus Introduction To Machine Learning With TensorFlow nd230
12 pages
Data Mining Techniques For Fraud Detection in Banking Sector
No ratings yet
Data Mining Techniques For Fraud Detection in Banking Sector
5 pages
Classifying the traffic state of urban expressways- A machinelearning approach
No ratings yet
Classifying the traffic state of urban expressways- A machinelearning approach
18 pages
Detection of Abnormalities in Real-Time Computer Network Traffic Empowered by Machine Learning
No ratings yet
Detection of Abnormalities in Real-Time Computer Network Traffic Empowered by Machine Learning
8 pages
Data Minig 2
No ratings yet
Data Minig 2
108 pages
Elbow Metrics
No ratings yet
Elbow Metrics
6 pages
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec07-Lexical-Semantics - (Cuuduongthancong - Com)
No ratings yet
Xu-Ly-Ngon-Ngu-Tu-Nhien - Regina-Barzilay - Lec07-Lexical-Semantics - (Cuuduongthancong - Com)
68 pages
UNIT4 Clustering
No ratings yet
UNIT4 Clustering
30 pages
Intelligent Systems Notes: Federico Rossi A.A 2017/2018
No ratings yet
Intelligent Systems Notes: Federico Rossi A.A 2017/2018
34 pages
The Hierarchical Equal Risk Contribution Portfolio
No ratings yet
The Hierarchical Equal Risk Contribution Portfolio
26 pages
Ch7 Refactoring
No ratings yet
Ch7 Refactoring
71 pages
Cluster Analysis Thesis Matlab Code PDF
100% (3)
Cluster Analysis Thesis Matlab Code PDF
7 pages
An Integrative Review of Computational Methods For
No ratings yet
An Integrative Review of Computational Methods For
15 pages
Anish Shankar: Curriculum Vitae
No ratings yet
Anish Shankar: Curriculum Vitae
2 pages
AI ML Question Bank with answers
No ratings yet
AI ML Question Bank with answers
29 pages
Case Study On An Android App For Inventory Management System With Sales Prediction For Local Shopkeepers in India
No ratings yet
Case Study On An Android App For Inventory Management System With Sales Prediction For Local Shopkeepers in India
4 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages

dm unit 3

Uploaded by

dm unit 3

Uploaded by

Data Preprocessing Overview: o A sales manager may find the data unreliable because of

missing or incorrect values, while a marketing analyst may

o Inaccurate Data: Mistakes during data entry or 2. Data Integration:

▪ Numerosity Reduction: Uses techniques like

o Normalization: Scales data values to a standard range

o Discretization: Replaces continuous values with ranges or

o Concept Hierarchy: Grouping data into higher-level

o Improves Data Quality: Clean, complete, and consistent

o Better Mining Results: Preprocessed data is more suitable

o Data in the real world is often "dirty" (incomplete,

o Data preprocessing improves quality, leading to more

3.2.1 Missing Values 3.2.2 Noisy Data

2. Data Value Conflict: o Techniques include:

▪ Wavelet Transforms: A technique to transform

2. Numerosity Reduction: 3.4.3 Principal Components Analysis (PCA):

Key Points: • Single Attribute Histograms: Show the distribution of one

• Different techniques, such as Wavelet Transforms and PCA, are Clustering:

o The key property of Apriori is that if a set of items is

o If a larger itemset is not frequent, then any itemset that

o This helps reduce the search space by eliminating

3. Steps in the Algorithm:

o Step 1: First, find all frequent 1-itemsets by counting how

o Step 2: For each subsequent step, generate candidate

• Example with itemset X = {I1, I2, I5}: 3. Partitioning:

By using these techniques, the mining process becomes more efficient,

You might also like