DM UNIT II (1)
DM UNIT II (1)
What Is An Itemset?
A set of items together is called an itemset. If any itemset has k-items it is called a k- itemset. An itemset
consists of two or more items. An itemset that occurs frequently is called a frequent itemset. Thus, frequent
itemset mining is a data mining techniqueto identify the items that often occur together.
For Example, Bread and butter, Laptop and Antivirus software, etc.
A set of items is called frequent if it satisfies a minimum threshold value for support and confidence.
Support shows transactions with items purchased together in a single transaction. Confidence shows
transactions where the items are purchased one after the other.
For frequent itemset mining method, we consider only those transactions which meet minimum threshold
support and confidence requirements. Insights from these mining algorithms offer a lot of benefits, cost-cutting
and improved competitive advantage.
There is a tradeoff time taken to mine data and the volume of data for frequent mining. The frequent
mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within a short time and less
memory consumption.
PROBLEM
When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a distinctive
list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner,
while a bachelor might buy beer and chips. Understanding these buying patterns can help to increase sales in
several ways. If there is a pair of items, X and Y, that are frequently bought together:
Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.
Promotional discounts could be applied to just one out of the two items.
Advertisements on X could be targeted at buyers who purchase Y.
X and Y could be combined into a new product, such as having Y in flavors of X.
While we may know that certain items are frequently bought together, the question is, how do we
uncover these associations?
Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for
instance, understanding which symptoms tend to co-morbid can help to improve patient care and medicine
prescription.
Definition
Association rules analysis is a technique to uncover how items are associated to each other. Association rule
mining, at a basic level, involves the use of machine learning models to analyze data for patterns, or co-
occurrences, in a database. It identifies frequent if-then associations, which themselves are the association
rules.
If we apply to supermarket transaction data, that is, to examine the customer behavior in terms the purchased
products. Association rules describe how often the items are purchased together.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are
stored within a shelf or mostly nearby. Consider the below diagram:
Methods for searching data for frequent if-then patterns/Measures of the effectiveness of association
rules
Association rules are created by searching data for frequent if-then patterns or to measure strength of a given
association rule by using the criteria support and confidence to identify the most important relationships.
Support It is an indication of how frequently the items appear in the data.
Measure 1: Support.
It is defined as the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:
This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.
In Table, the support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple items. For instance,
the support of {apple, beer, rice} is 2 out of 8, or 25%.
Measure 2: Confidence.
It refers to the amount of times a given rule turns out to be true in practice. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.
This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured
by the proportion of transactions with item X, in which item Y also appears. In Table 1, the confidence of
{apple -> beer} is 3 out of 4, or 75%.
One drawback of the confidence measure is that it might misrepresent the importance of an association. This
is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general,
there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the
confidence measure. To account for the base popularity of both constituent items, we use a third measure
called lift.
Measure 3: Lift.
A third metric, called lift, can be used to compare confidence with expected confidence, or how many times
an if-then statement is expected to be found true. It is the ratio of confidence to support. is the ratio of the
observed support measure and expected support if X and Y are independent of each other.
This says how likely item Y is purchased when item X is purchased, while controlling for how popular item
Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association between items. A lift value greater
than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item
Y is unlikely to be bought if item X is bought.
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative
effect on another.
Support focuses on identifying frequent itemsets, while confidence measures the strength of association
between antecedent and consequent itemsets in association rules.
Association rules are calculated from itemsets, which are made up of two or more items. If rules are built
from analyzing all the possible itemsets, there could be so many rules that the rules hold little meaning. With
that, association rules are typically created from rules well-represented in data.
A rule may show a strong correlation in a data set because it appears very often but may occur far less
when applied. This would be a case of high support, but low confidence.
Conversely, a rule might not particularly stand out in a data set, but continued analysis shows that it occurs
very frequently. This would be a case of high confidence and low support. Using these measures helps
analysts separate causation from correlation and allows them to properly value a given rule.
The frequent pattern mining algorithm is one of the most important techniques of data mining to discover
relationships between different items in a dataset. These relationships are represented in the form of association
rules. It helps to find the irregularities in data.
FPM has many applications in the field of data analysis, software bugs, cross- marketing, sale campaign
analysis, market basket analysis, etc.
Frequent itemset or pattern mining is broadly used because of its wide applications in mining Association
rules, correlations and graph patterns constraint that is based on frequent patterns, sequential patterns, and
many other data mining tasks.
Frequent itemsets discovered through Apriori have many applications in data mining tasks. Tasks such as
finding interesting patterns in the database, finding out sequence and Mining of association rules is the most
important of them.
In frequent itemset mining, the process of discovering sets of items that frequently occur together in a dataset,
there are two main approaches: one with candidate generation and one without candidate generation. These
approaches are typically associated with two popular algorithms:
Apriori Algorithm: This algorithm is one of the earliest and most well-known frequent itemset mining
algorithms. It uses candidate generation as a key step.
Candidate Itemsets: In Apriori, candidate itemsets are generated at each iteration. The algorithm starts with
single items and gradually generates larger itemsets by joining frequent itemsets from the previous level.
These candidate itemsets are then pruned based on the Apriori property, which states that if an itemset is
infrequent, none of its supersets can be frequent.
Support Counting: The support (frequency) of these candidate itemsets is then counted in the dataset to
identify frequent itemsets (those that meet a predefined support threshold).
Pros: Apriori is easy to understand and implement. It works well for a wide range of datasets and support
thresholds.
Cons: The generation of candidate itemsets and multiple passes over the dataset can make it inefficient for
large datasets.
Without Candidate Generation (e.g., FP-growth):
FP-growth Algorithm: FP-growth (Frequent Pattern growth) is an alternative approach to frequent itemset
mining that does not rely on candidate generation.
FP-tree Structure: FP-growth builds an FP-tree (Frequent Pattern tree) from the dataset, which encodes
itemset frequencies compactly.
Conditional FP-trees: After building the FP-tree, it extracts frequent itemsets directly from the tree structure
using a recursive process, without generating candidate itemsets explicitly.
Apriori Algorithm – Frequent Pattern Algorithms
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was later
improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm uses two steps
“join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent
itemsets.
Apriori says:
The statement you provided is accurate for the Apriori algorithm. If the support of an itemset I (P(I)) is less
than the minimum support threshold, then I is not considered frequent. Similarly, if the support of an itemset
I+A (P(I+A)), where A is another item, is less than the minimum support threshold, then I+A is not
considered frequent.
Suppose we have a transaction dataset containing shopping baskets, and we want to find frequent itemsets
with a minimum support threshold of 3.
Transaction Dataset:
P(I) < minimum support threshold: In our example, P(I) = 3, which is greater than the minimum support
threshold of 3. Therefore, itemset I = {Milk, Bread} is frequent.
P(I + A) < minimum support threshold: P(I + A) = 3, which is equal to the minimum support threshold of 3.
Therefore, itemset I + A = {Milk, Bread, Eggs} is also frequent.
In this example, both itemset I and itemset I + A are considered frequent because their support is equal to or
greater than the minimum support threshold. If any of them had a support less than the minimum threshold,
they would be considered infrequent.
If an itemset set has value less than minimum support then all of its supersets will also fall below min
support, and thus can be ignored. This property iscalled the Antimonotone property.
1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate itemsets.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given database.
This data mining technique follows the join and the prune steps iteratively until the most frequent itemset is
achieved. A minimum support threshold is given in the problem or it is assumed by the user.
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup,
are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the tablewill have 2 –itemsets
with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2
–itemset subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be
frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does
not meet the min_sup criteria. The algorithm is stoppedwhen the most frequent itemset is achieved.
The minimum support threshold represents the minimum frequency or support count that an itemset must
meet to be considered "frequent" in the dataset.
Minimum Support Count = (Minimum Support Percentage / 100) * Total Number of Transactions
Apriori Helps in mining the frequent itemset.
Result:
Only one itemset is frequent (Eggs, Tea, Cold Drink) because this itemset has
minimum support 2.
The Apriori Algorithm: Pseudo Code
function apriori(dataset, min_sup):
# Initialize frequent itemsets with 1-itemsets
Ck = generate_candidate_1_itemsets(dataset)
Lk = prune_infrequent_itemsets(Ck, min_sup)
frequent_itemsets = Lk
return frequent_itemsets
In this pseudo-code:
Disadvantages
It requires high computation if the itemsets are very large and the minimumsupport is kept
very low.
The entire database needs to be scanned.
Many methods are available for improving the efficiency of the algorithm.
Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-
itemsets and its corresponding count. It uses a hash function for generating the table.
Transaction Reduction: This method reduces the number of transactions scanning in iterations. The
transactions which do not contain frequent items are marked or removed.
Partitioning: This method requires only two database scans to mine the frequent itemsets. It says that for
any itemset to be potentially frequent in the database, it should be frequent in at least one of the partitions of
the database.
Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in
S. It may be possible to lose a global frequent itemset. This can be reduced by lowering the min_sup.
Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the
database during the scanning of the database.
HASH BASED TECHNIQUES
Partitioning algorithm is based to find the frequent elements on the basis partitioning of database in n parts.
It overcomes the memory problem for large database which do not fit into main memory because small parts
of database easily fit into main memory. This algorithm divides into two passes,
A partition concept has been proposed to increase the execution speed with minimum cost. For each itemsets,
that is one itemset, 2-itemsets, 3-itemsets etc., a separate partition will be created during data insertion in to
the table. Initially a set of frequent 1-itemsets is found by scanning the databases and get the numbers of
occurrences of each item from the partition of those particular items using the pointer, the items satisfying the
minimum support count will be included in the frequent 1- itemsets.
It should be noted that if the minimum support for transactions in whole database is min_sup then the
minimum support for partitioned transactions is min-sup number of transaction in that partition.
transaction dataset with the following transactions:
Now, let's say we want to mine association rules and set a minimum support threshold (min_sup) of 30% for the entire
dataset. This means that an itemset must appear in at least 30% of all transactions (8 transactions) to be considered
frequent.
Minimum Support Count = 0.30 * 8 transactions = 2.4 (rounded to 3 for practical purposes)
So, for the entire dataset, an itemset must appear in at least 3 transactions to be considered frequent.
Now, let's consider that we want to partition this dataset into two subsets based on some criteria (e.g., geographical
location or time period). We have two partitions, Partition A and Partition B, and each has its own set of transactions:
Partition A:
Now, the statement you provided suggests that the minimum support for each partition should be determined based on
the number of transactions in that partition. Let's calculate the minimum support for each partition:
To obtain the final set of frequent itemsets for the entire dataset, we need to combine the frequent itemsets from both
Partition A and Partition B. Since each partition has its own frequent itemsets based on its minimum support threshold,
we can merge these results to find the frequent itemsets for the entire dataset. Here are the frequent itemsets for the
entire dataset:
This algorithm also used to reduce the number of database scan. It is based upon the downward disclosure
property in which adds the candidate itemsets at different point of time during the scan. In this dynamic blocks
are formed from the database marked by start points and unlike the previous techniques of Apriori it
dynamically changes the sets of candidates during the database scan. Unlike the Apriori it cannot start the
next level scan at the end of first level scan, it start the scan by starting label attached to each dynamic partition
of candidate sets.
In this way it reduce the database scan for finding the frequent itemsets by just adding the new candidate at
any point of time during the run time. But it generates the large number of candidates and computing their
frequencies are the bottleneck of performance while the database scans only take a small part of runtime.
Solid box: confirmed frequent itemset - an itemset we have finished counting and
exceeds the support threshold minsupp
Solid circle: confirmed infrequent itemset - we have finished counting and it is
below minsupp
Dashed box: suspected frequent itemset - an itemset we are still counting that
exceeds minsupp
Dashed circle: suspected infrequent itemset - an itemset we are still counting that is
below minsupp
DIC Algorithm
itemset counting typically refers to the process of counting the occurrences of specific combinations of items or
attributes in a dataset.
Algorithm:
1. Mark the empty itemset with a solid square. Mark all the 1-itemsets with dashed circles.
Leave all other itemsets unmarked.
2. While any dashed itemsets remain:
1. Read M transactions (if we reach the end of the transaction file, continue from the
beginning). For each transaction, increment the respective counters for the itemsets
that appear in the transaction and are marked with dashes.
2. If a dashed circle's count exceeds minsupp, turn it into a dashed square. If any
immediate superset of it has all of its subsets as solid or dashed squares, add a new
counter for it and make it a dashed circle.
3. Once a dashed itemset has been counted through all the transactions, make it solid
and stop counting it.
Itemset lattices: An itemset lattice contains all of the possible itemsets for a transaction database.
Each itemset in the lattice points to all of its supersets. When represented graphically, a itemset
lattice can help us to understand the concepts behind the DIC algorithm.
Example: minsupp = 25% and M = 2.
TID A B C
T1 1 1 0
T2 1 0 0
T3 0 1 1
T4 0 0 0
Transaction Database
Applications Of Apriori AlgorithmSome fields where Apriori is used:
In Education Field: Extracting association rules in data mining of admitted students through characteristics and
specialties.
In the Medical field: For example Analysis of the patient’s database.
In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
Apriori is used by many companies like Amazon in the Recommender System and by Google for the auto-complete
feature.
Conclusion
Apriori algorithm is an efficient algorithm that scans the database only once.
It reduces the size of the itemsets in the database considerably providing a good performance.
Thus, data mining helps consumers and industries better in the decision- making process.
Using Apriori needs a generation of candidate itemsets. These itemsets may be large in number if the itemset
in the database is huge.
Apriori needs multiple scans of the database to check the support of each itemset generated and this
leads to high costs.
These shortcomings can be overcome using the FP growth algorithm.
This algorithm is an improvement to the Apriori method. A frequent pattern is generated without
the need for candidate generation. FP growth algorithm represents the database in the form of a
tree called a frequent pattern tree or FP tree.
This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”. The
itemsets of these fragmented patterns are analyzed. Thus with this method, the search for
frequent itemsets is reduced comparatively.
Frequent pattern-growth (FP-Growth) is the mining of pattern itemsets, subsequences, and
substructures that appear frequently in a dataset.
A Frequent itemset refers to the most common items bought together. A Subsequence where
items are bought by a customer is called a frequent sequential pattern.
A Substructure refers to different structural forms, such as subgraphs and subtrees, that are
combined with frequent structural patterns to analyze and find relations with different items of
frequent itemsets.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of
the nodes with the lower nodes that is the itemsets with the other itemsets are maintained while
forming the tree.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
#1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is called
support count or frequency of 1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree.The root
is represented by null.
#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the
next itemset with lower count and so on. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for example
in the 1st transaction), then this transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this
transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according to
transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node representsthe frequency pattern
length 1. From this, traverse the path in the FP Tree. This pathor paths are called a conditional
pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring
with the lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
Summary:
1. In the first scan of the database, we try to find the minimum support of itemsets.
2. If the frequency of items is less than the specified minimum support, we simply delete them from the
database.
3. The set of frequent items is sorted in the descending order of support.
4. Then, an FP-tree is constructed.
5. The initial node of the FP-tree is called the root node. It is labeled null.
6. The items in the transactions are processed when the node is created.
7. Whenever we try to pass the itemset, we can create a child node and expand it if the itemset follows
the root node.
8. If a branch encounters a common prefix for the existing path, we increment the value of the root node.
9. We repeat the process for each itemset until we reach the end of the itemsets in the transaction.
10. To facilitate tree traversal, we build an item header table so that each item points to occurrences in
thetree via a chain of node-links.
This algorithm needs to scan the database only twice when compared to Apriori which scans the
transactions for each iteration.
The pairing of items is not done in this algorithm and this makes it faster.
The database is stored in a compact version in memory.
It is efficient and scalable for mining both long and short frequent patterns.
FP Growth vs Apriori
FP Growth Apriori
FP growth generates pattern by constructing a Apriori generates pattern by pairing the items
FP tree into singletons, pairs and triplets.
Process
The process is faster as compared to Apriori. The process is comparatively slower than FP The runtime
of process increases linearly with Growth, the runtime increases exponentially increase in number of
itemsets. with increase in number of itemsets
Memory Usage
A compact version of database is saved The candidates combinations are saved in memory
Conclusion
The Apriori algorithm is used for mining association rules. It works on the principle, “the non-empty subsets
of frequent itemsets must also be frequent”. It forms k-itemset candidates from (k-1) itemsets and scans the
database to find the frequent itemsets.
Frequent Pattern Growth Algorithm is the method of finding frequent patterns without candidate generation.
It constructs an FP Tree rather than using the generate and test strategy of Apriori. The focus of the FP Growth
algorithm is on fragmenting the paths of the items and mining frequent patterns.
There are two forms of data analysis that can be used for extracting models describing important classes or to
predict future data trends. These two forms are as follows −
Classification
Prediction
Classification models predict categorical class labels; and prediction models predict continuous valued
functions. For example, we can build a classification model to categorize bank loan applications as either
safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or
which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new
computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels. These
labels are risky or safe for loan application data and yes orno for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his
company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a
continuous- valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.
Classification Examples:
Email Spam Detection: Classify emails as either spam or not spam (ham). Features may include the sender,
subject line, and content of the email.
Image Classification: Classify images into categories, such as recognizing whether an image contains a cat
or a dog.
Sentiment Analysis: Determine the sentiment (positive, negative, or neutral) of a piece of text, such as a
product review or a social media post.
Medical Diagnosis: Classify medical images, like X-rays or MRIs, to diagnose diseases or conditions like
cancer.
Credit Risk Assessment: Determine whether a loan applicant is high, medium, or low risk based on their
financial history, credit score, and other relevant factors.
Species Identification: Identify the species of a plant or animal based on characteristics like size, shape, and
color.
Fraud Detection: Detect fraudulent transactions in financial data by classifying transactions as either
legitimate or fraudulent.
Prediction Examples:
Stock Price Prediction: Predict the future price of a stock based on historical price data, trading volume, and
other financial indicators.
Weather Forecasting: Predict future weather conditions like temperature, precipitation, and wind speed
based on historical weather data and atmospheric variables.
Sales Forecasting: Predict future sales for a product or service based on historical sales data, seasonality, and
marketing efforts.
Demand Forecasting: Predict the demand for a product in a supply chain to optimize inventory and
production planning.
Energy Consumption Prediction: Predict future energy consumption in a building or a city based on
historical data and weather conditions.
Traffic Flow Prediction: Predict traffic congestion and flow patterns on roads to optimize transportation
systems.
Healthcare Outcome Prediction: Predict patient outcomes or disease progression based on medical data,
such as patient records and test results.
Crop Yield Prediction: Predict agricultural crop yields based on factors like soil quality, weather conditions,
and farming practices.
Customer Lifetime Value Prediction: Predict the future value of a customer to a business based on their
past behavior and purchase history.
With the help of the bank loan application that we have discussed above, let us understand the working of
classification. The Data Classification process includes two steps −
Building the Classifier or Model
Using Classifier for Classification
Dataset: You have a dataset of emails, each labeled as "spam" or "not spam," and you want to build a model
that can classify new, unlabeled emails as spam or not spam based on the email's content and other features.
Data Preparation:
Feature Engineering:
Extract relevant features from the text data. Common text features include word frequency counts (using
techniques like TF-IDF), email length, number of links, etc.
Select the most important features if you have many, using techniques like feature selection.
Data Splitting:
Split your dataset into a training set and a testing set. This allows you to train the model on one portion of the
data and evaluate its performance on another.
Select a classification algorithm. In this example, we'll use a simple one: Logistic Regression.
Model Training:
Train the chosen model on the training data. In Python with scikit-learn, it can be as simple as:
Use the testing data to evaluate the model's performance. Common metrics for classification include accuracy,
precision, recall, F1-score, and confusion matrix.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Depending on the model's performance, you may need to fine-tune hyperparameters or consider more
advanced models.
Deployment:
If the model meets your requirements and performs well, you can deploy it in a production environment to
classify incoming emails.
This is a simplified example, but it illustrates the general steps involved in building a classification model in
data mining. Real-world problems may involve more complex data preprocessing, feature engineering, and
model selection. Additionally, handling imbalanced datasets and conducting cross-validation are important
considerations in classification tasks.
Classification rules are the decision criteria or patterns that a machine learning classifier, such as Logistic
Regression, Decision Trees, or Random Forests, learns from the training data to assign class labels to new or
unseen data points. These rules are essentially the "if-then" conditions that help the classifier make predictions.
Let's take an example with a simplified rule learned from a Logistic Regression classifier for email spam
classification:
If the sum of the frequencies of the words "buy," "now," and "discount" in an email is greater than 0.5, then
classify the email as "spam."
In this example, the classifier has learned a rule based on the presence and frequency of specific words in an
email. If the condition is met (the sum of the word frequencies exceeds 0.5), the email is classified as "spam."
Otherwise, it's classified as "not spam."
The specific rules learned by a classifier can vary significantly depending on the algorithm, the features used,
and the complexity of the problem. These rules are determined during the training phase when the classifier
adjusts its parameters to best fit the training data and minimize prediction errors.
Data preparation for Classification and Prediction / Issues
The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −
Data Cleaning − Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques and the problem of missing values is
solved by replacing a missing value with most commonly occurring value for that attribute.
Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is
used to know whether any two given attributes are related.
Data Transformation and reduction − The data can be transformed by anyof the following
methods.
Generalization − The data can also be transformed by generalizing it to the higher concept. For
this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.
Here is the criteria for comparing the methods of Classification and Prediction −
Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.
Speed − This refers to the computational cost in generating and using the classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.