Generalized Sequential Pattern (GSP) Mining in Data Mining
Last Updated :
02 Feb, 2023
GSP is a very important algorithm in data mining. It is used in sequence mining from large databases. Almost all sequence mining algorithms are basically based on a prior algorithm. GSP uses a level-wise paradigm for finding all the sequence patterns in the data. It starts with finding the frequent items of size one and then passes that as input to the next iteration of the GSP algorithm. The database is passed multiple times to this algorithm. In each iteration, GSP removes all the non-frequent itemsets. This is done based on a threshold frequency which is called support. Only those itemsets are kept whose frequency is greater than the support count. After the first pass, GSP finds all the frequent sequences of length-1 which are called 1-sequences. This makes the input to the next pass, it is the candidate for 2-sequences. At the end of this pass, GSP generates all frequent 2-sequences, which makes the input for candidate 3-sequences. The algorithm is recursively called until no more frequent itemsets are found.
Basic of Sequential Pattern (GSP) Mining:
- Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …, sn}. As the name suggests, it is the sequence of items occurring together. It can be considered as a transaction or purchased items together in a basket.
- Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q, y, e, c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the subsequence is not necessarily consecutive items of the sequence. From the sequences of databases, subsequences are found from which the generalized sequence patterns are found at the end.
- Sequence pattern: A sub-sequence is called a pattern when it is found in multiple sequences. The goal of the GSP algorithm is to mine the sequence patterns from the large database. The database consists of the sequences. When a subsequence has a frequency equal to more than the “support” value. For example: the pattern <a, b> is a sequence pattern mined from sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.
Sequential Pattern (GSP) Mining uses:
Sequential pattern mining, also known as GSP (Generalized Sequential Pattern) mining, is a technique used to identify patterns in sequential data. The goal of GSP mining is to discover patterns in data that occur over time, such as customer buying habits, website navigation patterns, or sensor data.
Some of the main uses of GSP mining include:
Market basket analysis: GSP mining can be used to analyze customer buying habits and identify products that are frequently purchased together. This can help businesses to optimize their product placement and marketing strategies.
- Fraud detection: GSP mining can be used to identify patterns of behavior that are indicative of fraud, such as unusual patterns of transactions or access to sensitive data.
- Website navigation: GSP mining can be used to analyze website navigation patterns, such as the sequence of pages visited by users, and identify areas of the website that are frequently accessed or ignored.
- Sensor data analysis: GSP mining can be used to analyze sensor data, such as data from IoT devices, and identify patterns in the data that are indicative of certain conditions or states.
- Social media analysis: GSP mining can be used to analyze social media data, such as posts and comments, and identify patterns in the data that indicate trends, sentiment, or other insights.
- Medical data analysis: GSP mining can be used to analyze medical data, such as patient records, and identify patterns in the data that are indicative of certain health conditions or trends.
Methods for Sequential Pattern Mining:
- Apriori-based Approaches
- Pattern-Growth-based Approaches
Sequence Database: A database that consists of ordered elements or events is called a sequence database. Example of a sequence database:
S.No. |
SID |
sequences |
1. |
100 |
<a(ab)(ac)d(cef)> or <a{ab}{ac}d{cef}> |
2. |
200 |
<(ad)c(bcd)(abe)> |
3. |
300 |
<(ef)(ab)(def)cb> |
4. |
400 |
<eg(adf)CBC> |
Transaction: The sequence consists of many elements which are called transactions.
<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac),
(d) and (cef) are the elements of the sequence.
These elements are sometimes referred as transactions.
An element may contain a set of items. Items within an element are unordered and we list them alphabetically.
For example, (cef) is the element and it consists of 3 items c, e and f.
Since, all three items belong to same element, their order does not matter. But we prefer to put them in alphabetical order for convenience.
The order of the elements of the sequence matters unlike order of items in same transaction.
k-length Sequence:
The number of items involved in the sequence is denoted by K. A sequence of 2 items is called a 2-len sequence. While finding the 2-length candidate sequence this term comes into use. Example of 2-length sequence is: {ab}, {(ab)}, {bc} and {(bc)}.
- {bc} denotes a 2-length sequence where b and c are two different transactions. This can also be written as {(b)(c)}
- {(bc)} denotes a 2-length sequence where b and c are the items belonging to the same transaction, therefore enclosed in the same parenthesis. This can also be written as {(cb)}, because the order of items in the same transaction does not matter.
Support in k-length Sequence:
Support means the frequency. The number of occurrences of a given k-length sequence in the sequence database is known as the support. While finding the support the order is taken care.
Illustration:
Suppose we have 2 sequences in the database.
s1: <a(bc)b(cd)>
s2: <b(ab)abc(de)>
We need to find the support of {ab} and {(bc)}
Finding support of {ab}:
This is present in first sequence.
s1: <a(bc)b(cd)>
Since, a and b belong to different elements, their order matters.
In second sequence {ab} is not found but {ba} is present.
s2: <b(ab)abc(de)> Thus we don’t consider this.
Hence, support of {ab} is 1.
Finding support of {bc}:
Since, b and c are present in same element, their order does not matter.
s1: <a(bc)b(cd)>, first occurrence.
s2: <b(ab)abc(de)>, it seems correct, but is not. b and c are present in different elements here. So, we don’t consider it.
Hence, support of {(bc)} is 1.
How to join L1 and L1 to give C2?
L1 is the final 1-length sequence after pruning. After pruning all the entries left in the set have supported greater than the threshold.
Case 1: Join {ab} and {ac}
s1: {ab}, s2: {ac}
After removing a from s1 and c from s2.
s1’={b}, s2’={a}
s1′ and s2′ are not same, so s1 and s2 can’t be joined.
Case 2: Join {ab} and {be}
s1: {ab}, s2: {be}
After removing a from s1 and e from s2.
s1’={b}, s2’={b}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {abe}
Case 3: Join {(ab)} and {be}
s1: {(ab)}, s2: {be}
After removing a from s1 and e from s2.
s1’={(b)}, s2’={(b)}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {(ab)e}
s1 and s2 are joined in such a way that items belong to correct elements or transactions.
Pruning Phase: While building Ck (candidate set of k-length), we delete a candidate sequence that has a contiguous (k-1) subsequence whose support count is less than the minimum support (threshold). Also, delete a candidate sequence that has any subsequence without minimum support.
{abg} is a candidate sequence of C3.
{abg} is a candidate sequence of C3.
To check if {abg} is proper candidate or not, without checking its support, we check the support of its subsets.
Because subsets of 3-length sequence will be 1 and 2 length sequences. We build the candidate sets increment like 1-length, 2-length and so on.
Subsets of {abg} are: {ab], {bg} and {ag}
Check support of all three subsets. If any of them have support less than minimum support then delete the sequence {abg} from the set C3 otherwise keep it.
Challenges in Generalized Sequential Pattern Data Mining
The database is passed many times to the algorithm recursively. The computational efforts are more to mine the frequent pattern. When the sequence database is very large and patterns to be mined are long then GSP encounters the problem in doing so effectively.
Similar Reads
Frequent Pattern Mining in Data Mining
Frequent pattern mining in data mining is the process of identifying patterns or associations within a dataset that occur frequently. This is typically done by analyzing large datasets to find items or sets of items that appear together frequently. Frequent pattern extraction is an essential mission
10 min read
Text Mining in Data Mining
In this article, we will learn about the main process or we should say the basic building block of any NLP-related tasks starting from this stage of basically Text Mining. What is Text Mining?Text mining is a component of data mining that deals specifically with unstructured text data. It involves t
10 min read
STING - Statistical Information Grid in Data Mining
STING is a Grid-Based Clustering Technique. In STING, the dataset is recursively divided in a hierarchical manner. After the dataset, each cell is divided into a different number of cells. And after the cell, the statistical measures of the cell are collected, which helps answer the query as quickly
3 min read
Pattern Evaluation Methods in Data Mining
Pre-requisites: Data Mining In data mining, pattern evaluation is the process of assessing the quality of discovered patterns. This process is important in order to determine whether the patterns are useful and whether they can be trusted. There are a number of different measures that can be used to
14 min read
Data Mining in Science and Engineering
Data mining is an automatic process of uncovering implicit patterns, correlations, anomalies, and statistical information within large amounts of data stored in repositories. This information can be interpreted by hypothesis or theory and used to make forecasts. It is an interdisciplinary area that
4 min read
Classification Using Frequent Patterns in Data Mining
A data mining approach called frequent pattern mining is used to find recurring patterns in a dataset. It is a kind of unsupervised machine-learning technique that looks for and identifies patterns in data using algorithms. This method can be applied to find products that are frequently purchased to
7 min read
Statistical Methods in Data Mining
Data mining refers to extracting or mining knowledge from large amounts of data. In other words, data mining is the science, art, and technology of discovering large and complex bodies of data in order to discover useful patterns. Theoreticians and practitioners are continually seeking improved tech
6 min read
Difference Between Data Mining and Text Mining
Data Mining: Data mining is the process of finding patterns and extracting useful data from large data sets. It is used to convert raw data into useful data. Data mining can be extremely useful for improving the marketing strategies of a company as with the help of structured data we can study the d
3 min read
Need of Colossal Patterns in Data Mining
In data mining, a "colossal pattern" is a pattern that is large in size or scope. It is a pattern that is significant or important because of its size or because it has a significant impact on the data set. Colossal patterns can be found in large data sets, such as those used in big data application
4 min read
What is Prediction in Data Mining?
To find a numerical output, prediction is used. The training dataset contains the inputs and numerical output values. According to the training dataset, the algorithm generates a model or predictor. When fresh data is provided, the model should find a numerical output. This approach, unlike classifi
2 min read