0% found this document useful (0 votes)
16 views62 pages

Afrin

1. The FP-Growth algorithm is an efficient method for mining frequent patterns in transactional databases. 2. It uses an FP-tree structure to store compressed, crucial information about frequent itemsets, avoiding costly database scans. 3. The algorithm first constructs the FP-tree by recursively inserting frequent items. It then mines the tree by identifying conditional patterns and building conditional FP-trees to extract the full set of frequent patterns.

Uploaded by

Ahfrin J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views62 pages

Afrin

1. The FP-Growth algorithm is an efficient method for mining frequent patterns in transactional databases. 2. It uses an FP-tree structure to store compressed, crucial information about frequent itemsets, avoiding costly database scans. 3. The algorithm first constructs the FP-tree by recursively inserting frequent items. It then mines the tree by identifying conditional patterns and building conditional FP-trees to extract the full set of frequent patterns.

Uploaded by

Ahfrin J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Next →← Prev

Data Mining Techniques


Data mining includes the utilization of refined data analysis tools to find previously unknown, va
Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning, data
the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed and us
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. Thi

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining


This classification is as per the type of data handled. For example, multimedia, spatial data,
ii. Classification of data
This classification based on the data model involved. For example. Object-oriented databas
iii. Classification of data mining
This classification depends on the types of knowledge discovered or data mining functional
iv. Classification of data mining
This classification is as per the data analysis approach util
The classification can also take into account, the level of user interaction involved in the dat
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to hi
mining applications. For example, scientific data exploration, text mining, information retrieval, spa

In other words, we can say that Clustering analysis is a data mining technique to identify similar d
similarities.

3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship betw
it to project certain costs, depending on other factors such as availability, consumer demand, and

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden p

Association rules are if-then statements that support to show the probability of interactions betwe

The way the algorithm works is that you have various data, For example, a list of grocery items tha

These are three major measurements technique:

o Lift:
This measurement technique measures the
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how
(Item A + Item B)/ (Item A)

5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which
mining. The outlier is a data point that diverges too much from the rest of the dataset. The major
credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to d
frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in tran

7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classifica

Data mining techniques are methods and processes used to discover useful patterns, trends, corre

1. **Classification:** Classification is the process of assigning a data point to a predefined class or


machines, and neural networks.

2. **Regression:** Regression analysis is used to predict a continuous numeric value based on i


regression are some regression techniques.

3. **Clustering:** Clustering is the process of grouping similar data points together into clusters or

4. **Association Rule Mining:** Association rule mining is used to discover interesting association
rule mining.

5. **Anomaly Detection:** Anomaly detection identifies unusual or rare data points that deviate fr
anomaly detection.

6. **Text Mining:** Text mining focuses on extracting valuable information from textual data. Techn
7. **Time Series Analysis:** Time series analysis is used for data where the order of observations m

8. **Dimensionality Reduction:** Dimensionality reduction techniques reduce the number of varia


used for this purpose.

9. **Neural Networks and Deep Learning:** Deep learning techniques, including neural networks, c

10. **Ensemble Methods:** Ensemble methods combine the predictions of multiple models to imp

11. **Web Mining:** Web mining involves extracting information from web data, including web pa

12. **Spatial Data Mining:** Spatial data mining focuses on geographical and spatial data, often us

These data mining techniques are often applied in combination to solve real-world problems. The

Fp growth algorithm
The FP-Growth (Frequent Pattern Growth) algorithm is a popular and efficient data mining algorith
as market basket analysis. The FP-Growth algorithm was introduced by Jiawei Han, Jian Pei, and Yiw

Here's an overview of how the FP-Growth algorithm works:

1. **Data Preprocessing:** First, the transaction data is preprocessed to identify the frequent items
2. **Building the FP-Tree:** The FP-Tree is the core data structure of the algorithm. It represents th

- In the first pass, the algorithm counts the support (the number of transactions containing an ite

- In the second pass, the algorithm builds the FP-Tree. Each transaction is used to create a path i

3. **Mining Frequent Itemsets:** Once the FP-Tree is constructed, the algorithm recursively mines
the item and its associated branches. This conditional pattern base is used to construct a new FP-T

4. **Generating Association Rules:** After identifying frequent itemsets, you can generate associati

The FP-Growth algorithm is known for its efficiency, especially when dealing with large datasets, b
FP-Tree structure to significantly reduce the search space, making it faster and more scalable.

In summary, the FP-Growth algorithm is a powerful and efficient method for finding frequent item

Fp Growth Algorithm in Data Mining


In Data Mining, finding frequent patterns in large databases is very important and has been stud
when many patterns exist.

The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for mi
structure for storing compressed and crucial information about frequent patterns named frequen
mining frequent patterns, e.g. the Apriori Algorithm and the TreeProjection. In some later wor
popularity and efficiency of the FP-Growth Algorithm contribute to many studies that propose var

What is FP Growth Algorithm?


The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which r

This algorithm works as follows:

Backward Skip 10

o First, it compresses the input database creating an FP-tree instance to represent frequent it
o After this first step, it divides the compressed database into a set of conditional databases,
o Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking for short pattern

In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with
then construct an FP-tree from each of these smaller databases.

FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative informatio
tree. This is done until all transactions have been read. Different transactions with common subset

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tr

The root node represents null, while the lower nodes represent the item sets. The associations
forming the tree.

Han defines the FP-tree as the tree structure given below:

1. One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-it
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path reaching t
o Node-link: links to the next node in the FP-tree carrying the same item name or null
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the item name.

Additionally, the frequent-item-header table can have the count support for an item. The below d
size of the FP-tree will be only a single branch of nodes.
The worst-case scenario occurs when every transaction has a unique item set. So the space nee
requires additional space to store pointers between nodes and the counters for each item. The dia
grows with each transaction's uniqueness.
Algorithm by Han
The original algorithm to construct the FP-Tree defined by Han is given below:

Algorithm 1: FP-tree construction

Input: A transaction database DB and a minimum support threshold?

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

1. The first step is to scan the database to find the occurrences of the itemsets in the datab
support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is r
3. The next step is to scan the database again and examine the transactions. Examine the first
next itemset with the lower count. It means that the branch of the tree is constructed with t
4. The next transaction in the database is examined. The itemsets are ordered in descending o
branch would share a
This means that the common itemset is linked to the new node of another itemset in this tr
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common n
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, al
this, traverse the path in the FP Tree. This
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurr
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets me
8. Frequent Patterns are generated from the Conditional FP Tree.

Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and s

Example

Support threshold=50%, Confidence= 60%

Table 1:

Transaction List of items

T1 I1,I2,I3
T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4
Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2
Table 3: Sort the itemset in descending order.

Item Count

I2 5

I1 4

I3 4

I4 4
Build FP Tree

Let's build the FP tree in the following steps, such as:

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is l
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is lin
4. Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is linked as a child to
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node. Hence
{I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.

Mining of FP-tree is summarized below:

1. The lowest node item, I5, is not considered as it does not have a min support count. Hence
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore conside
3. The conditional pattern base is considered a transaction database, and an FP tree is constru
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4, I1:3
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and freq
Item Conditional Pattern Base Conditional FP-tree Frequent Pattern

I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,

I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2

I1 {I2:4} {I2:4} {I2,I1:4}


The diagram given below depicts the conditional FP tree associated with the conditional node I3.

FP-Growth Algorithm
After constructing the FP-Tree, it's possible to mine it to find the complete set of frequent patte
Growth Algorithm.

Algorithm 2: FP-Growth

Input: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimu

Output: The complete set of frequent patterns.

Method: Call FP-growth (FP-tree, null).

1. Procedure FP-growth(Tree, a)
2. {
3. If the tree contains a single prefix path, then.
4. {
5. // Mining single prefix-path FP-tree
6. let P be the single prefix-path part of the tree;
7. let Q be the multipath part with the top branching node replaced by a null root;
8. for each combination (denoted as ß) of the nodes in the path, P do
9. generate pattern ß ∪ a with support = minimum support of nodes in ß;
10. let freq pattern set(P) be the set of patterns so generated;
11. }
12. else let Q be Tree;
13. for each item ai in Q, do
14. {
15. // Mining multipath FP-tree
16. generate pattern ß = ai ∪ a with support = ai .support;
17. construct ß's conditional pattern-based, and then ß's conditional FP-tree Tree ß;
18. if Tree ß ≠ Ø then
19. call FP-growth(Tree ß, ß);
20. let freq pattern set(Q) be the set of patterns so generated;
21. }
22. return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern set(Q)))
23. }

When the FP-tree contains a single prefix path, the complete set of frequent patterns can be gene

1. The single prefix-path P,


2. The multipath Q,
3. And their combinations (lines 01 to 03 and 14).

The resulting patterns for a single prefix path are the enumerations of its subpaths with minimu
combined results are returned as the frequent patterns found.

Advantages of FP Growth Algorithm


Here are the following advantages of the FP growth algorithm, such as:

o This algorithm needs to scan the database twice when compared to Apriori, which scans the
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


This algorithm also has some disadvantages, such as:

o FP Tree is more cumbersome and difficult to build than Apriori.


o It may be expensive.
o The algorithm may not fit in the shared memory when the database is large.

Difference between Apriori and FP Growth Algorithm


Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some basic differe

Apriori FP Growth

Apriori generates frequent patterns by making the itemsets using FP Growth generates an
pairings such as single item set, double itemset, and triple itemset. frequent patterns.

Apriori uses candidate generation where frequent subsets are FP-growth generates a co
extended one item at a time. every item in the data.

Since apriori scans the database in each step, it becomes time- FP-tree requires only one
consuming for data where the number of items is larger. beginning steps, so it con

A converted version of the database is saved in the memory A set of conditional FP-t
saved in the memory

It uses a breadth-first search It uses a depth-first search


.

Website Designing − Data Mining Bayesian Classifiers


In numerous applications, the connection between the attribute set and the class variable is non-
may emerge due to the noisy data or the presence of certain confusing factors that influence cla
working efficiency. Although most people who eat healthly and exercise consistently having less
individual's eating routine is healthy or the workout efficiency is sufficient is also subject to analysi

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian classifi

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to

Bayes's theorem is expressed mathematically by the following equation that is given below.

Where X and Y are the events and P (Y) ≠ 0


P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem conne
percent of occurrence of either heads and tails is 50%. If the coin is flipped numbers of times, and

For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) proc
Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical grap
The nodes here represent random variables, and the edges define the relationship between these

o Bayes' Theorem is a fundamental concept in probability theory and statistics that provide
classification, Bayes' Theorem is often used in a statistical method called Naive Bayes cla
o
o Bayes' Theorem can be expressed mathematically as follows:
o
o \[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]
o
o Where:
o - \(P(A|B)\) is the posterior probability of event A given evidence B.
o - \(P(B|A)\) is the probability of evidence B given that event A has occurred.
o - \(P(A)\) is the prior probability of event A.
o - \(P(B)\) is the probability of evidence B.
o
o In the context of data classification, let's say you have a dataset with different classes or c
o
o 1. **Define Classes:** First, you need to identify and define the classes or categories you
o
o 2. **Calculate Prior Probabilities:** Determine the prior probabilities \(P(A)\) for each c
points.
o
o 3. **Estimate Likelihoods:** For each feature or attribute in your data, calculate the cond
class.
o
o 4. **Apply Bayes' Theorem:** When you receive new data to classify, you calculate the
o
o \[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]
o
o You compute this for each class, and the class with the highest posterior probability is t
o
o 5. **Handling Independence Assumptions:** In the Naive Bayes classifier, it is often ass
combination of feature values is a product of the probabilities of individual feature values
o
o 6. **Smoothing:** To handle cases where some feature-value combinations have zero pr
o
o Bayes' Theorem, particularly in the Naive Bayes classifier, is commonly used in text clas
holds reasonably true. However, it may not perform as well in situations where features a

Classification method and software

In data mining, classification is a crucial technique that involves the categorization of data points
play a vital role in this data mining process. Let's analyze classification methods and software in th

**Classification Methods in Data Mining:**

1. **Decision Trees:** Decision trees are widely used in data mining. They create a hierarchical stru

2. **Random Forest:** Random Forest is an ensemble method that combines multiple decision tree

3. **Support Vector Machines (SVM):** SVM is an effective method for classification, especially whe

4. **Naive Bayes:** Naive Bayes is a probabilistic method based on Bayes' Theorem. It's often used

5. **Neural Networks:** Deep learning models, such as neural networks, are powerful tools for com

6. **K-Nearest Neighbors (K-NN):** K-NN is a simple yet effective method for classification. It assig

7. **Gradient Boosting:** Algorithms like XGBoost, LightGBM, and CatBoost are ensemble methods

8. **Logistic Regression:** Logistic regression is a linear classification method that models the prob
**Classification Software in Data Mining:**

1. **Weka:** Weka is a popular open-source data mining software that provides a wide range of cl

2. **RapidMiner:** RapidMiner is a data science platform that includes classification algorithms, da

3. **Orange:** Orange is an open-source data visualization and analysis tool that also offers machi

4. **KNIME:** KNIME is an open-source data analytics platform that allows users to build and exec

5. **R:** R is a programming language and environment for statistical computing and graphics. It h

6. **TensorFlow and Keras:** These libraries are used for deep learning and neural network-based

7. **SAS Enterprise Miner:** SAS Enterprise Miner is a comprehensive data mining and machine lea

8. **IBM SPSS Modeler:** IBM SPSS Modeler is a data mining and predictive analytics software tha

The choice of classification method and software in data mining depends on the specific needs of
given data mining problem.

Hierarchical method
Hierarchical clustering is a popular method in the field of unsupervised machine learning and data
like structure called a dendrogram. This method doesn't require the number of clusters to be spec

**Hierarchical Clustering Algorithms:**

1. **Agglomerative (Bottom-Up):** This is the most common approach. It starts with each data poi

a. Start with each data point as a single cluster.

b. Compute the similarity (or dissimilarity) between all pairs of clusters.

c. Merge the two most similar clusters into a new cluster.

d. Repeat steps b and c until only one cluster remains.

2. **Divisive (Top-Down):** This approach starts with all data points in a single cluster and recursiv

**Distance (or Similarity) Measures:**

Hierarchical clustering requires a metric to measure the similarity or dissimilarity between data poi
- **Euclidean Distance:** The straight-line distance between two data points in a multi-dimensiona
- **Manhattan Distance:** The sum of the absolute differences between the coordinates of two da
- **Cosine Similarity:** Measures the cosine of the angle between two vectors. It's often used for te
- **Correlation:** Measures the linear relationship between two variables.
- **Jaccard Index:** Used for binary data to measure the similarity between sets.

**Linkage Methods:**

In agglomerative hierarchical clustering, a linkage method determines how the distance between c

- **Single Linkage:** The distance between two clusters is the minimum distance between any two
- **Complete Linkage:** The distance between two clusters is the maximum distance between any
- **Average Linkage:** The distance between two clusters is the average distance between all pairs
- **Ward's Linkage:** Minimizes the variance within the clusters when merging. It aims to create co

**Dendrogram:**

A dendrogram is a tree-like structure that represents the hierarchy of clusters in the data. It's a gra

**Cutting the Dendrogram:**

To obtain a specific number of clusters from a dendrogram, you can cut the tree at a certain heigh

**Advantages of Hierarchical Clustering:**

1. Doesn't require the number of clusters to be specified in advance.


2. Provides an interpretable representation of cluster relationships through the dendrogram.
3. Allows for both agglomerative and divisive approaches, depending on the problem.

**Disadvantages of Hierarchical Clustering:**

1. Computationally intensive for large datasets.


2. Sensitive to noise and outliers.
3. Once a merge or split is made, it can't be undone.

Hierarchical clustering is a powerful tool for exploring and understanding the structure of your dat
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped st

Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down appro

Why hierarchical clustering?


As we already have other clustering algorithms such as K-Means Clustering, then why we need h
clusters of the same size. To solve these two challenges, we can opt for the hierarchical clustering a

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering


The agglomerative hierarchical clustering algorithm is a popular example of HCA. To group the
together. It does this until all the clusters are merged into a single cluster that contains all the data

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Wo


The working of the AHC algorithm can be explained using the below steps:

o Step-1: Create each data point as a single


o Step-2: Take two closest data points or

o Step-3: Again, take the two closest c


o Step-4: Repeat Step 3 until only one
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide t

Note: To better understand hierarchical clustering, it is advised to have a look on k-means clus

Measure for the distance between two clusters


As we have seen, the closest distance between the two clusters is crucial for the hierarchical clus
methods are given below:

1. Single Linkage: It is the Shortest D


2. Complete Linkage: It is the farthest distance between the tw

3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is ad
4. Centroid Linkage: It is the linkage method in
From the above-given approaches, we can apply any of them according to the type of problem or

Woking of Dendrogram in Hierarchical clustering


The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the

The working of the dendrogram can be explained using the below diagram:

In the above diagram, the left part is showing how clusters are created in agglomerative clustering

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a cluster,
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P
o At last, the final dendrogram is created that combines all the data points together.

We can cut the dendrogram tree structure at any level as per our requirement.

What is main objective of clustering give the categorization of clustering approaches


**The main objective of clustering** is to group a set of data points or objects into subsets or clust

1. Objects within the same cluster are more similar to each other than to those in other clusters.
2. Objects in different clusters are more dissimilar from each other.

In other words, clustering aims to discover natural groupings or structure within data, making it ea

**Categorization of Clustering Approaches:**

Clustering approaches can be categorized based on various factors, including the algorithm used,

1. **Partitioning Clustering:**
- In partitioning clustering, data points are divided into a predefined number of clusters.
- Algorithms: K-Means, K-Medoids, CLARA (Clustering Large Applications), PAM (Partitioning Aro
- Each data point belongs to exactly one cluster.

2. **Hierarchical Clustering:**
- Hierarchical clustering creates a tree-like structure of clusters (dendrogram) with a hierarchy of
- Algorithms: Agglomerative (bottom-up) and divisive (top-down) methods.
- No need to specify the number of clusters beforehand.

3. **Density-Based Clustering:**
- Density-based clustering identifies regions of high data point density and forms clusters based
- Algorithms: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ord

4. **Grid-Based Clustering:**
- Grid-based clustering divides the data space into a grid (or mesh) and assigns data points to gr
- Algorithms: STING (Statistical Information Grid), CLIQUE (CLustering In QUEst).

5. **Model-Based Clustering:**
- Model-based clustering assumes that data points are generated from a mixture of probability d
- Algorithms: Expectation-Maximization (EM) with Gaussian Mixture Models (GMM), Latent Dirich

6. **Graph-Based Clustering:**
- Graph-based clustering represents data points as nodes in a graph and clusters as connected c
- Algorithms: Spectral Clustering, Normalized Cuts.
7. **Fuzzy Clustering:**
- Fuzzy clustering allows data points to belong to multiple clusters with different degrees of mem
- Algorithms: Fuzzy C-Means (FCM), Gustafson-Kessel (GK) clustering.

8. **Constraint-Based Clustering:**
- Constraint-based clustering incorporates user-defined constraints to guide the clustering proce
- Algorithms: Semi-Supervised Clustering, Constrained K-Means.

9. **Subspace Clustering:**
- Subspace clustering identifies clusters in subspaces (subsets of features) of the data.
- Algorithms: CLIQUE (in the context of subspace clustering), SUBCLU (Subspace Clustering).

10. **Biological and Evolutionary Clustering:**


- These approaches are designed for specific domains, such as biology and evolutionary compu

Each of these clustering approaches has its own strengths and weaknesses, and the choice of met
method when selecting an appropriate approach.

WEB TERMINOLOGY (WT) NOT WANTED


Web terminology refers to the specialized vocabulary and terms associated with the World Wide
Web and the internet. Understanding these terms is essential for effectively navigating and working
on the web. Here are some key web terminology and characteristics:
1. **Website:** A collection of web pages, documents, or multimedia content that is accessible via a
unique URL (Uniform Resource Locator). Websites are typically hosted on web servers and can serve
various purposes, from information dissemination to e-commerce.

2. **Web Page:** A single document or page within a website. It can contain text, images,
multimedia, and hyperlinks to other web pages or external resources.

3. **URL (Uniform Resource Locator):** A web address that specifies the location of a resource on
the internet. It typically includes a protocol (e.g., http:// or https://), a domain name (e.g.,
www.example.com), and a specific path or resource identifier.

4. **HTML (Hypertext Markup Language):** The standard markup language used to create web
pages. HTML tags are used to structure and format content, and it is the backbone of web
development.

5. **Hyperlink:** A clickable element on a web page that, when clicked, directs the user to another
web page or resource. Hyperlinks are typically represented as underlined text or buttons.

6. **Web Browser:** Software used to access and view web pages. Popular web browsers include
Google Chrome, Mozilla Firefox, Microsoft Edge, and Safari.

7. **Web Server:** A computer or software that stores and serves web content to users upon
request. It uses protocols like HTTP (Hypertext Transfer Protocol) to communicate with web clients
(browsers).

8. **Web Hosting:** The service of storing and making websites accessible on the internet. Web
hosting providers offer various plans and resources for hosting websites.

9. **Domain Name:** A human-readable address used to identify a specific website or resource on


the internet. It corresponds to an IP (Internet Protocol) address, allowing users to access websites
using names instead of numerical addresses.

10. **CMS (Content Management System):** A software platform that simplifies website creation
and management. Popular CMSs include WordPress, Joomla, and Drupal.

11. **Responsive Design:** Designing websites to adapt and display optimally on various devices
and screen sizes, such as desktop computers, tablets, and smartphones.
12. **SEO (Search Engine Optimization):** The practice of optimizing a website's content and
structure to improve its visibility in search engine results, driving organic traffic.

13. **Web Development:** The process of creating websites or web applications, which can involve
front-end development (user interface) and back-end development (server-side functionality).

14. **E-commerce:** Online buying and selling of goods and services, often through dedicated e-
commerce websites or platforms.

15. **Cookies:** Small text files that websites store on a user's device to track and remember user
preferences and interactions.

16. **HTTPS (Hypertext Transfer Protocol Secure):** A secure version of HTTP that encrypts data
exchanged between a user's browser and a website, providing enhanced security for data
transmission.

17. **Web 2.0:** A term referring to the second generation of web development and design,
emphasizing user-generated content, collaboration, and interactive web applications.

18. **Web Hosting:** Services that allow individuals or organizations to make their websites
accessible on the internet. It involves storing website files on a web server.

19. **Web Accessibility:** Designing websites to be usable by people with disabilities, ensuring
equal access to information and services.

20. **Web Standards:** Guidelines and specifications set by organizations like the World Wide Web
Consortium (W3C) to ensure consistency and interoperability of web technologies.

These are just a few key terms and characteristics related to the web. The web is a dynamic and
evolving space, so new terminology and technologies continually emerge. Understanding these
fundamentals is crucial for anyone involved in web development, design, or digital marketing.
WEB TERMINOLOGY AND CHARACTERISTICS

When discussing web terminology and characteristics in the context of data mining, there are several
specific terms and concepts that are relevant. Data mining on the web involves extracting valuable
insights and patterns from large datasets collected from online sources. Here are some key terms and
characteristics related to web data mining:

1. **Web Data:** Data obtained from various online sources, including websites, social media,
online forums, and other web-based platforms.

2. **Web Scraping:** The process of automatically extracting data from websites. Web scraping
tools and techniques are used to collect data from web pages for further analysis.

3. **Web Crawling:** The automated process of navigating the web to index and collect data from
multiple websites. Search engines, like Google, use web crawlers to index web content.

4. **Structured and Unstructured Data:** Web data can be either structured (e.g., databases, tables)
or unstructured (e.g., text, images). Data mining techniques are often used to process and analyze
unstructured web data.

5. **Text Mining:** A subset of data mining that focuses on extracting insights and patterns from
textual data found on websites, including sentiment analysis, topic modeling, and keyword
extraction.
6. **Web Content Analysis:** The process of analyzing the content of web pages, which may include
text, images, videos, and other media, to gain insights into user preferences, trends, and behavior.

7. **User Behavior Tracking:** Monitoring and analyzing user interactions on websites, such as
clickstream data, to understand how users navigate and engage with web content.

8. **Web Analytics:** The practice of collecting, measuring, and analyzing web data to optimize
websites and online marketing strategies.

9. **Data Preprocessing:** Data mining often involves data cleaning, transformation, and reduction
to prepare the data for analysis. This step is critical in dealing with noisy or incomplete web data.

10. **Data Mining Algorithms:** Various algorithms are used to uncover patterns, associations, and
insights in web data. Common algorithms include decision trees, clustering, and association rule
mining.

11. **Data Visualization:** Representing data mining results through charts, graphs, and
visualizations to make complex patterns and insights more understandable.

12. **Recommendation Systems:** Using data mining to provide personalized recommendations to


users based on their web activity and preferences, as seen in e-commerce and content
recommendation systems.

13. **Anomaly Detection:** Identifying unusual patterns or outliers in web data, which can be useful
for fraud detection or monitoring network security.

14. **Big Data:** Web data mining often deals with large and complex datasets, making it a part of
the broader field of big data analytics.

15. **Privacy and Ethical Concerns:** Web data mining must consider issues related to user privacy,
data protection, and ethical data usage, especially with the increasing focus on data regulations and
user rights.

16. **Machine Learning:** Utilizing machine learning algorithms and techniques to improve the
accuracy and predictive capabilities of web data mining models.
17. **Web Mining Tools and Frameworks:** Various software tools and frameworks are available to
facilitate web data mining, including Scrapy for web scraping and Python libraries like scikit-learn for
data analysis.

Web data mining is an essential component of extracting valuable insights from the vast amount of
information available on the internet. Understanding these web-specific terms and characteristics is
crucial for data scientists, analysts, and businesses looking to leverage web data for informed
decision-making.

Search engine architecture in data mining refers to the underlying structure and
components of a search engine that are designed to retrieve and present information from vast
collections of data, such as web pages, databases, or documents. Data mining techniques are often
employed within the architecture of search engines to provide more relevant and efficient search
results. Here are the key components of search engine architecture in data mining:

1. **Crawling and Indexing:**

- **Crawler (Web Spider):** The crawler, also known as a web spider, is responsible for traversing
the web and collecting web pages or documents. It follows hyperlinks from one web page to another
and downloads their content.

- **Indexer:** The indexer processes the downloaded content, extracts textual information, and
builds an index of the documents. This index is crucial for efficient and quick retrieval of relevant
results.

2. **Data Preprocessing:**

- **Data Cleaning:** Removing duplicate content, correcting errors, and filtering out irrelevant
information to ensure high data quality.

- **Data Transformation:** Converting data into a suitable format for analysis, which may involve
text preprocessing, feature extraction, and dimensionality reduction.

3. **Query Processing:**

- **User Query Parsing:** Parsing and understanding user queries to extract relevant keywords and
concepts.

- **Query Expansion:** Expanding user queries with synonyms or related terms to improve recall
and precision in search results.
4. **Ranking and Retrieval:**

- **Information Retrieval Models:** Implementing models like TF-IDF (Term Frequency-Inverse


Document Frequency) or BM25 to rank documents based on their relevance to the user query.

- **Relevance Scoring:** Assigning a relevance score to each document based on factors like
keyword matches, document popularity, and user behavior.

5. **Data Mining Techniques:**

- **Classification and Clustering:** Using classification algorithms to categorize and group


documents into relevant classes or clusters.

- **Association Rule Mining:** Discovering patterns and associations within the data, which can be
used for query suggestions or content recommendations.

- **Text Mining:** Extracting insights from unstructured text data, such as sentiment analysis,
named entity recognition, and topic modeling.

6. **User Interface:**

- **Search Interface:** The user interface is the front-end that allows users to input queries and
view search results. It may include advanced search options, filters, and facets.

7. **Feedback Mechanisms:**

- **User Feedback:** Collecting user feedback to improve search relevance and accuracy. This may
involve feedback forms, click-through data analysis, and user surveys.

8. **Security and Privacy:**

- **Access Control:** Ensuring that sensitive or private information is not accessible to


unauthorized users.

- **Data Encryption:** Protecting data during transmission and storage to maintain user privacy.

9. **Scalability:**

- **Distributed Systems:** Many modern search engines are built on distributed architectures to
handle large-scale data and user requests efficiently.

10. **Machine Learning Integration:**

- Incorporating machine learning algorithms to enhance search results and personalize


recommendations based on user behavior and preferences.
11. **Monitoring and Maintenance:**

- Continuous monitoring to ensure the search engine is operational, detect and fix errors, and
update the index with new content.

12. **Relevancy Testing:**

- Testing and evaluating the performance of the search engine by comparing the quality of search
results against established metrics and benchmarks.

Search engine architecture in data mining is a complex and evolving field, with various algorithms,
models, and technologies constantly being developed to improve the accuracy and efficiency of
search engines. These systems play a vital role in helping users find relevant information within the
vast amount of data available on the internet and in various data repositories.

Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP)


are two distinct but complementary approaches to data management and analysis, often used in the
context of data mining. They serve different purposes and are typically part of a broader data
warehousing and business intelligence architecture. Here's how OLTP and OLAP work together with
data mining:

1. **Online Transaction Processing (OLTP):**

- **Purpose:** OLTP systems are designed for efficient and real-time transaction processing. They
handle day-to-day operational tasks, such as order processing, inventory management, and customer
interactions.

- **Characteristics:** OLTP databases are optimized for fast and frequent data insertions, updates,
and deletions. They focus on maintaining data integrity and ensuring that transactions are processed
accurately and quickly.

- **Data Schema:** OLTP databases typically have normalized schemas to reduce data redundancy
and maintain data consistency.

- **Example:** A point-of-sale system in a retail store that records individual sales transactions is
an OLTP application.

2. **Online Analytical Processing (OLAP):**


- **Purpose:** OLAP systems are designed for complex queries and data analysis. They provide a
multidimensional view of data, allowing users to explore data from various angles, make informed
decisions, and discover trends and insights.

- **Characteristics:** OLAP databases are optimized for read-heavy operations, aggregations, and
complex queries. They may use denormalized data structures to improve query performance.

- **Data Schema:** OLAP databases use star or snowflake schemas, which involve fact tables and
dimension tables to enable efficient data analysis.

- **Example:** A data warehouse that stores historical sales data and allows business analysts to
generate reports, perform trend analysis, and make strategic decisions is an OLAP application.

3. **Data Mining in OLAP:**

- Data mining can be integrated with OLAP systems to discover hidden patterns, associations, and
insights within large datasets. OLAP cubes and reports can serve as a starting point for data mining
analysis.

- Data mining algorithms, such as clustering, classification, and association rule mining, can be
applied to OLAP data to uncover valuable information about customer behavior, market trends, or
operational efficiency.

- The results of data mining can feed back into the OLAP system, enriching the multidimensional
view and enabling more informed decision-making.

4. **Data Flow:**

- OLTP systems capture and store transactional data in real-time. These transactions are essential
for tracking business operations and maintaining data integrity.

- OLAP systems periodically extract and transform data from OLTP databases into a data
warehouse. This process may involve aggregation, cleaning, and structuring data for analytical
purposes.

- Data mining can be performed on the data warehouse, which contains historical and aggregated
data, making it suitable for discovering patterns and trends.

5. **Business Applications:**

- OLTP systems are primarily used for daily operational tasks and data recording.

- OLAP systems support strategic and business intelligence functions, enabling decision-makers to
analyze data and generate reports.

- Data mining provides a deeper layer of analysis, helping organizations make data-driven
predictions and discover hidden insights that might not be apparent through traditional reporting.
Integrating OLTP, OLAP, and data mining within an organization's data management and analysis
strategy allows for a comprehensive approach to handling data, from real-time transactions to
historical analysis and predictive modeling. This synergy helps organizations make informed decisions
and gain a competitive advantage in various industries.




Pre-Requisite: OLAP, OLTP
OLAP stands for Online Analytical Processing. OLAP systems have the capability to
analyze database information of multiple systems at the current time. The primary
goal of OLAP Service is data analysis and not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to administer
day-to-day transactions in any organization. The main goal of OLTP is data
processing not data analysis.

Online Analytical Processing (OLAP)


Online Analytical Processing (OLAP) consists of a type of software tool that is used
for data analysis for business decisions. OLAP provides an environment to get
insights from the database retrieved from multiple database systems at one time.

OLAP Examples

Any type of Data Warehouse System is an OLAP system. The uses of the OLAP
System are described below.
 Spotify analyzed songs by users to come up with a personalized homepage
of their songs and playlist.
 Netflix movie recommendation system.
OLAP

Benefits of OLAP Services

 OLAP services help in keeping consistency and calculation.


 We can store planning, analysis, and budgeting for business analytics within
one platform.
 OLAP services help in handling large volumes of data, which helps in
enterprise-level business applications.
 OLAP services help in applying security restrictions for data protection.
 OLAP services provide a multidimensional view of data, which helps in
applying operations on data in various ways.
Drawbacks of OLAP Services

 OLAP Services requires professionals to handle the data because of its


complex modeling procedure.
 OLAP services are expensive to implement and maintain in cases when
datasets are large.
 We can perform an analysis of data only after extraction and transformation
of data in the case of OLAP which delays the system.
 OLAP services are not efficient for decision-making, as it is updated on a
periodic basis.
Online Transaction Processing (OLTP)
Online transaction processing provides transaction-oriented applications in a 3-tier
architecture. OLTP administers the day-to-day transactions of an organization.

OLTP Examples

An example considered for OLTP System is ATM Center a person who authenticates
first will receive the amount first and the condition is that the amount to be withdrawn
must be present in the ATM. The uses of the OLTP System are described below.
 ATM center is an OLTP application.
 OLTP handles the ACID properties during data transactions via the
application.
 It’s also used for Online banking, Online airline ticket booking, sending a
text message, add a book to the shopping cart.
OLTP vs OLAP

Benefits of OLTP Services

 OLTP services allow users to read, write and delete data operations quickly.
 OLTP services help in increasing users and transactions which helps in real-
time access to data.
 OLTP services help to provide better security by applying multiple security
features.
 OLTP services help in making better decision making by providing accurate
data or current data.
 OLTP Services provide Data Integrity, Consistency, and High Availability
to the data.

Drawbacks of OLTP Services

 OLTP has limited analysis capability as they are not capable of intending
complex analysis or reporting.
 OLTP has high maintenance costs because of frequent maintenance,
backups, and recovery.
 OLTP Services get hampered in the case whenever there is a hardware
failure which leads to the failure of online transactions.
 OLTP Services many times experience issues such as duplicate or
inconsistent data.
Difference between OLAP and OLTP
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.

Consists of historical data Consists of only operational


Data source
from various Databases. current data.

It makes use of a data It makes use of a standard database


Method used
warehouse. management system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.

In an OLAP database, tables In an OLTP database, tables


Normalized
are not normalized. are normalized (3NF).

The data is used in planning,


The data is used to perform day-to-
Usage of data problem-solving, and
day fundamental operations.
decision-making.

It provides a multi-
It reveals a snapshot of present
Task dimensional view of different
business tasks.
business tasks.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

It serves the purpose to extract It serves the purpose to Insert,


Purpose information for analysis and Update, and Delete information
decision-making. from the database.

The size of the data is relatively


Volume of A large amount of data is
small as the historical data is
data stored typically in TB, PB
archived in MB, and GB.

Relatively slow as the amount


Very Fast as the queries operate on
Queries of data involved is large.
5% of the data.
Queries may take hours.

The OLAP database is not The data integrity constraint must


Update often updated. As a result, be maintained in an OLTP
data integrity is unaffected. database.

It only needs backup from


Backup and The backup and recovery process
time to time as compared to
Recovery is maintained rigorously
OLTP.

The processing of complex It is comparatively fast in


Processing
queries can take a lengthy processing because of simple and
time
time. straightforward queries.

This data is generally


This data is managed by
Types of users managed by CEO, MD, and
clerksForex and managers.
GM.

Only read and rarely write


Operations Both read and write operations.
operations.

With lengthy, scheduled batch


The user initiates data updates,
Updates operations, data is refreshed
which are brief and quick.
on a regular basis.

Nature of The process is focused on the The process is focused on the


audience customer. market.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

Database Design with a focus on the Design that is focused on the


Design subject. application.

Improves the efficiency of


Productivity Enhances the user’s productivity.
business analysts.
Data cubes play a crucial role in both data mining and data warehousing, facilitating efficient
multidimensional data analysis. They are a fundamental concept in these fields, particularly in Online
Analytical Processing (OLAP) systems. Here's an explanation of data cubes in the context of data
mining and data warehousing:

**Data Cubes in Data Warehousing**:

In data warehousing, a data cube is a multidimensional representation of data that allows for
complex analysis and reporting. It stores data in a format that is optimized for querying and reporting
on multiple dimensions. The key components of a data cube in data warehousing include:

1. **Dimensions**: These are the attributes or characteristics by which you want to analyze data. For
example, in a retail context, dimensions could include time, products, and stores.

2. **Measures**: These are the numeric values or metrics you want to analyze. For instance, in
retail, measures could be sales revenue, units sold, or profit.

3. **Hierarchies**: Dimensions can have hierarchies, which represent levels of granularity. For
example, the time dimension might have hierarchies like year, quarter, month, and day.

4. **Cuboid**: A cuboid represents a specific subcube within the data cube, defined by a
combination of dimension values. It's essentially a slice of the cube used for analysis.

5. **Aggregations**: Data cubes often store aggregated values to speed up query performance,
especially for summarization at higher levels of detail.

**Data Cubes in Data Mining**:


In data mining, data cubes are used as a foundation for analyzing large datasets, discovering
patterns, and generating insights. Data cubes provide a structured way to organize and analyze data
from various dimensions, which is particularly useful in data mining applications:

1. **Pattern Discovery**: Data cubes are used to identify patterns and trends within the data,
helping data miners discover valuable insights. Patterns can include associations, correlations, and
anomalies.

2. **Data Visualization**: Data cubes can be visualized in the form of pivot tables or
multidimensional charts, making it easier for data analysts to understand complex relationships
within the data.

3. **Drill-Down and Roll-Up**: Data miners can drill down to finer levels of detail or roll up to higher-
level summaries within the data cube to explore patterns and trends more deeply.

4. **Hypothesis Testing**: Data cubes allow for hypothesis testing and the evaluation of data mining
models within the context of multiple dimensions.

5. **Advanced Analytics**: Data mining techniques, such as clustering, classification, and regression,
can be applied to data cubes to generate predictive models and make data-driven decisions.

In summary, data cubes in data warehousing provide a structured way to store and retrieve
multidimensional data for reporting and analysis, while data cubes in data mining are used for
discovering patterns and generating insights from multidimensional datasets. The concept of data
cubes is pivotal in both domains, enabling efficient and comprehensive data analysis.

Data Cube or OLAP approach in Data


Mining
 Read
 Discuss
 Courses



What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that enables
multi-dimensional analysis of business data. It provides interactive access to large
amounts of data and supports complex calculations and data aggregation. OLAP is
used to support business intelligence and decision-making processes.
Grouping of data in a multidimensional matrix is called data cubes. In Dataware
housing, we generally deal with various multidimensional data models as the data
will be represented by multiple dimensions and multiple attributes. This
multidimensional data is represented in the data cube as the cube represents a high-
dimensional space. The Data cube pictorially shows how different attributes of data
are arranged in the data model. Below is the diagram of a general data cube.

The example above is a 3D cube having attributes like branch(A,B,C,D),item


type(home,entertainment,computer,phone,security), year(1997,1998,1999) .

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large amounts
of data by making use of a multi-dimensional array. It increases its
efficiency by keeping an index of each dimension. Thus, dimensional is
able to retrieve data fast.
 Relational data cube: It basically helps in storing large amounts of data
by making use of relational tables. Each relational table displays the
dimensions of the data cube. It is slower compared to a Multidimensional
Data Cube.

Data cube operations:


Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
 Roll-up: operation and aggregate certain similar data attributes having the
same dimension together. For example, if the data cube displays the daily
income of a customer, we can use a roll-up operation to find the monthly
income of his salary.

 Drill-down: this operation is the reverse of the roll-up operation. It allows


us to take particular information and then subdivide it further for coarser
granularity analysis. It zooms into more detail. For example- if India is an
attribute of a country column and we wish to see villages in India, then the
drill-down operation splits India into states, districts, towns, cities,
villages and then displays the required information.

 Slicing: this operation filters the unnecessary portions. Suppose in a


particular dimension, the user doesn’t need everything for analysis, rather
a particular attribute. For example, country=”jamaica”, this will display
only about jamaica and only display other countries present on the country
list.

 Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole
cube(as depicted in the figure). For example- the user wants to see the
annual salary of Jharkhand state employees.

 Pivot: this operation is very important from a viewing point of view. It


basically transforms the data cube in terms of view. It doesn’t change the
data present in the data cube. For example, if the user is comparing year
versus branch, using the pivot operation, the user can change the
viewpoint and now compare branch versus item type.

Advantages of data cubes:

 Multi-dimensional analysis: Data cubes enable multi-dimensional


analysis of business data, allowing users to view data from different
perspectives and levels of detail.
 Interactivity: Data cubes provide interactive access to large amounts of
data, allowing users to easily navigate and manipulate the data to support
their analysis.
 Speed and efficiency: Data cubes are optimized for OLAP analysis,
enabling fast and efficient querying and aggregation of data.
 Data aggregation: Data cubes support complex calculations and data
aggregation, enabling users to quickly and easily summarize large
amounts of data.
 Improved decision-making: Data cubes provide a clear and
comprehensive view of business data, enabling improved decision-making
and business intelligence.
 Accessibility: Data cubes can be accessed from a variety of devices and
platforms, making it easy for users to access and analyze business data
from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.

Disadvantages of data cube:

 Complexity: OLAP systems can be complex to set up and maintain,


requiring specialized technical expertise.
 Data size limitations: OLAP systems can struggle with very large data
sets and may require extensive data aggregation or summarization.
 Performance issues: OLAP systems can be slow when dealing with large
amounts of data, especially when running complex queries or calculations.
 Data integrity: Inconsistent data definitions and data quality issues can
affect the accuracy of OLAP analysis.
 Cost: OLAP technology can be expensive, especially for enterprise-level
solutions, due to the need for specialized hardware and software.
 Inflexibility: OLAP systems may not easily accommodate changing
business needs and may require significant effort to modify or extend.

Online Analytical Processing (OLAP) operations are a set of operations used to interact with
multidimensional data cubes in data warehousing and business intelligence systems. OLAP enables
users to explore, analyze, and gain insights from data in a multidimensional format. There are several
primary OLAP operations, including:

1. **Roll-Up (Aggregation):**

- **Operation:** Roll-up, also known as aggregation, involves summarizing data at a higher level of
a dimension hierarchy. It is moving from a detailed level to a more aggregated or summarized level.

- **Example:** Summarizing daily sales data into monthly or yearly totals.

2. **Drill-Down (Detail):**

- **Operation:** Drill-down, or detail operation, is the opposite of roll-up. It involves navigating to


a lower level of detail in a dimension hierarchy.

- **Example:** Exploring monthly sales data to view daily sales for a specific month.

3. **Slice:**

- **Operation:** Slicing involves selecting a single level of one dimension and one or more levels of
other dimensions to view a specific "slice" of the data cube.

- **Example:** Viewing sales data for a particular product category (one dimension) and a
particular quarter (another dimension).

4. **Dice:**

- **Operation:** Dicing allows the selection of two or more dimensions and specific levels within
those dimensions to create a subcube or a more focused view of the data.

- **Example:** Creating a subcube to examine sales data for a particular region, product category,
and time period.

5. **Pivot (Rotation):**

- **Operation:** Pivoting involves changing the orientation of the data cube to view it from a
different perspective. It typically involves interchanging rows and columns in the data representation.

- **Example:** Rotating a data cube to view sales by product category as rows and quarters as
columns.

6. **Query (Selection):**

- **Operation:** Query or selection allows users to specify criteria or conditions to filter the data in
the cube. It focuses on a specific subset of the data based on user-defined constraints.
- **Example:** Selecting only sales data for products that belong to a particular category and are
sold in a specific region.

7. **Drill-Through:**

- **Operation:** Drill-through provides a way to access detailed data at the lowest level of
granularity, often by connecting to the underlying relational or transactional databases.

- **Example:** Accessing individual transaction records from a summary-level sales report.

8. **Ranking:**

- **Operation:** Ranking involves calculating and displaying the rank or order of data values within
a particular dimension. It helps identify the highest or lowest values.

- **Example:** Ranking products by sales volume within a specific time frame.

9. **Top-N (Bottom-N):**

- **Operation:** Top-N and Bottom-N operations return the top (or bottom) N values based on a
specific measure. This is useful for identifying the best or worst performers.

- **Example:** Finding the top 10 best-selling products or the bottom 5 least profitable regions.

These OLAP operations provide users with the flexibility to interact with multidimensional data
cubes, allowing them to explore data from various angles and levels of granularity, make informed
decisions, and discover valuable insights. OLAP tools and systems are widely used in business
intelligence and data analysis to support decision-making processes

OLAP Operations in the


Multidimensional Data Model
In the multidimensional model, the records are organized into various dimensions,
and each dimension includes multiple levels of abstraction described by concept
hierarchies. This organization support users with the flexibility to view data from
various perspectives. A number of OLAP data cube operation exist to demonstrate
these different views, allowing interactive queries and search of the record at hand.
Hence, OLAP supports a user-friendly environment for interactive data analysis.

Consider the OLAP operations which are to be performed on multidimensional data.


The figure shows data cubes for sales of a shop. The cube contains the dimensions,
location, and time and item, where the location is aggregated with regard to city
values, time is aggregated with respect to quarters, and an item is aggregated with
respect to item types.

Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs
aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension
reduction. Roll-up is like zooming-out on the data cubes. Figure shows the result of
roll-up operations performed on the dimension location. The hierarchy for the
location is defined as the Order Street, city, province, or state, country. The roll-up
operation aggregates the data by ascending the location hierarchy from the level of
the city to the level of the country.

When a roll-up is performed by dimensions reduction, one or more dimensions are


removed from the cube. For example, consider a sales data cube having two
dimensions, location and time. Roll-up may be performed by removing, the time
dimensions, appearing in an aggregation of the total sales by location, relatively than
by location and by time.

Example
Consider the following cubes illustrating temperature of certain days recorded
weekly:

Temperature 64 65 68 69 70 71 72 75 80 8

Week1 1 0 1 0 1 0 0 0 0 0

Week2 0 0 0 1 0 0 1 2 0 1

Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in
temperature from the above cubes.

To do this, we have to group column and add up the value according to the concept
hierarchies. This operation is known as a roll-up.

By doing this, we contain the following cube:

Temperature cool mild

Week1 2 1
Week2 2 1

The roll-up operation groups the information by levels of temperature.

The following diagram illustrates how roll-up works.

Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drill-down is like zooming-in on the data cube. It navigates from less detailed record
to more detailed data. Drill-down can be performed by either stepping down a
concept hierarchy for a dimension or adding additional dimensions.

Figure shows a drill-down operation performed on the dimension time by stepping


down a concept hierarchy which is defined as day, month, quarter, and year. Drill-
down appears by descending the time hierarchy from the level of the quarter to a
more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed
by adding a new dimension to a cube. For example, a drill-down on the central cubes
of the figure can occur by introducing an additional dimension, such as a customer
group.

Example
Drill-down adds more details to the given data

Temperature cool mild

Day 1 0 0

Day 2 0 0

Day 3 0 0

Day 4 0 1

Day 5 1 0

Day 6 0 0

Day 7 1 0

Day 8 0 0

Day 9 1 0

Day 10 0 1

Day 11 0 1

Day 12 0 1

Day 13 0 0

Day 14 0 0

The following diagram illustrates how Drill-down works.


Slice
A slice is a subset of the cubes corresponding to a single value for one or more
members of the dimension. For example, a slice operation is executed when the
customer wants a selection on one dimension of a three-dimensional cube resulting
in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.

For example, if we make the selection, temperature=cool we will obtain the following
cube:

Temperature cool

Day 1 0

Day 2 0

Day 3 0

Day 4 0

Day 5 1
Day 6 1

Day 7 1

Day 8 1

Day 9 1

Day 11 0

Day 12 0

Day 13 0

Day 14 0

The following diagram illustrates how Slice works.


Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".

It will form a new sub-cubes by selecting one or more dimensions.

Dice
The dice operation describes a subcube by operating a selection on two or more
dimension.

For example, Implement the selection (time = day 3 OR time = day 4) AND
(temperature = cool OR temperature = hot) to the original cubes we get the
following subcube (still two-dimensional)
Temperature cool hot

Day 3 0 1

Day 4 0 0

Consider the following diagram, which shows the dice operations.

The dice operation on the cubes based on the following selection criteria involves
three dimensions.

o (location = "Toronto" or "Vancouver")


o (time = "Q1" or "Q2")
o (item =" Mobile" or "Modem")

Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which
rotates the data axes in view to provide an alternative presentation of the data. It
may contain swapping the rows and columns or moving one of the row-dimensions
into the column dimensions.

Consider the following diagram, which shows the pivot operation.


Other OLAP Operations
executes queries containing more than one fact table. The drill-through operations
make use of relational SQL facilitates to drill through the bottom level of a data
cubes down to its back-end relational tables.

Other OLAP operations may contain ranking the top-N or bottom-N elements in lists,
as well as calculate moving average, growth rates, and interests, internal rates of
returns, depreciation, currency conversions, and statistical tasks.

OLAP offers analytical modeling capabilities, containing a calculation engine for


determining ratios, variance, etc. and for computing measures across various
dimensions. It can generate summarization, aggregation, and hierarchies at each
granularity level and at every dimensions intersection. OLAP also provide functional
models for forecasting, trend analysis, and statistical analysis. In this context, the
OLAP engine is a powerful data analysis tool.

You might also like