Afrin
Afrin
Depending on various methods and technologies from the intersection of machine learning, data
the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed and us
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. Thi
In other words, we can say that Clustering analysis is a data mining technique to identify similar d
similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship betw
it to project certain costs, depending on other factors such as availability, consumer demand, and
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden p
Association rules are if-then statements that support to show the probability of interactions betwe
The way the algorithm works is that you have various data, For example, a list of grocery items tha
o Lift:
This measurement technique measures the
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which
mining. The outlier is a data point that diverges too much from the rest of the dataset. The major
credit or debit card fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to d
frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in tran
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classifica
Data mining techniques are methods and processes used to discover useful patterns, trends, corre
3. **Clustering:** Clustering is the process of grouping similar data points together into clusters or
4. **Association Rule Mining:** Association rule mining is used to discover interesting association
rule mining.
5. **Anomaly Detection:** Anomaly detection identifies unusual or rare data points that deviate fr
anomaly detection.
6. **Text Mining:** Text mining focuses on extracting valuable information from textual data. Techn
7. **Time Series Analysis:** Time series analysis is used for data where the order of observations m
9. **Neural Networks and Deep Learning:** Deep learning techniques, including neural networks, c
10. **Ensemble Methods:** Ensemble methods combine the predictions of multiple models to imp
11. **Web Mining:** Web mining involves extracting information from web data, including web pa
12. **Spatial Data Mining:** Spatial data mining focuses on geographical and spatial data, often us
These data mining techniques are often applied in combination to solve real-world problems. The
Fp growth algorithm
The FP-Growth (Frequent Pattern Growth) algorithm is a popular and efficient data mining algorith
as market basket analysis. The FP-Growth algorithm was introduced by Jiawei Han, Jian Pei, and Yiw
1. **Data Preprocessing:** First, the transaction data is preprocessed to identify the frequent items
2. **Building the FP-Tree:** The FP-Tree is the core data structure of the algorithm. It represents th
- In the first pass, the algorithm counts the support (the number of transactions containing an ite
- In the second pass, the algorithm builds the FP-Tree. Each transaction is used to create a path i
3. **Mining Frequent Itemsets:** Once the FP-Tree is constructed, the algorithm recursively mines
the item and its associated branches. This conditional pattern base is used to construct a new FP-T
4. **Generating Association Rules:** After identifying frequent itemsets, you can generate associati
The FP-Growth algorithm is known for its efficiency, especially when dealing with large datasets, b
FP-Tree structure to significantly reduce the search space, making it faster and more scalable.
In summary, the FP-Growth algorithm is a powerful and efficient method for finding frequent item
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for mi
structure for storing compressed and crucial information about frequent patterns named frequen
mining frequent patterns, e.g. the Apriori Algorithm and the TreeProjection. In some later wor
popularity and efficiency of the FP-Growth Algorithm contribute to many studies that propose var
Backward Skip 10
o First, it compresses the input database creating an FP-tree instance to represent frequent it
o After this first step, it divides the compressed database into a set of conditional databases,
o Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short pattern
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope with
then construct an FP-tree from each of these smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative informatio
tree. This is done until all transactions have been read. Different transactions with common subset
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tr
The root node represents null, while the lower nodes represent the item sets. The associations
forming the tree.
1. One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-it
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the path reaching t
o Node-link: links to the next node in the FP-tree carrying the same item name or null
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the item name.
Additionally, the frequent-item-header table can have the count support for an item. The below d
size of the FP-tree will be only a single branch of nodes.
The worst-case scenario occurs when every transaction has a unique item set. So the space nee
requires additional space to store pointers between nodes and the counters for each item. The dia
grows with each transaction's uniqueness.
Algorithm by Han
The original algorithm to construct the FP-Tree defined by Han is given below:
1. The first step is to scan the database to find the occurrences of the itemsets in the datab
support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root is r
3. The next step is to scan the database again and examine the transactions. Examine the first
next itemset with the lower count. It means that the branch of the tree is constructed with t
4. The next transaction in the database is examined. The itemsets are ordered in descending o
branch would share a
This means that the common itemset is linked to the new node of another itemset in this tr
5. Also, the count of the itemset is incremented as it occurs in the transactions. The common n
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first, al
this, traverse the path in the FP Tree. This
A conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurr
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path. The itemsets me
8. Frequent Patterns are generated from the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects and s
Example
Table 1:
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
Table 3: Sort the itemset in descending order.
Item Count
I2 5
I1 4
I3 4
I4 4
Build FP Tree
1. The lowest node item, I5, is not considered as it does not have a min support count. Hence
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore conside
3. The conditional pattern base is considered a transaction database, and an FP tree is constru
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4, I1:3
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and freq
Item Conditional Pattern Base Conditional FP-tree Frequent Pattern
FP-Growth Algorithm
After constructing the FP-Tree, it's possible to mine it to find the complete set of frequent patte
Growth Algorithm.
Algorithm 2: FP-Growth
Input: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimu
1. Procedure FP-growth(Tree, a)
2. {
3. If the tree contains a single prefix path, then.
4. {
5. // Mining single prefix-path FP-tree
6. let P be the single prefix-path part of the tree;
7. let Q be the multipath part with the top branching node replaced by a null root;
8. for each combination (denoted as ß) of the nodes in the path, P do
9. generate pattern ß ∪ a with support = minimum support of nodes in ß;
10. let freq pattern set(P) be the set of patterns so generated;
11. }
12. else let Q be Tree;
13. for each item ai in Q, do
14. {
15. // Mining multipath FP-tree
16. generate pattern ß = ai ∪ a with support = ai .support;
17. construct ß's conditional pattern-based, and then ß's conditional FP-tree Tree ß;
18. if Tree ß ≠ Ø then
19. call FP-growth(Tree ß, ß);
20. let freq pattern set(Q) be the set of patterns so generated;
21. }
22. return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern set(Q)))
23. }
When the FP-tree contains a single prefix path, the complete set of frequent patterns can be gene
The resulting patterns for a single prefix path are the enumerations of its subpaths with minimu
combined results are returned as the frequent patterns found.
o This algorithm needs to scan the database twice when compared to Apriori, which scans the
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.
Apriori FP Growth
Apriori generates frequent patterns by making the itemsets using FP Growth generates an
pairings such as single item set, double itemset, and triple itemset. frequent patterns.
Apriori uses candidate generation where frequent subsets are FP-growth generates a co
extended one item at a time. every item in the data.
Since apriori scans the database in each step, it becomes time- FP-tree requires only one
consuming for data where the number of items is larger. beginning steps, so it con
A converted version of the database is saved in the memory A set of conditional FP-t
saved in the memory
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian classifi
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
Bayes's theorem is expressed mathematically by the following equation that is given below.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem conne
percent of occurrence of either heads and tails is 50%. If the coin is flipped numbers of times, and
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) proc
Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical grap
The nodes here represent random variables, and the edges define the relationship between these
o Bayes' Theorem is a fundamental concept in probability theory and statistics that provide
classification, Bayes' Theorem is often used in a statistical method called Naive Bayes cla
o
o Bayes' Theorem can be expressed mathematically as follows:
o
o \[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]
o
o Where:
o - \(P(A|B)\) is the posterior probability of event A given evidence B.
o - \(P(B|A)\) is the probability of evidence B given that event A has occurred.
o - \(P(A)\) is the prior probability of event A.
o - \(P(B)\) is the probability of evidence B.
o
o In the context of data classification, let's say you have a dataset with different classes or c
o
o 1. **Define Classes:** First, you need to identify and define the classes or categories you
o
o 2. **Calculate Prior Probabilities:** Determine the prior probabilities \(P(A)\) for each c
points.
o
o 3. **Estimate Likelihoods:** For each feature or attribute in your data, calculate the cond
class.
o
o 4. **Apply Bayes' Theorem:** When you receive new data to classify, you calculate the
o
o \[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]
o
o You compute this for each class, and the class with the highest posterior probability is t
o
o 5. **Handling Independence Assumptions:** In the Naive Bayes classifier, it is often ass
combination of feature values is a product of the probabilities of individual feature values
o
o 6. **Smoothing:** To handle cases where some feature-value combinations have zero pr
o
o Bayes' Theorem, particularly in the Naive Bayes classifier, is commonly used in text clas
holds reasonably true. However, it may not perform as well in situations where features a
In data mining, classification is a crucial technique that involves the categorization of data points
play a vital role in this data mining process. Let's analyze classification methods and software in th
1. **Decision Trees:** Decision trees are widely used in data mining. They create a hierarchical stru
2. **Random Forest:** Random Forest is an ensemble method that combines multiple decision tree
3. **Support Vector Machines (SVM):** SVM is an effective method for classification, especially whe
4. **Naive Bayes:** Naive Bayes is a probabilistic method based on Bayes' Theorem. It's often used
5. **Neural Networks:** Deep learning models, such as neural networks, are powerful tools for com
6. **K-Nearest Neighbors (K-NN):** K-NN is a simple yet effective method for classification. It assig
7. **Gradient Boosting:** Algorithms like XGBoost, LightGBM, and CatBoost are ensemble methods
8. **Logistic Regression:** Logistic regression is a linear classification method that models the prob
**Classification Software in Data Mining:**
1. **Weka:** Weka is a popular open-source data mining software that provides a wide range of cl
3. **Orange:** Orange is an open-source data visualization and analysis tool that also offers machi
4. **KNIME:** KNIME is an open-source data analytics platform that allows users to build and exec
5. **R:** R is a programming language and environment for statistical computing and graphics. It h
6. **TensorFlow and Keras:** These libraries are used for deep learning and neural network-based
7. **SAS Enterprise Miner:** SAS Enterprise Miner is a comprehensive data mining and machine lea
8. **IBM SPSS Modeler:** IBM SPSS Modeler is a data mining and predictive analytics software tha
The choice of classification method and software in data mining depends on the specific needs of
given data mining problem.
Hierarchical method
Hierarchical clustering is a popular method in the field of unsupervised machine learning and data
like structure called a dendrogram. This method doesn't require the number of clusters to be spec
1. **Agglomerative (Bottom-Up):** This is the most common approach. It starts with each data poi
2. **Divisive (Top-Down):** This approach starts with all data points in a single cluster and recursiv
Hierarchical clustering requires a metric to measure the similarity or dissimilarity between data poi
- **Euclidean Distance:** The straight-line distance between two data points in a multi-dimensiona
- **Manhattan Distance:** The sum of the absolute differences between the coordinates of two da
- **Cosine Similarity:** Measures the cosine of the angle between two vectors. It's often used for te
- **Correlation:** Measures the linear relationship between two variables.
- **Jaccard Index:** Used for binary data to measure the similarity between sets.
**Linkage Methods:**
In agglomerative hierarchical clustering, a linkage method determines how the distance between c
- **Single Linkage:** The distance between two clusters is the minimum distance between any two
- **Complete Linkage:** The distance between two clusters is the maximum distance between any
- **Average Linkage:** The distance between two clusters is the average distance between all pairs
- **Ward's Linkage:** Minimizes the variance within the clusters when merging. It aims to create co
**Dendrogram:**
A dendrogram is a tree-like structure that represents the hierarchy of clusters in the data. It's a gra
To obtain a specific number of clusters from a dendrogram, you can cut the tree at a certain heigh
Hierarchical clustering is a powerful tool for exploring and understanding the structure of your dat
Hierarchical Clustering in Machine Learning
Hierarchical clustering is another unsupervised machine learning algorithm, which is used to group
In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-shaped st
Sometimes the results of K-means clustering and hierarchical clustering may look similar, but they
1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with taking
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down appro
Note: To better understand hierarchical clustering, it is advised to have a look on k-means clus
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is ad
4. Centroid Linkage: It is the linkage method in
From the above-given approaches, we can apply any of them according to the type of problem or
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative clustering
o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form a cluster,
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P
o At last, the final dendrogram is created that combines all the data points together.
We can cut the dendrogram tree structure at any level as per our requirement.
1. Objects within the same cluster are more similar to each other than to those in other clusters.
2. Objects in different clusters are more dissimilar from each other.
In other words, clustering aims to discover natural groupings or structure within data, making it ea
Clustering approaches can be categorized based on various factors, including the algorithm used,
1. **Partitioning Clustering:**
- In partitioning clustering, data points are divided into a predefined number of clusters.
- Algorithms: K-Means, K-Medoids, CLARA (Clustering Large Applications), PAM (Partitioning Aro
- Each data point belongs to exactly one cluster.
2. **Hierarchical Clustering:**
- Hierarchical clustering creates a tree-like structure of clusters (dendrogram) with a hierarchy of
- Algorithms: Agglomerative (bottom-up) and divisive (top-down) methods.
- No need to specify the number of clusters beforehand.
3. **Density-Based Clustering:**
- Density-based clustering identifies regions of high data point density and forms clusters based
- Algorithms: DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ord
4. **Grid-Based Clustering:**
- Grid-based clustering divides the data space into a grid (or mesh) and assigns data points to gr
- Algorithms: STING (Statistical Information Grid), CLIQUE (CLustering In QUEst).
5. **Model-Based Clustering:**
- Model-based clustering assumes that data points are generated from a mixture of probability d
- Algorithms: Expectation-Maximization (EM) with Gaussian Mixture Models (GMM), Latent Dirich
6. **Graph-Based Clustering:**
- Graph-based clustering represents data points as nodes in a graph and clusters as connected c
- Algorithms: Spectral Clustering, Normalized Cuts.
7. **Fuzzy Clustering:**
- Fuzzy clustering allows data points to belong to multiple clusters with different degrees of mem
- Algorithms: Fuzzy C-Means (FCM), Gustafson-Kessel (GK) clustering.
8. **Constraint-Based Clustering:**
- Constraint-based clustering incorporates user-defined constraints to guide the clustering proce
- Algorithms: Semi-Supervised Clustering, Constrained K-Means.
9. **Subspace Clustering:**
- Subspace clustering identifies clusters in subspaces (subsets of features) of the data.
- Algorithms: CLIQUE (in the context of subspace clustering), SUBCLU (Subspace Clustering).
Each of these clustering approaches has its own strengths and weaknesses, and the choice of met
method when selecting an appropriate approach.
2. **Web Page:** A single document or page within a website. It can contain text, images,
multimedia, and hyperlinks to other web pages or external resources.
3. **URL (Uniform Resource Locator):** A web address that specifies the location of a resource on
the internet. It typically includes a protocol (e.g., http:// or https://), a domain name (e.g.,
www.example.com), and a specific path or resource identifier.
4. **HTML (Hypertext Markup Language):** The standard markup language used to create web
pages. HTML tags are used to structure and format content, and it is the backbone of web
development.
5. **Hyperlink:** A clickable element on a web page that, when clicked, directs the user to another
web page or resource. Hyperlinks are typically represented as underlined text or buttons.
6. **Web Browser:** Software used to access and view web pages. Popular web browsers include
Google Chrome, Mozilla Firefox, Microsoft Edge, and Safari.
7. **Web Server:** A computer or software that stores and serves web content to users upon
request. It uses protocols like HTTP (Hypertext Transfer Protocol) to communicate with web clients
(browsers).
8. **Web Hosting:** The service of storing and making websites accessible on the internet. Web
hosting providers offer various plans and resources for hosting websites.
10. **CMS (Content Management System):** A software platform that simplifies website creation
and management. Popular CMSs include WordPress, Joomla, and Drupal.
11. **Responsive Design:** Designing websites to adapt and display optimally on various devices
and screen sizes, such as desktop computers, tablets, and smartphones.
12. **SEO (Search Engine Optimization):** The practice of optimizing a website's content and
structure to improve its visibility in search engine results, driving organic traffic.
13. **Web Development:** The process of creating websites or web applications, which can involve
front-end development (user interface) and back-end development (server-side functionality).
14. **E-commerce:** Online buying and selling of goods and services, often through dedicated e-
commerce websites or platforms.
15. **Cookies:** Small text files that websites store on a user's device to track and remember user
preferences and interactions.
16. **HTTPS (Hypertext Transfer Protocol Secure):** A secure version of HTTP that encrypts data
exchanged between a user's browser and a website, providing enhanced security for data
transmission.
17. **Web 2.0:** A term referring to the second generation of web development and design,
emphasizing user-generated content, collaboration, and interactive web applications.
18. **Web Hosting:** Services that allow individuals or organizations to make their websites
accessible on the internet. It involves storing website files on a web server.
19. **Web Accessibility:** Designing websites to be usable by people with disabilities, ensuring
equal access to information and services.
20. **Web Standards:** Guidelines and specifications set by organizations like the World Wide Web
Consortium (W3C) to ensure consistency and interoperability of web technologies.
These are just a few key terms and characteristics related to the web. The web is a dynamic and
evolving space, so new terminology and technologies continually emerge. Understanding these
fundamentals is crucial for anyone involved in web development, design, or digital marketing.
WEB TERMINOLOGY AND CHARACTERISTICS
When discussing web terminology and characteristics in the context of data mining, there are several
specific terms and concepts that are relevant. Data mining on the web involves extracting valuable
insights and patterns from large datasets collected from online sources. Here are some key terms and
characteristics related to web data mining:
1. **Web Data:** Data obtained from various online sources, including websites, social media,
online forums, and other web-based platforms.
2. **Web Scraping:** The process of automatically extracting data from websites. Web scraping
tools and techniques are used to collect data from web pages for further analysis.
3. **Web Crawling:** The automated process of navigating the web to index and collect data from
multiple websites. Search engines, like Google, use web crawlers to index web content.
4. **Structured and Unstructured Data:** Web data can be either structured (e.g., databases, tables)
or unstructured (e.g., text, images). Data mining techniques are often used to process and analyze
unstructured web data.
5. **Text Mining:** A subset of data mining that focuses on extracting insights and patterns from
textual data found on websites, including sentiment analysis, topic modeling, and keyword
extraction.
6. **Web Content Analysis:** The process of analyzing the content of web pages, which may include
text, images, videos, and other media, to gain insights into user preferences, trends, and behavior.
7. **User Behavior Tracking:** Monitoring and analyzing user interactions on websites, such as
clickstream data, to understand how users navigate and engage with web content.
8. **Web Analytics:** The practice of collecting, measuring, and analyzing web data to optimize
websites and online marketing strategies.
9. **Data Preprocessing:** Data mining often involves data cleaning, transformation, and reduction
to prepare the data for analysis. This step is critical in dealing with noisy or incomplete web data.
10. **Data Mining Algorithms:** Various algorithms are used to uncover patterns, associations, and
insights in web data. Common algorithms include decision trees, clustering, and association rule
mining.
11. **Data Visualization:** Representing data mining results through charts, graphs, and
visualizations to make complex patterns and insights more understandable.
13. **Anomaly Detection:** Identifying unusual patterns or outliers in web data, which can be useful
for fraud detection or monitoring network security.
14. **Big Data:** Web data mining often deals with large and complex datasets, making it a part of
the broader field of big data analytics.
15. **Privacy and Ethical Concerns:** Web data mining must consider issues related to user privacy,
data protection, and ethical data usage, especially with the increasing focus on data regulations and
user rights.
16. **Machine Learning:** Utilizing machine learning algorithms and techniques to improve the
accuracy and predictive capabilities of web data mining models.
17. **Web Mining Tools and Frameworks:** Various software tools and frameworks are available to
facilitate web data mining, including Scrapy for web scraping and Python libraries like scikit-learn for
data analysis.
Web data mining is an essential component of extracting valuable insights from the vast amount of
information available on the internet. Understanding these web-specific terms and characteristics is
crucial for data scientists, analysts, and businesses looking to leverage web data for informed
decision-making.
Search engine architecture in data mining refers to the underlying structure and
components of a search engine that are designed to retrieve and present information from vast
collections of data, such as web pages, databases, or documents. Data mining techniques are often
employed within the architecture of search engines to provide more relevant and efficient search
results. Here are the key components of search engine architecture in data mining:
- **Crawler (Web Spider):** The crawler, also known as a web spider, is responsible for traversing
the web and collecting web pages or documents. It follows hyperlinks from one web page to another
and downloads their content.
- **Indexer:** The indexer processes the downloaded content, extracts textual information, and
builds an index of the documents. This index is crucial for efficient and quick retrieval of relevant
results.
2. **Data Preprocessing:**
- **Data Cleaning:** Removing duplicate content, correcting errors, and filtering out irrelevant
information to ensure high data quality.
- **Data Transformation:** Converting data into a suitable format for analysis, which may involve
text preprocessing, feature extraction, and dimensionality reduction.
3. **Query Processing:**
- **User Query Parsing:** Parsing and understanding user queries to extract relevant keywords and
concepts.
- **Query Expansion:** Expanding user queries with synonyms or related terms to improve recall
and precision in search results.
4. **Ranking and Retrieval:**
- **Relevance Scoring:** Assigning a relevance score to each document based on factors like
keyword matches, document popularity, and user behavior.
- **Association Rule Mining:** Discovering patterns and associations within the data, which can be
used for query suggestions or content recommendations.
- **Text Mining:** Extracting insights from unstructured text data, such as sentiment analysis,
named entity recognition, and topic modeling.
6. **User Interface:**
- **Search Interface:** The user interface is the front-end that allows users to input queries and
view search results. It may include advanced search options, filters, and facets.
7. **Feedback Mechanisms:**
- **User Feedback:** Collecting user feedback to improve search relevance and accuracy. This may
involve feedback forms, click-through data analysis, and user surveys.
- **Data Encryption:** Protecting data during transmission and storage to maintain user privacy.
9. **Scalability:**
- **Distributed Systems:** Many modern search engines are built on distributed architectures to
handle large-scale data and user requests efficiently.
- Continuous monitoring to ensure the search engine is operational, detect and fix errors, and
update the index with new content.
- Testing and evaluating the performance of the search engine by comparing the quality of search
results against established metrics and benchmarks.
Search engine architecture in data mining is a complex and evolving field, with various algorithms,
models, and technologies constantly being developed to improve the accuracy and efficiency of
search engines. These systems play a vital role in helping users find relevant information within the
vast amount of data available on the internet and in various data repositories.
- **Purpose:** OLTP systems are designed for efficient and real-time transaction processing. They
handle day-to-day operational tasks, such as order processing, inventory management, and customer
interactions.
- **Characteristics:** OLTP databases are optimized for fast and frequent data insertions, updates,
and deletions. They focus on maintaining data integrity and ensuring that transactions are processed
accurately and quickly.
- **Data Schema:** OLTP databases typically have normalized schemas to reduce data redundancy
and maintain data consistency.
- **Example:** A point-of-sale system in a retail store that records individual sales transactions is
an OLTP application.
- **Characteristics:** OLAP databases are optimized for read-heavy operations, aggregations, and
complex queries. They may use denormalized data structures to improve query performance.
- **Data Schema:** OLAP databases use star or snowflake schemas, which involve fact tables and
dimension tables to enable efficient data analysis.
- **Example:** A data warehouse that stores historical sales data and allows business analysts to
generate reports, perform trend analysis, and make strategic decisions is an OLAP application.
- Data mining can be integrated with OLAP systems to discover hidden patterns, associations, and
insights within large datasets. OLAP cubes and reports can serve as a starting point for data mining
analysis.
- Data mining algorithms, such as clustering, classification, and association rule mining, can be
applied to OLAP data to uncover valuable information about customer behavior, market trends, or
operational efficiency.
- The results of data mining can feed back into the OLAP system, enriching the multidimensional
view and enabling more informed decision-making.
4. **Data Flow:**
- OLTP systems capture and store transactional data in real-time. These transactions are essential
for tracking business operations and maintaining data integrity.
- OLAP systems periodically extract and transform data from OLTP databases into a data
warehouse. This process may involve aggregation, cleaning, and structuring data for analytical
purposes.
- Data mining can be performed on the data warehouse, which contains historical and aggregated
data, making it suitable for discovering patterns and trends.
5. **Business Applications:**
- OLTP systems are primarily used for daily operational tasks and data recording.
- OLAP systems support strategic and business intelligence functions, enabling decision-makers to
analyze data and generate reports.
- Data mining provides a deeper layer of analysis, helping organizations make data-driven
predictions and discover hidden insights that might not be apparent through traditional reporting.
Integrating OLTP, OLAP, and data mining within an organization's data management and analysis
strategy allows for a comprehensive approach to handling data, from real-time transactions to
historical analysis and predictive modeling. This synergy helps organizations make informed decisions
and gain a competitive advantage in various industries.
Pre-Requisite: OLAP, OLTP
OLAP stands for Online Analytical Processing. OLAP systems have the capability to
analyze database information of multiple systems at the current time. The primary
goal of OLAP Service is data analysis and not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to administer
day-to-day transactions in any organization. The main goal of OLTP is data
processing not data analysis.
OLAP Examples
Any type of Data Warehouse System is an OLAP system. The uses of the OLAP
System are described below.
Spotify analyzed songs by users to come up with a personalized homepage
of their songs and playlist.
Netflix movie recommendation system.
OLAP
OLTP Examples
An example considered for OLTP System is ATM Center a person who authenticates
first will receive the amount first and the condition is that the amount to be withdrawn
must be present in the ATM. The uses of the OLTP System are described below.
ATM center is an OLTP application.
OLTP handles the ACID properties during data transactions via the
application.
It’s also used for Online banking, Online airline ticket booking, sending a
text message, add a book to the shopping cart.
OLTP vs OLAP
OLTP services allow users to read, write and delete data operations quickly.
OLTP services help in increasing users and transactions which helps in real-
time access to data.
OLTP services help to provide better security by applying multiple security
features.
OLTP services help in making better decision making by providing accurate
data or current data.
OLTP Services provide Data Integrity, Consistency, and High Availability
to the data.
OLTP has limited analysis capability as they are not capable of intending
complex analysis or reporting.
OLTP has high maintenance costs because of frequent maintenance,
backups, and recovery.
OLTP Services get hampered in the case whenever there is a hardware
failure which leads to the failure of online transactions.
OLTP Services many times experience issues such as duplicate or
inconsistent data.
Difference between OLAP and OLTP
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)
It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.
It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.
It provides a multi-
It reveals a snapshot of present
Task dimensional view of different
business tasks.
business tasks.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)
In data warehousing, a data cube is a multidimensional representation of data that allows for
complex analysis and reporting. It stores data in a format that is optimized for querying and reporting
on multiple dimensions. The key components of a data cube in data warehousing include:
1. **Dimensions**: These are the attributes or characteristics by which you want to analyze data. For
example, in a retail context, dimensions could include time, products, and stores.
2. **Measures**: These are the numeric values or metrics you want to analyze. For instance, in
retail, measures could be sales revenue, units sold, or profit.
3. **Hierarchies**: Dimensions can have hierarchies, which represent levels of granularity. For
example, the time dimension might have hierarchies like year, quarter, month, and day.
4. **Cuboid**: A cuboid represents a specific subcube within the data cube, defined by a
combination of dimension values. It's essentially a slice of the cube used for analysis.
5. **Aggregations**: Data cubes often store aggregated values to speed up query performance,
especially for summarization at higher levels of detail.
1. **Pattern Discovery**: Data cubes are used to identify patterns and trends within the data,
helping data miners discover valuable insights. Patterns can include associations, correlations, and
anomalies.
2. **Data Visualization**: Data cubes can be visualized in the form of pivot tables or
multidimensional charts, making it easier for data analysts to understand complex relationships
within the data.
3. **Drill-Down and Roll-Up**: Data miners can drill down to finer levels of detail or roll up to higher-
level summaries within the data cube to explore patterns and trends more deeply.
4. **Hypothesis Testing**: Data cubes allow for hypothesis testing and the evaluation of data mining
models within the context of multiple dimensions.
5. **Advanced Analytics**: Data mining techniques, such as clustering, classification, and regression,
can be applied to data cubes to generate predictive models and make data-driven decisions.
In summary, data cubes in data warehousing provide a structured way to store and retrieve
multidimensional data for reporting and analysis, while data cubes in data mining are used for
discovering patterns and generating insights from multidimensional datasets. The concept of data
cubes is pivotal in both domains, enabling efficient and comprehensive data analysis.
Dicing: this operation does a multidimensional cutting, that not only cuts
only one dimension but also can go to another dimension and cut a certain
range of it. As a result, it looks more like a subcube out of the whole
cube(as depicted in the figure). For example- the user wants to see the
annual salary of Jharkhand state employees.
Online Analytical Processing (OLAP) operations are a set of operations used to interact with
multidimensional data cubes in data warehousing and business intelligence systems. OLAP enables
users to explore, analyze, and gain insights from data in a multidimensional format. There are several
primary OLAP operations, including:
1. **Roll-Up (Aggregation):**
- **Operation:** Roll-up, also known as aggregation, involves summarizing data at a higher level of
a dimension hierarchy. It is moving from a detailed level to a more aggregated or summarized level.
2. **Drill-Down (Detail):**
- **Example:** Exploring monthly sales data to view daily sales for a specific month.
3. **Slice:**
- **Operation:** Slicing involves selecting a single level of one dimension and one or more levels of
other dimensions to view a specific "slice" of the data cube.
- **Example:** Viewing sales data for a particular product category (one dimension) and a
particular quarter (another dimension).
4. **Dice:**
- **Operation:** Dicing allows the selection of two or more dimensions and specific levels within
those dimensions to create a subcube or a more focused view of the data.
- **Example:** Creating a subcube to examine sales data for a particular region, product category,
and time period.
5. **Pivot (Rotation):**
- **Operation:** Pivoting involves changing the orientation of the data cube to view it from a
different perspective. It typically involves interchanging rows and columns in the data representation.
- **Example:** Rotating a data cube to view sales by product category as rows and quarters as
columns.
6. **Query (Selection):**
- **Operation:** Query or selection allows users to specify criteria or conditions to filter the data in
the cube. It focuses on a specific subset of the data based on user-defined constraints.
- **Example:** Selecting only sales data for products that belong to a particular category and are
sold in a specific region.
7. **Drill-Through:**
- **Operation:** Drill-through provides a way to access detailed data at the lowest level of
granularity, often by connecting to the underlying relational or transactional databases.
8. **Ranking:**
- **Operation:** Ranking involves calculating and displaying the rank or order of data values within
a particular dimension. It helps identify the highest or lowest values.
9. **Top-N (Bottom-N):**
- **Operation:** Top-N and Bottom-N operations return the top (or bottom) N values based on a
specific measure. This is useful for identifying the best or worst performers.
- **Example:** Finding the top 10 best-selling products or the bottom 5 least profitable regions.
These OLAP operations provide users with the flexibility to interact with multidimensional data
cubes, allowing them to explore data from various angles and levels of granularity, make informed
decisions, and discover valuable insights. OLAP tools and systems are widely used in business
intelligence and data analysis to support decision-making processes
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs
aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension
reduction. Roll-up is like zooming-out on the data cubes. Figure shows the result of
roll-up operations performed on the dimension location. The hierarchy for the
location is defined as the Order Street, city, province, or state, country. The roll-up
operation aggregates the data by ascending the location hierarchy from the level of
the city to the level of the country.
Example
Consider the following cubes illustrating temperature of certain days recorded
weekly:
Temperature 64 65 68 69 70 71 72 75 80 8
Week1 1 0 1 0 1 0 0 0 0 0
Week2 0 0 0 1 0 0 1 2 0 1
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in
temperature from the above cubes.
To do this, we have to group column and add up the value according to the concept
hierarchies. This operation is known as a roll-up.
Week1 2 1
Week2 2 1
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drill-down is like zooming-in on the data cube. It navigates from less detailed record
to more detailed data. Drill-down can be performed by either stepping down a
concept hierarchy for a dimension or adding additional dimensions.
Example
Drill-down adds more details to the given data
Day 1 0 0
Day 2 0 0
Day 3 0 0
Day 4 0 1
Day 5 1 0
Day 6 0 0
Day 7 1 0
Day 8 0 0
Day 9 1 0
Day 10 0 1
Day 11 0 1
Day 12 0 1
Day 13 0 0
Day 14 0 0
For example, if we make the selection, temperature=cool we will obtain the following
cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
Dice
The dice operation describes a subcube by operating a selection on two or more
dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND
(temperature = cool OR temperature = hot) to the original cubes we get the
following subcube (still two-dimensional)
Temperature cool hot
Day 3 0 1
Day 4 0 0
The dice operation on the cubes based on the following selection criteria involves
three dimensions.
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which
rotates the data axes in view to provide an alternative presentation of the data. It
may contain swapping the rows and columns or moving one of the row-dimensions
into the column dimensions.
Other OLAP operations may contain ranking the top-N or bottom-N elements in lists,
as well as calculate moving average, growth rates, and interests, internal rates of
returns, depreciation, currency conversions, and statistical tasks.