Data Mining Micro PGDM
Data Mining Micro PGDM
The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user interface,
and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW),
text files, and other documents. You need a huge amount of historical data for data
mining to be successful. Organizations typically store data in databases or data
warehouses. Data warehouses may comprise one or more databases, text files
spreadsheets, or other repositories of data. Sometimes, even plain text files or
spreadsheets may contain information. Another primary source of data is the World Wide
Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data mining procedure because the
data may not be complete and accurate. So, the first data requires to be cleaned and
unified. More information than needed will be collected from various data sources, and
only the data of interest will have to be selected and passed to the server. These
procedures are not as easy as we think. Several methods may be performed on the data
as part of selection, integration, and cleaning.
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on
data mining as per user request.
The data mining engine is a major component of any data mining system. It contains
several modules for operating data mining tasks, including association, characterization,
classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation
of the pattern by using a threshold value. It collaborates with the data mining engine to
focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake
threshold to filter out discovered patterns. On the other hand, the pattern evaluation
module might be coordinated with the mining module, depending on the implementation
of the data mining techniques used. For efficient data mining, it is abnormally suggested
to push the evaluation of pattern stake as much as possible into the mining procedure to
confine the search to only fascinating patterns.
The graphical user interface (GUI) module communicates between the data mining
system and the user. This module helps the user to easily and efficiently use the system
without knowing the complexity of the process. This module cooperates with the data
mining system when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns. The knowledge base may
even contain user views and data from user experiences that might be helpful in the data
mining process. The data mining engine may receive inputs from the knowledge base to
make the result more accurate and reliable. The pattern assessment module regularly
interacts with the knowledge base to get inputs, and also update it.
2) Explain Naïve Baye’s Classification ?
They can predict the probabilities of class items that is it gives the
probability that a given data item belongs to that class or not.
Naïve Bayes classifier is one of the simple and most effective classification
algorithms
helps in building the fast machine learning models that can make quick
predication.
Baye’s Theorem :-
P(Y|X)=P(X|Y) P(Y)
P(X)
P(Y|X1, X2…Xn)=P(X1|Y)*P(X2|Y)*P(Xn|Y)*P(Y)
(NO)
P(N|X1, X2…Xn)=P(X1|N)*P(X2|N)*P(Xn|N)*P(N)
Example:
Suppose we want to find stolen cars and have the following dataset:
According to our dataset, we can understand that our algorithm makes the
following assumptions:
For example, knowing only the Color and Origin would predict the
outcome correctly.
Conditional Probability
Conditional Probability
Frequency Table of Origin:
Conditional Probability
=⅗x⅕x⅖x1
= 0.048
=⅖x⅗x⅗x1
= 0.144
Advantages
Disadvantages
Naive Bayes assumes that all predictors (or features) are independent,
rarely happening in real life.
Its estimations can be wrong in some cases, so you shouldn’t take its
probability outputs very seriously.
3) Explain FP Growth/Tree ?
FP-Growth is an efficient and complete set of FP using a tree structure for storing
information about Frequent Pattern called FP-Growth.
FP growth algorithm represents the database in the form of a tree called a frequent
pattern tree or FP tree.
This tree structure will maintain the association between the item sets.
Frequent Pattern tree is a tree like structure that is made with the initial item sets of
the database.
The root node represents NULL while the lower nodes represent the item sets.
The association of the nodes with the lower nodes that is the item sets with the other
item sets are maintained while forming the tree.
1: FP-tree construction
2. The second step is to construct the FP tree. For this, create the root of the tree. The
root is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the item set in it. The item set with the max count is taken
at the top, and then the next item set with the lower count. It means that the branch of
the tree is constructed with transaction item sets in descending order of count.
4. The next transaction in the database is examined. The item sets are ordered in
descending order of count. If any item set of this transaction is already present in
another branch, then this transaction branch would share a common prefix to the root.
This means that the common item set is linked to the new node of another item set in
this transaction.
5. Also, the count of the item set is incremented as it occurs in the transactions. The
common node and new node count are increased by 1 as they are created and linked
according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined
first, along with the links of the lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree. This path or paths is called
a conditional pattern base. A conditional pattern base is a sub-database consisting of
prefix paths in the FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of item sets in the path. The item
sets meeting the threshold support are considered in the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan
collects and sorts the set of frequent items, and the second constructs the FP-Tree.
o This algorithm needs to scan the database twice when compared to Apriori, which
scans the transactions for each iteration.
o The pairing of items is not done in this algorithm, making it faster. o The database is
stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages of FP-Growth Algorithm This algorithm also has some disadvantages,
such as:
o The algorithm may not fit in the shared memory when the database is large.
Difference:
1. Apriori FP Growth Apriori generates frequent patterns by making the item sets using
pairings such as single item set, double item set, and triple item set.
2. Apriori uses candidate generation where frequent subsets are extended one item at a
time.
3. Since Apriori scans the database in each step, it becomes time-consuming for data
where the number of items is larger.
FP-tree requires only one database scan in its beginning steps, so it consumes less time.
Apriori Algorithm
It is given by R. Agrawal and R. Srikant in 1994 for finding frequent item sets in a dataset for
Boolean association rule.
The Apriori Algorithm is an important algorithm for mining frequent item sets in a dataset.
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules.
We take an example to understand the concept better. You must have noticed that the
Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He also offers a
discount to their customers who buy these combos. Do you ever think why does he do so? He
thinks that customers who buy pizza also buy soft drinks and breadsticks. However, by
making combos, he makes it easy for the customers. At the same time, he also increases his
sales performance.
Apriori algorithm helps the customers to buy their products with ease and increases the
sales performance of the particular store and reduce cost.
The Apriori algorithm uses the concept of “Apriori property” which states that if an item
set is frequent, then all of its subsets must also be frequent.
Ex: item set (A, B, C) frequently appear in a dataset. {A, B}, {A, C}, {B, C}, {A}, {B},
{C} must also appear frequently in the data set.
Apriori algorithm generates singletons, pairs and triplets by pairing the items within the
transaction.
Apriori algorithm uses candidate generation which is the joining step performed by
combining item sets of two transaction.
The algorithm analyzes a dataset to determine which combinations of items occur together
frequently.
Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
Steps for Apriori Algorithm The Apriori algorithm has the following steps:
Step 1: Determine the level of transactional database support and establish the minimal
degree of assistance and dependability.
Step 2: Take all of the transaction's supports that are greater than the standard or chosen
support value.
Step 3: Look for all rules with greater precision than the cutoff or baseline standard, in these
subgroups.
Ex:
For the following given Transaction Data-set, Generate Rules using Apriori Algorithm.
Answer:
Bread 4 4/5=80%
Cheese 3 3/5=60%
Egg 1 1/5=20%
Juice 4 4/5=80%
Milk 3 3/5=60%
Yogurt 1 1/5=20
Step 2) Remove all the items whose support is below given minimum support
Bread 4 4/5=80%
Cheese 3 3/5=60%
Juice 4 4/5=80%
Milk 3 3/5=60%
Step 3) Now form the two items candidate set and write their frequencies.
Step 4) Remove all the items whose support is below given minimum support
Therefore,
2 marks or 5marks
2. *Finance:*
3. *Healthcare:*
4. *Telecommunications:*
5. *Education:*
- Identifying students at risk of academic challenges.
9. *Environmental Science:*
Market Basket Analysis (MBA) is a data mining technique that examines the associations
between items frequently purchased together. It originated from the retail industry but
has found applications in various sectors. The primary goal of market basket analysis is
to understand customer purchasing behavior and discover patterns in product co-
occurrence.
- Express these relationships as rules, often in the form "If A, then B."
- *Lift:* Measures how much more likely item B is purchased when item A is bought,
compared to when item B is purchased without item A.
3. *Example:*
- If customers frequently buy bread (A) and butter (B) together, the association rule
might be: "If a customer buys bread, they are likely to buy butter."
- The rule would have high support if this combination is frequent, high confidence if
customers consistently buy both, and high lift if the combination is more likely than
random chance.
4. *Applications:*
Classification and regression are both types of supervised learning in machine learning,
but they differ in the nature of the prediction they make:
1. **Classification:**
2. **Regression:**
- **Evaluation:** Common evaluation metrics include mean squared error (MSE), mean
absolute error (MAE), R-squared, and other regression-specific metrics.
**Key Differences:**
- **Algorithms:** While some algorithms can be used for both classification and
regression (e.g., decision trees), there are specific algorithms tailored for each task. For
example, logistic regression and support vector machines are commonly used for
classification, while linear regression and decision trees for regression.
In summary, the key distinction lies in whether the goal is to predict categories or
continuous values. Classification deals with discrete outcomes, while regression deals
with continuous outcomes.
1. **Supervised Learning:**
- **Use of Labels:** The training data includes both input features and their
corresponding labeled output.
2. **Unsupervised Learning:**
- **Definition:** In unsupervised learning, the algorithm is given input data
without explicit output labels. The algorithm explores the patterns and
relationships within the data on its own.
**Key Differences:**
It's worth noting that there is also a third category called semi-supervised
learning, which combines elements of both supervised and unsupervised
learning by training on a dataset that contains both labeled and unlabeled
examples.
1. **Classification:**
2. **Regression:**
3. **Clustering:**
5. **Anomaly Detection:**
6. **Dimensionality Reduction:**
- **Objective:** Reducing the number of input features while preserving as much
relevant information as possible.
9. **Recommendation Systems:**
- **Example:** Web content mining, web usage mining, social media analysis.
These tasks often involve the use of various algorithms and techniques to uncover
patterns, relationships, and insights hidden within large and complex datasets. The
choice of task depends on the specific goals and characteristics of the data being
analyzed.
Data preprocessing is a crucial step in the data analysis pipeline that involves cleaning
and transforming raw data into a format that is suitable for analysis. It aims to improve
the quality of the data, address issues such as missing values and outliers, and prepare
the data for further analysis or modeling. Data preprocessing includes several steps,
each designed to enhance the reliability and effectiveness of the data. Here are
common tasks involved in data preprocessing:
1. **Data Cleaning:**
- Identify and handle missing data: Fill in missing values or remove rows/columns with
missing data.
- Detect and handle outliers: Outliers may distort analysis; they can be detected and
either removed or adjusted.
- Correct inaccuracies: Check for errors or inaccuracies in the data and correct them.
2. **Data Transformation:**
3. **Data Reduction:**
- For time-series data, ensure proper handling of temporal aspects, handle missing
time points, and potentially aggregate or resample the data.
- Identify and remove duplicate records to avoid skewing the analysis or model
training.
- Noise in data can arise from various sources. Applying smoothing techniques or
filters can help reduce noise.
8. **Data Integration:**
- Combining data from multiple sources to create a unified dataset for analysis. This
may involve resolving discrepancies, matching variables, and ensuring consistency.
9. **Addressing Skewness:**
- For text data, preprocessing may include tasks such as tokenization, stemming, and
removing stop words to prepare the data for natural language processing (NLP) tasks.
Effective data preprocessing is essential for obtaining reliable and meaningful insights
from data. The specific steps taken will depend on the nature of the data and the goals
of the analysis or modeling process.
1. **Itemset:**
2. **Support of an Itemset:**
3. **Frequent Itemset:**
- An itemset is considered frequent if its support is equal to or exceeds a
predefined minimum support threshold. Frequent itemsets are interesting
because they represent patterns that occur frequently in the data.
### Example:
```
```
### Applications:
Apriori vs FP tree
Data warehousing plays a crucial role in the context of data mining. A data warehouse is a large,
centralized repository that integrates and stores data from various sources in a structured format. It
serves as a foundation for business intelligence and analytics, including data mining activities. Here
are key aspects of data warehousing in the context of data mining:
1. **Integration of Data:**
- Data warehouses consolidate data from different operational systems, such as transactional
databases and external sources. This integration provides a unified and consistent view of the
organization's data, making it suitable for analysis and decision-making.
2. **Structured Storage:**
- Data warehouses are optimized for complex analytical queries. They support the execution of ad-
hoc queries and the generation of reports that involve aggregations, filtering, and grouping of data.
- Data warehousing involves data cleaning and quality assurance processes to ensure that the data
is accurate and reliable. Clean, high-quality data is essential for meaningful analysis and mining.
- Data mining algorithms operate on data warehouses to discover patterns, trends, and associations
in the data. The structured and integrated nature of the data warehouse simplifies the application of
data mining techniques.
- Data warehouses are a foundation for business intelligence systems. BI tools often connect
directly to the data warehouse to create reports, dashboards, and visualizations, providing insights to
business users.
8. **Decision Support:**
9. **ETL Processes:**
- ETL (Extract, Transform, Load) processes are employed to extract data from source systems,
transform it to fit the structure of the data warehouse, and load it into the warehouse. This ensures
that data is consistent and suitable for analysis.
- Data warehouses are designed to handle large volumes of data and provide high performance for
analytical queries. They often use techniques such as indexing, partitioning, and materialized views
to optimize query performance.
In summary, data warehousing creates a centralized, structured, and historical repository of data that
is conducive to data mining and analytics. It serves as the backbone for organizations seeking to
derive valuable insights from their data and make informed decisions.
what is decision tree in data mining and explain types of nodes in decision tree?
A decision tree is a popular supervised machine learning algorithm used for both classification and
regression tasks in data mining and machine learning. It models decisions and their possible
consequences in the form of a tree-like structure. The tree is constructed by recursively partitioning
the data based on the input features, and each leaf node of the tree represents the predicted
outcome.
1. **Root Node:**
- The topmost node of the tree, representing the entire dataset. It is the starting point for the
decision-making process.
2. **Internal Nodes:**
- Nodes that are not leaf nodes are called internal nodes. They represent decision points where the
dataset is split based on a certain feature and a corresponding threshold or condition.
3. **Leaf Nodes:**
- The terminal nodes of the tree, where the final prediction or decision is made. Each leaf node
corresponds to a specific class (in classification) or a predicted value (in regression).
4. **Edges (Branches):**
- The branches connecting nodes represent the decision paths. The decision tree algorithm
determines the conditions for moving from one node to another.
- These nodes represent decisions or tests based on a particular feature and its value. The decision
tree algorithm selects the best feature and threshold to split the data, aiming to maximize
information gain or minimize impurity.
2. **Leaf Nodes (Terminal Nodes):**
- These nodes represent the final outcomes or predictions. In classification tasks, each leaf
corresponds to a specific class label, and in regression tasks, each leaf provides a predicted numerical
value.
The process of building a decision tree involves making splits at decision nodes based on the values
of features. The types of splits include:
1. **Binary Splits:**
- Each internal node makes a binary decision, leading to two branches (true or false). The binary
splits continue until the tree is fully grown.
2. **Multiway Splits:**
- Some decision tree variants allow for multiway splits, where a node can have more than two
branches. However, binary splits are more common.
- One of the earliest decision tree algorithms developed by Ross Quinlan. It uses information gain
as the criterion for selecting the best split at each node.
2. **C4.5:**
- An extension of ID3, developed by Ross Quinlan. C4.5 uses gain ratio as a refinement over
information gain and can handle both discrete and continuous features.
4. **Random Forest:**
- A ensemble learning method that builds multiple decision trees and combines their predictions.
Each tree is trained on a random subset of the data and features, reducing overfitting.
Decision trees are interpretable, easy to visualize, and widely used for their simplicity and
effectiveness in various machine learning applications. They are especially useful for exploring
complex decision-making processes.
Dimensionality reduction is a technique used in data mining and machine learning to reduce the
number of input features or variables in a dataset. The primary goal of dimensionality reduction is to
simplify the dataset while retaining as much relevant information as possible. High-dimensional data,
where the number of features is large, can pose challenges such as increased computational
complexity, the curse of dimensionality, and the risk of overfitting. Dimensionality reduction methods
address these challenges by transforming the data into a lower-dimensional space.
1. **Computational Efficiency:**
- Reducing the number of features often leads to faster training and testing of machine learning
models.
2. **Overfitting Mitigation:**
- High-dimensional data may result in overfitting, where a model performs well on training data but
poorly on new, unseen data. Dimensionality reduction helps mitigate overfitting.
3. **Improved Visualization:**
4. **Noise Reduction:**
- Dimensionality reduction may remove irrelevant features or noise, focusing on the most
significant aspects of the data.
### Common Techniques for Dimensionality Reduction:
- PCA is a widely used linear technique that transforms the data into a new set of uncorrelated
variables called principal components. These components capture the maximum variance in the
data. By selecting a subset of principal components, the dimensionality of the data is reduced.
- t-SNE is a nonlinear technique commonly used for visualizing high-dimensional data in two or
three dimensions. It emphasizes preserving local relationships between data points.
- LDA is a technique that maximizes the separation between classes in a dataset. It is often used for
dimensionality reduction in the context of classification tasks.
4. **Autoencoders:**
- Autoencoders are a type of neural network that learns a compact representation of the input
data. The middle layer of the autoencoder serves as the reduced-dimensional representation.
5. **Factor Analysis:**
- Factor analysis models the observed variables as linear combinations of underlying factors and
error terms. It is used to identify latent factors that explain the observed data.
6. **Random Projection:**
1. **Loss of Information:**
- Dimensionality reduction involves a trade-off between simplifying the data and retaining
important information. Some methods may result in a loss of information.
2. **Selection of Techniques:**
- The choice of dimensionality reduction technique depends on the nature of the data, the task at
hand, and the characteristics of the problem.
3. **Nonlinear Relationships:**
- Linear techniques may not capture complex nonlinear relationships in the data. Nonlinear
techniques, like t-SNE and kernel PCA, may be more appropriate for such scenarios.
4. **Interpretability:**
- While dimensionality reduction aids in simplifying data, it may make the interpretation of the
transformed features more challenging.
1. **Moving Averages:**
- **Simple Moving Average (SMA):** In SMA, a fixed-size window slides through the data, and the
average value within each window is calculated. This helps smooth out short-term fluctuations.
- **Weighted Moving Average (WMA):** Similar to SMA, but it assigns different weights to
different points within the window, giving more importance to recent data points.
- EMA assigns exponentially decreasing weights to past observations. It places more weight on
recent data points, making it more responsive to changes compared to SMA.
- Lowess is a non-parametric regression method that fits a smooth curve to the data by locally
fitting a polynomial regression to subsets of the data. It adapts to the local characteristics of the
dataset.
4. **Savitzky-Golay Smoothing:**
- Savitzky-Golay smoothing is a method that fits a polynomial to small subsets of adjacent data
points. The coefficients of the polynomial are determined through least squares regression, resulting
in a smoothed curve.
5. **Kernel Smoothing:**
- Kernel smoothing, or kernel density estimation, involves placing a kernel (a smooth, symmetric
function) at each data point and summing them to create a smooth curve. The bandwidth of the
kernel determines the level of smoothing.
6. **Hodrick-Prescott Filter:**
- The Hodrick-Prescott filter is commonly used for time-series data. It decomposes a time series
into a trend component and a cyclical component. The trend component represents the smoothed
version of the original time series.
7. **Butterworth Filters:**
- Butterworth filters are a type of linear filter that can be applied to time-series data. They are
commonly used in signal processing to remove high-frequency noise while preserving low-frequency
components.
8. **Kalman Filtering:**
- Kalman filtering is an algorithm that estimates the state of a system based on noisy
measurements. It is used for real-time data smoothing and is particularly effective when dealing with
time-series data.
9. **Moving Median:**
- Similar to moving averages, the moving median replaces each data point with the median value
within a moving window. It is less sensitive to outliers compared to the mean-based methods.
- LOESS is a non-parametric method that fits a locally weighted regression to the data. It adapts to
the local behavior of the dataset and is effective for smoothing curves.
The choice of a specific data smoothing method depends on the characteristics of the data, the
nature of the noise, and the objectives of the analysis. It's important to consider the trade-offs
between smoothing and preserving relevant information in the dataset.
what is bayesian classification
1. **Bayes' Theorem:**
- At the core of Bayesian classification is Bayes' theorem, which relates the conditional and
marginal probabilities of random events. For a given event A and evidence B:
- In the context of classification, A represents the class label, and B represents the features or
attributes of the data instance.
- The Naive Bayes classifier is a specific type of Bayesian classifier that makes a simplifying
assumption: it assumes that the features are conditionally independent given the class label. This
assumption simplifies the computation of probabilities.
3. **Classification Process:**
- Given a new data instance with observed features, the Bayesian classifier calculates the
probability of the instance belonging to each possible class. The class with the highest probability is
then assigned as the predicted class.
- \(P(A)\) represents the prior probability of a class, i.e., the probability of observing the class
without considering any specific features. It is estimated from the training data.
- \(P(B | A)\) is the probability of observing the features given the class. In Naive Bayes, the
assumption of conditional independence allows the likelihood to be computed as the product of
individual feature probabilities.
6. **Evidence (P(B)):**
- \(P(B)\) is the probability of observing the features, which is also known as the evidence. It acts as
a normalization factor to ensure that the probabilities sum to 1.
1. **Training:**
- Estimate the class prior probabilities and the conditional probabilities of observing each feature
given the class from the training data.
2. **Prediction:**
- Given a new data instance with observed features, calculate the probability of the instance
belonging to each class using Bayes' theorem. Assign the class with the highest probability as the
predicted class.
- Suitable for classification tasks where the features represent counts or frequencies. Commonly
used in document classification tasks.
- Assumes that the features are normally distributed. Suitable for continuous data.
- Suitable for binary or boolean data, where features represent the presence or absence of certain
attributes.
### Applications:
- **Spam Filtering:** Bayesian classification is often used for spam and ham (non-spam) email
classification based on the presence of certain keywords or features.
- **Document Classification:** Classifying documents into categories (e.g., topics) based on the
words or terms present in the document.
- **Medical Diagnosis:** Bayesian classifiers can be applied to medical data for diagnostic purposes,
considering the presence of specific symptoms or test results.
Bayesian classification is known for its simplicity and effectiveness, especially when dealing with
high-dimensional datasets. Despite the "naive" assumption of feature independence, Naive Bayes
classifiers often perform well in practice and are computationally efficient.
explain KDD process in data mining?
KDD, or Knowledge Discovery in Databases, is a process of extracting valuable information, patterns,
and knowledge from large volumes of data. It involves various steps and techniques to transform raw
data into actionable insights. The KDD process is often represented as a sequence of steps, and it is
closely associated with the broader field of data mining. Here are the key steps in the KDD process:
- **Objective:** Define the problem, understand the goals of the data mining project, and gain
domain knowledge. Clarify the objectives and requirements of the analysis.
2. **Data Selection:**
- **Objective:** Identify and select the relevant data for analysis. This step involves choosing the
datasets that contain the necessary information to address the problem at hand.
3. **Data Preprocessing:**
- **Objective:** Prepare the data for analysis by cleaning, transforming, and integrating it. This
includes handling missing values, removing noise, and converting data into a suitable format.
4. **Data Reduction:**
- **Objective:** Reduce the dimensionality of the data while preserving important information.
Techniques such as dimensionality reduction, aggregation, and sampling may be applied to manage
large datasets more effectively.
5. **Data Transformation:**
- **Objective:** Transform the data into a suitable format for analysis. This may involve
normalization, encoding categorical variables, and creating new features through feature
engineering.
6. **Data Mining:**
- **Objective:** Apply data mining algorithms to discover patterns, relationships, or trends in the
data. Common data mining techniques include classification, regression, clustering, association rule
mining, and anomaly detection.
7. **Pattern Evaluation:**
- **Objective:** Evaluate the patterns or models generated by the data mining algorithms. Assess
their quality, relevance, and usefulness in achieving the project goals.
8. **Knowledge Representation:**
- **Objective:** Represent the discovered patterns or models in a form that is understandable and
interpretable. This may involve creating visualizations, rules, or other representations that can be
communicated to stakeholders.
- **Objective:** Interpret the results in the context of the domain and evaluate the effectiveness
of the data mining process. Assess whether the discovered patterns meet the project objectives.
10. **Deployment:**
- **Objective:** Integrate the discovered knowledge into the decision-making process of the
organization. Deploy models, reports, or other outputs for practical use.
- **Objective:** Reflect on the results, gather feedback, and improve the models or processes.
Iteratively refine the knowledge discovery process based on insights and feedback.
The KDD process is often depicted as a cyclical or iterative process, emphasizing the importance of
feedback and continuous improvement. It is not a one-time activity but a dynamic and evolving
process that adapts to changing data, goals, and business requirements. The ultimate goal of KDD is
to transform raw data into actionable knowledge that can drive informed decision-making and
provide value to organizations.
1. **Structured Data:**
- **Description:** Structured data is organized in a tabular format with rows and columns. Each
column corresponds to a specific attribute, and each row represents a record or instance. Relational
databases are a common source of structured data.
2. **Unstructured Data:**
- **Description:** Unstructured data lacks a predefined data model and is not organized in a
tabular format. It includes text, images, audio, and video data.
3. **Semi-Structured Data:**
- **Description:** Semi-structured data has a flexible schema and may include some level of
organization, such as tags or attributes. It lies between structured and unstructured data.
4. **Time-Series Data:**
5. **Spatial Data:**
6. **Text Data:**
- **Description:** Text data consists of unstructured textual information. It is often analyzed using
natural language processing (NLP) techniques for sentiment analysis, topic modeling, and text
classification.
- **Example:** Documents, articles, emails, social media posts.
7. **Numeric Data:**
- **Description:** Numeric data consists of numerical values that can be subject to mathematical
operations. It includes continuous and discrete numerical variables.
8. **Categorical Data:**
- **Description:** Categorical data represents categories or labels and does not have a numerical
scale. It is often used in classification tasks.
9. **Binary Data:**
- **Description:** Binary data consists of values that are binary or Boolean (e.g., true/false, 0/1). It
is common in classification problems with two classes.
- **Description:** Graph data represents relationships between entities in the form of nodes and
edges. Graph mining is used to analyze network structures.
- **Description:** Multimedia data includes various forms of media, such as images, audio, and
video. It often involves specialized techniques for feature extraction and analysis.
- **Description:** Web data includes information collected from the internet, such as web pages,
web logs, and user interactions. It is often used for web mining and user behavior analysis.
### Support:
1. **Definition:**
- Support measures the frequency or occurrence of a particular itemset in the dataset. It indicates
how frequently the items in the rule co-occur together.
2. **Formula:**
- The support (\( \text{supp}(X) \)) of an itemset \(X\) is calculated as the ratio of the number of
transactions containing \(X\) to the total number of transactions in the dataset.
3. **Interpretation:**
- A high support value indicates that the itemset is frequent in the dataset, suggesting that the
items in the set often appear together.
### Confidence:
1. **Definition:**
- Confidence measures the strength of the association between two itemsets in a rule. It indicates
the likelihood that the presence of one itemset implies the presence of another.
2. **Formula:**
- The confidence (\( \text{conf}(X \Rightarrow Y) \)) of a rule \(X \Rightarrow Y\) is calculated as the
ratio of the number of transactions containing both \(X\) and \(Y\) to the number of transactions
containing \(X\).
3. **Interpretation:**
- A high confidence value suggests that the occurrence of itemset \(X\) is strongly associated with
the occurrence of itemset \(Y\).
### Relationship:
- While support and confidence are distinct measures, they are often used together to filter and
select association rules. Analysts typically set minimum thresholds for both support and confidence
to identify meaningful and interesting rules.
- High support ensures that the discovered rules are applicable to a sufficient number of
transactions, and high confidence ensures that the rules are reliable and not merely coincidental.
### Example:
```
Transaction 1: {A, B, C}
Transaction 2: {A, C, D}
Transaction 3: {B, C, E}
Transaction 4: {A, B, D}
Transaction 5: {B, C, D}
```
1. **Support Example:**
- Calculate the support of the itemset {A, B}:
2. **Confidence Example:**
In this example, the support of {A, B} is 0.4, indicating that this itemset is present in 40% of
transactions. The confidence of the rule {A} \(\Rightarrow\) {B} is approximately 0.67, indicating that
when item A is present, there is a 67% likelihood that item B will also be present.