unit 2
unit 2
The data mining process may vary depending on your specific project and the
techniques employed, but it typically involves the 10 key steps described below.
1. Define Problem. Clearly define the objectives and goals of your data mining
project. Determine what you want to achieve and how mining data can help in
solving the problem or answering specific questions.
3. Prep Data. Clean and preprocess your collected data to ensure its quality and
suitability for analysis. This step involves tasks such as removing duplicate or
irrelevant records, handling missing values, correcting inconsistencies, and
transforming the data into a suitable format.
7. Train Model. Train your selected model using the prepared dataset. This
involves feeding the model with the input data and adjusting its parameters or
weights to learn from the patterns and relationships present in the data.
10. Monitor & Maintain Model. Continuously monitor your model's performance
and ensure its accuracy and relevance over time. Update the model as new data
becomes available, and refine the data mining process based on feedback and
changing requirements.
1. Classification
Classification is a technique used to categorize data into predefined classes or
categories based on the features or attributes of the data instances. It involves
training a model on labeled data and using it to predict the class labels of new,
unseen data instances.
2. Regression
Regression is employed to predict numeric or continuous values based on the
relationship between input variables and a target variable. It aims to find a
mathematical function or model that best fits the data to make accurate
predictions.
3. Clustering
Clustering is a technique used to group similar data instances together based
on their intrinsic characteristics or similarities. It aims to discover natural patterns
or structures in the data without any predefined classes or labels.
4. Association Rule
Association rule mining focuses on discovering interesting relationships or
patterns among a set of items in transactional or market basket data. It helps
identify frequently co-occurring items and generates rules such as "if X, then Y"
to reveal associations between items. This simple Venn diagram shows the
associations between itemsets X and Y of a dataset.
5. Anomaly Detection
Anomaly detection, sometimes called outlier analysis, aims to identify rare or
unusual data instances that deviate significantly from the expected patterns. It is
useful in detecting fraudulent transactions, network intrusions, manufacturing
defects, or any other abnormal behavior.
8. Decision Trees
Decision trees are graphical models that use a tree-like structure to represent
decisions and their possible consequences. They recursively split the data based
on different attribute values to form a hierarchical decision-making process.
9. Ensemble Methods
Ensemble methods combine multiple models to improve prediction accuracy
and generalization. Techniques like Random Forests and Gradient Boosting
utilize a combination of weak learners to create a stronger, more accurate
model.
Measuring
Measuring similarity and dissimilarity in data mining is an important task that helps identify
patterns and relationships in large datasets. To quantify the degree of similarity or
dissimilarity between two data points or objects, mathematical functions called similarity and
dissimilarity measures are used. Similarity measures produce a score that indicates the degree
of similarity between two data points, while dissimilarity measures produce a score that
indicates the degree of dissimilarity between two data points. These measures are crucial for
many data mining tasks, such as identifying duplicate records, clustering, classification, and
anomaly detection.
Similarity Measure
Dissimilarity Measure
• For nominal variables, these measures are binary, indicating whether two values are equal or
not.
• For ordinal variables, it is the difference between two values that are normalized by the max
distance. For the other variables, it is just a distance function.
Similarity Measures
• Similarity measures are mathematical functions used to determine the degree of similarity
between two data points or objects. These measures produce a score that indicates how
similar or alike the two data points are.
• It takes two data points as input and produces a similarity score as output, typically ranging
from 0 (completely dissimilar) to 1 (identical or perfectly similar).
• Similarity measures also have some well-known properties -
o sim(A,B)=1sim(A,B)=1 (or maximum similarity) only if A=BA=B
o Typical range - (0≤sim≤1)(0≤sim≤1)
Cosine Similarity
Cosine similarity is a widely used similarity measure in data mining and information
retrieval. It measures the cosine of the angle between two non-zero vectors in a multi-
dimensional space. In the context of data mining, these vectors represent the feature vectors
of two data points. The cosine similarity score ranges from 0 to 1, with 0 indicating no
similarity and 1 indicating perfect similarity.
The cosine similarity between two vectors is calculated as the dot product of the vectors
divided by the product of their magnitudes. This calculation can be represented
mathematically as follows -
cos(θ)=A⋅B∥A∥∥B∥=∑i=1nAiBi∑i=1nAi2∑i=1nBi2cos(θ)=∥A∥∥B∥A⋅B
=∑i=1nAi2∑i=1nBi2∑i=1nAiBi
where A and B are the feature vectors of two data points, "." denotes the dot product, and "||"
denotes the magnitude of the vector.
Jaccard Similarity
The Jaccard similarity is another widely used similarity measure in data mining, particularly
in text analysis and clustering. It measures the similarity between two sets of data by
calculating the ratio of the intersection of the sets to their union. The Jaccard similarity score
ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect similarity.
J(A,B)=∣A∩B∣∣A∪B∣=∣A∩B∣∣A∣+∣B∣−∣A∩B∣J(A,B)=∣A∪B∣∣A∩B∣
=∣A∣+∣B∣−∣A∩B∣∣A∩B∣
The Pearson correlation coefficient is a widely used similarity measure in data mining and
statistical analysis. It measures the linear correlation between two continuous variables, X
and Y. The Pearson correlation coefficient ranges from -1 to +1, with -1 indicating a perfect
negative correlation, 0 indicating no correlation, and +1 indicating a perfect positive
correlation. The Pearson correlation coefficient is commonly used in data mining applications
such as feature selection and regression analysis. It can help identify variables that are highly
correlated with each other, which can be useful for reducing the dimensionality of a dataset.
In regression analysis, it can also be used to predict the value of one variable based on the
value of another variable.
The Pearson correlation coefficient between two variables, X and Y, is calculated as follows -
ρX,Y=cov(X,Y)σXσY=∑i=1n(Xi−Xˉ)(Yi−Yˉ)∑i=1n(Xi−Xˉ)2∑i=1n(Yi−
Yˉ)2ρX,Y=σXσYcov(X,Y)=∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2∑i=1n(Xi−Xˉ)(Yi−Yˉ)
where cov(X,Y)cov(X,Y) is the covariance between variables XX and YY, and σXσX
and σYσY are the standard deviations of variables XX and YY, respectively.
Sørensen-Dice Coefficient
The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is
a similarity measure used to compare the similarity between two sets of data, typically used
in the context of text or image analysis. The coefficient ranges from 0 to 1, with 0 indicating
no similarity and 1 indicating perfect similarity. The Sørensen-Dice coefficient is commonly
used in text analysis to compare the similarity between two documents based on the set of
words or terms they contain. It is also used in image analysis to compare the similarity
between two images based on the set of pixels they contain.
Choosing an appropriate similarity measure depends on the nature of the data and the specific
task at hand. Here are some factors to consider when choosing a similarity measure -
• Different similarity measures are suitable for different data types, such as continuous or
categorical data, text or image data, etc. For example, the Pearson correlation coefficient,
which is only suitable for continuous variables.
• Some similarity measures are sensitive to the scale of measurement of the data.
• The choice of similarity measure also depends on the specific task at hand. For example,
cosine similarity is often used in information retrieval and text mining, while Jaccard
similarity is commonly used in clustering and recommendation systems.
• Some similarity measures are more robust to noise and outliers in the data than others. For
example, the Sørensen-Dice coefficient is less sensitive to noise.
The structure of a decision tree consists of a root node, branches, and leaf nodes.
The branched nodes are the outcomes of a tree and the internal nodes represent
the test on an attribute. The leaf nodes represent a class label.
Working of a decision tree
1. A decision tree works under the supervised learning approach for both
discreet and continuous variables. The dataset is split into subsets on the basis
of the dataset’s most significant attribute. Identification of the attribute and
splitting is done through the algorithms.
2. The structure of the decision tree consists of the root node, which is the
significant predictor node. The process of splitting occurs from the decision
nodes which are the sub-nodes of the tree. The nodes which do not split further
are termed as the leaf or terminal nodes.
4. Until and unless a stop criterion is reached, the decision tree will keep on
running.
5. With the building of a decision tree, lots of noise and outliers are generated.
To remove these outliers and noisy data, a method of “Tree pruning” is applied.
Hence, the accuracy of the model increases.
6. Accuracy of a model is checked on a test set consisting of test tuples and class
labels. An accurate model is defined based on the percentages of classification
test set tuples and classes by the model.
Figure 1: An example of an unpruned and a pruned tree
• Classification
• Regression
1. Classification
Regression models are used for the regression analysis of data, i.e. the prediction
of numerical attributes. These are also called continuous values. Therefore,
instead of predicting the class labels, the regression model predicts the
continuous values.
List of Applications
1. Healthcare
3. Educational Sectors
Shortlisting of a student based on his merit score, attendance, etc. can be decided
with the help of decision trees.
List of Advantages
• The interpretable results of a decision model can be represented to senior
management and stakeholders.
• While building a decision tree model, preprocessing of the data, i.e.
normalization, scaling, etc. is not required.
• Both types of data- numerical and categorical can be handled by a decision
tree which displays its higher efficiency of use over other algorithms.
• Missing value in data doesn’t affect the process of a decision tree thereby
making it a flexible algorithm.
Neural networks are powerful equipment in records mining because of their capability to
research complicated patterns from massive datasets.
Training of ANN :
We can train the neural network by feeding it by teaching patterns and
letting it change its weight according to some learning rule. We can
categorize the learning situations as follows.
1. Supervised Learning: In which the network is trained by providing it
with input and matching output patterns. And these input-output pairs
can be provided by an external system that contains the neural
network.
2. Unsupervised Learning: In which output is trained to respond to a
cluster of patterns within the input. Unsupervised learning uses a
machine learning algorithm to analyze and cluster unlabeled datasets.
3. Reinforcement Learning: This type of learning may be considered as
an intermediate form of the above two types of learning, which trains
the model to return an optimum solution for a problem by taking a
sequence of decisions by itself.
Training Algorithm:
six phases
Initial population
Being the first phase of the algorithm, it includes a set of individuals where
each individual is a solution to the concerned problem. We characterize
each individual by the set of parameters that we refer to as genes.
Calculate Fitness
Selection
The selection process selects the individuals with the highest fitness score
and is allowed to pass on their genes to the next generation.
Crossover
Mutation
The mutation phase inserts random genes into the generated offspring to
maintain the population’s diversity. It is done by flipping random genes in
new offspring.
Termination
The iteration of the algorithm stops when it produces offspring that is not
different from the previous generation. It is said to have produced a set
Disadvantages
Applications of GA
GA is used in implementing many applications let s discuss a few of them.
• Economics: In the field of economics GA is used to implement certain
models that conduct competitive analysis, decision making, and effective
scheduling.
• Aircraft Design: GA is used to provide the parameters that must be
modified and upgraded in order to get a better design.
• DNA Analysis: GA is used to establish DNA structure using spectrometric
information.
• Transport: GA is used to develop a transport plan that is time and cost-
efficient.