0% found this document useful (0 votes)
3 views

unit 2

Uploaded by

mugasinmugasin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

unit 2

Uploaded by

mugasinmugasin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining Process

The data mining process may vary depending on your specific project and the
techniques employed, but it typically involves the 10 key steps described below.

1. Define Problem. Clearly define the objectives and goals of your data mining
project. Determine what you want to achieve and how mining data can help in
solving the problem or answering specific questions.

2. Collect Data. Gather relevant data from various sources, including


databases, files, APIs, or online platforms. Ensure that the collected data is
accurate, complete, and representative of the problem domain. Modern analytics
and BI tools often have data integration capabilities. Otherwise, you’ll need
someone with expertise in data management to clean, prepare, and integrate
the data.

3. Prep Data. Clean and preprocess your collected data to ensure its quality and
suitability for analysis. This step involves tasks such as removing duplicate or
irrelevant records, handling missing values, correcting inconsistencies, and
transforming the data into a suitable format.

4. Explore Data. Explore and understand your data through descriptive


statistics, visualization techniques, and exploratory data analysis. This step helps
in identifying patterns, trends, and outliers in the dataset and gaining insights into
the underlying data characteristics.

5. Select predictors. This step, also called feature selection/engineering,


involves identifying the relevant features (variables) in the dataset that are most
informative for the task. This may involve eliminating irrelevant or redundant
features and creating new features that better represent the problem domain.

6. Select Model. Choose an appropriate model or algorithm based on the


nature of the problem, the available data, and the desired outcome. Common
techniques include decision trees, regression, clustering, classification,
association rule mining, and neural networks.

7. Train Model. Train your selected model using the prepared dataset. This
involves feeding the model with the input data and adjusting its parameters or
weights to learn from the patterns and relationships present in the data.

8. Evaluate Model. Assess the performance and effectiveness of your trained


model using a validation set or cross-validation. This step helps in determining
the model's accuracy, predictive power, or clustering quality and whether it
meets the desired objectives. You may need to adjust the hyperparameters to
prevent overfitting and improve the performance of your model.

9. Deploy Model. Deploy your trained model into a real-world environment


where it can be used to make predictions, classify new data instances, or
generate insights. This may involve integrating the model into existing systems
or creating a user-friendly interface for interacting with the model.

10. Monitor & Maintain Model. Continuously monitor your model's performance
and ensure its accuracy and relevance over time. Update the model as new data
becomes available, and refine the data mining process based on feedback and
changing requirements.

Data Mining Techniques


Your choice of technique depends on the nature of your problem, the available
data, and the desired outcomes. Top-10 data mining techniques:

1. Classification
Classification is a technique used to categorize data into predefined classes or
categories based on the features or attributes of the data instances. It involves
training a model on labeled data and using it to predict the class labels of new,
unseen data instances.
2. Regression
Regression is employed to predict numeric or continuous values based on the
relationship between input variables and a target variable. It aims to find a
mathematical function or model that best fits the data to make accurate
predictions.

3. Clustering
Clustering is a technique used to group similar data instances together based
on their intrinsic characteristics or similarities. It aims to discover natural patterns
or structures in the data without any predefined classes or labels.

4. Association Rule
Association rule mining focuses on discovering interesting relationships or
patterns among a set of items in transactional or market basket data. It helps
identify frequently co-occurring items and generates rules such as "if X, then Y"
to reveal associations between items. This simple Venn diagram shows the
associations between itemsets X and Y of a dataset.
5. Anomaly Detection
Anomaly detection, sometimes called outlier analysis, aims to identify rare or
unusual data instances that deviate significantly from the expected patterns. It is
useful in detecting fraudulent transactions, network intrusions, manufacturing
defects, or any other abnormal behavior.

6. Time Series Analysis


Time series analysis focuses on analyzing and predicting data points collected
over time. It involves techniques such as forecasting, trend analysis, seasonality
detection, and anomaly detection in time-dependent datasets.
7. Neural Networks
Neural networks are a type of machine learning or AI model inspired by the
human brain's structure and function. They are composed of interconnected
nodes (neurons) and layers that can learn from data to recognize patterns,
perform classification, regression, or other tasks.

8. Decision Trees
Decision trees are graphical models that use a tree-like structure to represent
decisions and their possible consequences. They recursively split the data based
on different attribute values to form a hierarchical decision-making process.
9. Ensemble Methods
Ensemble methods combine multiple models to improve prediction accuracy
and generalization. Techniques like Random Forests and Gradient Boosting
utilize a combination of weak learners to create a stronger, more accurate
model.

10. Genetic Algorithm

Genetic algorithms are based on the ideas of natural selection and


genetics. These are intelligent exploitation of random search provided with
historical data to direct the search into the region of better performance in
solution space. They are commonly used to generate high-quality solutions
for optimization problems and search problems. Genetic algorithms
simulate the process of natural selection which means those species who
can adapt to changes in their environment are able to survive and
reproduce and go to the next generation.
Statistical Perspective on Data Mining
From a statistical perspective, data mining involves the process of discovering
patterns and relationships within large datasets. This process typically includes the
following key aspects:
• Exploratory Data Analysis: Data mining often begins with exploratory data
analysis, where statistical techniques are used to summarize and visualize the main
characteristics of the dataset. This can involve measures of central tendency,
dispersion, and graphical representations such as histograms and scatter plots.
• Hypothesis Testing: Statistical hypothesis testing is used to determine whether
observed patterns or relationships in the data are statistically significant or simply
due to chance. This helps in identifying meaningful patterns that can be generalized
to the larger population.
• Predictive Modeling: Statistical techniques such as regression analysis, time series
analysis, and machine learning algorithms are used to build predictive models that
can forecast future trends or behavior based on historical data patterns.
• Model Evaluation: Statistical measures such as accuracy, precision, recall, and F1
score are used to evaluate the performance of predictive models and assess their
ability to generalize to new data.
• Inferential Statistics: Data mining often involves making inferences about the larger
population based on the patterns observed in the sample data. Statistical inference
techniques help in drawing conclusions and making predictions about the population
from which the data was sampled.
In summary, the statistical perspective on data mining emphasizes the use of
rigorous statistical methods to explore, analyze, and extract valuable insights from
large datasets, with a focus on understanding the underlying patterns and
relationships within the data.

Measuring
Measuring similarity and dissimilarity in data mining is an important task that helps identify
patterns and relationships in large datasets. To quantify the degree of similarity or
dissimilarity between two data points or objects, mathematical functions called similarity and
dissimilarity measures are used. Similarity measures produce a score that indicates the degree
of similarity between two data points, while dissimilarity measures produce a score that
indicates the degree of dissimilarity between two data points. These measures are crucial for
many data mining tasks, such as identifying duplicate records, clustering, classification, and
anomaly detection.

Similarity Measure

• A similarity measure is a mathematical function that quantifies the degree of similarity


between two objects or data points. It is a numerical score measuring how alike two data
points are.
• It takes two data points as input and produces a similarity score as output, typically ranging
from 0 (completely dissimilar) to 1 (identical or perfectly similar).
• A similarity measure can be based on various mathematical techniques such as Cosine
similarity, Jaccard similarity, and Pearson correlation coefficient.
• Similarity measures are generally used to identify duplicate records, equivalent instances, or
identifying clusters.

Dissimilarity Measure

• A dissimilarity measure is a mathematical function that quantifies the degree of dissimilarity


between two objects or data points. It is a numerical score measuring how different two data
points are.
• It takes two data points as input and produces a dissimilarity score as output, ranging from 0
(identical or perfectly similar) to 1 (completely dissimilar). A few dissimilarity measures also
have infinity as their upper limit.
• A dissimilarity measure can be obtained by using different techniques such as Euclidean
distance, Manhattan distance, and Hamming distance.

Data Types Similarity and Dissimilarity Measures

• For nominal variables, these measures are binary, indicating whether two values are equal or
not.
• For ordinal variables, it is the difference between two values that are normalized by the max
distance. For the other variables, it is just a distance function.

Similarity Measures
• Similarity measures are mathematical functions used to determine the degree of similarity
between two data points or objects. These measures produce a score that indicates how
similar or alike the two data points are.
• It takes two data points as input and produces a similarity score as output, typically ranging
from 0 (completely dissimilar) to 1 (identical or perfectly similar).
• Similarity measures also have some well-known properties -
o sim(A,B)=1sim(A,B)=1 (or maximum similarity) only if A=BA=B
o Typical range - (0≤sim≤1)(0≤sim≤1)

Cosine Similarity

Cosine similarity is a widely used similarity measure in data mining and information
retrieval. It measures the cosine of the angle between two non-zero vectors in a multi-
dimensional space. In the context of data mining, these vectors represent the feature vectors
of two data points. The cosine similarity score ranges from 0 to 1, with 0 indicating no
similarity and 1 indicating perfect similarity.

The cosine similarity between two vectors is calculated as the dot product of the vectors
divided by the product of their magnitudes. This calculation can be represented
mathematically as follows -

cos⁡(θ)=A⋅B∥A∥∥B∥=∑i=1nAiBi∑i=1nAi2∑i=1nBi2cos(θ)=∥A∥∥B∥A⋅B
=∑i=1nAi2∑i=1nBi2∑i=1nAiBi

where A and B are the feature vectors of two data points, "." denotes the dot product, and "||"
denotes the magnitude of the vector.
Jaccard Similarity

The Jaccard similarity is another widely used similarity measure in data mining, particularly
in text analysis and clustering. It measures the similarity between two sets of data by
calculating the ratio of the intersection of the sets to their union. The Jaccard similarity score
ranges from 0 to 1, with 0 indicating no similarity and 1 indicating perfect similarity.

The Jaccard similarity between two sets A and B is calculated as follows -

J(A,B)=∣A∩B∣∣A∪B∣=∣A∩B∣∣A∣+∣B∣−∣A∩B∣J(A,B)=∣A∪B∣∣A∩B∣
=∣A∣+∣B∣−∣A∩B∣∣A∩B∣

where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB,


and ∣A∪B∣∣A∪B∣ is the size of the union of sets AA and BB.

Pearson Correlation Coefficient

The Pearson correlation coefficient is a widely used similarity measure in data mining and
statistical analysis. It measures the linear correlation between two continuous variables, X
and Y. The Pearson correlation coefficient ranges from -1 to +1, with -1 indicating a perfect
negative correlation, 0 indicating no correlation, and +1 indicating a perfect positive
correlation. The Pearson correlation coefficient is commonly used in data mining applications
such as feature selection and regression analysis. It can help identify variables that are highly
correlated with each other, which can be useful for reducing the dimensionality of a dataset.
In regression analysis, it can also be used to predict the value of one variable based on the
value of another variable.

The Pearson correlation coefficient between two variables, X and Y, is calculated as follows -

ρX,Y=cov⁡(X,Y)σXσY=∑i=1n(Xi−Xˉ)(Yi−Yˉ)∑i=1n(Xi−Xˉ)2∑i=1n(Yi−
Yˉ)2ρX,Y=σXσYcov(X,Y)=∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2∑i=1n(Xi−Xˉ)(Yi−Yˉ)

where cov⁡(X,Y)cov(X,Y) is the covariance between variables XX and YY, and σXσX
and σYσY are the standard deviations of variables XX and YY, respectively.

Sørensen-Dice Coefficient

The Sørensen-Dice coefficient, also known as the Dice similarity index or Dice coefficient, is
a similarity measure used to compare the similarity between two sets of data, typically used
in the context of text or image analysis. The coefficient ranges from 0 to 1, with 0 indicating
no similarity and 1 indicating perfect similarity. The Sørensen-Dice coefficient is commonly
used in text analysis to compare the similarity between two documents based on the set of
words or terms they contain. It is also used in image analysis to compare the similarity
between two images based on the set of pixels they contain.

The Sørensen-Dice coefficient between two sets, A and B, is calculated as follows -


S(A,B)=2∣A∩B∣∣A∣+∣B∣S(A,B)=∣A∣+∣B∣2∣A∩B∣

where ∣A∩B∣∣A∩B∣ is the size of the intersection of sets AA and BB,


and ∣A∣∣A∣ and ∣B∣∣B∣ are the sizes of sets AA and BB, respectively.

Choosing The Appropriate Similarity Measure

Choosing an appropriate similarity measure depends on the nature of the data and the specific
task at hand. Here are some factors to consider when choosing a similarity measure -

• Different similarity measures are suitable for different data types, such as continuous or
categorical data, text or image data, etc. For example, the Pearson correlation coefficient,
which is only suitable for continuous variables.
• Some similarity measures are sensitive to the scale of measurement of the data.
• The choice of similarity measure also depends on the specific task at hand. For example,
cosine similarity is often used in information retrieval and text mining, while Jaccard
similarity is commonly used in clustering and recommendation systems.
• Some similarity measures are more robust to noise and outliers in the data than others. For
example, the Sørensen-Dice coefficient is less sensitive to noise.

Decision tree in Data mining


A type of data mining technique, Decision tree in data mining builds a model for
classification of data. The models are built in the form of the tree structure and
hence belong to the supervised form of learning. Other than the classification
models, decision trees are used for building regression models for predicting
class labels or values aiding the decision-making process. Both the numerical
and categorical data like gender, age, etc. can be used by a decision tree.

Structure of a decision tree

The structure of a decision tree consists of a root node, branches, and leaf nodes.
The branched nodes are the outcomes of a tree and the internal nodes represent
the test on an attribute. The leaf nodes represent a class label.
Working of a decision tree

1. A decision tree works under the supervised learning approach for both
discreet and continuous variables. The dataset is split into subsets on the basis
of the dataset’s most significant attribute. Identification of the attribute and
splitting is done through the algorithms.

2. The structure of the decision tree consists of the root node, which is the
significant predictor node. The process of splitting occurs from the decision
nodes which are the sub-nodes of the tree. The nodes which do not split further
are termed as the leaf or terminal nodes.

3. The dataset is divided into homogenous and non -overlapping regions


following a top-down approach. The top layer provides the observations at a
single place which then splits into branches. The process is termed as “Greedy
Approach” due to its focus only on the current node rather than the future nodes.

4. Until and unless a stop criterion is reached, the decision tree will keep on
running.

5. With the building of a decision tree, lots of noise and outliers are generated.
To remove these outliers and noisy data, a method of “Tree pruning” is applied.
Hence, the accuracy of the model increases.

6. Accuracy of a model is checked on a test set consisting of test tuples and class
labels. An accurate model is defined based on the percentages of classification
test set tuples and classes by the model.
Figure 1: An example of an unpruned and a pruned tree

Types of Decision Tree


Decision trees lead to the development of models for classification and
regression based on a tree-like structure. The data is broken down into smaller
subsets. The result of a decision tree is a tree with decision nodes and leaf nodes.
Two types of decision trees are

• Classification
• Regression

1. Classification

The classification includes the building up of models describing important class


labels. They are applied in the areas of machine learning and pattern recognition.

Two step process of a classification model includes:

• Learning: A classification model based on the training data is built.


• Classification: Model accuracy is checked and then used for classification
of the new data. Class labels are in the form of discrete values like “yes”, or
“no”, etc.
2. Regression

Regression models are used for the regression analysis of data, i.e. the prediction
of numerical attributes. These are also called continuous values. Therefore,
instead of predicting the class labels, the regression model predicts the
continuous values.

Functions of Decision Tree in Data Mining


• Classification: Decision trees serve as powerful tools for classification
tasks in data mining. They classify data points into distinct categories
based on predetermined criteria.
• Prediction: Decision trees can predict outcomes by analyzing input
variables and identifying the most likely outcome based on historical data
patterns.
• Visualization: Decision trees offer a visual representation of the decision -
making process, making it easier for users to interpret and understand the
underlying logic.
• Feature Selection: Decision trees assist in identifying the most relevant
features or variables that contribute to the classification or prediction
process.
• Interpretability: Decision trees provide transparent and interpretable
models, allowing users to understand the rationale behind each decision
made by the algorithm.

List of Applications
1. Healthcare

Decision trees allow the prediction of whether a patient is suffering from a


particular disease with conditions of age, weight, sex, etc. Other predictions
include deciding the effect of medicine considering factors like composition,
period of manufacture, etc.
2. Banking sectors

Decision trees help in predicting whether a person is eligible for a loan


considering his financial status, salary, family members, etc. It can also identify
credit card frauds, loan defaults, etc.

3. Educational Sectors

Shortlisting of a student based on his merit score, attendance, etc. can be decided
with the help of decision trees.

List of Advantages
• The interpretable results of a decision model can be represented to senior
management and stakeholders.
• While building a decision tree model, preprocessing of the data, i.e.
normalization, scaling, etc. is not required.
• Both types of data- numerical and categorical can be handled by a decision
tree which displays its higher efficiency of use over other algorithms.
• Missing value in data doesn’t affect the process of a decision tree thereby
making it a flexible algorithm.

Neural Networks in Data Mining:


Neural networks, a subset of devices gaining knowledge of algorithms, play an
essential position in data mining.

Neural networks are powerful equipment in records mining because of their capability to
research complicated patterns from massive datasets.

Applications, which include:

o Pattern Recognition: Neural networks excel in spotting styles within


data, making them precious for obligations, which include photo and
speech reputation, fraud detection, and medical analysis.
o Classification: In class duties, neural networks categorize input facts
into predefined instructions. Applications include email junk mail
detection, sentiment evaluation, and disorder analysis.
o Regression: Neural networks can perform regression obligations by
predicting numerical values. This is useful in scenarios consisting of
predicting inventory expenses, sales forecasts, and housing charges.
o Clustering: Neural networks may be applied to clustering troubles,
grouping similar data points. This is useful in customer segmentation,
anomaly detection, and statistics compression.

Types of Neural Networks

1. Feedforward Neural Networks (FNN):


Feedforward neural networks are the most effective form of neural networks, in
which statistics flow in a single direction-from the input layer via the hidden layers
to the output layer. They are normally used for classification and regression
duties.
2. Recurrent Neural Networks (RNN):
Recurrent neural networks have connections that shape cycles, letting them
seize temporal dependencies in sequential statistics. RNNs are appropriate for
responsibilities regarding time series evaluation, natural language processing,
and speech reputation.
3. Convolutional Neural Networks (CNN):
Convolutional neural networks are designed to method grid-like records together
with photographs. They rent convolutional layers to robotically learn hierarchical
representations of styles, making them notably powerful in picture reputation and
computer imaginative and prescient obligations.
4. Radial Basis Function Networks (RBFN):
Radial foundation function networks use radial foundation features as activation
functions inside the hidden layers. They are often hired for pattern reputation and
feature approximation.

Data Preparation for Neural Networks:

o Feature Scaling: Neural networks have an advantage from feature scaling,


making sure that all input capabilities have a similar scale. Common scaling
strategies encompass normalization and standardization.
o Handling Missing Data: Addressing lacking information is important for
powerful neural community training. Techniques like imputation or exclusion
of incomplete statistics help maintain the facts' integrity.
o Data Splitting: Datasets are generally split into education, validation, and
testing units. Training sets are used to educate the model; validation units
assist in tracking hyperparameters and checking out sets to examine the
model's performance on unseen facts

Neural Network Architecture for Data Mining:

o Input Layer: The enter layer of a neural community consists of neurons


similar to the functions of the dataset. Each neuron represents a function, and
the values are fed into the community throughout schooling.
o Hidden Layers: Hidden layers are where the network learns and extracts
features from the input facts. The number of hidden layers and neurons in
every layer is a crucial component of community architecture and is often
determined through experimentation.
o Output Layer: The output layer produces the very last predictions or
classifications. The number of neurons in this layer depends upon the
assignment's character, binary type, multi-elegance class, or regression.

Training of ANN :
We can train the neural network by feeding it by teaching patterns and
letting it change its weight according to some learning rule. We can
categorize the learning situations as follows.
1. Supervised Learning: In which the network is trained by providing it
with input and matching output patterns. And these input-output pairs
can be provided by an external system that contains the neural
network.
2. Unsupervised Learning: In which output is trained to respond to a
cluster of patterns within the input. Unsupervised learning uses a
machine learning algorithm to analyze and cluster unlabeled datasets.
3. Reinforcement Learning: This type of learning may be considered as
an intermediate form of the above two types of learning, which trains
the model to return an optimum solution for a problem by taking a
sequence of decisions by itself.

Training Algorithm:

o Backpropagation: One of the most important algorithms for training neural


networks is backpropagation. It is an iterative configuration of weights
following the gradient of error with respect to these estimates. This process is
critical in ensuring that the difference between the predicted and actual
outputs is minimal.
o Activation Functions: Activation functions introduce nonlinearity into the
neural network, enabling it to learn complex relationships. Some typical
activation functions are sigmoid, hyperbolic tangent H(x), and rectified linear
units.
o Regularization: So, they apply regularization techniques such as dropout
and weight decay while training to prevent overfitting. All these techniques aid
the model in generalizing well to new, unseen data.
o Hyperparameter Tuning: The selection of appropriate hyperparameters,
such as learning rate, batch size, and the number of hidden layers, drastically
influence the performance level of a neural network. Hyperparameter tuning
often involves the use of Grid search or random search methods.

Challenges in Data Mining of Neural Networks:


o Overfitting: Neural networks are prone to memorizing the training
data, which generalizes poorly when transferred to new data.
Regularization techniques and appropriate validation strategies
mitigate this problem.
o Interpretability: Neural networks are so often being called 'black box'
models that it is difficult to explain why such predictions were made. In
some areas with the requirement for transparency, this inability to
make sense of it becomes a problem.
o Computational Resources: Training large neural networks is a heavy
computational task that requires strong GPUs or TPUs. This is a
limiting factor, especially for small-scale projects or organizations with
limited resources.

Genetic Algorithm in Data Mining

A genetic algorithm in data mining is an advanced method of data


classification.

Genetic algorithm emulates the principles of natural evolution, i.e. survival


of the fittest. Natural evolution propagates the genetic material in the fittest
individuals from one generation to the next.

The genetic algorithm iteratively performs the selection, crossover,


mutation, and encoding process to evolve the successive generation of
models.

The components of genetic algorithms consist of:

• Population incorporating individuals.


• Encoding or decoding mechanism of individuals.
• The objective function and an associated fitness evaluation criterion.
• Selection procedure.
• Genetic operators like recombination or crossover, mutation.
• Probabilities to perform genetic operations.
• Replacement technique.
• Termination combination.

six phases

Initial population

Being the first phase of the algorithm, it includes a set of individuals where
each individual is a solution to the concerned problem. We characterize
each individual by the set of parameters that we refer to as genes.

Calculate Fitness

A fitness function is implemented to compute the fitness of each individual


in the population. The function provides a fitness score to each individual in
the population. The fitness score is the probability of the individual selection
in the reproduction process.

Selection

The selection process selects the individuals with the highest fitness score
and is allowed to pass on their genes to the next generation.
Crossover

It is a core phase of the genetic algorithm. Now the algorithm chooses a


crossover point within the parents’ genes chosen for mating. the algorithm
keeps generating the offspring until the groups of parents exchange their
genes until they reach the crossover point. Now, these newly created
offspring are added to the population.

Mutation

The mutation phase inserts random genes into the generated offspring to
maintain the population’s diversity. It is done by flipping random genes in
new offspring.

Termination

The iteration of the algorithm stops when it produces offspring that is not
different from the previous generation. It is said to have produced a set

of solutions for the problem at this stage.

Advantages and Disadvantages


Advantages

• Easy to understand as it is based on the concept of natural evolution.


• Classifies an optimal solution from a set of solutions.
• GA uses the pay off information instead of the derivative to yield an optimal
solution.
• GA backs multi-objective optimization.
• GA is an adaptive search algorithm.
• GA also operates in a noisy environment.

Disadvantages

• An improper implementation may lead to a solution that is not optimal.


• Implementing fitness function iteratively may lead to computational
challenges.
• GA is time-consuming as it deals with a lot of computation.

Applications of GA
GA is used in implementing many applications let s discuss a few of them.
• Economics: In the field of economics GA is used to implement certain
models that conduct competitive analysis, decision making, and effective
scheduling.
• Aircraft Design: GA is used to provide the parameters that must be
modified and upgraded in order to get a better design.
• DNA Analysis: GA is used to establish DNA structure using spectrometric
information.
• Transport: GA is used to develop a transport plan that is time and cost-
efficient.

You might also like