0% found this document useful (0 votes)
2 views

Classification and Predication in Data Mining

The document discusses two primary forms of data analysis in data mining: classification and prediction. Classification involves identifying the category of new observations based on training data, while prediction focuses on estimating numerical outputs without class labels. It also outlines the data classification lifecycle, issues in preparing data for analysis, and the use of decision trees for both classification and regression tasks.

Uploaded by

Inderpreet Kaur
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Classification and Predication in Data Mining

The document discusses two primary forms of data analysis in data mining: classification and prediction. Classification involves identifying the category of new observations based on training data, while prediction focuses on estimating numerical outputs without class labels. It also outlines the data classification lifecycle, issues in preparing data for analysis, and the use of decision trees for both classification and regression tasks.

Uploaded by

Inderpreet Kaur
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Classification and Predication in Data Mining

There are two forms of data analysis that can be used to extract models describing important
classes or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction

What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training data
set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network. In
classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.

Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.

How does Classification Works?


The functioning of classification with the assistance of the bank loan application has been mentioned
above. There are two stages in the data classification system: classifier or model creation and
classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the learning process.
The classification algorithms construct the classifier in this stage. A classifier is constructed from a
training set composed of the records of databases and their corresponding class names. Each category
that makes up the training set is referred to as a category or class. We may also refer to these records
as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The test data
are used here to estimate the accuracy of the classification algorithm. If the consistency is deemed
sufficient, the classification rules can be expanded to cover new data records. It includes:
a. Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring. We
can use it to extract social media insights. We can build sentiment analysis models to read
and analyze misspelled words with advanced machine learning algorithms. The accurate
trained models provide consistently accurate outcomes and result in a fraction of the time.
b. Document Classification: We can use document classification to organize the documents
into sections according to the content. Document classification refers to text classification;
we can classify the words in the entire document. And with the help of machine learning
classification algorithms, we can execute it automatically.
c. Image Classification: Image classification is used for the trained categories of an image.
These could be the caption of the image, a statistical value, a theme. You can tag images to
train your model for relevant categories by applying supervised learning algorithms.
d. Machine Learning Classification: It uses the statistically demonstrable algorithm rules to
execute analytical tasks that would take humans hundreds of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into five steps:
a. Create the goals of data classification, strategy, workflows, and architecture of data
classification.
b. Classify confidential details that we store.
c. Using marks by data labelling.
d. To improve protection and obedience, use effects.
e. Data is complex, and a continuous method is a classification.

What is Data Classification Lifecycle?


The data classification life cycle produces an excellent structure for controlling the flow of data to an
enterprise. Businesses need to account for data security and compliance at each level. With the help
of data classification, we can perform it at every stage, from origin to deletion. The data life-cycle has
the following stages, such as:

1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various devices
and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view and download
in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values. The
algorithm derives the model or a predictor according to the training dataset. The model should find a
numerical output when the new data is given. Unlike in classification, this method does not have a
class label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts
such as the number of rooms, the total area, etc., is an example for prediction.

Classification and Prediction Issues


The major issue is preparing the data for Classification and Prediction. Preparing the data involves the
following activities, such as:

1. Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques, and the problem of missing
values is solved by replacing a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following methods.
o Normalization: The data is transformed using normalization. Normalization involves scaling
all values for a given attribute to make them fall within a small specified range. Normalization
is used when the neural networks or the methods involving measurements are used in the
learning step.
o Generalization: The data can also be transformed by generalizing it to the higher concept.
For this purpose, we can use the concept hierarchies.

Difference between Classification and Prediction


Classification Prediction

Classification is the process of identifying which category a Predication is the process of identifying the missing or
new observation belongs to based on a training data set unavailable numerical data for a new observation.
containing observations whose category membership is
known.

In classification, the accuracy depends on finding the class In prediction, the accuracy depends on how well a given
label correctly. predictor can guess the value of a predicated attribute for
new data.

In classification, the model can be known as the classifier. In prediction, the model can be known as the predictor.

A model or the classifier is constructed to find the A model or a predictor will be constructed that predicts a
categorical labels. continuous-valued function or ordered value.

For example, the grouping of patients based on their For example, We can think of prediction as predicting
medical records can be considered a classification. the correct treatment for a particular disease for a person
Decision tree
Decision is used on wide range of problems. Decision tree cells are formed by splitting each dimension into random
number of evenly spaced partitions. A variable of interest such as, response rate, experience, or average order size,
is measured for each cell. New records are scored by determining which cell they belong to.
Decision trees use two techniques. A top-down approach recursively splits data into smaller and smaller cells with
similar values based on target variable. The degree to which a cell has similar values is known as purity of the cell.
Each cell in decision tree is treated independently and new split for a particular cell is found using algorithm that
tests splits based on all available variables of interest. A decision tree is used for variable selection as well as for
building models or classifiers.
Other technique is a bottom-up approach. In it a decision tree uses the target variable to determine how input to
decision tree should be partitioned. Decision tree then breaks the data into segments using the split rules at each
step, and rules for all segments taken together form the decision tree classifier. Rules are expressed using simple
English. Rules are then expressed using database query language to retrieve or score similar records. Decision tree
based models can be used for data mining tasks like classification, estimation, or prediction.
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and regression
methods. It is a tree that helps us in decision-making purposes. The decision tree creates classification
or regression models as a tree structure. It separates a data set into smaller subsets, and at the same
time, the decision tree is steadily developed. The final tree is a tree with the decision nodes and leaf
nodes. A decision node has at least two branches. The leaf nodes show a classification or decision.
We can't accomplish more split on leaf nodes-The uppermost decision node in a tree that relates to the
best predictor called the root node. Decision trees can deal with both categorical and numerical data.

Why are decision trees useful?


It enables us to analyze the possible consequences of a decision thoroughly.
It provides us a framework to measure the values of outcomes and the probability of accomplishing
them.
It helps us to make the best decisions based on existing data and best speculations.
Advantages of using decision trees:
A decision tree does not need scaling of information.
Missing values in data also do not influence the process of building a choice tree to any considerable
extent.
A decision tree model is automatic and simple to explain to the technical team as well as stakeholders.
Compared to other algorithms, decision trees need less exertion for data preparation during pre-
processing.
A decision tree does not require a standardization of data.
Decision tree Algorithm:
The decision tree algorithm may appear long, but it is quite simply the basis algorithm techniques is as
follows:
The algorithm is based on three parameters: D, attribute_list, and Attribute _selection_method.
Generally, we refer to D as a data partition.
Initially, D is the entire set of training tuples and their related class levels (input training data).
The parameter attribute_list is a set of attributes defining the tuples.
Attribute_selection_method specifies a heuristic process for choosing the attribute that "best"
discriminates the given tuples according to class.
Attribute_selection_method process applies an attribute selection measure.
Attribute Selection Measures
Number of criteria’s may be used to evaluate or select potential splits. Alternate splitting criteria lead to decision
trees that look quite different from each other, but have similar performance. Different purity measures happen to
select different splits. An attribute selection measure selects splitting criterion that provides best separation for a
given D of class-labeled training tuples into individual classes. Splitting D into smaller partitions according to
conclusions of splitting criterion ideally results in each partition being pure. “Best” splitting criterion is one that
results in new smaller partitions of highest purity. The attribute selection measure provides ranking for each
attribute, and attribute with best score is preferred to be the splitting attribute for given tuples.

Entropy:
The concept is used to find the disorder present in the given dataset. In simple terms, we can say that
Entropy tells the impurity present in the dataset. It determines the best splits for partitioning data.

Following is the mathematical formula for Entropy:

In the above formula,

o Entropy (D) is the Entropy of the dataset "D".


o "n" is the dataset's number of distinct classes or categories.
o "pi" is the proportion of data points in class i to the total data points in "D".
For calculating the Entropy of a dataset, the formula sums up the proportions of data points that
belong to each distinct class or category. The negative sign is used to ensure the Entropy is a positive
value. This formula commonly uses the logarithm base 2 (2log2), making the unit of Entropy bits.

Information Gain:
It is used to find the changes in Entropy after splitting the dataset based on attributes. It is used to tell
how much information an attribute can provide us about the class. We choose the feature that directs
to the most significant reduction in Entropy, as it provides more information.

Information Gain can be calculated with the help of the following formula:

Information Gain (IG) = Entropy (D) - [(Weighted average)*Entropy (each Feature)

IG (D, A) = H(D) - ∑ (|Dv|/|D|)⋅H(Dv)


OR

In the above formula,

o "IG (D)" is the Information Gain of dataset "D" based on attribute A.


o "H (D)" is the Entropy of dataset "D".
o "|Dv|" is the number of instances in subset v.
o "|D|" is the total number of instances in dataset "D".
o "H(Dv)" is the Entropy of each subset v.

You might also like