Classification and Predication in Data Mining
Classification and Predication in Data Mining
There are two forms of data analysis that can be used to extract models describing important
classes or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data. The set of input data and the corresponding outputs are given to the algorithm. So, the training data
set includes the input data and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical formula, or a neural network. In
classification, when unlabeled data is given to the model, it should find the class to which it belongs. The new
data provided to the model is the test data set.
Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word, Google documents,
social media, and websites.
2. Role-based practice: Role-based security restrictions apply to all delicate data by tagging based on in-
house protection policies and agreement rules.
3. Storage: Here, we have the obtained data, including access controls and encryption.
4. Sharing: Data is continually distributed among agents, consumers, and co-workers from various devices
and platforms.
5. Archive: Here, data is eventually archived within an industry's storage systems.
6. Publication: Through the publication of data, it can reach customers. They can then view and download
in the form of dashboards.
What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same as in
classification, the training dataset contains the inputs and corresponding numerical output values. The
algorithm derives the model or a predictor according to the training dataset. The model should find a
numerical output when the new data is given. Unlike in classification, this method does not have a
class label. The model predicts a continuous-valued function or ordered value.
Regression is generally used for prediction. Predicting the value of a house depending on the facts
such as the number of rooms, the total area, etc., is an example for prediction.
1. Data Cleaning: Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques, and the problem of missing
values is solved by replacing a missing value with the most commonly occurring value for that attribute.
2. Relevance Analysis: The database may also have irrelevant attributes. Correlation analysis is used to
know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the following methods.
o Normalization: The data is transformed using normalization. Normalization involves scaling
all values for a given attribute to make them fall within a small specified range. Normalization
is used when the neural networks or the methods involving measurements are used in the
learning step.
o Generalization: The data can also be transformed by generalizing it to the higher concept.
For this purpose, we can use the concept hierarchies.
Classification is the process of identifying which category a Predication is the process of identifying the missing or
new observation belongs to based on a training data set unavailable numerical data for a new observation.
containing observations whose category membership is
known.
In classification, the accuracy depends on finding the class In prediction, the accuracy depends on how well a given
label correctly. predictor can guess the value of a predicated attribute for
new data.
In classification, the model can be known as the classifier. In prediction, the model can be known as the predictor.
A model or the classifier is constructed to find the A model or a predictor will be constructed that predicts a
categorical labels. continuous-valued function or ordered value.
For example, the grouping of patients based on their For example, We can think of prediction as predicting
medical records can be considered a classification. the correct treatment for a particular disease for a person
Decision tree
Decision is used on wide range of problems. Decision tree cells are formed by splitting each dimension into random
number of evenly spaced partitions. A variable of interest such as, response rate, experience, or average order size,
is measured for each cell. New records are scored by determining which cell they belong to.
Decision trees use two techniques. A top-down approach recursively splits data into smaller and smaller cells with
similar values based on target variable. The degree to which a cell has similar values is known as purity of the cell.
Each cell in decision tree is treated independently and new split for a particular cell is found using algorithm that
tests splits based on all available variables of interest. A decision tree is used for variable selection as well as for
building models or classifiers.
Other technique is a bottom-up approach. In it a decision tree uses the target variable to determine how input to
decision tree should be partitioned. Decision tree then breaks the data into segments using the split rules at each
step, and rules for all segments taken together form the decision tree classifier. Rules are expressed using simple
English. Rules are then expressed using database query language to retrieve or score similar records. Decision tree
based models can be used for data mining tasks like classification, estimation, or prediction.
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and regression
methods. It is a tree that helps us in decision-making purposes. The decision tree creates classification
or regression models as a tree structure. It separates a data set into smaller subsets, and at the same
time, the decision tree is steadily developed. The final tree is a tree with the decision nodes and leaf
nodes. A decision node has at least two branches. The leaf nodes show a classification or decision.
We can't accomplish more split on leaf nodes-The uppermost decision node in a tree that relates to the
best predictor called the root node. Decision trees can deal with both categorical and numerical data.
Entropy:
The concept is used to find the disorder present in the given dataset. In simple terms, we can say that
Entropy tells the impurity present in the dataset. It determines the best splits for partitioning data.
Information Gain:
It is used to find the changes in Entropy after splitting the dataset based on attributes. It is used to tell
how much information an attribute can provide us about the class. We choose the feature that directs
to the most significant reduction in Entropy, as it provides more information.
Information Gain can be calculated with the help of the following formula: