0% found this document useful (0 votes)
3 views

unit2

The document provides an overview of data mining and data pre-processing, detailing the processes involved in transforming raw data into useful information. It discusses the challenges faced in data mining, such as scalability, high dimensionality, and data ownership, while also outlining various data mining tasks like predictive modeling, association analysis, and anomaly detection. Additionally, it emphasizes the importance of data quality and the different types of data sets, including record data, graph-based data, and ordered data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

unit2

The document provides an overview of data mining and data pre-processing, detailing the processes involved in transforming raw data into useful information. It discusses the challenges faced in data mining, such as scalability, high dimensionality, and data ownership, while also outlining various data mining tasks like predictive modeling, association analysis, and anomaly detection. Additionally, it emphasizes the importance of data quality and the different types of data sets, including record data, graph-based data, and ordered data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT-II

Data Mining & Data Pre-processing

Syllabus:

Data Mining: Introduction, Data Mining, Motivating challenges, Origins of Data Mining, Data
Mining Tasks, Types of Data, Data Quality.

Data Preprocessing: Aggregation, Sampling, Dimensionality Reduction, Feature Subset Selection,


Feature creation, Discretization and Binarization, Variable Transformation, Measures of Similarity
and Dissimilarity. (Tan &Vipin)

End of Syllabus

What Is Data Mining?

Data mining is the process of automatically discovering useful information in large data repositories.
Data mining techniques are deployed to scour large databases in order to find novel and useful
patterns. They also provide capabilities to predict the outcome of a future observation. Data mining is
an integral part of knowledge discovery in databases (KDD), which is the overall process of
converting raw data into useful information, as shown in Figure 1.1. This process consists of a series
of transformation steps, from data pre-processing to post-processing of data mining results.

The input data can be stored in a variety of formats like flat files, spreadsheets, or relational tables.
The data may reside in a centralized data repository or be distributed across multiple sites. Pre-
processing transforms the raw input data into an appropriate format suitable for subsequent analysis.
The steps involved in data pre-processing include fusing data from multiple sources, cleaning data to
remove noise and duplicate observations, and selecting records and features that are relevant to the
data mining task at hand. Data pre-processing is the most laborious and time-consuming step in the
overall knowledge discovery process. A post-processing step ensures that only valid and useful results
are incorporated into the decision support system. An example of post-processing is visualization,
which allows analysts to explore the data and the data mining results from a variety of viewpoints.

Figure 1.1
Motivating Challenges:

Traditional data analysis techniques may not meet well the challenges posed by new data sets.
Following are some of the specific challenges that motivated the development of data mining.

Scalability: Because of advances in data generation and collection, data sets with huge sizes are
becoming common. The data mining algorithms used to handle these massive data sets must be
scalable. Many data mining algorithms employ special search strategies to handle exponential search
problems. Scalability may also require the implementation of novel data structures to access
individual records in an efficient manner.

High Dimensionality: Data sets may contain hundreds of attributes. In bioinformatics, gene
expression data may involve large number of features. Data sets with temporal or spatial components
also tend to have high dimensionality. Traditional data analysis techniques that were developed for
low dimensional data often do not work well for highdimensional data. The computational complexity
also increases rapidly as the dimensionality increases.

Heterogeneous and Complex Data: Traditional data analysis methods often deal with data sets
containing attributes of the same type, either continuous or categorical. Efficient techniques are
needed to handle heterogeneous attributes. More complex data objects are also emerging in the recent
days. Examples of such data include collections of Web pages containing semi-structured text and
hyperlinks; DNA data with sequential and three-dimensional structure.

Data ownership and Distribution: Sometimes, the data needed for an analysis is not stored in one
location or owned by one organization. Instead, the data is geographically distributed among
resources belonging to multiple entities. This requires the development of distributed data mining
techniques. The key challenges in this task are: (1) how to reduce the amount of communication
needed to perform the distributed computation, (2) how to effectively consolidate the data mining
results obtained from multiple sources, and (3) how to address data security issues.

Non-traditional Analysis: The traditional statistical approach is based on a hypothesize-and-test


paradigm. Current data analysis tasks often require the generation and evaluation of thousands of
hypotheses which must be automated. The data sets frequently involve non-traditional types of data
and data distributions. All these issues must be addressed.
The Origins of Data Mining: The goal of meeting the challenges is achieved by developing more
efficient and scalable tools that could handle diverse types of data. This work is built upon the
methodology and algorithms used previously by researchers. Data mining is developed upon ideas,
such as (1) sampling, estimation, and hypothesis testing from statistics and (2) search algorithms,
modelling techniques, and learning theories from artificial intelligence, pattern recognition, and
machine learning. Data mining also adopts ideas from other areas like optimization, evolutionary
computing, information theory etc. A number of other areas also play key supporting roles. In
particular, database systems are needed to provide support for efficient storage, indexing, and query
processing. Techniques from high performance (parallel) computing are often important in addressing
the massive size of some data sets. Distributed techniques can also help address the issue of size and
are essential when the data cannot be gathered in one location. Figure 1.2 shows the relationship of
data mining to other areas.

Data Mining Tasks: Data mining tasks are generally divided into two major categories:

Predictive Tasks: The objective of these tasks is to predict the value of a particular attribute based on
the values of other attributes. The attribute to be predicted is commonly known as the target or
dependent variable, while the attributes used for making the prediction are known as the explanatory
or independent variables.

Descriptive tasks: Here, the objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data mining tasks
frequently require post-processing techniques to validate and explain the results.

Figure 1.3 illustrates four of the core data mining tasks.

Figure 1.3 Four of the core data mining tasks

Predictive Modelling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modelling tasks: classification, which is used
for discrete target variables, and regression, which is used for continuous target variables. For
example, predicting whether a Web user will make a purchase at an online bookstore is a
classification task because the target variable is binary-valued. On the other hand, forecasting the
future price of a stock is a regression task because price is a continuous-valued attribute. The goal of
both tasks is to learn a model that minimizes the error between the predicted and true values of the
target variable.

Examples of predictive modelling include: identify customers that will respond to a marketing
campaign, predict disturbances in the Earth's ecosystem, or judge whether a patient has a particular
disease based on the results of medical tests.

Association Analysis is used to discover patterns that describe strongly associated features in the
data. The discovered patterns are typically represented in the form of implication rules or feature
subsets. Because of the exponential size of its search space, the goal of association analysis is to
extract the most interesting patterns in an efficient manner. Useful applications of association analysis
include finding groups of genes that have related functionality, identifying Web pages that are
accessed together, or understanding the relationships between different elements of Earth's climate
system.

Cluster Analysis seeks to find groups of closely related objects so that objects that belong to the same
cluster are more similar to each other than objects that belong to other clusters. Clustering has been
used to group sets of related customers, to group related documents, find areas of the ocean that have
a significant impact on the Earth's climate, and compress data.

Anomaly Detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The goal of
an anomaly detection algorithm is to discover the real anomalies and avoid falsely labelling normal
objects as anomalous. In other words, a good anomaly detector must have a high detection rate and a
low false alarm rate. Applications of anomaly detection include the detection of fraud, network
intrusions, unusual patterns of disease, and ecosystem disturbances.

Types of Data

A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity. In turn, data objects are
described by a number of attributes that capture the basic characteristics of an object, such as the mass
of a physical object or the time at which an event occurred. Other names for an attribute are variable,
characteristic, field, feature, or dimension.

General Characteristics of Data Sets: There are three major characteristics that apply to many data
sets and have a significant impact on the data mining techniques. They are: Dimensionality, Sparsity,
and Resolution.

Dimensionality

The dimensionality of a data set is the number of attributes that describe the objects in the data set.
Data with low dimensionality produce good data mining results. The difficulties associated with
analyzing high dimensional data are referred to as the curse of dimensionality. An important
motivation in pre-processing the data is dimensionality reduction.
Sparsity

For some data sets, such as those with asymmetric features, most attributes of an object have values of
0; in many cases, less than 1% of the entries are non-zero. In practical terms, sparsity is an advantage
because usually only the non-zero values need to be stored and manipulated. This results in significant
savings with respect to computation time and storage. Furthermore, some data mining algorithms
work well only for sparse data.

Resolution

It is frequently possible to obtain data at different levels of resolution, and often the properties of the
data are different at different resolutions. For instance, the surface of the Earth seems very uneven at a
resolution of a few meters, but is relatively smooth at a resolution of tens of kilo meters. The patterns
found in the data also depend on the level of resolution. If the resolution is too fine, a pattern may not
be visible or may be buried in noise; if the resolution is too coarse, the pattern may disappear.

Types of Data or Data Sets

There are many types of data sets, and a greater variety of data sets become available for analysis.
Generally, the types of data sets are divided into three groups: record data, graph based data, and
ordered data.

Record Data

Much data mining work assumes that the data set is a collection of records (data objects), each of
which consists of a fixed set of data fields (attributes). In the most basic form of record data, there is
no explicit relationship among records or data fields, and every record (object) has the same set of
attributes. Record data is usually stored either in flat files or in relational databases. The database
serves as a convenient place to find records. Different types of record data are described below and
are illustrated in Figure 2.2

Figure 2.2 Different variations of Record Data


Transaction or Market Basket Data

Transaction data is a special type of record data, where each record (transaction) involves a set of
items. Consider a grocery store. The set of products purchased by a customer during one shopping trip
constitutes a transaction, while the individual products that were purchased are the items. This type of
data is called market basket data.

The Data Matrix

If all data objects in a collection of data have the same set of numeric attributes, then the data objects
can be thought of as points (vectors) in a multidimensional space, where each dimension represents a
distinct attribute describing the object. A set of such data objects can be interpreted as an m by n
matrix, where there are m rows, one for each object, and n columns, one for each attribute.

The Sparse Data Matrix

A sparse data matrix is a special case of a data matrix in which the attributes are of the same type and
are asymmetric; i.e., only non-zero values are important. Transaction data is an example of a sparse
data matrix that has only 0 1 entries. Another common example is document data. A document can be
represented as a term vector, where each term is an attribute of the vector and the value of each
component is the number of times the corresponding term occurs in the document. This representation
of a collection of documents is often called a document-term matrix.

Graph-Based Data

A graph can sometimes be a convenient and powerful representation for data. Two cases are:

Data with Relationships among Objects

The relationships among objects frequently convey important information. In such cases, the data is
often represented as a graph. In particular, the data objects are mapped to nodes of the graph, while
the relationships among objects are captured by the links between objects and link properties, such as
direction and weight. Consider Web pages on the World Wide Web, which contain both text and links
to other pages. Figure 2.3(a) shows a set of linked Web pages.

Data with Objects That Are Graphs

If objects have structure, that is, the objects contain sub objects that have relationships, then such
objects are frequently represented as graphs. For example, the structure of chemical compounds can
be represented by a graph, where the nodes are atoms and the links between nodes are chemical
bonds. Figure 2.3(b) shows a ball-and-stick diagram of the chemical compound Benzene.
Ordered Data

For some types of data, the attributes have relationships that involve order in time or space.
Sequential Data

Sequential or temporal data is an extension of record data, where each record has a time associated
with it. Consider a retail transaction data set that also stores the time at which the transaction took
place. This time information makes it possible to find useful patterns. Figure 2.4 (a) shows an
example of sequential transaction data.

Sequence Data

Sequence data consists of a data set that is a sequence of individual entities, like words or letters. In
this data, there are no time stamps; instead, there are positions in an ordered sequence. For example,
the genetic information of plants and animals can be represented in the form of sequences of
nucleotides that are known as genes. Figure 2.4(b) shows a section of the human genetic code
expressed using the four nucleotides from which all DNA is constructed: A, T, G, and C.

Time Series Data

Time series data is a special type of sequential data in which each record is a time series, i.e., a series
of measurements taken over time. For example, a financial data set might contain objects that are time
series of the daily prices of various stocks. As another example, consider Figure 2.4(c), which shows a
time series of the average monthly temperature for Minneapolis during the years 1982 to 1994.

Spatial Data

Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
An example of spatial data is weather data (precipitation, temperature, pressure) that is collected for a
variety of geographical locations. An important aspect of spatial data is spatial auto correlation; i.e.,
objects that are physically close tend to be similar in other ways as well.

Data Quality

Data used for any mining application should be of good quality. When there are huge amounts of data
collected, quality of data is a major issue. Data quality problems cannot be prevented at source.
Hence, data mining focuses on (1) the detection and correction of data quality problems and (2) the
use of algorithms that can tolerate poor data quality. Detection and correction of errors, is often called
data cleaning.

Measurement and Data Collection Issues

These issues refer to problems due to human error, limitations of measuring devices, or flaws in the
data collection process. Values or even entire data objects may be missing. In other cases, there may
be spurious or duplicate objects. For example, there might be two different records for a person who
has recently lived at two different addresses. There may be inconsistencies like a person has a height
of 2 meters, but weighs only 2 kilograms.

Measurement and Data Collection Errors

The term measurement error refers to any problem resulting from the measuring process. A common
problem is that the value recorded differs from the true value to some extent. For continuous
attributes, the numerical difference of the measured and true value is called the error.
The term data collection error refers to errors such as omitting data objects or attribute values, or
inappropriately including a data object. Both measurement errors and data collection errors can be
either systematic or random. There are certain types of data errors that are commonplace, and there
often exist well-developed techniques for detecting and/or correcting these errors.

Noise and Artifacts

Noise is the random component of a measurement error. It may involve the distortion of a value or the
addition of spurious objects. Figure 2.5 shows a time series before and after it has been disrupted by
random noise. If a bit more noise were added to the time series, its shape would be lost. Data errors
may be the result of a more deterministic phenomenon, such as a streak in the same place on a set of
photographs. Such deterministic distortions of the data are often referred to as artifacts.

Precision, Bias, and Accuracy

These are used to measure the quality of the measurement process and the resulting data. Let a set of
repeated measurements of the same quantity are taken and use this set of values to calculate a mean
value that serves as estimate of the true value.

Precision The closeness of repeated measurements (of the same quantity) to one another.

Bias A systematic variation of measurements from the quantity being measured.

Precision is often measured by the standard deviation of a set of values, while bias is measured by
taking the difference between the mean of the set of values and the known value of the quantity being
measured.
Accuracy The closeness of measurements to the true value of the quantity being measured. Accuracy
depends on precision and bias, but since it is a general concept, there is no specific formula for
accuracy in terms of these two quantities. One important aspect of accuracy is the use of significant
digits.

Outliers

Outliers are either (1) data objects that, in some sense, have characteristics that are different from
most of the other data objects in the data set, or (2) values of an attribute that are unusual with respect
to the typical values for that attribute.

Missing Values

It is not unusual for an object to be missing one or more attribute values. In some cases, the
information was not collected. In other cases, some attributes are not applicable to all objects.
Regardless, missing values should be taken into account during the data analysis. There are several
strategies for dealing with missing data.

Eliminate Data Objects or Attributes

A simple and effective strategy is to eliminate objects with missing values. However, even a partially
specified data object contains some information, and if many objects have missing values, then a
reliable analysis can be difficult or impossible.

Estimate Missing Values

Sometimes missing data can be reliably estimated. For example, consider a data set that has many
similar data points. In this situation, the missing values can be estimated using the attribute values of
the closest point. If the attribute is continuous, then the average attribute value of the nearest
neighbours is used.

Ignore the Missing Value during Analysis

Many data mining approaches can be modified to ignore missing values. In clustering problem, the
similarity between pairs of data objects can be calculated only using non missing attribute values.
Many classification schemes can be modified to work with missing values.

Inconsistent Values

Data can contain inconsistent values. Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city. Some types of inconsistencies are
easy to detect. For instance, a person's height should not be negative. Sometimes inconsistent values
can be corrected.

Duplicate Data

A data set may include data objects that are duplicates, or almost duplicates, of one another. Many
people receive duplicate mailings because they appear in a database multiple times under slightly
different names. These duplicate objects must be detected and removed. Care needs to be taken to
avoid accidentally combining data objects that are similar, but not duplicates.

Issues Related to Applications


Data quality issues can also be considered from an application viewpoint. There are many issues that
are specific to particular applications and fields. Few of them are as follows.

Timeliness

Some data starts to age as soon as it has been collected. In particular, if the data provides a snapshot
of some ongoing phenomenon or process, such as the purchasing behaviour of customers or Web
browsing patterns, then this snapshot represents reality for only a limited time. If the data is out of
date, then so are the models and patterns that are based on it.

Relevance

The available data must contain the information necessary for the application. Consider the task of
building a model that predicts the accident rate for drivers. If information about the age and gender of
the driver is omitted, then it is likely that the model will have limited accuracy.

Knowledge about the Data

Ideally, data sets are accompanied by documentation that describes different aspects of the data; the
quality of this documentation can either aid or hinder the subsequent analysis. For example, if the
documentation identifies several attributes as being strongly related, these attributes are likely to
provide highly redundant information, and we may decide to keep just one.

Types of Attributes
Data Pre-Processing

Preprocessing of data should be done to make the data more suitable for data mining algorithm.

Aggregation

Aggregation combines two or more objects into a single object. Consider a data set consisting of
transactions recording the daily sales of products in various store locations. (Minneapolis, Chicago,
Paris, ...) for different days over the course of a year. One way to aggregate transactions for this data
set is to replace all the transactions of a single store with a single storewide transaction.

Sampling

Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed
instead of taking the entire data. Sampling is very useful in data mining. Data miners sample because
it is too expensive or time consuming to process all the data. In some cases, using a sampling
algorithm can reduce the data size to the point where a better, but more expensive algorithm can be
used. Using a sample will work almost as well as using the entire data set if the sample is
representative.

Sampling Approaches

The simplest type of sampling is simple random sampling. For this type of sampling, there is an
equal probability of selecting any particular item. There are two variations on random sampling: (1)
sampling without replacement (2) sampling with replacement. In sampling with replacement, the
same object can be picked more than once.

Progressive Sampling
The proper sample size can be difficult to determine, so adaptive or progressive sampling schemes
are sometimes used. These approaches start with a small sample, and then increase the sample size
until a sample of sufficient size has been obtained. This technique does not reuire computing the
sample size initially, but it requires that there be a way to evaluate the sample to judge if it is large
enough.

Dimensionality Reduction

Data sets can have a large number of features. Dimensionality refers to the number of attributes or
features. Consider a set of documents, where each document is represented by a vector whose
components are the frequencies with which each word occurs in the document. In such cases, there
are typically many attributes, one for each word in the vocabulary. Many data mining algorithms work
better if the dimensionality is lower. Dimensionality reduction eliminates irrelevant features, reduces
noise, and provides a more understandable model.

The Curse of Dimensionality

The curse of dimensionality refers to the phenomenon that many types of data analysis become
significantly harder as the dimensionality of the data increases. As the dimensionality increases, the
data becomes increasingly sparse in the space that it occupies.

Linear Algebra Techniques for Dimensionality Reduction

Some of the most common approaches for dimensionality reduction, for continuous data, use
techniques from linear algebra to project the data from a high-dimensional space into a lower-
dimensional space. Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that finds new attributes (principal components) that (1) are linear combinations
of the original attributes, (2) are orthogonal (perpendicular) to each other, and (3) capture the
maximum amount of variation in the data. For example, the first two principal components capture as
much of the variation in the data as is possible with two orthogonal attributes that are linear
combinations of the original attributes.

Feature Subset Selection

Another way to reduce the dimensionality is to use only a subset of the features. Redundant features
duplicate much or all of the information contained in one or more other attributes. Irrelevant features
contain no useful information for the data mining task. For instance, students' ID numbers are
irrelevant to the task of predicting students' grade point averages. There are three standard approaches
to feature selection: embedded, filter, and wrapper.

Embedded approaches

Feature selection occurs naturally as part of the data mining algorithm. During the operation of the
data mining algorithm, the algorithm itself decides which attributes to use and which to ignore.
Algorithms for building decision tree classifiers often operate in this manner.

Filter approaches

Features are selected before the data mining algorithm is run, using some approach that is
independent of the data mining task. For example, we might select sets of attributes whose pair wise
correlation is as low as possible.
Wrapper approaches

These methods use the target data mining algorithm as a black box to find the best subset of
attributes, in a way similar to that of the ideal algorithm described above, but typically without
enumerating all possible subsets. The embedded approaches are algorithm-specific.

Architecture for Feature Subset Selection

It is possible to encompass both the filter and wrapper approaches within a common architecture. The
feature selection process is viewed as consisting of four parts: a measure for evaluating a subset, a
search strategy that controls the generation of a new subset of features, a stopping criterion, and a
validation procedure. The steps are shown in the figure above.

Feature Weighting

Feature weighting is an alternative to keeping or eliminating features. More important features are
assigned a higher weight, while less important features are given a lower weight. These weights are
sometimes assigned based on domain knowledge about the relative importance of features.

Feature Creation

It is frequently possible to create, from the original attributes, a new set of attributes that captures the
important information in a data set much more effectively. Three related methodologies for creating
new attributes are: feature extraction, mapping the data to a new space, and feature construction.

Feature Extraction

The creation of a new set of features from the original raw data is known as feature extraction.
Consider a set of photographs, where each photograph is to be classified according to whether or not
it contains a human face. The raw data is a set of pixels and is not suitable for many types of
classification algorithms. If the data is processed to provide higher level features, such as the presence
or absence of certain types of edges and areas that are highly correlated with the presence of human
faces, then a much broader set of classification techniques can be applied to this problem.

Mapping the Data to a New Space

A totally different view of the data can reveal important and interesting features. Consider time series
data, which often contains periodic patterns. If there is only a single periodic pattern and not much
noise, then the pattern is easily detected. If there are a number of periodic patterns and a significant
amount of noise is present, then these patterns are hard to detect. Such patterns can be detected by
applying a Fourier transform to the time series in order to change to a representation in which
frequency information is explicit. Besides the Fourier transform, the wavelet transform is also very
useful for time series and other types of data.

Feature Construction

Sometimes the features in the original data sets have the necessary information, but it is not in a form
suitable for the data mining algorithm. In this situation, one or more new features constructed out of
the original features can be more useful than the original features.

Discretization and Binarization

Some data mining algorithms, especially certain classification algorithms, require that the data be in
the form of categorical attributes. Algorithms that find association patterns require that the data be in
the form of binary attributes. Thus, it is often necessary to transform a continuous attribute into a
categorical attribute (discretization), and both continuous and discrete attributes may need to be
transformed into one or more binary attributes (binarization).

Binarization

A simple technique to binarize a categorical attribute is the following: If there are m categorical
values, then uniquely assign each original value to an integer in the interval [0, rn - 1]. If the attribute
is ordinal, then order must be maintained by the assignment. Next, convert each of these m integers to
a binary number.

Education Level: Binarize into three binary attributes.

 HighSchool: 1 if Education Level is High School, 0 otherwise


 BachelorsDegree: 1 if Education Level is Bachelor's Degree, 0 otherwise
 MastersDegree: 1 if Education Level is Master's Degree, 0 otherwise

Employment Status: Binarize into one binary attribute.

 Employed: 1 if Employment Status is Employed, 0 otherwise

Discretization of Continuous Attributes

Discretization is typically applied to attributes that are used in classification or association analysis.
In general, the best discretization depends on the algorithm being used, and other attributes being
considered. The discretization of an attribute is considered in isolation. Transformation of a
continuous attribute to a categorical attribute involves two subtasks: deciding how many categories to
have and determining how to map the values of the continuous attribute to these categories. In the first
step, after the values of the continuous attribute are sorted, they are then divided into n intervals by
specifying n-1 split points. In the second step, all the values in one interval are mapped to the same
categorical value.

Example:
Age: Discretize into three categories: Young, Middle-aged, and Old.

 Young: Age <= 25


 Middle-aged: 25 < Age <= 40
 Old: Age > 40

Income: Discretize into two categories: LowIncome and HighIncome.

 LowIncome: Income <= $50,000


 HighIncome: Income > $50,000

Unsupervised Discretization

A basic distinction between discretization methods for classification is whether class information is
used (supervised) or not (unsupervised).

The equal width approach divides the range of the attribute into a user-specified number of intervals
each having the same width.

Example: If we have a continuous attribute "Age" ranging from 20 to 60 and we want to discretize it
into 4 intervals using equal width approach, each interval will have a width of (60 - 20) / 4 = 10 years.

So, the intervals would be [20-30), [30-40), [40-50), [50-60).

An equal frequency (equal depth) approach tries to put the same number of objects into each
interval.

Example: If we have a dataset of 100 individuals and we want to discretize their ages into 4 intervals
using equal frequency approach, each interval will contain 100 / 4 = 25 individuals. The intervals
would be created such that each interval contains 25 individuals with approximately equal frequency
distribution.

Supervised Discretization

Using class labels information for discretization often produces better results. A simple approach is to
place the splits in a way that maximizes the purity of the intervals. Some statistically based
approaches start with small intervals and create larger intervals by merging adjacent intervals that are
similar. Entropy based approaches are one of the most promising approaches to discretization.

Example:

we have a dataset containing information about loan applicants, including their income and whether
their loan applications were approved or denied. We want to discretize the income variable into
intervals to predict the likelihood of loan approval.

Applican Income Loan Approved


t
1 25000 Approved
2 35000 Denied
3 40000 Approved
4 50000 Approved
5 60000 Denied
6 70000 Approved
7 80000 Approved
8 90000 Denied
9 100000 Denied
10 120000 Approved

Step 1: Explore Potential Split Points

Let's consider potential split points based on common income thresholds or significant values in the
dataset, such as 40,000, 60,000, and 80,000.

Step 2: Calculate Entropy for Each Split Point

For each potential split point, calculate the entropy of the resulting intervals based on loan approval
labels. We compare the entropies to assess the purity of resulting intervals.

Let's calculate the entropy for each potential split point and evaluate the purity of resulting intervals
based on loan approval labels. Splitting at 40,000:

Interval [0-40,000):

 Approved: 2 (Applicants 1, 3)
 Denied: 1 (Applicant 2)

Calculation of Entropy for Interval [0-40,000):

Interval [40,000-120,000):

 Approved: 4 (Applicants 4, 6, 7, 10)


 Denied: 3 (Applicants 5, 8, 9)

Calculation of Entropy for Interval [40,000-120,000):


Calculation of Entropy for Interval [80,000-120,000):

Since there are no denied applicants in this interval, the entropy is 0.

To choose the optimal split point that minimizes entropy, we compare the entropy values calculated
for each potential split point. The split point with the lowest entropy indicates the point that
maximizes the purity or homogeneity of the resulting intervals.

Based on these entropy values, the split point that minimizes entropy and maximizes the purity of
resulting intervals is splitting at 60,000. The entropy value for the interval [0-60,000) is lower
compared to other intervals, indicating higher purity within this interval with respect to loan approval.

Interval [0-60,000): Lower to middle-income applicants

Interval [60,000-120,000): Higher-income applicants

These intervals provide a discretized representation of income levels, dividing the applicants into two
distinct categories based on their income levels. This interpretation will be useful for further analysis,
such as predicting loan approval likelihood based on income.

Variable Transformation

A variable transformation refers to a transformation that is applied to all the values of a variable. For
each object, the transformation is applied to the value of the variable for that object. For example, if
only the magnitude of a variable is important, then the values of the variable can be transformed by
taking the absolute value. Two important types of variable transformations are simple functional
transformations and normalization.

Simple Functions

For this type of variable transformation, a simple mathematical function is applied to each value
individually. If x is a variable, examples of various transformations include xk, log x, ex etc.

Normalization or Standardization

Another common type of variable transformation is standardization or normalization of a variable.


The goal of standardization or normalization is to make an entire set of values have a particular
property. If different variables are to be combined in some way, then such a transformation is
necessary to avoid a variable with large values dominating the results of the calculation.

Consider comparing people based on two variables: age and income. For any two people, the
difference in income will likely be much higher in absolute terms than the difference in age.

 Let's say we have two variables: "Age" and "Income".


 Age ranges from 20 to 70, while Income ranges from 10,000 to 100,000.
 Without normalization, income's larger range could dominate the analysis.
 Normalization ensures both variables are on a similar scale, preventing one from dominating
the analysis.

Note: Prepare similarity and dissimilarity measures from class notes (Pearson, cosine, jaaccard,
Euclidean, Manhattan).

-------XXXXXXX-------

You might also like