DWDM-UNIT-2
DWDM-UNIT-2
UNIT-2
Introduction to Data Mining & Data Pre-processing
7. Knowledge presentation
In this step visualization and knowledge representation techniques are
used to present mined knowledge to users
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
3-2 IT Regulation: R19 DWDM: UNIT-2
User Interaction
The user plays an important role in the data mining process.
Interesting areas of research include how to interact with a data mining
system, how to incorporate a user’s background knowledge in mining, and
how to visualize and comprehend data mining results.
Interactive mining: The data mining process should be highly
interactive. Thus, it is important to build flexible user interfaces and an
exploratory mining environment, facilitating the user’s interaction with
the system.
Incorporation of background knowledge: Background knowledge,
constraints, rules, and other information regarding the domain under study
should be incorporated into the knowledge discovery process. Such
knowledge can be used for pattern evaluation as well as to guide the
search toward interesting patterns.
Ad hoc data mining and data mining query languages:Data Mining
Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and optimized
for efficient and flexible data mining.
Presentation and visualization of data mining results: − Once the
patterns are discovered it needs to be expressed in high level languages,
and visual representations. These representations should be easily
understandable.
Efficiency and Scalability
Efficiency and scalability are always considered when comparing data
mining algorithms.
As data amounts continue to multiply, these two factors are especially
critical.
Efficiency and scalability of data mining algorithms: Data mining
algorithms must be efficient and scalable in order to effectively extract
information from huge amounts of data in many data repositories or in
dynamic data streams. In other words, the running time of a data mining
algorithm must be predictable, short, and acceptable by applications.
Efficiency, scalability, performance, optimization, and the ability to
execute in real time are key criteria that drive the development of many
new data mining algorithms.
Parallel, distributed, and incremental mining algorithms: The factors
such as huge size of databases, wide distribution of data, and complexity
of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide the data into
partitions which is further processed in a parallel fashion. Then the results
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
3-2 IT Regulation: R19 DWDM: UNIT-2
than twice a month) and those who rarely shop for such products (e.g.,
less than three times a year). The resulting description provides a general
comparative profile of these customers, such as that 80% of the customers
who frequently purchase computer products are between 20 and 40 years
old and have a university education, whereas 60% of the customers who
infrequently buy such products are either seniors or youths, and have no
university degree.
The output of data characterization can be presented in various forms.
Examples include pie charts, bar charts, curves, multidimensional data
cubes, and multidimensional tables, including crosstabs. The resulting
descriptions can also be presented as generalized relations or in rule form
(called characteristic rules).
The forms of output presentation of discrimination are similar to those for
characteristic descriptions, although discrimination descriptions should
include comparative measures that help to distinguish between the target
and contrasting classes. Discrimination descriptions expressed in the form
of rules are referred to as discriminant rules.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns are patterns that occur frequently in data. There are
many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent
substructures.
A frequent itemset typically refers to a set of items that often appear
together in a transactional data set—for example, milk and bread, which
are frequently bought together in grocery stores by many customers.
A frequently occurring subsequence, such as the pattern that customers,
tend to purchase first a laptop, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern.
A substructure can refer to different structural forms (e.g., graphs, trees, or
lattices) that may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations
and correlations within data.
Example: Suppose that, as a marketing manager of company want to know
which items are frequently purchased together (i.e., within the same
transaction). An example of such a rule, mined from the company
transactional database, is
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
3-2 IT Regulation: R19 DWDM: UNIT-2
features of the items, such as price, brand, place made, type, and
category.
Example2: A common example of classification comes with detecting
spam emails. To write a program to filter out spam emails, a computer
programmer can train a machine learning algorithm with a set of spam-
like emails labelled as spam and regular emails labelled as not-spam. The
idea is to make an algorithm that can learn characteristics of spam emails
from this training set so that it can filter out spam emails when it
encounters new emails.
Cluster Analysis
Clustering analyzes data objects without consulting class labels.
Clustering can be used to generate class labels for a group of data.
The objects are clustered or grouped based on the principle of maximizing
the intra class similarity and minimizing the interclass similarity. That is,
clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather dissimilar to objects
in other clusters.
Each cluster so formed can be viewed as a class of objects, from which
rules can be derived.
Clustering can also facilitate taxonomy formation, that is, the
organization of observations into a hierarchy of classes that group similar
events together.
Example: Cluster analysis can be performed on store customer data to
identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing. Following figure shows a
2-D plot of customers with respect to customer locations in a city. Three
clusters of data points are evident.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
3-2 IT Regulation: R19 DWDM: UNIT-2
Outlier Analysis
Data objects in data set that do not comply with the general behaviour or
model of the data are called outliers or anomalies or deviations
Many data mining methods discard outliers as noise or exceptions.
However, in some applications (e.g., fraud detection) the rare events can
be more interesting than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier analysis or anomaly
mining or deviation detection
Example: Outlier analysis may uncover fraudulent usage of credit cards
by detecting purchases of unusually large amounts for a given account
number in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the locations and types
of purchase, or the purchase frequency.
Data Evolution Analysis
Data evolution analysis describes and models regularities or trends for
objects whose behaviour changes over time.
Although this may include characterization, discrimination, association,
classification, or clustering of time-related data, distinct features of such
an analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.
Are All Patterns Interesting?
A data mining system has the potential to generate thousands or even
millions of patterns, or rules.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
3-2 IT Regulation: R19 DWDM: UNIT-2
If the data objects are stored in a database, they are data tuples. That is,
the rows of a database correspond to the data objects, and the columns
correspond to the attributes.
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a
data object.
The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
The term dimension is commonly used in data warehousing. Machine
learning literature tends to use the term feature, while statisticians prefer
the term variable. Data mining and database professionals commonly use
the term attribute.
Attributes describing a customer object can include, for example,
customer ID, name, and address.
Observed values for a given attribute are known as observations.
A set of attributes used to describe a given object is called an attribute
vector (or feature vector). The distribution of data involving one attribute
(or variable) is called univariate. A bivariate distribution involves two
attributes, and so on.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
3-2 IT Regulation: R19 DWDM: UNIT-2
Nominal Attributes
Nominal means “relating to names.”
The values of a nominal (or categorical) attribute are symbols or names
of things, where each value represents some kind of category, code, or
state.
The values do not have any meaningful order.
In computer science, the values are also known as enumerations.
Examples: Suppose that hair color, marital status and occupation are three
attributes describing person objects.
hair color with values black, brown, blond, red, auburn, gray, and
white.
marital status with values single, married, divorced, and widowed.
occupation, with the values teacher, dentist, programmer, farmer,
and so on.
Binary Attributes
A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent and 1
means that it is present.
Binary attributes are referred to as Boolean if the two states correspond to
true and false.
Examples:
Given the attribute smoker describing a patient object, 1 indicates
that the patient smokes, while 0 indicates that the patient does not.
Similarly, suppose the patient undergoes a medical test that has two
possible outcomes. The attribute medical test is binary, where a
value of 1 means the result of the test for the patient is positive,
while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable
and carry the same weight; that is, there is no preference on which
outcome should be coded as 0 or 1. One such example could be the
attribute gender having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a
medical test for HIV. By convention, we code the most important
outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the
other by 0 (e.g., HIV negative).
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
3-2 IT Regulation: R19 DWDM: UNIT-2
Examples:
Suppose that drink size corresponds to the size of drinks available at a
fast-food restaurant. This nominal attribute has three possible values:
small, medium, and large. The values have a meaningful sequence
(which corresponds to increasing drink size); however, we cannot tell
from the values how much bigger, say, a medium is than a large.
Other examples of ordinal attributes include grade )
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
3-2 IT Regulation: R19 DWDM: UNIT-2
where m is the number of matches (i.e., the number of attributes for which
i and j are in the same state), and p is the total number of attributes
describing the objects.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
3-2 IT Regulation: R19 DWDM: UNIT-2
is the number of attributes that equal 0 for both objects i and j. The total
number of attributes is p, where p = q+r +s +t .
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
3-2 IT Regulation: R19 DWDM: UNIT-2
Manhattan Distance
Minkowski Distance
Supremum Distance
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
3-2 IT Regulation: R19 DWDM: UNIT-2
Note:
n
The Euclidean distance of two n-dimensional vectors x and y = 2
( xi yi) 2
i 1
n
The Manhattan distance of two n-dimensional vectors x and y = xi yi
i 1
n
The Minkowski distance of two n-dimensional vectors x and y = h
xi yi h
i 1
n
The Supremum distance of two n-dimensional vectors x and y = max xi yi
i 1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
3-2 IT Regulation: R19 DWDM: UNIT-2
Cosine Similarity
A document can be represented by thousands of attributes, each recording
the frequency of a particular word (such as a keyword) or phrase in the
document. Thus, each document is an object represented by what is called
a term-frequency vector.
For example, in following table, we see that Document1 contains five
instances of the word team, while hockey occurs three times. The word
coach is absent from the entire document, as indicated by a count value of
0.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
3-2 IT Regulation: R19 DWDM: UNIT-2
Pre-processing: An Overview
Real-world data are tend to be incorrect (inaccurate or noisy or dirty),
incomplete (missing), and inconsistent due to their typically huge size and
their likely origin from multiple, heterogeneous sources. Low-quality data
will lead to low-quality mining results.
Process of transforming the raw data into appropriate, useful and efficient
format for subsequent analysis is called data preprocessing.
Data Preprocessing prepares raw data for further processing.
Data processing techniques substantially improve the overall quality of the
patterns mined and/or the time required for the actual mining.
Why Preprocess the Data?
(Need of Data Pre-processing)
Real world data tend to be inaccurate (incorrect or noisy or dirty), incomplete
and inconsistent.
incomplete : lacking attribute values or certain attributes of interest,
or containing only aggregate data
inaccurate or noisy :containing errors, or values that deviate from
the expected
inconsistent: containing discrepancies in the department codes used
to categorize items
Data preprocessing techniques improve the quality, accuracy and
efficiency of the subsequent mining.
Data preprocessing is important step in the knowledge discovery process
because quality decision must base on quality data.
Detecting data anomalies, rectifying them early and reducing the data to
be analyzed can lead to huge payoffs for decision making.
Major Tasks in Data Pre-processing
(Data Pre-processing Techniques (or) Forms of data Pre-processing)
The major steps involved in data preprocessing are:
Data Cleaning
Data Integration
Data Reduction
Data Transformation.
Data cleaning can be applied to remove noise and correct inconsistencies in
data. Data integration merges data from multiple sources into a coherent data
store such as a data warehouse. Data reduction can reduce data size by, for
instance, aggregating, eliminating redundant features, or clustering. Data
transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
3-2 IT Regulation: R19 DWDM: UNIT-2
Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving
inconsistencies.
Data integration is the merging of data from multiple data stores. Careful
integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set. This can help improve the accuracy and speed of the
subsequent data mining process.
Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same (or almost the same) analytical
results.
Data reduction strategies include dimensionality reduction and
numerosity reduction.
In dimensionality reduction, data encoding schemes are applied so as to
obtain a reduced or “compressed” representation of the original data.
Examples include data compression techniques (e.g., wavelet transforms
and principal components analysis), attribute subset selection (e.g.,
removing irrelevant attributes), and attribute construction (e.g., where a
small set of more useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).
Data Transformation: In data transformation, the data are transformed or
consolidated into forms appropriate for mining. In this preprocessing step,
the data are transformed or consolidated so that the resulting mining
process may be more efficient, and the patterns found may be easier to
understand.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
3-2 IT Regulation: R19 DWDM: UNIT-2
Data Cleaning
Data cleaning (or data cleansing) routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in
the data.
Basic methods for data cleaning are.
Handling missing values
Data smoothing technique
Handling Missing Values
No recorded value for attribute of tuple is called missing value. Reasons
for missing values may include:
The person originally asked to provide a value for the attribute
refuses and/or finds that the information requested is not applicable
(e.g., a license number attribute left blank by non drivers
the data entry person does not know the correct value
the value is to be provided by a later step of the process.
Methods for filling in the missing values:
Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not
very effective, unless the tuple contains several attributes with missing
values. It is especially poor when the percentage of missing values per
attribute varies considerably.
Fill in the missing value manually: In general, this approach is time
consuming and may not be feasible given a large data set with many
missing values.
Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant such as a label like “Unknown”
or If missing values are replaced by, say, “Unknown,” then the mining
program may mistakenly think that they form an interesting concept,
since they all have a value in common—that of “Unknown.” Hence,
although this method is simple, it is not foolproof.
Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value: For normal (symmetric) data
distributions, the mean can be used, while skewed data distribution
should employ the median
Use the attribute mean or median for all samples belonging to the
same class as the given tuple: For example, if classifying customers
according to credit risk, we may replace the missing value with the
mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is
skewed, the median value is a better choice.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29
3-2 IT Regulation: R19 DWDM: UNIT-2
Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
Noisy Data
Noise is a random error or variance in a measured variable.
We can “smooth” out the data to remove the noise.
Data smoothing techniques.
Binning
Regression
Outlier Analysis
Binning
Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it. The sorted values are
distributed into a number of “buckets,” or bins. Because binning methods
consult the neighborhood of values, they perform local smoothing.
In smoothing by bin means, each value in a bin is replaced by the mean
value of the bin
Smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Above figure illustrates some binning techniques. In this example, the data
for price are first sorted and then partitioned into equal-frequency bins of
size 3 (i.e., each bin contains three values). In smoothing by bin means,
each value in a bin is replaced by the mean value of the bin. For example,
the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 30
3-2 IT Regulation: R19 DWDM: UNIT-2
Regression:
Data smoothing can also be done by regression, a technique that conforms
data values to a function.
Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
Multiple linear regressions is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Outlier analysis:
Outliers may be detected by clustering, for example, where similar values
are organized into groups, or “clusters.” Intuitively, values that fall outside
of the set of clusters may be considered outliers.
Data Integration
Data integration combines data from multiple sources to form a coherent
data store.
Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set. This can help improve the
accuracy and speed of the subsequent data mining process.
The resolution of semantic heterogeneity, metadata, correlation analysis,
tuple duplication detection, and data conflict detection contribute to
smooth data integration.
Issues that must be considered during such integration include:
Schema integration: The metadata from the different data sources
must be integrated in order to match up equivalent real-world
entities. This is referred to as the entity identification problem.
Handling redundant data: Derived attributes may be redundant,
and inconsistent attribute naming may also lead to redundancies in
the resulting data set. Some redundancies can be detected by
correlation analysis
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 31
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 33
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 34
3-2 IT Regulation: R19 DWDM: UNIT-2
Dimensionality Reduction
Wavelet transforms
Numerosity Reduction
Parametric methods
Regression
Log-linear models.
Data Compression
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 35
3-2 IT Regulation: R19 DWDM: UNIT-2
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing
technique that, when applied to a data vector X, transforms it to a numerically
different vector, X’, of wavelet coefficients. The two vectors are of the same
length.
The usefulness lies in the fact that the wavelet transformed data can be
truncated. A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines
Wavelet transforms can be applied to multidimensional data such as a data
cube. This is done by first applying the transform to the first dimension, then to
the second, and so on.
Wavelet transforms have many real world applications, including the
compression of fingerprint images, computer vision, analysis of time- series
data, and data cleaning.
Principal Components Analysis (PCA)
Suppose that the data to be reduced consist of tuples or data vectors described
by n attributes or dimensions. Principal components analysis (PCA; also
called the Karhunen-Loeve, or K-L, method) searches for k n- dimensional
orthogonal vectors that can best be used to represent the data, where k<=n. The
original data are thus projected onto a much smaller space, resulting in
dimensionality reduction.
the data, providing important information about variance. That is, the sorted
axes are such that the first axis shows the most variance among the data, the
second axis shows the next highest variance, and so on. For example, below
Figure shows the first two principal components, Y1 and Y2, for the given set of
data originally mapped to the axes X1 and X2. This information helps identify
groups or patterns within the data.
4. Because the components are sorted in decreasing order of “significance,” the
data size can be reduced by eliminating the weaker components, that is, those
with low variance. Using the strongest principal components, it should be
possible to reconstruct a good approximation of the original data.
PCA can be applied to ordered and unordered attributes, and can
handle sparse data and skewed data.
Multidimensional data of more than two dimensions can be handled by
reducing the problem to two dimensions.
Attribute Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant.
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions).
The goal of attribute subset selection is to find a minimum set of attributes
such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes.
Basic heuristic methods of attribute subset selection include the following:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 37
3-2 IT Regulation: R19 DWDM: UNIT-2
2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so that,
at each step, the procedure selects the best attribute and removes the worst
from among the remaining attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and
CART) were originally intended for classification. Decision tree induction
constructs a flowchart like structure where each internal (non leaf) node
denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best” attribute to partition the data into individual
classes. When decision tree induction is used for attribute subset selection, a
tree is constructed from the given data. All attributes that do not appear in the
tree are assumed to be irrelevant. The set of attributes appearing in the tree
form the reduced subset of attributes.
Regression and Log-Linear Models:
(Parametric Data Reduction)
Regression and log-linear models can be used to approximate the given data.
In (simple) linear regression, the data are modeled to fit a straight line. For
example, a random variable, y (called a response variable), can be modeled as
a linear function of another random variable, x (called a predictor variable),
with the equation y = wx + b, where the variance of y is assumed to be
constant.
In the context of data mining, x and y are numeric database attributes. The
coefficients, w and b (called regression coefficients), specify the slope of the
line and the y-intercept, respectively. These coefficients can be solved for by
the method of least squares, which minimizes the error between the actual line
separating the data and the estimate of the line.
Multiple linear regression is an extension of (simple) linear regression, which
allows a response variable, y, to be modeled as a linear function of two or
more predictor variables.
Log-linear models approximate discrete multidimensional probability
distributions. Given a set of tuples in n dimensions (e.g., described by n
attributes), we can consider each tuple as a point in an n-dimensional space.
Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations
Histograms:
Histograms use binning to approximate data distributions and are a popular
form of data reduction.
A histogram for an attribute, A, partitions the data distribution of A into
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 38
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 39
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 40
3-2 IT Regulation: R19 DWDM: UNIT-2
For example, Figure 3.11 shows a data cube for multidimensional analysis of
sales data with respect to annual sales per item type for each AllElectronics
branch.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 42
3-2 IT Regulation: R19 DWDM: UNIT-2
Data Compression
Data Compression specifically refers to a data reduction method by which
files are shrunk at the bit level.
Data Compression works by using formulas or algorithms to reduce the
number of bits needed to represent the data.
The data reduction is lossless if the original data can be reconstructed from the
compressed data without any loss of information; otherwise, it is lossy.
String Compression:
There are extensive theories & well-tuned algorithms
Typically lossless
But only limited manipulation is possible
Audio/video Compression:
Typically lossy compression with progressive refinement
Some time small fragments of signal can be constructed without reconstruction
of the whole.
Dimensionality Reduction and Numerosity Reduction techniques can also be
consider form of data compression.
Data Discretization and Concept Hierarchy Generation for Numerical Data
Data discretization transforms numeric data by mapping values to interval or
concept labels. Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of
granularity.
Discretization techniques can be categorized based on how the discretization is
performed, such as whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up).
If the discretization process uses class information, then we say it is supervised
discretization. Otherwise, it is unsupervised.
If the process starts by first finding one or a few points (called split points or
cut points) to split the entire attribute range, and then repeats this recursively
on the resulting intervals, it is called top-down discretization or splitting. This
contrasts with bottom-up discretization or merging, which starts by
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 43
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 44
3-2 IT Regulation: R19 DWDM: UNIT-2
In an equal-width histogram, for example, the values are partitioned into equal-
size partitions or ranges.
With an equal-frequency histogram, the values are partitioned so that, ideally,
each partition contains the same number of data tuples.
The histogram analysis algorithm can be applied recursively to each partition
in order to automatically generate a multilevel concept hierarchy, with the
procedure terminating once a pre specified number of concept levels has been
reached. A minimum interval size can also be used per level to control the
recursive procedure. This specifies the minimum width of a partition, or the
minimum number of values for each partition at each level.
Histograms can also be partitioned based on cluster analysis of the data
distribution, as described next.
Discretization by Cluster, Decision Tree, and Correlation Analyses:
Clustering, decision tree analysis, and correlation analysis can be used for data
discretization.
Cluster analysis is a popular data discretization method. A clustering algorithm
can be applied to discretize a numeric attribute, A, by partitioning the values of
A into clusters or groups.
Clustering takes the distribution of A into consideration, as well as the
closeness of data points, and therefore is able to produce high-quality
discretization results.
Clustering can be used to generate a concept hierarchy for A by following
either a top-down splitting strategy or a bottom-up merging strategy, where
each cluster forms a node of the concept hierarchy. In the former, each initial
cluster or partition may be further decomposed into several subclusters,
forming a lower level of the hierarchy. In the latter, clusters are formed by
repeatedly grouping neighboring clusters in order to form higher-level
concepts.
Techniques to generate decision trees for classification can be applied to
discretization. Such techniques employ a top-down splitting approach.
Measures of correlation can be used for discretization. ChiMerge is 2 - based
discretization method. This method employs a bottom-up approach by finding
the best neighboring intervals and then merging them to form larger intervals,
recursively.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 45
3-2 IT Regulation: R19 DWDM: UNIT-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 46
3-2 IT Regulation: R19 DWDM: UNIT-2
these attributes at the schema level such as street < city < province or state <
country.
2.Specification of a portion of a hierarchy by explicit data grouping: This
is essentially the manual definition of a portion of a concept hierarchy. In a
large database, it is unrealistic to define an entire concept hierarchy by explicit
value enumeration. On the contrary, we can easily specify explicit groupings
for a small portion of intermediate-level data. For example, after specifying
that province and country form a hierarchy at the schema level, a user could
define some intermediate levels manually, such as “{Alberta, Saskatchewan,
Manitoba} prairies- Canada” and “{British Columbia, prairies-
Canada} Western -Canada.”
3.Specification of a set of attributes, but not of their partial ordering: A
user may specify a set of attributes forming a concept hierarchy, but omit to
explicitly state their partial ordering. The system can then try to automatically
generate the attribute ordering so as to construct a meaningful concept
hierarchy. Consider the observation that since higher- level concepts generally
cover several subordinate lower-level concepts, an attribute defining a high
concept level (e.g., country) will usually contain a smaller number of distinct
values than an attribute defining a lower concept level (e.g., street). Based on
this observation, a concept hierarchy can be automatically generated based on
the number of distinct values per attribute in the given attribute set. The
attribute with the most distinct values is placed at the lowest hierarchy level.
The lower the number of distinct values an attribute has, the higher it is in the
generated concept hierarchy.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 47
3-2 IT Regulation: R19 DWDM: UNIT-2
2. List and briefly describe the different data mining functionalities with real life
example of each.
4. Define data object and attribute. Describe different attribute types with examples.
(or) What is data set? Describe different characteristics and types of data sets
used in data mining.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 48
3-2 IT Regulation: R19 DWDM: UNIT-2
6. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q=3.
(d) Compute the supremum distance between the two objects.
7. Why preprocess the data? Briefly describe the major steps involved in data
pre-processing.
9. What is the need for data cleaning? Explain the steps in the process of data
cleaning. Discuss briefly about data cleaning techniques.
10. Describe the various approaches to remove the noisy data from the original data.
(or) What is noisy data? Explain the binning methods for data smoothening.
11. What is data integration? Discuss the issues to consider during data integration.
13. What is data reduction? Describe briefly the strategies for data reduction.
14. What are the techniques used to produce smaller forms of data
representation using numerosity reduction? Explain each with an example
15. What is attribute subset selection? Describe heuristic methods of attribute subset
selection.
16. What is concept hierarchy generation? Describe the various methods for automatic
generation of concept hierarchies for categorical data.
17. What is need of dimensionality reduction? Describe briefly different methods for
dimensionality reduction.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 49
3-2 IT Regulation: R19 DWDM: UNIT-2
20. Use the three methods below to normalize the following group of data: 200, 300,
400, 600, 1000
a. Min-max normalization by setting min = 0 and max = 1
b. z-score normalization
c. Normalization by decimal scaling
21. Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.Partition them into three bins by each
of the following methods:
a. equal-frequency (equal-depth) partitioning
b. equal-width partitioning
c. clustering
22. Using sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
a) Use smoothing by bin means to smooth these data, using a bin depth of 3.
b) Use smoothing by bin medians to smooth these data, using a bin depth of 3.
c) Use smoothing by bin boundaries to smooth these data, using a bin depth of 3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 50